none
web crawler sample

All replies

  • Thanks for sharing, Bnaya.
    Monday, February 06, 2012 3:50 PM
  • Hi,

    I love the TPL approach to a Crawler/Spider and I am trying to implement my own for a client project. Do you happen to know how I would go about detecting that it has finished/completed?

    I see that it is waiting for:

    Task.WaitAll(downloader.Completion, linkParser.Completion);

    and for you example you wait for a period of time before calling the downloader.complete() and linkParser.complete() methods, how would you do this in a real world example?

    It needs to waiting say no more links to process and the downloading is no longer going to send more. Can this be done?

    Tuesday, August 28, 2012 1:29 PM
  • the timeout is to avoid being a real crawler (keep it legal).
    real-life crawler on the web can continue forever.

    if you want to stop the operation after some terms were satisfy you can use a ActionBlock linked to any block that may 
    have the completion term indication and call complete on the blocks that you want to stop
    (while you define the links between the block you can define the link in a way that it will propagate the completion to the 
    target block, this way you can call complete on the root block and it will propagate to the entire network).

    in the monitoring ActionBlock you can apply logic upon state like counters, ext... 


    Bnaya Eshet

    Tuesday, August 28, 2012 1:44 PM
  • Hi Thanks for getting back to me so soon.

    It really is only for indexing my clients content and pushing it into a Lucene.net Index. Nothing malicious.

    Do you know of any simple example of this completion propagation?


    Here is my layout

    ///////////////////////////////////////////////////////////////////////
    //                   downloader <-------------------------           //
    //                       |                                |          //
    //               contentBroadcaster                       |          //
    //              /                  \                      |          //
    //      htmlParser                  linkParsers--->linkBroadcaster   //
    //          |                                             |          //
    //       indexWriter <------------------------------------           //
    ///////////////////////////////////////////////////////////////////////
    disposeAll = new CompositeDisposable(
                    downloader.LinkTo(contentBroadcaster, downloaded => downloaded != null),
                    contentBroadcaster.LinkTo(htmlParser),
                    contentBroadcaster.LinkTo(linkParser),
                    linkParser.LinkTo(linkBroadcaster),
                    linkBroadcaster.LinkTo(downloader, linkFilter),
                    htmlParser.LinkTo(indexWriter)
                );
    Can you help? Thanks
    Tuesday, August 28, 2012 2:19 PM
  • just use something like this

    block.LinkTo(actionBlock, new DataflowLinkOptions { PropagateCompletion = true });

    and only keep the disposable of the downloader (the root block)


    Bnaya Eshet

    Tuesday, August 28, 2012 2:37 PM
  • I guess what I am struggling to see is how I could detect that I am finished. I need to know that there are no new links to download and all downloads have finished.
    Tuesday, August 28, 2012 2:51 PM
  • this is actually more logical problem than a technical one.

    because each download may have multiple links it will create a graph (which can be circular)
    you can think of some counter which will be increased for the links count - 1 and decrease
    when a page has no links (you should avoid circular ones), other direction may be to monitor the 
    downloader buffer count and if it stay empty for awhile you may assume that the network execution were completed.

    and I sure that there is more idea to think of. 


    Bnaya Eshet

    Tuesday, August 28, 2012 3:02 PM
  • That's great it confirms what I was thinking. I will come up with some solution.

    Thanks for taking the time to assist. I appreciate it

    Tuesday, August 28, 2012 3:04 PM