Saturday, January 28, 2012 10:10 AM
I have publish a post (and sample code)
about creating a web crawler application using the Tpl Dataflow library.
Monday, February 06, 2012 3:50 PMOwnerThanks for sharing, Bnaya.
Tuesday, August 28, 2012 1:29 PM
I love the TPL approach to a Crawler/Spider and I am trying to implement my own for a client project. Do you happen to know how I would go about detecting that it has finished/completed?
I see that it is waiting for:
and for you example you wait for a period of time before calling the downloader.complete() and linkParser.complete() methods, how would you do this in a real world example?
It needs to waiting say no more links to process and the downloading is no longer going to send more. Can this be done?
Tuesday, August 28, 2012 1:44 PM
the timeout is to avoid being a real crawler (keep it legal).
real-life crawler on the web can continue forever.
if you want to stop the operation after some terms were satisfy you can use a ActionBlock linked to any block that may
have the completion term indication and call complete on the blocks that you want to stop
(while you define the links between the block you can define the link in a way that it will propagate the completion to the
target block, this way you can call complete on the root block and it will propagate to the entire network).
in the monitoring ActionBlock you can apply logic upon state like counters, ext...
Tuesday, August 28, 2012 2:19 PM
Hi Thanks for getting back to me so soon.
It really is only for indexing my clients content and pushing it into a Lucene.net Index. Nothing malicious.
Do you know of any simple example of this completion propagation?
Here is my layout
/////////////////////////////////////////////////////////////////////// // downloader <------------------------- // // | | // // contentBroadcaster | // // / \ | // // htmlParser linkParsers--->linkBroadcaster // // | | // // indexWriter <------------------------------------ // ///////////////////////////////////////////////////////////////////////
disposeAll = new CompositeDisposable( downloader.LinkTo(contentBroadcaster, downloaded => downloaded != null), contentBroadcaster.LinkTo(htmlParser), contentBroadcaster.LinkTo(linkParser), linkParser.LinkTo(linkBroadcaster), linkBroadcaster.LinkTo(downloader, linkFilter), htmlParser.LinkTo(indexWriter) );Can you help? Thanks
Tuesday, August 28, 2012 2:37 PM
and only keep the disposable of the downloader (the root block)
Tuesday, August 28, 2012 2:51 PMI guess what I am struggling to see is how I could detect that I am finished. I need to know that there are no new links to download and all downloads have finished.
Tuesday, August 28, 2012 3:02 PM
this is actually more logical problem than a technical one.
because each download may have multiple links it will create a graph (which can be circular)
you can think of some counter which will be increased for the links count - 1 and decrease
when a page has no links (you should avoid circular ones), other direction may be to monitor the
downloader buffer count and if it stay empty for awhile you may assume that the network execution were completed.
and I sure that there is more idea to think of.
Tuesday, August 28, 2012 3:04 PM
That's great it confirms what I was thinking. I will come up with some solution.
Thanks for taking the time to assist. I appreciate it