I have publish a post (and sample code)
about creating a web crawler application using the Tpl Dataflow library.
I love the TPL approach to a Crawler/Spider and I am trying to implement my own for a client project. Do you happen to know how I would go about detecting that it has finished/completed?
I see that it is waiting for:
and for you example you wait for a period of time before calling the downloader.complete() and linkParser.complete() methods, how would you do this in a real world example?
It needs to waiting say no more links to process and the downloading is no longer going to send more. Can this be done?
the timeout is to avoid being a real crawler (keep it legal).
real-life crawler on the web can continue forever.
if you want to stop the operation after some terms were satisfy you can use a ActionBlock linked to any block that may
have the completion term indication and call complete on the blocks that you want to stop
(while you define the links between the block you can define the link in a way that it will propagate the completion to the
target block, this way you can call complete on the root block and it will propagate to the entire network).
in the monitoring ActionBlock you can apply logic upon state like counters, ext...
Hi Thanks for getting back to me so soon.
It really is only for indexing my clients content and pushing it into a Lucene.net Index. Nothing malicious.
Do you know of any simple example of this completion propagation?
Here is my layout
/////////////////////////////////////////////////////////////////////// // downloader <------------------------- // // | | // // contentBroadcaster | // // / \ | // // htmlParser linkParsers--->linkBroadcaster // // | | // // indexWriter <------------------------------------ // ///////////////////////////////////////////////////////////////////////
disposeAll = new CompositeDisposable( downloader.LinkTo(contentBroadcaster, downloaded => downloaded != null), contentBroadcaster.LinkTo(htmlParser), contentBroadcaster.LinkTo(linkParser), linkParser.LinkTo(linkBroadcaster), linkBroadcaster.LinkTo(downloader, linkFilter), htmlParser.LinkTo(indexWriter) );Can you help? Thanks
this is actually more logical problem than a technical one.
because each download may have multiple links it will create a graph (which can be circular)
you can think of some counter which will be increased for the links count - 1 and decrease
when a page has no links (you should avoid circular ones), other direction may be to monitor the
downloader buffer count and if it stay empty for awhile you may assume that the network execution were completed.
and I sure that there is more idea to think of.