web crawler sample
-
sábado, 28 de janeiro de 2012 10:10
I have publish a post (and sample code)
about creating a web crawler application using the Tpl Dataflow library.
http://blogs.microsoft.co.il/blogs/bnaya/archive/2012/01/28/tpl-dataflow-walkthrough-part-5.aspx
Bnaya Eshet
Todas as Respostas
-
segunda-feira, 6 de fevereiro de 2012 15:50ProprietárioThanks for sharing, Bnaya.
-
terça-feira, 28 de agosto de 2012 13:29
Hi,
I love the TPL approach to a Crawler/Spider and I am trying to implement my own for a client project. Do you happen to know how I would go about detecting that it has finished/completed?
I see that it is waiting for:
Task.WaitAll(downloader.Completion, linkParser.Completion);
and for you example you wait for a period of time before calling the downloader.complete() and linkParser.complete() methods, how would you do this in a real world example?
It needs to waiting say no more links to process and the downloading is no longer going to send more. Can this be done?
-
terça-feira, 28 de agosto de 2012 13:44
the timeout is to avoid being a real crawler (keep it legal).
real-life crawler on the web can continue forever.if you want to stop the operation after some terms were satisfy you can use a ActionBlock linked to any block that may
have the completion term indication and call complete on the blocks that you want to stop
(while you define the links between the block you can define the link in a way that it will propagate the completion to the
target block, this way you can call complete on the root block and it will propagate to the entire network).
in the monitoring ActionBlock you can apply logic upon state like counters, ext...
Bnaya Eshet
-
terça-feira, 28 de agosto de 2012 14:19
Hi Thanks for getting back to me so soon.
It really is only for indexing my clients content and pushing it into a Lucene.net Index. Nothing malicious.
Do you know of any simple example of this completion propagation?
Here is my layout
/////////////////////////////////////////////////////////////////////// // downloader <------------------------- // // | | // // contentBroadcaster | // // / \ | // // htmlParser linkParsers--->linkBroadcaster // // | | // // indexWriter <------------------------------------ // ///////////////////////////////////////////////////////////////////////
disposeAll = new CompositeDisposable( downloader.LinkTo(contentBroadcaster, downloaded => downloaded != null), contentBroadcaster.LinkTo(htmlParser), contentBroadcaster.LinkTo(linkParser), linkParser.LinkTo(linkBroadcaster), linkBroadcaster.LinkTo(downloader, linkFilter), htmlParser.LinkTo(indexWriter) );Can you help? Thanks -
terça-feira, 28 de agosto de 2012 14:37
just use something like this
block.LinkTo(actionBlock, new DataflowLinkOptions { PropagateCompletion = true });
and only keep the disposable of the downloader (the root block)
Bnaya Eshet
-
terça-feira, 28 de agosto de 2012 14:51I guess what I am struggling to see is how I could detect that I am finished. I need to know that there are no new links to download and all downloads have finished.
-
terça-feira, 28 de agosto de 2012 15:02
this is actually more logical problem than a technical one.
because each download may have multiple links it will create a graph (which can be circular)
you can think of some counter which will be increased for the links count - 1 and decrease
when a page has no links (you should avoid circular ones), other direction may be to monitor the
downloader buffer count and if it stay empty for awhile you may assume that the network execution were completed.
and I sure that there is more idea to think of.
Bnaya Eshet
-
terça-feira, 28 de agosto de 2012 15:04
That's great it confirms what I was thinking. I will come up with some solution.
Thanks for taking the time to assist. I appreciate it

