Looking for an architecture or starting point for business problem RRS feed

  • Question

  • Hello All:

    I am in the initial stages of discovery for a new project and hope someone here can suggest a 'jumping-off' point.  My client is looking to build an on demand web spider type of application to access county courthouse records.  They oftentimes need to go to a county court website, look up a subject and then manually transcribe to the results of the county site search into their own system.  Obviously, this is a time and labor intensive process. 

    Like most others, I've built screen-scraping software before but this problem is a little different.  In the past, I've used a template-based approach that defined a format for the site I'm scraping and then specific instructions as to how to retrieve the data required.  This would be fine if the site format could be relied upon to stay consistent or if the domain of possible sites was small.  This is not the case since there are 1660+ county courts now online with more coming.  Maintaining a set of templates isn't likely to be an effective solution as the sites can change often and drastically -- not to mention new sites continually coming on board. 

    Ideally, the customer would like some sort of inference engine that can learn where in markup, for a particular site for example, the date of birth information is kept. I've got a the basic concept of how I'm going to contact the sites, how parameters are going to be passed/persisted using WF but my stumbling block happens once I've retrieved the HTML.  The basic idea is the software should examine the html markup and look for 'tags' (i.e. keys the denote certain information is present in the markup).  Once it finds those tags, it should try and resolve a value for the tag.  For example, in the marjkup the code will look for a tag named 'Date of Birth' (or dob, birth date, etc...).  Once it finds this tag, it looks through it children to see if any text nodes happen to be dates.  If it doesn't find any matching candidates, it moves to the parent container's siblings, their children, etc... down the tree.  If it resolves it correctly, the user can tell the system it's correct and the system learns the steps it took to find the text value representing date of birth.  The next time it would try and resolve the date of birth value using the past hints and, if it doesn't find it, would repeat the process.  It would do this for each tag the user defines it is looking for in the markup.  The idea is eventually there need not ever be a user and this could run in the background whenever these records need to be retrieved. 

    Does anyone have any ideas where to look for more information on something like this?  I've examined some of the AI articles online but they seem to solve very domain-specific problems or I'm not getting the gist of what they are saying.

    Any help is greatly appreciated.







    Wednesday, May 5, 2010 6:51 PM

All replies

  • Greeting ,

    In your query " My client is looking to build an on demand web spider type of application to access county courthouse records." This is your statement.


    In this scenario you need a very good database design. Suppose the user enter the date of birth of a person, their has to bussiness logic to retrieve the corresponding the records from the database. How the information is displayed mainly depends upon you design the screen.

    A possible suggestion is

    This application be a database driven architecture.

    Take Care



    Helping People To Solve Technical Problems
    Thursday, May 6, 2010 3:47 PM
  • I do agree with PL. Also you can look at the opportunity of using the BizTalk server for integration. It has an inbuilt adapter to handle HTTP requests as well. You can get plenty of documentation about what BizTalk is and how do we use it. It comes at a cost though. You might need to consider.


    Thursday, May 6, 2010 4:03 PM
  • I'd start managing the client's expectations as early as possible, mate.

    Even if it is possible to get the thing to learn some logic on how to find specific fields then I reckon it'll be very fragile.  If you can't trust it over 90% then there's no point in totally automating it because the users won't trust the results.

    So I'd focus on delivering something that showed two screens side by side - the page found and the results form.

    Good luck!!

    Thursday, May 6, 2010 5:44 PM
  • Thanks Andy -- and I do agree that showing the two screens in a side-by-side display is a good idea.  In fact, that is what the client envisions as the 'learning' portion of the system -- with the page open, the user looks at what was returned by the auto parse and decides if it is correct or not.  If so, the field is 'learned' for that particular url.




    Thursday, May 6, 2010 8:57 PM
  • Not too sure what you are referring to here...the subject details that the user can enter are already being captured today.  This web spider would need to crawl a remote site not affiliated with my customer and parse the remote site's markup to match (or confirm) that the date of birth is the same and the civil case in question belongs to the search subject.  With the exception of the training UI, the idea is that the application should eventually be interface-less and just run as a service once fully trained.

    I might not be understanding what you are suggesting.




    Thursday, May 6, 2010 9:05 PM
  • Thanks Shival but I'd say BizTalk is off the table for this project.  In part, the county websites don't offer integration -- I'd say they actively discourage it :).  I know BizTalk does more than just system-to-system integration, but I don't feel the customer will shoulder the licensing cost when really the steps necessary to make the http request certainly isn't the challenging part of the project.  The difficulty comes in extracting the results once the request completes.




    Thursday, May 6, 2010 9:11 PM
  • I would have thought that the database design was the least of your problems here.

    Maybe some sort of heuristic approach could be applied.  Such an approach is used with robots for sure.  I think also maybe container packing and some computer games.  My concern would be the time and steep learning curve involved in learning a completely new approach to programme design.

    Your outlined approach looks reasonable.

    Friday, May 7, 2010 9:54 AM
  • @Andy -- Yes, I agree... the db design is the trival part of this thing.  A heuristic-based approach is what the client is advocating and, as you've said, it is a new approach (for me anyway) with regard to writing business software.  It is worth mentioning that there are 3-4 other companies doing this somehow.  My client is in a different market from them so they won't be competing.  I originally suggested using these other companies' services to accomplish this goal (they sell this service as a product) but management does not feel it is cost-effective.  They run several hundred a day and the cost is $4-$5 per search if purchased thru the service.  Their labor cost now is still cheaper per search (meaning: an employee going to the site and transcribing the info costs less per search than the service cost).




    Friday, May 7, 2010 12:30 PM
  • Yep.  I do business systems myself.  Intellectually I know roughly how the heuristic thing works.  It keeps on doing the try and learn thing until you say yes that's right and then you weight however it got that decision.  Beyond that, it's all greek.

    Quite how you find someone who is a heuristic screen scraping expert I dunno, seems pretty specialist to me.  Maybe some sort of consultancy advice.

    Some university lecturers do consultancy work on the side.  Maybe it's worth a couple of phone calls. 

    Have you tried games dev forums?

    Failing that it occurs to me that people must work in development for these not-quite-competing companies and maybe one of them would like to earn some week end money.  They probably sign non disclosure and suchlike so I'm not suggesting outright piracy.  This seems the sort of thing where an hour's worth of knowledgeable advice could go a long way.

    That's all I've got mate. 

    Good luck!

    Friday, May 7, 2010 2:04 PM