none
Best way to parse HTML table into XML RRS feed

  • Question

  • I would like extract the data elements from tables within HTML pages.  The output should produce an XML file.  What is the best way to do that?  I am using VB.NET 3.5.
    Wednesday, February 10, 2010 9:24 PM

Answers

  • -You can try regular expression to extract the data from html source and then pass it to XML

    Also, XmlDocument class from .net framework which can read HTML documents.

    There is an article in the link below that extract tables from html page and store the data in DataSet using regular expression. You can adapt the method to your own need by change the dataset to XMLDocument or write the dataset directly to xml file.

    Extract Tables from HTML page and store it in data set using Regular Expressions


    The above link was C# code but if you find it difficult to translate to vb.net , you can use the online code converter
    to do it for you in the link below

    kaymaf


    If that what you want, take it. If not, ignored it and no complain

    CODE CONVERTER SITE : http://www.carlosag.net/Tools/CodeTranslator/.

    Wednesday, February 10, 2010 10:30 PM
  • Due to the fact that I was unwilling to use regular expressions, as I was advised "regular expressions, the unholy child weeps the blood of virgins" and the fact that I haven't yet explored XSLT, I just sat down and wrote .NET code that extracts table HTML out of HTML.  It took a while, but it works and now I have what I need.  Nothing wrong with .NET code, it's got a lot of string parsing power built into it!  I do plan to explore XSLT, I just couldn't do it right away since I was under a tighht deadline.
    • Marked as answer by Gary Frank Sunday, February 28, 2010 6:00 PM
    Sunday, February 28, 2010 6:00 PM

All replies

  • I would like extract the data elements from tables within HTML pages.  The output should produce an XML file.  What is the best way to do that?  I am using VB.NET 3.5.

    try here please:

    http://www.25hoursaday.com/StoringAndQueryingXML.html
    Just Be Humble Malange!
    Wednesday, February 10, 2010 9:34 PM
  • -You can try regular expression to extract the data from html source and then pass it to XML

    Also, XmlDocument class from .net framework which can read HTML documents.

    There is an article in the link below that extract tables from html page and store the data in DataSet using regular expression. You can adapt the method to your own need by change the dataset to XMLDocument or write the dataset directly to xml file.

    Extract Tables from HTML page and store it in data set using Regular Expressions


    The above link was C# code but if you find it difficult to translate to vb.net , you can use the online code converter
    to do it for you in the link below

    kaymaf


    If that what you want, take it. If not, ignored it and no complain

    CODE CONVERTER SITE : http://www.carlosag.net/Tools/CodeTranslator/.

    Wednesday, February 10, 2010 10:30 PM
  • I'm a bit reluctant to use regular expressions. 
    I read the following at the website http://www.mail-archive.com/debian-user@lists.debian.org/msg564629.html

    "Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp."

    The site claims that "Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML".

    So I'm thinking it would be better to use stable, tested, robust .NET code to do it -- if anybody has already created some, and I don't have to.  Has anyone done this?  Care to share it?
    Thursday, February 11, 2010 2:21 PM
  • As Kaymaf said you could use Xml.XmlDocument if the HTML if XML compliant. If not you can try to use Window.Forms.HtmlDocument, but HtmlDocument is much harder to use.


    Bill Gates look out!
    Thursday, February 11, 2010 2:33 PM
  • I'm a bit reluctant to use regular expressions. 
    I read the following at the website http://www.mail-archive.com/debian-user@lists.debian.org/msg564629.html

    "Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp."

    The site claims that "Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML".

    So I'm thinking it would be better to use stable, tested, robust .NET code to do it -- if anybody has already created some, and I don't have to.  Has anyone done this?  Care to share it?
    Try this without Regular expression
    http://blogs.techrepublic.com.com/programming-and-development/?p=2265&tag=nl.e055

    kaymaf

    If that what you want, take it. If not, ignored it and no complain

    CODE CONVERTER SITE

    http://www.carlosag.net/Tools/CodeTranslator/.

    http://www.developerfusion.com/tools/convert/csharp-to-vb/.

    Wednesday, February 24, 2010 12:42 AM
  • You asked for the best way

    http://msdn.microsoft.com/en-us/library/ms256069.aspx

    Be aware it is not my prefered way, I would go for the DOM (document described already in this thread)


    Success
    Cor
    Wednesday, February 24, 2010 6:12 AM
  • Due to the fact that I was unwilling to use regular expressions, as I was advised "regular expressions, the unholy child weeps the blood of virgins" and the fact that I haven't yet explored XSLT, I just sat down and wrote .NET code that extracts table HTML out of HTML.  It took a while, but it works and now I have what I need.  Nothing wrong with .NET code, it's got a lot of string parsing power built into it!  I do plan to explore XSLT, I just couldn't do it right away since I was under a tighht deadline.
    • Marked as answer by Gary Frank Sunday, February 28, 2010 6:00 PM
    Sunday, February 28, 2010 6:00 PM