none
parse and extract text from HTML webpage

    Question

  •  

    Hello there

    I am looking for efficient and easy way (open source/tools ) fro C# devopler to parse and extract html content to free text strcture ,the html webpage's elements/content  change from time to time ,I dont know but i think Regex would required lots of code and skills ,i google on internet ,i found tools called C# html parse (.NET) ,but do not this would helps me out for parsing and extract into text strcture ,can you help me out and porvide with some references

    thanks

     

    Friday, October 15, 2010 2:39 PM

Answers

  • No one told you to browse the page. You can instantiate a WebBrowser just for parsing the HTML. The following writes all the text from an HTML document. I wrote it in code, but it's not difficult writing it from a file or an Uri.

    WebBrowser browser=new WebBrowser();
    browser.DocumentText = ""; // Creates the empty document.
    HtmlDocument doc = browser.Document;
    doc.OpenNew(false);
    doc.Write("<html><body><p><span>This</span> is a sample document.<p>Some tags are not closed.</p>");
    Console.WriteLine(doc.Body.InnerText);
    • Marked as answer by sager79 Friday, October 15, 2010 9:47 PM
    Friday, October 15, 2010 4:07 PM

All replies

  • Am 15.10.2010 16:39, schrieb Elsamelghi:


    Hello there

    I am looking for efficient and easy way (open source/tools )
    fro C# devopler to parse and extract html content to free text strcture,
    the html webpage's elements/content  change from time to time,
    I dont know but i think Regex would required lots of code and skills,
    i google on internet ,i found tools called C# html parse (.NET) ,
    but do not this would helps me out for parsing and extract into text strcture ,can you help me out and porvide with some references

    Probably loading the html in a WebBrowser-Control
    and getting Document.document.innerText is pretty
    straight forward. MSHTML will do all the parsing for
    you.

    Chris

    Friday, October 15, 2010 3:00 PM
  • Hi Chris

    I do not need to load html Document into WebBrowser, just need to extrat html webpage info into text ,and i do not know what sort of text or html elements of the webpage are,? i am looking for tools that does cleaning the html text  from tags and delete heperlinks text

    i trying to extract html information and do some work on these info (NLP),the issue the content or infro present on the html webpage is changes from time to time including the strcture of html webpage

    any idea,any tools that can be used with C# to do this task for me ,what about the C# html parser (.NET) Created by byMajestic12   ,is this tool able to do this work for me and work fine in c# project

    thanks

    Friday, October 15, 2010 3:24 PM
  • Why do you not want to load the document in a WebBrowser control? You can then navigate in the document easily.
    Friday, October 15, 2010 3:40 PM
  •  

     

    Hi Louise

    i am not looking for extracting infro (text )from html while Browsing the actual webpage

    Friday, October 15, 2010 3:53 PM
  • No one told you to browse the page. You can instantiate a WebBrowser just for parsing the HTML. The following writes all the text from an HTML document. I wrote it in code, but it's not difficult writing it from a file or an Uri.

    WebBrowser browser=new WebBrowser();
    browser.DocumentText = ""; // Creates the empty document.
    HtmlDocument doc = browser.Document;
    doc.OpenNew(false);
    doc.Write("<html><body><p><span>This</span> is a sample document.<p>Some tags are not closed.</p>");
    Console.WriteLine(doc.Body.InnerText);
    • Marked as answer by sager79 Friday, October 15, 2010 9:47 PM
    Friday, October 15, 2010 4:07 PM
  • You mean probably the html agility pack

    http://htmlagilitypack.codeplex.com/


    Success
    Cor
    Friday, October 15, 2010 4:28 PM
  • Actually, if the original poster is looking for the actual text, they can get that by:

    WebBrowser browser=new WebBrowser(); // This is what you have
    HtmlDocument doc = browser.Document; // This gives you the browser contents
    String content = 
     (((mshtml.HTMLDocumentClass)(doc.DomDocument)).documentElement).innerText;
    

     

    Thursday, October 21, 2010 4:26 PM