locked
How to read specific text from html RRS feed

  • Question

  • Dear Friends,

    I want to read some specific text from html page.

    For ex:

    <h1 class="Review"></h1>
    <div class="customer_cmts" style="overflow-x: hidden; overflow-y: scroll; height: 147px;">
    <ul> <li style="margin-bottom:8px;"><a href="/tripbox/reviews/travel.aspx" class="more">It was an nice experience fly with ethihad airways<br />thanks alon k larry</a>
    </li>
    </ul>
    </div>

    from the html, i want to read the div text and load it to database for customer analysis.

    i have done until to get in a single string variable using ..but having no idea how to read the text from div.

    My code:

     

    string sURL3 = "http://siteurl/";

     

    string sResp = "";

     

    try

    {

     

    HttpWebRequest oWebReq = (HttpWebRequest)WebRequest.Create(sURL3);

     

    HttpWebResponse oWebResp = (HttpWebResponse)oWebReq.GetResponse();

     

    StreamReader oStream = new StreamReader(oWebResp.GetResponseStream(), System.Text.Encoding.ASCII);

    sResp = oStream.ReadToEnd();

     

    if (sResp.Length > 0)

    {

     string res = sResp.ToString();

    }

    }


    - Thiru
    Wednesday, October 12, 2011 9:42 AM

Answers

  • for html parsing you can use HTMLDocument (one of the IE components).
    here is a sample:

    using System;
    using System.Runtime.InteropServices;
    using System.Windows.Forms;
    
    namespace WindowsFormsApplication1
    {
        public partial class Form1 : Form
        {
            public Form1()
            {
                var html = @"<h1 class='Review'></h1><div class='customer_cmts' style='overflow-x: hidden; overflow-y: scroll; height: 147px;'>
                    <ul> <li style='margin-bottom:8px;'><a href='/tripbox/reviews/travel.aspx' class='more'>It was an nice experience 
                    fly with ethihad airways<br />thanks alon k larry</a></li></ul></div>";
                var txt = Helper.GetCustomerText(html);
                System.Diagnostics.Trace.WriteLine(txt);
            }
        }
    
        public class Helper
        {
            [ComImport, Guid("25336920-03F9-11CF-8FD0-00AA00686F13")]
            class HTMLDocument { }
    
            public static string GetCustomerText(string html)
            {
                dynamic doc = new HTMLDocument();
                doc.write(html);
                doc.close();
                dynamic  tags = doc.getElementsByTagName("DIV");
                for(var i=0; i < (int) tags.length; i++)
                {
                    dynamic tag = tags[i];
                    if(string.Equals(tag.className, "customer_cmts", StringComparison.OrdinalIgnoreCase))
                        return System.Text.RegularExpressions.Regex.Replace((string) tag.innerText, "\\s{2,}", " ");
                }
                return null;
            }
        }
    }
    
    

     

    • Proposed as answer by Malobukv Wednesday, October 12, 2011 10:57 AM
    • Marked as answer by Allen_MSDN Monday, October 17, 2011 1:21 AM
    Wednesday, October 12, 2011 10:56 AM

All replies

  • Important thing to keep in mind while doing such iteration that you need to make sure you get valid html from Response. If you are sure that your html will be proper then you can use regular expression to get specific pattern from html.
    If you are not able to find proper expression then i would suggest to use WebBrowser control.... Once you get body in webbrowser control then you can easily find different html node/nodes then you can do anything you want....

     

    Hope this will help you a bit...


    Sumit Kumar
    Wednesday, October 12, 2011 9:54 AM
  • There is no particular way of reading it you can use substring to acompolish this

    but in order to use substring you should be very much sure about the tags i.e. the  div tag will always be

    <div class="customer_cmts" style="overflow-x: hidden; overflow-y: scroll; height: 147px;">
    
    

     


    --------------------------------------------------------

    Surender Singh Bhadauria

     

    Wednesday, October 12, 2011 9:54 AM
  • for html parsing you can use HTMLDocument (one of the IE components).
    here is a sample:

    using System;
    using System.Runtime.InteropServices;
    using System.Windows.Forms;
    
    namespace WindowsFormsApplication1
    {
        public partial class Form1 : Form
        {
            public Form1()
            {
                var html = @"<h1 class='Review'></h1><div class='customer_cmts' style='overflow-x: hidden; overflow-y: scroll; height: 147px;'>
                    <ul> <li style='margin-bottom:8px;'><a href='/tripbox/reviews/travel.aspx' class='more'>It was an nice experience 
                    fly with ethihad airways<br />thanks alon k larry</a></li></ul></div>";
                var txt = Helper.GetCustomerText(html);
                System.Diagnostics.Trace.WriteLine(txt);
            }
        }
    
        public class Helper
        {
            [ComImport, Guid("25336920-03F9-11CF-8FD0-00AA00686F13")]
            class HTMLDocument { }
    
            public static string GetCustomerText(string html)
            {
                dynamic doc = new HTMLDocument();
                doc.write(html);
                doc.close();
                dynamic  tags = doc.getElementsByTagName("DIV");
                for(var i=0; i < (int) tags.length; i++)
                {
                    dynamic tag = tags[i];
                    if(string.Equals(tag.className, "customer_cmts", StringComparison.OrdinalIgnoreCase))
                        return System.Text.RegularExpressions.Regex.Replace((string) tag.innerText, "\\s{2,}", " ");
                }
                return null;
            }
        }
    }
    
    

     

    • Proposed as answer by Malobukv Wednesday, October 12, 2011 10:57 AM
    • Marked as answer by Allen_MSDN Monday, October 17, 2011 1:21 AM
    Wednesday, October 12, 2011 10:56 AM
  • Hello Thiru

    See this code this is HTML side

    <body>
        <form id="form1" runat="server"> 
       <div id="divbody" runat="server" >
        Hello Friends this is testing
      </div>
        </form>
    </body>
    

    Get the value from code side

     protected void Page_Load(object sender, EventArgs e)
        {
            string divvalue = divbody.InnerHtml.ToString();
            Response.Write("This is From Response.Write:  "+divvalue);
        }
    
    




    Please mark the post answered your question as the answer, and mark other helpful posts as helpful, so they will appear differently to other users who are visiting your thread for the same problem.
    Wednesday, October 12, 2011 11:03 AM
  •  

    Use HTML Agility Pack.

    What is exactly the Html Agility Pack (HAP)?

    This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
    Link for download and more information: http://htmlagilitypack.codeplex.com/


    -Jai
    Wednesday, October 12, 2011 11:24 AM
  • for HTML parsing, I am using MILHTMLParser (find it on codeprojet : http://www.codeproject.com/KB/dotnet/apmilhtml.aspx)..

    I found it very helpful after testing a lot of HTML parsers

    here is a simple code used to parse web page to xml, put it on a treenode control, look for all html tags class "p" :

     

            
    
            private void ProcessHTML(string html)
            {
    
                tvwDOM.Nodes.Clear();
    
                mDocument = MIL.Html.HtmlDocument.Create(html, false);
    
                BuildTree(mDocument.Nodes, tvwDOM.Nodes);
    
            }
    
            private void SimpleUse(string ServerURL,string login, string password)
            {
                
                string SourceHTML = "";
    
                WebClient WebClient = new WebClient();
                WebClient.Credentials = new System.Net.NetworkCredential(login, password);
                SourceHTML = WebClient.DownloadString(ServerURL);
    
                ProcessHTML(SourceHTML);
    
    
                foreach (HtmlNode node in mDocument.Nodes.FindByAttributeNameValue("class", "p"))
                {
                    if ((node) is MIL.Html.HtmlElement)
                    {
                     ......                
     
                    }
                }
    
           }
    
    


     

    Wednesday, October 12, 2011 2:30 PM
  • Thirusen,

    You can use HTML Agility Pack (you can get it from this link: http://htmlagilitypack.codeplex.com/) You can add its dll to your project and try the code as given below:

    objRequest = WebRequest.Create(listItemUrl); objRequest.Credentials = CredentialCache.DefaultCredentials; objResponse = objRequest.GetResponse();

    string strResult = ""; using (StreamReader sr = new StreamReader(objResponse.GetResponseStream())) { strResult = sr.ReadToEnd(); // Close and clean up the StreamReader sr.Close(); } HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(strResult); strResult = doc.DocumentNode.SelectSingleNode("//div[@class='customer_cmts']").InnerHtml; return strResult;





    AshitaP

    • Proposed as answer by AshitaP Wednesday, March 21, 2012 6:45 PM
    Wednesday, March 21, 2012 6:45 PM