locked
HTML decoding RRS feed

  • Question

  • How to decode the html into normal string?

    I'm using HttpUtility.HTMLdecode() method, buyt it is giving me string with html tags.


    Regards Kumar Gaurav.
    Tuesday, November 22, 2011 7:31 PM

Answers

  • Sorry Kumar, I did not realize you were asking both the RSS thread, and this thread.  Now your question is better understood by me.
     
    If the HTML is simple, such as what you have shown with a single div, then regular expressions may be sufficient.  I suspect though that the HTML in many feeds is complex.  I have heard of many people using something called the HTML Agility Pack (http://htmlagilitypack.codeplex.com/).  I cannot vouch for this as I have not used it.  It appears that it may be of help to you if the HTML is somewhat complex.

    --
    Mike
    • Marked as answer by Leo Liu - MSFT Tuesday, November 29, 2011 2:04 AM
    Thursday, November 24, 2011 12:14 AM
  • > It is giving me error of some active x control

     
    you didn't say, that you have activex in html. ok. try the following code.
    if not help, then cut from the html all of the <object> tags before GetText using.

     

    using System.Runtime.InteropServices;
    ...
    var txt = Html.GetText("<b><div>text1<i>text2<ul><li>item1<li>item2");
    ...
    public class Html
    {
        [ComImport, Guid("25336920-03F9-11CF-8FD0-00AA00686F13")]
        class HTMLDocument { }
    
        public static string GetText(string html)
        {
            dynamic doc = new HTMLDocument();
            doc.write(html);
            doc.close();
            return doc.body.innerText;
        }
    }
    

    • Marked as answer by Leo Liu - MSFT Tuesday, November 29, 2011 2:04 AM
    Thursday, November 24, 2011 3:29 AM
  • Hi urprob,

    I agree with Mike that I could not figure out anything wrong with the encoded html code, please check out the following code. If the result is not what you want, please show us your desired format.
                using (StreamWriter sw = new StreamWriter(@"C:\EncodedHtml.txt"))
                {
                    WebRequest request = (HttpWebRequest)HttpWebRequest.Create("http://windows.live.com");
                    StreamReader SR = new StreamReader(request.GetResponse().GetResponseStream());
    
                    sw.Write(HttpUtility.HtmlEncode(SR.ReadToEnd()));
                }
    

    Or do you want to get the text content of the html page as Malobukv? If so I think it is totally a different aspect compared with html encoding.

    Have a nice day,
    Leo Liu [MSFT]
    MSDN Community Support | Feedback to us


    It's not working sir,all html charecters are there only,

    i'm making a feed reader, i read using xmldocument and then read the innertetxt of "description" which is in html format. I want to convert it into normal string.

    for example

    if i get in description

    <div>Hello</div>

    then i want just "Hello" out of it.

    Hope you'll  understand my problem


    Regards Kumar Gaurav.
    So it is totally a different aspect compared with html encoding.
    As suggested by Mike, the HTML Agility Pack library is really powerful and easy to use. I've used it for a period of time.
    Below is a code snippet I write to meet your requirement, don't forget to download the library and add reference to it.
    using HAP = HtmlAgilityPack;
    ...
                HAP.HtmlDocument htmlDoc = new HAP.HtmlDocument();
                htmlDoc.LoadHtml("<div id=\"wrapper\" onclick=\"history.go(-1);\">inner text.</div>");
    
                MessageBox.Show(htmlDoc.DocumentNode.InnerText);
    

    Have a nice day,

    Leo Liu [MSFT]
    MSDN Community Support | Feedback to us
    • Marked as answer by Leo Liu - MSFT Tuesday, November 29, 2011 2:04 AM
    Thursday, November 24, 2011 6:21 AM

All replies

  • To many people, HTML is a normal string.  Can you clarify what you mean by a "normal" string?  For example, do you mean the rendered page?  Some text rendered by the html is in attributes, but other text may be the innertext of the html element. 
     

    --
    Mike
    Tuesday, November 22, 2011 7:48 PM
  • > How to decode the html into normal string? I'm using HttpUtility.HTMLdecode() method, buyt it is giving me string with html tags


     
    the following method works even for invalid html

     

    using System.Windows.Forms;
    ...
    var txt = GetText("<b><div>text1<i>text2<ul><li>item1<li>item2");
    ...
    string GetText(string html)
    {
        var wb = new WebBrowser();
        wb.DocumentText = html;
        while(wb.ReadyState != WebBrowserReadyState.Complete) 
            Application.DoEvents();
        return wb.Document.Body.InnerText;
    }
    

    Tuesday, November 22, 2011 7:51 PM
  • Hi urprob,

    I agree with Mike that I could not figure out anything wrong with the encoded html code, please check out the following code. If the result is not what you want, please show us your desired format.
                using (StreamWriter sw = new StreamWriter(@"C:\EncodedHtml.txt"))
                {
                    WebRequest request = (HttpWebRequest)HttpWebRequest.Create("http://windows.live.com");
                    StreamReader SR = new StreamReader(request.GetResponse().GetResponseStream());
    
                    sw.Write(HttpUtility.HtmlEncode(SR.ReadToEnd()));
                }
    

    Or do you want to get the text content of the html page as Malobukv? If so I think it is totally a different aspect compared with html encoding.

    Have a nice day,
    Leo Liu [MSFT]
    MSDN Community Support | Feedback to us
    Wednesday, November 23, 2011 7:24 AM
  • > How to decode the html into normal string? I'm using HttpUtility.HTMLdecode() method, buyt it is giving me string with html tags


     
    the following method works even for invalid html

     

     

    using System.Windows.Forms;
    ...
    var txt = GetText("<b><div>text1<i>text2<ul><li>item1<li>item2");
    ...
    string GetText(string html)
    {
        var wb = new WebBrowser();
        wb.DocumentText = html;
        while(wb.ReadyState != WebBrowserReadyState.Complete) 
            Application.DoEvents();
        return wb.Document.Body.InnerText;
    }
    

     

    It is giving me error of some active x control

    Regards Kumar Gaurav.
    Wednesday, November 23, 2011 4:19 PM
  • Hi urprob,

    I agree with Mike that I could not figure out anything wrong with the encoded html code, please check out the following code. If the result is not what you want, please show us your desired format.
                using (StreamWriter sw = new StreamWriter(@"C:\EncodedHtml.txt"))
                {
                    WebRequest request = (HttpWebRequest)HttpWebRequest.Create("http://windows.live.com");
                    StreamReader SR = new StreamReader(request.GetResponse().GetResponseStream());
    
                    sw.Write(HttpUtility.HtmlEncode(SR.ReadToEnd()));
                }
    

    Or do you want to get the text content of the html page as Malobukv? If so I think it is totally a different aspect compared with html encoding.

    Have a nice day,
    Leo Liu [MSFT]
    MSDN Community Support | Feedback to us


    It's not working sir,all html charecters are there only,

    i'm making a feed reader, i read using xmldocument and then read the innertetxt of "description" which is in html format. I want to convert it into normal string.

    for example

    if i get in description

    <div>Hello</div>

    then i want just "Hello" out of it.

    Hope you'll  understand my problem


    Regards Kumar Gaurav.
    Wednesday, November 23, 2011 4:21 PM
  • Sorry Kumar, I did not realize you were asking both the RSS thread, and this thread.  Now your question is better understood by me.
     
    If the HTML is simple, such as what you have shown with a single div, then regular expressions may be sufficient.  I suspect though that the HTML in many feeds is complex.  I have heard of many people using something called the HTML Agility Pack (http://htmlagilitypack.codeplex.com/).  I cannot vouch for this as I have not used it.  It appears that it may be of help to you if the HTML is somewhat complex.

    --
    Mike
    • Marked as answer by Leo Liu - MSFT Tuesday, November 29, 2011 2:04 AM
    Thursday, November 24, 2011 12:14 AM
  • > It is giving me error of some active x control

     
    you didn't say, that you have activex in html. ok. try the following code.
    if not help, then cut from the html all of the <object> tags before GetText using.

     

    using System.Runtime.InteropServices;
    ...
    var txt = Html.GetText("<b><div>text1<i>text2<ul><li>item1<li>item2");
    ...
    public class Html
    {
        [ComImport, Guid("25336920-03F9-11CF-8FD0-00AA00686F13")]
        class HTMLDocument { }
    
        public static string GetText(string html)
        {
            dynamic doc = new HTMLDocument();
            doc.write(html);
            doc.close();
            return doc.body.innerText;
        }
    }
    

    • Marked as answer by Leo Liu - MSFT Tuesday, November 29, 2011 2:04 AM
    Thursday, November 24, 2011 3:29 AM
  • Hi urprob,

    I agree with Mike that I could not figure out anything wrong with the encoded html code, please check out the following code. If the result is not what you want, please show us your desired format.
                using (StreamWriter sw = new StreamWriter(@"C:\EncodedHtml.txt"))
                {
                    WebRequest request = (HttpWebRequest)HttpWebRequest.Create("http://windows.live.com");
                    StreamReader SR = new StreamReader(request.GetResponse().GetResponseStream());
    
                    sw.Write(HttpUtility.HtmlEncode(SR.ReadToEnd()));
                }
    

    Or do you want to get the text content of the html page as Malobukv? If so I think it is totally a different aspect compared with html encoding.

    Have a nice day,
    Leo Liu [MSFT]
    MSDN Community Support | Feedback to us


    It's not working sir,all html charecters are there only,

    i'm making a feed reader, i read using xmldocument and then read the innertetxt of "description" which is in html format. I want to convert it into normal string.

    for example

    if i get in description

    <div>Hello</div>

    then i want just "Hello" out of it.

    Hope you'll  understand my problem


    Regards Kumar Gaurav.
    So it is totally a different aspect compared with html encoding.
    As suggested by Mike, the HTML Agility Pack library is really powerful and easy to use. I've used it for a period of time.
    Below is a code snippet I write to meet your requirement, don't forget to download the library and add reference to it.
    using HAP = HtmlAgilityPack;
    ...
                HAP.HtmlDocument htmlDoc = new HAP.HtmlDocument();
                htmlDoc.LoadHtml("<div id=\"wrapper\" onclick=\"history.go(-1);\">inner text.</div>");
    
                MessageBox.Show(htmlDoc.DocumentNode.InnerText);
    

    Have a nice day,

    Leo Liu [MSFT]
    MSDN Community Support | Feedback to us
    • Marked as answer by Leo Liu - MSFT Tuesday, November 29, 2011 2:04 AM
    Thursday, November 24, 2011 6:21 AM
  • > the HTML Agility Pack library is really powerful and easy to use. [...] don't forget to download the library and add reference to it.


    HTML Agility Pack is a great library, but it's redundant.
    as mentioned above, html parsing can be done using HTMLDocument (this is a component of Internet Explorer).

     




    Thursday, November 24, 2011 7:11 AM
  • But AFAIK if we wanna use the HtmlDocument component, it must be accompanied by a WebBrowser component.
    Please point out if it is incorrect, or could you revise the prior code, with the HtmlDocument component.
    Thanks.


    Leo Liu [MSFT]
    MSDN Community Support | Feedback to us
    Thursday, November 24, 2011 8:00 AM