none
Issue reading text from HTML page RRS feed

  • Question

  • Hi All, I am reading some string from html file the string is appearing as 2x2 s in html. I understand that this contain non breaking space . But after reading them in C# app . How do i check if string contain &nbsp or not. I had tried couple of way but they didn't worked. They even are not able to detect that string has   .But if i am comparing  2x2 s with the string which i had received from html 2x2 s.The comparison is failing here .Any idea?

     if(strText.Contains("\\u00A0"))
           strText = Regex.Replace(strText, "\u00A0", " ");

    Thanks


    Rupesh Shukla


    Monday, December 31, 2018 4:39 AM

All replies

  • Have you tried “\u00A0” instead of “\\u00A0”?

    Monday, December 31, 2018 6:48 AM
  • Have you tried “\u00A0” instead of “\\u00A0”?

    Hi Viorel, Thanks for reply . Yes i had tried "\u00A0" & also @" \u00A0" but none of them are working.

    Thanks


    Rupesh Shukla

    Monday, December 31, 2018 3:23 PM
  • How do you read HTML from page? You can use HtmlAgilityPack which loads page into DOM (XML). I think there could be &nbsp found.
    Monday, December 31, 2018 4:00 PM
  • Then investigate the real contents of the strings. Put some breakpoints, and type into Watch window: strText.ToCharArray(). Expand the array to see the codes and letters. If you see the Evaluate button (‘’), then click it. Check the codes that appear after “2x2”.

    Monday, December 31, 2018 4:56 PM
  • Hi Viorel,

                That also didn't worked . Then when i had tried checking different encoding on both the text and encoded in ANSI then i can see one of the text is showing some different letter but that is appearing as space only.

    Thanks


    Rupesh Shukla

    Monday, December 31, 2018 9:32 PM
  • Can you provide the url and text you want to extract.
    Monday, December 31, 2018 9:39 PM
  • If you are reading the actual HTML, then the non-breaking spaces will appear as the six-character string " ".  As in, "if(strText.Contains(" ") )".  The U+00A0 character is only going to appear in the rendered output, not in the HTML.

    Tim Roberts | Driver MVP Emeritus | Providenza & Boekelheide, Inc.

    Wednesday, January 2, 2019 9:24 PM
  • If I was doing this I would use Selenium and use XPath to extract the text I'm interested in.  It can extract only text.  C# supports Selenium.  You can open a file or a url.

    • Edited by mogulman52 Wednesday, January 2, 2019 10:25 PM
    Wednesday, January 2, 2019 10:17 PM
  • Hi Pintu Shukla,

    Is there any update? do you resolve the issue?

    Best regards,

    Zhanglong


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Tuesday, January 8, 2019 1:36 AM
    Moderator