none
Quickie HTML / Regex question .... RRS feed

  • Question

  • Good evening, Gurus ...

    while writing a website datascraper I came across an unexpected situation.

    While browsing the site, I used Internet Explorer to save (as HTML) a page of interest. Shortly thereafter I opened the saved HTML file with Notepad. I selected one or two lines and cut-and-pasted them into a .cs file as comments.     

    The interesting thing is that upon closer examination of the .cs file, I discovered that the argument symbol '&' (in a hyperlink statement) was followed immediately by something else: "amp;".     That's blowing my Regex away.

    so my question is:  Is that "amp;" REALLY up there on the website, or is that an artifact of saving with IE, then opening with Notepad ?    That is, if I could actually view that line on the host machine, would I see that "amp;", or is that being placed there by IE / Notepad ?

    Thanks for your help.


    • Edited by Lincoln_MA Wednesday, September 12, 2018 5:26 AM
    Wednesday, September 12, 2018 5:23 AM

Answers

  • That's expected - & is for '&' sign in Urls. Reference Link

    Actually different browser treat it differently.

    Below is explanation from reference link - 

    The ampersand character ("&") declares the beginning of an entity reference (e.g., ® for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&"). For example, when the href attribute of the a element refers to a CGI script that takes parameters, it must be expressed as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user.


    Thanks,
    Vivek Bansod
    Blog | MSDN | LinkedIn
     


    • Edited by Vivek Bansod Wednesday, September 12, 2018 12:27 PM
    • Marked as answer by Lincoln_MA Wednesday, September 12, 2018 3:46 PM
    Wednesday, September 12, 2018 12:26 PM

All replies

  • That's expected - & is for '&' sign in Urls. Reference Link

    Actually different browser treat it differently.

    Below is explanation from reference link - 

    The ampersand character ("&") declares the beginning of an entity reference (e.g., ® for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&"). For example, when the href attribute of the a element refers to a CGI script that takes parameters, it must be expressed as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user.


    Thanks,
    Vivek Bansod
    Blog | MSDN | LinkedIn
     


    • Edited by Vivek Bansod Wednesday, September 12, 2018 12:27 PM
    • Marked as answer by Lincoln_MA Wednesday, September 12, 2018 3:46 PM
    Wednesday, September 12, 2018 12:26 PM
  • I would recommend that you use an HTML reader to read the HTML rather than trying to view or work with it directly as raw text. An HTML reader will handle the HTML encoding/decoding for you. I have used HtmlAgilityPack in the past to scrape arbitrarily complex sites with relative ease. You'll probably end up creating quite a few helper methods for the specific site you're parsing but it is pretty clean and easy to work with.

    Michael Taylor http://www.michaeltaylorp3.net

    Wednesday, September 12, 2018 1:46 PM
    Moderator
  • Thank you, Sir.

    It seems strange to me that the ampersand must be followed by a 'amp;' token (we don't, for instance, require a question mark to be followed by a 'que;' token … similarly, we don't expect a '.' to be followed by a 'dot;' token), but that shall forever remain one of life's (and HTML's) sweet mysteries.

    Again, thank you for your help, Sir.

    Wednesday, September 12, 2018 3:52 PM
  • Michael, Thank you for the tip, Sir.

    I'll check 'HtmlAgilityPack' out this morning.  I hope to begin using it this afternoon.

    Thanks again … Very Much Appreciated.

    bill

    Wednesday, September 12, 2018 3:58 PM