locked
To regex - or not to regex? RRS feed

  • Question

  • Hi,

    I have yet another regex query - this time regarding the retieval of a URL from HTML.  I have a string called HTML, with a number of links on it.  The format for these links is something like:

    <a href="/index/welcome.php?action=one&amp;section=27">one</a>

    Now, I only want the link which is called one (in the link text - so between the > and the </a>), and from that link I want everything between the " marks.  Can anyone help me do this?  Thanks

    Friday, August 25, 2006 11:29 PM

Answers

  • you could use regex however it is expensive. If you are going to be doing this stuff through your application then yes, use Regex otherwise you could use the String.SubString method:

     

    string theString = theHtmlString.SubString(theHtmlString.SubString(theHtmlString.IndexOf(">"));

     

    untested but it should be there abouts.

     

    Saturday, August 26, 2006 12:16 PM
    Moderator
  • From your posts Martin, I think you're overusing regex. I HATE regex, its so unnecessarily hard, when substring, and trimstart, and trimeend, could all work. Just trim the "a href and /a" and youre done. Or, as ahmedilyas said, substring.

    One day a few weeks back, I need to do a similar thing, and spent more than an hour trying to figure out how to get regex to work. Then as I doing some stuff I saw substring, and I used that instead, and it worked perfectly fine.
    Saturday, August 26, 2006 1:35 PM

All replies

  • Hi,
    something like this should do it:
    \<a\s+href\s*=\s*\"(.*)\"\s*\>(.*)\<\/a\>

    Hope it helps.
    Saturday, August 26, 2006 12:20 AM
  • Hi

    That didn't work I'm afraid - it outputs the whole HTML string.  I'll give a better example of what I want:

    I may have a link like this:  <a href="/index/page.php?title=Welcome234" title="Edit section: Welcome">edit</a>

    There may be hundereds of links like this in the HTML string, but what is different in each of them is the link itself and the link title. The program gets user input (a string called "find") for the title to find (the user inputs the part of the title after the "Edit section: " (note the space) and the regex should output the link for that title.  How can I do this?

    Thanks

     

    Saturday, August 26, 2006 11:08 AM
  • you could use regex however it is expensive. If you are going to be doing this stuff through your application then yes, use Regex otherwise you could use the String.SubString method:

     

    string theString = theHtmlString.SubString(theHtmlString.SubString(theHtmlString.IndexOf(">"));

     

    untested but it should be there abouts.

     

    Saturday, August 26, 2006 12:16 PM
    Moderator
  • From your posts Martin, I think you're overusing regex. I HATE regex, its so unnecessarily hard, when substring, and trimstart, and trimeend, could all work. Just trim the "a href and /a" and youre done. Or, as ahmedilyas said, substring.

    One day a few weeks back, I need to do a similar thing, and spent more than an hour trying to figure out how to get regex to work. Then as I doing some stuff I saw substring, and I used that instead, and it worked perfectly fine.
    Saturday, August 26, 2006 1:35 PM
  • well, regex is used for pattern searching as you may know and is the proper way of doing things on the long run but it expensive (can you blame it?) - with special meanings/pattern keywords etc...  it is good but hard, this is where you need to practice and read about it ;-)

     

    Saturday, August 26, 2006 1:38 PM
    Moderator
  • This pattern should work. 

    \<a[^>]*\>([^<]*)\<\/a\>

    It is defined as:

    <a
    Any character not in ">"  zero or more times
    >
    Capture
      Any character not in "<"   zero or more times
    End Capture
    </a>

    Saturday, August 26, 2006 2:40 PM
  • Hi

    Thanks for all your help - I've used substring this time though - because I understand it! 

     

    Saturday, August 26, 2006 3:49 PM