locked
RegEx Request : Get all the <img src=""> tag in a webpage.

    Question

  • Hello,

    How do I get all the <img src=" "> tags on a web page. Includes the value of the src. I need to in my C# application. I want to grab a webpage and parse the page to get only all the <img src=""> tags and put it in an array or in a database. I want to check the <img src='"> tag of two pages and check if they are similar or not. If not, error or prompt will occur.  

    Please help.

    Thanks and more power.


    Tuesday, March 13, 2007 2:01 AM

Answers

  • Here is an example, each match will have just the address. Or you can extract the address from the named group "Url".

    Input Text<html><body>....<p>
    <img src='http://www.somedomain.com/somepic.jpg' />
    <br><img src='http://www.microsoft.com/somepic.jpg' />
    </body></html>

    Regular Expression(?<=img\s+src\=[\x27\x22])(?<Url>[^\x27\x22]*)(?=[\x27\x22])

    Group CapturesGroups: (0) (Url)


    Match (1):
           0 : http://www.somedomain.com/somepic.jpg
         Url : http://www.somedomain.com/somepic.jpg

    Match (2):
           0 : http://www.microsoft.com/somepic.jpg
         Url : http://www.microsoft.com/somepic.jpg


    Tuesday, March 13, 2007 2:55 PM

All replies

  • I would suggestion taking a look at another posting I put here a few days ago talking about getting out tags like what you're looking for (I think/hope).

     

    http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1323055&SiteID=1

     

    Good luck.

    Tuesday, March 13, 2007 1:29 PM
  • Here is an example, each match will have just the address. Or you can extract the address from the named group "Url".

    Input Text<html><body>....<p>
    <img src='http://www.somedomain.com/somepic.jpg' />
    <br><img src='http://www.microsoft.com/somepic.jpg' />
    </body></html>

    Regular Expression(?<=img\s+src\=[\x27\x22])(?<Url>[^\x27\x22]*)(?=[\x27\x22])

    Group CapturesGroups: (0) (Url)


    Match (1):
           0 : http://www.somedomain.com/somepic.jpg
         Url : http://www.somedomain.com/somepic.jpg

    Match (2):
           0 : http://www.microsoft.com/somepic.jpg
         Url : http://www.microsoft.com/somepic.jpg


    Tuesday, March 13, 2007 2:55 PM
  • Hi,

    I' trying to do the same thing in Adobe Flex (get the src of all images from a page), but when I execute the expression on your Input Text it returns null, it might be because of that "Url "... Could you give me a hint please?

    Best Regards.
    Wednesday, April 22, 2009 2:55 PM
  • Hi,

    I' trying to do the same thing in Adobe Flex (get the src of all images from a page), but when I execute the expression on your Input Text it returns null, it might be because of that "Url "... Could you give me a hint please?

    Best Regards.
    It could be because Regex is severely flawed, if you have attributes before the SRC attribute then it won't work. The <Url> is a capture and should not effect the Regex.

    (?<=img\s+.+src\=[\x27\x22])(?<Url>[^\x27\x22]*)(?=[\x27\x22])

    I modified it to work for me. His answer his wrong and should not be marked as the proper answer because it does not take into account all the variations of the image tag.
    • Edited by nullsoldier Thursday, November 12, 2009 6:42 PM correction
    • Proposed as answer by nullsoldier Thursday, November 12, 2009 6:42 PM
    Thursday, November 12, 2009 6:41 PM
  • I don't take offense to your statement, but I did specify that it needed to be exact match only in data.

    I agree it doesn't provide a capture of all incarnations of all attributes and your solution goes a long way towards that. I though, would have created regex patterns that treats such attributes as generic key-value pairs. I would then pipe the pairs into a dynamic linq entity and setup a dictionary where I could peruse the attributes at my coding leisure. I provide a similar example on my blog entitled:

    Regex To Linq to Dictionary in C#
    William Wegerson (www.OmegaCoder.Com)
    Thursday, November 12, 2009 9:05 PM
  •  
    Hi,

    I' trying to do the same thing in Adobe Flex (get the src of all images from a page), but when I execute the expression on your Input Text it returns null, it might be because of that "Url "... Could you give me a hint please?

    Best Regards.
    It could be because Regex is severely flawed, if you have attributes before the SRC attribute then it won't work. The <Url> is a capture and should not effect the Regex.

    (?<=img\s+.+src\=[\x27\x22])(?<Url >[^\x27\x22]*)(?=[\x27\x22])

    I modified it to work for me. His answer his wrong and should not be marked as the proper answer because it does not take into account all the variations of the image tag.

    Hello,

    I used a tool called PowerGREP to test nullsoldier and OmegaMan <abbr class="affil"> </abbr>


    Against the following text:

    <img src="http://Www.imagesondemand.drt/picture11.gif" sadjklhskdhaskjdhsakjldhsakjld >
    <img src="http://Www.imagesondemand.drt/picture22.gif" sadjklhskdhaskjdhsakjldhsakjld >
    <img src="http://Www.imagesondemand.drt/picture333.gif" sadjklhskdhaskjdhsakjldhsakjld />
    <img alt="dsfjksdhfjk" src="http://Www.imagesondemand.drt/picture4444.gif" sadjklhskdhaskjdhsakjldhsakjld >
    <img alt="" src="http://Www.imagesondemand.drt/picture5555.gif" sadjklhskdhaskjdhsakjldhsakjld >
    <img Border="" src="http://Www.imagesondemand.drt/picture6666.gif" sadjklhskdhaskjdhsakjldhsakjld >

    And using OmegaMan regex I get:

    http://Www.imagesondemand.drt/picture11.gif
    http://Www.imagesondemand.drt/picture22.gif
    http://Www.imagesondemand.drt/picture333.gif

    Using nullsoldier regex I get:

    http://Www.imagesondemand.drt/picture4444.gif
    http://Www.imagesondemand.drt/picture5555.gif
    http://Www.imagesondemand.drt/picture6666.gif

    And using this regex: (?<=img+.+src\=[\x27\x22])(?<Url>[^\x27\x22]*)(?=[\x27\x22])

    I get:

    http://Www.imagesondemand.drt/picture11.gif
    http://Www.imagesondemand.drt/picture22.gif
    http://Www.imagesondemand.drt/picture333.gif
    http://Www.imagesondemand.drt/picture4444.gif
    http://Www.imagesondemand.drt/picture5555.gif
    http://Www.imagesondemand.drt/picture6666.gif

    So thanks for the help... ;)

    Wednesday, October 27, 2010 2:23 PM