locked
Regular Expression to Find Links RRS feed

  • Question

  • User-860009756 posted

    Looking for a Regular Expression that will extract all links out of Href=""

    Thursday, January 7, 2010 2:38 AM

Answers

  • User-952121411 posted

    I found a bunch of RegEx expressions to do exactly what you need.  Take a look at the following link:

    http://regexlib.com/Search.aspx?k=href&c=-1&m=-1&ps=20

    Here is a couple of decriptions and expressions from that link above:

    "Will locate an URL in a webpage. It'll search in 2 ways - first it will try to locate a href=, and then go to the end of the link. If there is nu href=, it will search for the end of the file instead (.asp, .htm and so on), and then take the data between the "xxxxxx" or 'xxxxxx'" 

    (("|')[a-z0-9\/\.\?\=\&]*(\.htm|\.asp|\.php|\.jsp)[a-z0-9\/\.\?\=\&]*("|'))|(href=*?[a-z0-9\/\.\?\=\&"']*)

    "This will match just about everything after href= Its good if you just need a list of all the href= values"

    href=[\"\']?((?:[^>]|[^\s]|[^"]|[^'])+)[\"\']?

    "This regex will extract the link and the link title for every a href in HTML source. Useful for crawling sites. Note that this pattern will also allow for links that are spread over multiple lines."

    <a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>

     

    Lastly, when I get in a bind with a RegEx I am already working with and have a specefic question, I usually post to the following forum:

    Regular Expressions (MSDN Forums):

    http://social.msdn.microsoft.com/Forums/en-US/regexp/threads

    Hope this helps! Smile

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Friday, January 8, 2010 10:07 AM

All replies

  • User191633014 posted

    use this:

            Regex x = new Regex(@"Href\s*?=\s*?""(?<Links>.*?)""", RegexOptions.IgnoreCase);
            string s = "Href=\"http://forums.asp.net/t/1511527.aspx\" 8888888888888 Href = \"http://asp.net\"  ";
            List<string> allLinks = new List<string>();
            MatchCollection mx = x.Matches(s);
            foreach (Match MItem in mx)
                allLinks.Add(MItem.Groups["Links"].Value);


    Thursday, January 7, 2010 3:07 AM
  • User-952121411 posted

    I found a bunch of RegEx expressions to do exactly what you need.  Take a look at the following link:

    http://regexlib.com/Search.aspx?k=href&c=-1&m=-1&ps=20

    Here is a couple of decriptions and expressions from that link above:

    "Will locate an URL in a webpage. It'll search in 2 ways - first it will try to locate a href=, and then go to the end of the link. If there is nu href=, it will search for the end of the file instead (.asp, .htm and so on), and then take the data between the &quot;xxxxxx&quot; or 'xxxxxx'" 

    ((&quot;|')[a-z0-9\/\.\?\=\&amp;]*(\.htm|\.asp|\.php|\.jsp)[a-z0-9\/\.\?\=\&amp;]*(&quot;|'))|(href=*?[a-z0-9\/\.\?\=\&amp;&quot;']*)

    "This will match just about everything after href= Its good if you just need a list of all the href= values"

    href=[\&quot;\']?((?:[^&gt;]|[^\s]|[^&quot;]|[^'])+)[\&quot;\']?

    "This regex will extract the link and the link title for every a href in HTML source. Useful for crawling sites. Note that this pattern will also allow for links that are spread over multiple lines."

    &lt;a[\s]+[^&gt;]*?href[\s]?=[\s\&quot;\']+(.*?)[\&quot;\']+.*?&gt;([^&lt;]+|.*?)?&lt;\/a&gt;

     

    Lastly, when I get in a bind with a RegEx I am already working with and have a specefic question, I usually post to the following forum:

    Regular Expressions (MSDN Forums):

    http://social.msdn.microsoft.com/Forums/en-US/regexp/threads

    Hope this helps! Smile

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Friday, January 8, 2010 10:07 AM
  • User-860009756 posted

    Will Give it a Try,

     decided Firstly I need to get the Entire <a></a> Tag Incase it has a Title Property or something I can use as Data () this does that well. Including any links around an image or other tag

    After Removing Whitespace

    Dim R As New Regex("\s+")
            Return R.Replace(strText, " ")

    Get <a> Tags

    Dim Links As New List(Of String)
            Dim x As New Regex("<a.*?href=[""'](?<url>.*?)[""'].*?>(?<name>.*?)</a>", RegexOptions.IgnoreCase)
            Dim mx As MatchCollection = x.Matches(PageData)
            System.Web.HttpContext.Current.Response.Write(mx.Count)
            For Each MItem As Match In mx
                Links.Add(MItem.Value)
            Next
            Return Links

    Then I think From there I will strip all other properties of the tag and be left with Just the URL. Which I wil save to the Database.

    So will try stripping the URL's Next.




    Friday, January 8, 2010 2:53 PM
  • User-860009756 posted

    Hey Stefan,

    This works , returns 90 LInks Including CSS Links, the <a> Tag Function Returns 81 <a> Tags on the page - will parse them through this regular expression to get the valid links. Don't need the CS Files etc.

    Public Shared Function GetHTMLAnchorLinks(ByVal PageData As String) As List(Of String)
            Dim Links As New List(Of String)
            'Dim x As New Regex("href=[\&quot;\']?((?:[^&gt;]|[^\s]|[^&quot;]|[^'])+)[\&quot;\']?", RegexOptions.IgnoreCase)
            Dim x As New Regex("Href\s*?=\s*?""(?<Links>.*?)""", RegexOptions.IgnoreCase)
            Dim mx As MatchCollection = x.Matches(PageData)
            System.Web.HttpContext.Current.Response.Write(mx.Count)
            For Each MItem As Match In mx
                Links.Add(MItem.Value)
            Next
            Return Links
        End Function


    Though Should be

    Public Shared Function GetHTMLAnchorLinks(ByVal PageData As String) As string


    Like

    Private Shared _RegExHref As String = "Href\s*?=\s*?""(?<Links>.*?)"""

    Public Shared Function GetHTMLAnchorLinks(ByVal PageData As String) As List(Of String)
            Dim Links As New List(Of String)
            Dim x As New Regex(_RegExHref, RegexOptions.IgnoreCase)
            Dim mx As MatchCollection = x.Matches(PageData)
            System.Web.HttpContext.Current.Response.Write(mx.Count)
            For Each MItem As Match In mx
                Links.Add(MItem.Value)
            Next
            Return Links
        End Function

        Public Shared Function GetHTMLAnchorLink(ByVal AnchorData As String) As String
            'TODO: See about simplifying
            Dim x As New Regex(_RegExHref, RegexOptions.IgnoreCase)
            Dim mx As MatchCollection = x.Matches(AnchorData)
            System.Web.HttpContext.Current.Response.Write(mx.Count)
            Dim HrefLink As String = ""
            For Each MItem As Match In mx
                HrefLink = MItem.Value
            Next
            Return HrefLink
        End Function




    Friday, January 8, 2010 3:11 PM
  • User-952121411 posted

    Great!  Glad to see you got it working.  For future reference that RegExLib site has a bunch of searchable Regular Expression that can be used quickly without having to write it from scratch.

    Regular Expression Library: 

    http://regexlib.com/

     

    Friday, January 8, 2010 3:36 PM