Answered by:
Regular Expression to Find Links

Question
-
User-860009756 posted
Looking for a Regular Expression that will extract all links out of Href=""
Thursday, January 7, 2010 2:38 AM
Answers
-
User-952121411 posted
I found a bunch of RegEx expressions to do exactly what you need. Take a look at the following link:
http://regexlib.com/Search.aspx?k=href&c=-1&m=-1&ps=20
Here is a couple of decriptions and expressions from that link above:
"Will locate an URL in a webpage. It'll search in 2 ways - first it will try to locate a href=, and then go to the end of the link. If there is nu href=, it will search for the end of the file instead (.asp, .htm and so on), and then take the data between the "xxxxxx" or 'xxxxxx'"
(("|')[a-z0-9\/\.\?\=\&]*(\.htm|\.asp|\.php|\.jsp)[a-z0-9\/\.\?\=\&]*("|'))|(href=*?[a-z0-9\/\.\?\=\&"']*)
"This will match just about everything after href= Its good if you just need a list of all the href= values"
href=[\"\']?((?:[^>]|[^\s]|[^"]|[^'])+)[\"\']?
"This regex will extract the link and the link title for every a href in HTML source. Useful for crawling sites. Note that this pattern will also allow for links that are spread over multiple lines."
<a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>
Lastly, when I get in a bind with a RegEx I am already working with and have a specefic question, I usually post to the following forum:
Regular Expressions (MSDN Forums):
http://social.msdn.microsoft.com/Forums/en-US/regexp/threads
Hope this helps!
- Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
Friday, January 8, 2010 10:07 AM
All replies
-
User191633014 posted
use this:
Regex x = new Regex(@"Href\s*?=\s*?""(?<Links>.*?)""", RegexOptions.IgnoreCase);
string s = "Href=\"http://forums.asp.net/t/1511527.aspx\" 8888888888888 Href = \"http://asp.net\" ";
List<string> allLinks = new List<string>();
MatchCollection mx = x.Matches(s);
foreach (Match MItem in mx)
allLinks.Add(MItem.Groups["Links"].Value);
Thursday, January 7, 2010 3:07 AM -
User-952121411 posted
I found a bunch of RegEx expressions to do exactly what you need. Take a look at the following link:
http://regexlib.com/Search.aspx?k=href&c=-1&m=-1&ps=20
Here is a couple of decriptions and expressions from that link above:
"Will locate an URL in a webpage. It'll search in 2 ways - first it will try to locate a href=, and then go to the end of the link. If there is nu href=, it will search for the end of the file instead (.asp, .htm and so on), and then take the data between the "xxxxxx" or 'xxxxxx'"
(("|')[a-z0-9\/\.\?\=\&]*(\.htm|\.asp|\.php|\.jsp)[a-z0-9\/\.\?\=\&]*("|'))|(href=*?[a-z0-9\/\.\?\=\&"']*)
"This will match just about everything after href= Its good if you just need a list of all the href= values"
href=[\"\']?((?:[^>]|[^\s]|[^"]|[^'])+)[\"\']?
"This regex will extract the link and the link title for every a href in HTML source. Useful for crawling sites. Note that this pattern will also allow for links that are spread over multiple lines."
<a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>
Lastly, when I get in a bind with a RegEx I am already working with and have a specefic question, I usually post to the following forum:
Regular Expressions (MSDN Forums):
http://social.msdn.microsoft.com/Forums/en-US/regexp/threads
Hope this helps!
- Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
Friday, January 8, 2010 10:07 AM -
User-860009756 posted
Will Give it a Try,
decided Firstly I need to get the Entire <a></a> Tag Incase it has a Title Property or something I can use as Data () this does that well. Including any links around an image or other tag
After Removing Whitespace
Dim R As New Regex("\s+")
Return R.Replace(strText, " ")Get <a> Tags
Dim Links As New List(Of String)
Dim x As New Regex("<a.*?href=[""'](?<url>.*?)[""'].*?>(?<name>.*?)</a>", RegexOptions.IgnoreCase)
Dim mx As MatchCollection = x.Matches(PageData)
System.Web.HttpContext.Current.Response.Write(mx.Count)
For Each MItem As Match In mx
Links.Add(MItem.Value)
Next
Return LinksThen I think From there I will strip all other properties of the tag and be left with Just the URL. Which I wil save to the Database.
So will try stripping the URL's Next.
Friday, January 8, 2010 2:53 PM -
User-860009756 posted
Hey Stefan,
This works , returns 90 LInks Including CSS Links, the <a> Tag Function Returns 81 <a> Tags on the page - will parse them through this regular expression to get the valid links. Don't need the CS Files etc.
Public Shared Function GetHTMLAnchorLinks(ByVal PageData As String) As List(Of String)
Dim Links As New List(Of String)
'Dim x As New Regex("href=[\"\']?((?:[^>]|[^\s]|[^"]|[^'])+)[\"\']?", RegexOptions.IgnoreCase)
Dim x As New Regex("Href\s*?=\s*?""(?<Links>.*?)""", RegexOptions.IgnoreCase)
Dim mx As MatchCollection = x.Matches(PageData)
System.Web.HttpContext.Current.Response.Write(mx.Count)
For Each MItem As Match In mx
Links.Add(MItem.Value)
Next
Return Links
End Function
Though Should be
Public Shared Function GetHTMLAnchorLinks(ByVal PageData As String) As string
Like
Private Shared _RegExHref As String = "Href\s*?=\s*?""(?<Links>.*?)"""
Public Shared Function GetHTMLAnchorLinks(ByVal PageData As String) As List(Of String)
Dim Links As New List(Of String)
Dim x As New Regex(_RegExHref, RegexOptions.IgnoreCase)
Dim mx As MatchCollection = x.Matches(PageData)
System.Web.HttpContext.Current.Response.Write(mx.Count)
For Each MItem As Match In mx
Links.Add(MItem.Value)
Next
Return Links
End Function
Public Shared Function GetHTMLAnchorLink(ByVal AnchorData As String) As String
'TODO: See about simplifying
Dim x As New Regex(_RegExHref, RegexOptions.IgnoreCase)
Dim mx As MatchCollection = x.Matches(AnchorData)
System.Web.HttpContext.Current.Response.Write(mx.Count)
Dim HrefLink As String = ""
For Each MItem As Match In mx
HrefLink = MItem.Value
Next
Return HrefLink
End Function
Friday, January 8, 2010 3:11 PM -
User-952121411 posted
Great! Glad to see you got it working. For future reference that RegExLib site has a bunch of searchable Regular Expression that can be used quickly without having to write it from scratch.
Regular Expression Library:
Friday, January 8, 2010 3:36 PM