Html Tags and Regex
-
Monday, April 03, 2006 10:15 AM
I writing a function to download the data from a html page and extract the value i want from it. The following is the html page. I want get all the value highlighted in red colour as shown...
<td class='tblDown' width='13%'>0012</td><td class='tblDown' width='12%'>3A</td><td class='tblDown' align='right' width='10%'>0.215</td>
<td class='tblDown' align='right' width='10%'>0.220</td>
<td class='tblDown' align='right' width='10%'>0.210</td>
This is the code i wrote, but i cant get the value i want, can someone help me to modify ?With the below code, i only able to get the value 0.215 because the second value 0.220 and third value 0.210 is at new line, is there anyway to match new line pattern?Need help urgently.....................
string HTMLText;
(Moderator: Thread moved to the Regular Expression Forum and Title tweaked for quicker thread understanding during a search)
string url = "http://bursa.n2nconnect.com/BursaStockSearchAll.htm";
WebClient client=new WebClient();
byte[] data=client.DownloadData(url);
HTMLText= System.Text.ASCIIEncoding.ASCII.GetString(data);
HTMLText = System.Text.RegularExpressions.Regex.Replace(HTMLText, "\n", "");
HTMLText = System.Text.RegularExpressions.Regex.Replace(HTMLText, "(?:<[^>]*?>)", " ");
HTMLText = System.Text.RegularExpressions.Regex.Replace(HTMLText, " +", " ").Trim();
Regex exp = new Regex("(0012) (3a) *(\\d+\\.?\\d*) *(\\d+\\.?\\d*) *(\\d+\\.?\\d*) ", RegexOptions.IgnoreCase);
Match match = exp.Match(HTMLText);
if (match.Success)
{
MessageBox.Show(match.Groups[1].Value + " " + match.Groups[3].Value);
}
else
{
MessageBox.Show("The entire Fund ID not found!");
}
All Replies
-
Monday, April 03, 2006 1:52 PM
Try this pattern:
<td class='tblDown' width='13%'>0012</td><td class='tblDown' width='12%'>3A</td><td class='tblDown' align='right' width='10%'>(.+)</td>\s*<td class='tblDown' align='right' width='10%'>(.+)</td>\s*<td class='tblDown' align='right' width='10%'>(.+)</td>
-
Monday, April 03, 2006 6:36 PM
An alternative approach is to navigate to the URL using the IE (webbrowser) control. Once the page has been loaded it is then possible to navigate the DOM of the page programmatically. This would allow you to loop through elements in a table and programmatically interrogate the value in each cell.
The advantage of this approach is that you do not have to worry about the various XML attributes (as you would if you tried to interpret the HTML using Regular Expressions).
The principle is outlined in the following article:
http://www.codeproject.com/csharp/mshtml_automation.asp

