Limit Lookahead Maybe?
-
Wednesday, July 18, 2012 1:29 PM
I have the following HTML Code:
<Table> <tr> <td>1</td> <td>Person Name</td> </tr> </Table> <Table> <tr> <td>1</td> <td>Person Name</td> </tr> <tr> <td>2</td> <td>Person Name</td> </tr> </Table> <Table> <tr> <td>1</td> <td>Person Name</td> </tr> </Table> <Table> <tr> <td>1</td> <td>Person Name</td> </tr> <tr> <td>2</td> <td>Person Name</td> </tr> </Table> <Table> <tr> <td>1</td> <td>Person Name</td> </tr> </Table>
The goal is to retrieve the data in the two TableData areas within each table, except that sometimes there is a second TableRow (Multiple Records).
My RegEx can currently find the <Table>, everything until <td>, (?<recordNum>everything until <), everything until <td>, (?<name>everything until <), then move along past </Table>, thusly it can start over again and I can collect the "1" from each table. So in RegexBuddy, if I highlight recordNum, all of the "1"s highlight, and also if I highlight name, all of the "names" highlight.
But it is skipping the second record in some (up to like 20 records). Do I need to make recordNum1,2,... and name1,2,... and manually do 0 or 1 occurrances over and over?
On the same note, and possibly moving in the right direction, I'd like to do a forward lookahead only within one Regex result (Between <Table> and </Table>). Something like this:
<Table> (?(?=((?:.*?[\s]*?)*?</tr>[\s]*?<tr>) (then iterate through multiple records))|
(get single records) )) (?:.*?[\s]*?)*?</Table>
The idea is to use an if/then/else forward lookahead
start(?(?=(if)(then)|(else)))finishIf, after <Table>, you find (anything) up until </tr> followed by (whitespace only), followed by <tr>, essentially stating there is indeed multiple record rows... The problem is that it always evaluates as TRUE because further down it finds it, but not within the RegEx results specified (between <Table> and </Table>). So I wanted to limit the Lookahead.
Then iterate through multiple records (which is where my question lies as to have either a delimited <name> capture, or a dynamic (name(x)) or if I have to manually put in iterations of (name1,2,...)
- Edited by Suamere Wednesday, July 18, 2012 1:40 PM Regex Fix
All Replies
-
Wednesday, July 18, 2012 1:59 PM
Hi,
I don't know any single one line regexp solution for this, but you can do it with nested loops:
string input = File.ReadAllText(@"c:\temp\test.html"); string tablePattern = @"<(t|T)able>(.*?)</(t|T)able>"; string rowPattern = @"<(tr|TR)>(.*?)</(tr|TR)>"; string cellPattern = @"<(td|TD)>(.*?)</(td|TD)>"; foreach (Match m in Regex.Matches(input, tablePattern, RegexOptions.Singleline)) { foreach (Match mm in Regex.Matches(m.Groups[2].ToString(), rowPattern, RegexOptions.Singleline)) { foreach (Match mmm in Regex.Matches(mm.Groups[2].ToString(), cellPattern, RegexOptions.Singleline)) { Console.WriteLine("cell content: {0}", mmm.Groups[2].ToString()); } } }
But in my opinion it's better to search for a free html parser. Any unpredicted attribute, capital letter, etc can broke your regexp in no time... -
Wednesday, July 18, 2012 4:41 PM
Certainly, thanks for the reply.
I actually have a very high end project using VB.NET and plan to develop programatic iterations. But the biggest part of my question is more to learn advanced RegEx, and I'm sure it's possible.
And you are certainly correct that there are good regex precautions to either limit to literals or expand using rep delimeters (*+?), which I would plan to do, if not initially bring code in in LCASE for simplicity.
The question stands:
<name> capture will capture 1 instance per Regex group, is there a way to make it capture more than 1? Or dynamically make name1,2,..., or do I have to manually build out the whole thing.Also, is it possible to stop a Lookahead from finding True beyond the Regex group, in this case <Table>...</Table>, so it doesn't find True in future iterations?
-
Friday, July 20, 2012 9:10 AMModerator
Hi Suamere,
Welcome to the MSDN Forum.
How about this method: http://msdn.microsoft.com/en-us/library/360dye2a
Reads XML schema and data into the DataSet using the specified file.
This issue is more difficult by Regular expression than above way.
I hope this will be helpful.
Best regards,
Mike Feng
MSDN Community Support | Feedback to us
Please remember to mark the replies as answers if they help and unmark them if they provide no help.
-
Friday, July 20, 2012 12:00 PM
On Wed, 18 Jul 2012 13:29:26 +0000, Suamere wrote:>>>I have the following HTML Code: <Table>>>The goal is to retrieve the data in the two TableData areas within each table, except that sometimes there is a second TableRow (Multiple Records).>I'm not certain exactly what you want to do, but the following regex matches the Number and Name into named capturing groups, and doesn't skip anything in the example you've given:(?<=<table>.*)(?:<tr>.*?<td>(?<TDNum>\d+)</td>.*?<td>(?<TDData>[^<]+).*?</tr>)+(?=.*</table>)Note the RegexOptions:RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace)(dot matches newline | Case insensitive | Free-spacing)
Ron
Edit: In Regex Buddy, the following works also, and is a bit simpler (same options as above):
(?<=<Table>.*)<tr>.*?<td>
(?<TDNum>\d+)</td>.*?<td>
(?<TDData>[^<]+).*?</tr>
(?=.*</Table)- Edited by Ron Rosenfeld Friday, July 20, 2012 12:09 PM
- Marked As Answer by Suamere Friday, July 20, 2012 2:36 PM
-
Friday, July 20, 2012 2:36 PMSo starting with a lookbehind and ending with a lookahead. That's a pretty good method that you'd think was common sense. Thanks a lot for the replies!
-
Tuesday, July 24, 2012 12:40 PM

