RegEx.Replace
-
Thursday, June 16, 2011 12:33 AM
Hi all --
I have a Powershell script that I'm trying to write to go thru a poorly formatted XML file to look for any nodes that have the word "Date" as part of the node name. I.E.
<System><SystemName>Acme</Systemname><SystemDate>313</SystemDate><SystemNumber>3</SystemNumber><FileDate>394</FileDate></System>
The above pattern is repeated with hundreds of times thoughout the file... for about 70MB worth of data.
The real file has lots more nodes and no linefeeds or anything... so it all appears on one line.
What I need to do is scan the file and look for any nodes that end in "Date" where the value is not 4 digits and replace with a 4 digit value.
Here is what I have so far... but it looks like the replace is only changing the first occurance and not all other matches after the first match.
Using the example above, it should find the closing </SystemDate> and closing </FileDate> node and see that the digit is only 3 characters and replace with 9999.
$infile=get-content z:\system.txt
write-host $infile.Length
$regex = New-Object System.Text.RegularExpressions.Regex ">\d\d\d</(.*Date)"
$replace = $regex.Replace($infile,"9999")
write-host $infile.Length
write-host $replace.Length
set-content -Value $replace z:\new_system.txtAny help would be appreciated!
All Replies
-
Thursday, June 16, 2011 8:12 AM
instead of .* use [^>]*Date
.* is greedy and will match much more than you're after
-
Thursday, June 16, 2011 2:25 PM
Unfortunately, I don't have much experience in powershell (I'm learning it), but I can tell you how I'd solve this problem in VB.NET. Perhaps the regular expression will be sufficient:Imports System.Text.RegularExpressions Module Module1 Sub Main() Dim l_fixedString = Regex.Replace( input:="<System><SystemName>Acme</Systemname><SystemDate>313</SystemDate><SystemNumber>3</SystemNumber><FileDate>394</FileDate></System>", pattern:="(?<OpenTag><(?<TagName>\w*Date\w*)>)(?<ShortDate>\d{3})(?<CloseTag></\k<TagName>>)", replacement:="${OpenTag}0${ShortDate}${CloseTag}") Console.WriteLine(l_fixedString) Console.WriteLine("Press any key...") Console.ReadKey(True) End Sub End Module
What's going on here?First, I am matching an open tag. If your XML is valid, then this should be a '<' followed by a string of alphanumeric characters (no spaces) followed by a '>'. We are further requiring that the word 'Date' appears somewhere within that tag name (not that regular expressions are case-sensitive by default). We are also capturing the tag name ('?<TagName>') so that we can reuse it in the close tag ('\k<TagName>').Also, we are naming the complete open tag, the content (three digits) and the close tag.In our replacement string, we are using substitutions. This allows us to insert the open tag (by calling its name), the content (prepended with a '0' to make it four digits) and the close tag.I could explain in more detail, but Microsoft has already done a good job in their documentation. Follow these links:Naming sub-expressions (?<Name>)Back references to a named sub-expression (\k<Name>)Substitutions (${...} in our replacement string)For more information generally, start here:After which, I'd highly recommend "Mastering Regular Expressions" by Jeffrey Friedl (O'Reilly Media, Inc). You can check out his accompanying website here: http://regex.info -
Sunday, June 19, 2011 2:19 AM
Will I_fixedString return the entire XML file if the entire file is on one line? I.E., my file has no formatting in it... so the REGEX needs to fix not only the first occurance on the line, but every occurrance after that?
-
Monday, June 20, 2011 5:07 AMModerator
Well, I think a simple way to resolve this problem is that we can read all date nodes from the xml file one by one, replace them with the regular expression, then re-write the fixed date node back to the xml file. Because the file contains about 70MB worth of data, it may cause a performance issue if you replace it only once by read whole string from the xml file.
Have a nice day.
Paul Zhou [MSFT]
MSDN Community Support | Feedback to us
Get or Request Code Sample from Microsoft
Please remember to mark the replies as answers if they help and unmark them if they provide no help.

-
Monday, June 20, 2011 2:55 PM
I think I've figured out the Powershell version:
$infile = get-content C:\test.txt $regex = New-Object System.Text.RegularExpressions.Regex "(?<OpenTag><(?<TagName>\w*Date\w*)>)(?<ShortDate>\d{3})(?<CloseTag></\k<TagName>>)" $replace = $regex.Replace($infile, '${OpenTag}0${ShortDate}${CloseTag}') set-content -Value $replace C:\test2.txt get-content C:\test2.txtThis should fix every occurrence in the document, as long as the XML is valid. If there are missing closing tags, tag pairs with different casing or other XML violations, then this regex will probably not work as expected. If the XML was generated by a tool, then it's probably valid and I wouldn't worry about it too much.
-
Tuesday, June 21, 2011 12:47 PM
Cyborgx372,
\w*Date\w* will find Date anywhere within the tag while the author wants to find tags *ending* with Date. Drop the last \w*. Also, though the author's example shows 3 digits, he implied that the number could be 1-3 digits. So, change \d{3} to \d{1,3}. Unfortunately, I do not have an easy solution for taking a variable length match and converting it to a fixed length (0 padded) string. Normally, I would use MatchEvaluator, but the author is using Powershell (note: search for MatchEvaluator and Powershell and you will find a complicated script that might be a good start).
Les Potter, Xalnix Corporation, Yet Another C# Blog -
Tuesday, June 21, 2011 3:59 PM
Thanks for everyone help. I'm still working on this problem and will update post if still not working --
-
Thursday, June 23, 2011 7:54 AMModerator
Hi,
Any update? Would you mind letting us know the results of the suggestions?
If the suggestions are helpful for you, please mark answers and close this thread.
If not, any concerns, please feel free to let us know.
Paul Zhou [MSFT]
MSDN Community Support | Feedback to us
Get or Request Code Sample from Microsoft
Please remember to mark the replies as answers if they help and unmark them if they provide no help.

-
Thursday, June 23, 2011 6:58 PM
Hi, I'm still working on this issue.
I need to load/search/replace the file as a data file and not XML.
I'm hoping I can revisit this next week.
-
Thursday, June 23, 2011 7:09 PM
Hello,
Although it may appear to be on one line, it may be a multiline issue. I'm not familiar with Powershell regex but I'd into the options multiline and replace all.
Adam
Ctrl+Z

