Locked RegEx.Replace

  • Thursday, June 16, 2011 12:33 AM
     
     

    Hi all --

     

    I have a Powershell script that I'm trying to write to go thru a poorly formatted XML file to look for any nodes that have the word "Date" as part of the node name.  I.E.

    <System><SystemName>Acme</Systemname><SystemDate>313</SystemDate><SystemNumber>3</SystemNumber><FileDate>394</FileDate></System>

    The above pattern is repeated with hundreds of times thoughout the file... for about 70MB worth of data.

    The real file has lots more nodes and no linefeeds or anything... so it all appears on one line.

    What I need to do is scan the file and look for any nodes that end in "Date" where the value is not 4 digits and replace with a 4 digit value.

    Here is what I have so far... but it looks like the replace is only changing the first occurance and not all other matches after the first match.

    Using the example above, it should find the closing </SystemDate> and closing </FileDate> node and see that the digit is only 3 characters and replace with 9999.

    $infile=get-content z:\system.txt
    write-host $infile.Length
    $regex = New-Object System.Text.RegularExpressions.Regex ">\d\d\d</(.*Date)"
    $replace = $regex.Replace($infile,"9999")
    write-host $infile.Length
    write-host $replace.Length
    set-content -Value $replace z:\new_system.txt

    Any help would be appreciated!

     


All Replies

  • Thursday, June 16, 2011 8:12 AM
     
     

    instead of .* use [^>]*Date

     

    .* is greedy and will match much more than you're after

  • Thursday, June 16, 2011 2:25 PM
     
      Has Code
    Unfortunately, I don't have much experience in powershell (I'm learning it), but I can tell you how I'd solve this problem in VB.NET. Perhaps the regular expression will be sufficient:
    Imports System.Text.RegularExpressions
    
    Module Module1
    
      Sub Main()
        Dim l_fixedString = Regex.Replace(
          input:="<System><SystemName>Acme</Systemname><SystemDate>313</SystemDate><SystemNumber>3</SystemNumber><FileDate>394</FileDate></System>",
          pattern:="(?<OpenTag><(?<TagName>\w*Date\w*)>)(?<ShortDate>\d{3})(?<CloseTag></\k<TagName>>)",
          replacement:="${OpenTag}0${ShortDate}${CloseTag}")
        Console.WriteLine(l_fixedString)
    
        Console.WriteLine("Press any key...")
        Console.ReadKey(True)
      End Sub
    
    End Module
    
    

    What's going on here?
    First, I am matching an open tag. If your XML is valid, then this should be a '<' followed by a string of alphanumeric characters (no spaces) followed by a '>'. We are further requiring that the word 'Date' appears somewhere within that tag name (not that regular expressions are case-sensitive by default). We are also capturing the tag name ('?<TagName>') so that we can reuse it in the close tag ('\k<TagName>').
    Also, we are naming the complete open tag, the content (three digits) and the close tag.
    In our replacement string, we are using substitutions. This allows us to insert the open tag (by calling its name), the content (prepended with a '0' to make it four digits) and the close tag.
    I could explain in more detail, but Microsoft has already done a good job in their documentation. Follow these links:
    Naming sub-expressions (?<Name>)
    Back references to a named sub-expression (\k<Name>)
    Substitutions (${...} in our replacement string)
    For more information generally, start here:
    After which, I'd highly recommend "Mastering Regular Expressions" by Jeffrey Friedl (O'Reilly Media, Inc). You can check out his accompanying website here: http://regex.info
  • Sunday, June 19, 2011 2:19 AM
     
     

    Will I_fixedString return the entire XML file if the entire file is on one line?   I.E., my file has no formatting in it... so the REGEX needs to fix not only the first occurance on the line, but every occurrance after that?

     

  • Monday, June 20, 2011 5:07 AM
    Moderator
     
     
     

    Well, I think a simple way to resolve this problem is that we can read all date nodes from the xml file one by one, replace them with the regular expression, then re-write the fixed date node back to the xml file. Because the file contains about 70MB worth of data, it may cause a performance issue if you replace it only once by read whole string from the xml file.

     

    Have a nice day.


    Paul Zhou [MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

  • Monday, June 20, 2011 2:55 PM
     
      Has Code

    I think I've figured out the Powershell version:

    $infile = get-content C:\test.txt
    $regex = New-Object System.Text.RegularExpressions.Regex "(?<OpenTag><(?<TagName>\w*Date\w*)>)(?<ShortDate>\d{3})(?<CloseTag></\k<TagName>>)"
    $replace = $regex.Replace($infile, '${OpenTag}0${ShortDate}${CloseTag}')
    set-content -Value $replace C:\test2.txt
    
    get-content C:\test2.txt

     This should fix every occurrence in the document, as long as the XML is valid. If there are missing closing tags, tag pairs with different casing or other XML violations, then this regex will probably not work as expected. If the XML was generated by a tool, then it's probably valid and I wouldn't worry about it too much.

  • Tuesday, June 21, 2011 12:47 PM
     
     

    Cyborgx372,

    \w*Date\w* will find Date anywhere within the tag while the author wants to find tags *ending* with Date.  Drop the last \w*.  Also, though the author's example shows 3 digits, he implied that the number could be 1-3 digits.  So, change \d{3} to \d{1,3}.  Unfortunately, I do not have an easy solution for taking a variable length match and converting it to a fixed length (0 padded) string.  Normally, I would use MatchEvaluator, but the author is using Powershell (note: search for MatchEvaluator and Powershell and you will find a complicated script that might be a good start).


    Les Potter, Xalnix Corporation, Yet Another C# Blog
  • Tuesday, June 21, 2011 3:59 PM
     
     

    Thanks for everyone help.   I'm still working on this problem and will update post if still not working --

     

  • Thursday, June 23, 2011 7:54 AM
    Moderator
     
     

    Hi,

    Any update? Would you mind letting us know the results of the suggestions?

    If the suggestions are helpful for you, please mark answers and close this thread.

    If not, any concerns, please feel free to let us know.


    Paul Zhou [MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

  • Thursday, June 23, 2011 6:58 PM
     
     

    Hi, I'm still working on this issue.

    I need to load/search/replace the file as a data file and not XML.   

    I'm hoping I can revisit this next week.

     

  • Thursday, June 23, 2011 7:09 PM
     
     

    Hello,

    Although it may appear to be on one line, it may be a multiline issue. I'm not familiar with Powershell regex but I'd into the options multiline and replace all.

    Adam


    Ctrl+Z