locked
RegExp Help

    Question

  • Hi,

    I am in need for Regular Expression pattern that could help me parse what I would call tokens. Currently, I am parsing a string character by character and using a scanner approach (Sort of how compiler works). However, I would think a RegExp would perform better.

     

    Here is the scenario: Everything that is used within a “()” is called a token. A token could have nested tokens. An example would be:

    Say, I have a token called (CurrentUser)and I have another token called (Date). Each token could support embedded attributes inside it. For example, the Date might have format=”Long|short|longtime|shortime”.

     

    So if I have a string that is “(CurrentUser) logged on (Date: format=”short”). I should be able to replace that string by “John Smith logged on 03/30/3008” That was simple, however it could get more complex. For example say “(CurrentUser)  logged on (Date: format=”(defaultFormat)” ) . the way my algorithm works is first I find inner tokens

    (CurrentUser) : John Smith

    (defaultFormat) : Long

     

    Step 1  “(CurrentUser)  logged on (Date: format=”(defaultFormat)” )

    Step2  “(CurrentUser)  logged on (Date: format=”long” )

    At this point, I start processing form start again and I will do string replacement on the tokens.

    I am hoping to find a creative RegExp pattern that I could use to parse it in a better way.

     

    Thanks,

    Friday, March 28, 2008 2:23 AM

Answers

  • You could start witha catch-all regular expression

    [(][:A-Za-z=\"\ ]*[)]

    that will match all the token(s) in your string.

    Or you could create an array of all the tokens in your application (CurrentUser, Date etc) and match on each one:

     and substitute the valid tokens in your application:

    [(]{0}[:a-z=\"\ ]*[)].

    Then on each match, process the key value pair and do a string.format for each.
    Friday, March 28, 2008 4:06 PM
  • I'm not sure what other possible tokens look like, or whether other tokens may have a name and maybe followed by additional info similar to the way Date is. With that said, I came up with the following to provide some flexibility that could be applied to other tokens with (optional) additional info.

     

    1. RegEx to grab token name and content (note this uses the @ shortcut):

    @"(?<=\()((?\w*)|(?\w*)(?.*))(?=\))"

     

    2. RegEx to grab anything enclosed in double quotes (ie. date format) - (not using @ here):

    "(?<=\").*(?=\")"

     

    Here's my code:

     

    Code Snippet

    Regex reToken = new Regex(@"(?<=\()((?<TokenName>\w*)|(?<TokenName>\w*)(?<TokenContent>.*))(?=\))");

    Regex reDblQuoteContent = new Regex("(?<=\").*(?=\")");

     

    string str1 = "(CurrentUser) logged on (Date: format=\"short\")";

    string str2 = "(CurrentUser) logged on (Date: format=\"(defaultFormat)\")";

     

    // Get all the matches

    MatchCollection mc = reToken.Matches(str1);

     

    // Show the string

    Console.WriteLine("String = " + str1);

     

    // Iterate over each match in the collection

    foreach (Match m in mc)

    {

    // Get the matched groups and get details by group name

    GroupCollection gc = m.Groups;

    string token = gc["TokenName"].Value;

    string content = gc["TokenContent"].Value;

    string dateFormat = null;

     

    // Show the current matched group

    Console.WriteLine("\nCurrent Matched Group = " + gc[0].Value);

     

    // Get the date format if the token is "Date"

    if (token == "Date" && content != String.Empty)

    {

    // Only get the date if it matches the expected format, otherwise it will remain null

    if (reDblQuoteContent.IsMatch(content))

    {

    dateFormat = reDblQuoteContent.Match(content).ToString();

    }

    }

     

    Console.WriteLine("TokenName: " + token);

    Console.WriteLine("TokenContent: " + content);

    Console.WriteLine("DateFormat: " + (dateFormat == null ? "Undefined" : dateFormat));

    }

     

    BTW TokenContent grabs everything after the TokenName is established. That's why, in the output below, you'll see the ":" included that was part of "Date:"

     

    Output for str1:

    String = (CurrentUser) logged on (Date: format="short")

     

    Current Matched Group = CurrentUser
    TokenName: CurrentUser
    TokenContent:
    DateFormat: Undefined

     

    Current Matched Group = Date: format="short"
    TokenName: Date
    TokenContent: : format="short"
    DateFormat: short

     

    Output for str2:

    String = (CurrentUser) logged on (Date: format="(defaultFormat)")

     

    Current Matched Group = CurrentUser
    TokenName: CurrentUser
    TokenContent:
    DateFormat: Undefined

     

    Current Matched Group = Date: format="(defaultFormat)"
    TokenName: Date
    TokenContent: : format="(defaultFormat)"
    DateFormat: (defaultFormat)

     

    I hope that was useful! If you use the code you'll have to change the references of "str1" to "str2" when you want to test that string out. I imagine you'll come up with a file reading routine to feed each line through the above as a function or such.

    Friday, March 28, 2008 9:05 PM

All replies

  • You could start witha catch-all regular expression

    [(][:A-Za-z=\"\ ]*[)]

    that will match all the token(s) in your string.

    Or you could create an array of all the tokens in your application (CurrentUser, Date etc) and match on each one:

     and substitute the valid tokens in your application:

    [(]{0}[:a-z=\"\ ]*[)].

    Then on each match, process the key value pair and do a string.format for each.
    Friday, March 28, 2008 4:06 PM
  • I'm not sure what other possible tokens look like, or whether other tokens may have a name and maybe followed by additional info similar to the way Date is. With that said, I came up with the following to provide some flexibility that could be applied to other tokens with (optional) additional info.

     

    1. RegEx to grab token name and content (note this uses the @ shortcut):

    @"(?<=\()((?\w*)|(?\w*)(?.*))(?=\))"

     

    2. RegEx to grab anything enclosed in double quotes (ie. date format) - (not using @ here):

    "(?<=\").*(?=\")"

     

    Here's my code:

     

    Code Snippet

    Regex reToken = new Regex(@"(?<=\()((?<TokenName>\w*)|(?<TokenName>\w*)(?<TokenContent>.*))(?=\))");

    Regex reDblQuoteContent = new Regex("(?<=\").*(?=\")");

     

    string str1 = "(CurrentUser) logged on (Date: format=\"short\")";

    string str2 = "(CurrentUser) logged on (Date: format=\"(defaultFormat)\")";

     

    // Get all the matches

    MatchCollection mc = reToken.Matches(str1);

     

    // Show the string

    Console.WriteLine("String = " + str1);

     

    // Iterate over each match in the collection

    foreach (Match m in mc)

    {

    // Get the matched groups and get details by group name

    GroupCollection gc = m.Groups;

    string token = gc["TokenName"].Value;

    string content = gc["TokenContent"].Value;

    string dateFormat = null;

     

    // Show the current matched group

    Console.WriteLine("\nCurrent Matched Group = " + gc[0].Value);

     

    // Get the date format if the token is "Date"

    if (token == "Date" && content != String.Empty)

    {

    // Only get the date if it matches the expected format, otherwise it will remain null

    if (reDblQuoteContent.IsMatch(content))

    {

    dateFormat = reDblQuoteContent.Match(content).ToString();

    }

    }

     

    Console.WriteLine("TokenName: " + token);

    Console.WriteLine("TokenContent: " + content);

    Console.WriteLine("DateFormat: " + (dateFormat == null ? "Undefined" : dateFormat));

    }

     

    BTW TokenContent grabs everything after the TokenName is established. That's why, in the output below, you'll see the ":" included that was part of "Date:"

     

    Output for str1:

    String = (CurrentUser) logged on (Date: format="short")

     

    Current Matched Group = CurrentUser
    TokenName: CurrentUser
    TokenContent:
    DateFormat: Undefined

     

    Current Matched Group = Date: format="short"
    TokenName: Date
    TokenContent: : format="short"
    DateFormat: short

     

    Output for str2:

    String = (CurrentUser) logged on (Date: format="(defaultFormat)")

     

    Current Matched Group = CurrentUser
    TokenName: CurrentUser
    TokenContent:
    DateFormat: Undefined

     

    Current Matched Group = Date: format="(defaultFormat)"
    TokenName: Date
    TokenContent: : format="(defaultFormat)"
    DateFormat: (defaultFormat)

     

    I hope that was useful! If you use the code you'll have to change the references of "str1" to "str2" when you want to test that string out. I imagine you'll come up with a file reading routine to feed each line through the above as a function or such.

    Friday, March 28, 2008 9:05 PM
  •  omarazam wrote:
    You could start witha catch-all regular expression

    [(][:A-Za-z=\"\ ]*[)]

    that will match all the token(s) in your string.

    Or you could create an array of all the tokens in your application (CurrentUser, Date etc) and match on each one:

     and substitute the valid tokens in your application:

    [(]{0}[:a-z=\"\ ]*[)].

    Then on each match, process the key value pair and do a string.format for each.

     

    The above works nicely to capture both tokens and balsaim (the OP) could use this if he knows what to expect. I started with something similar, but the issue I found with it is when dealing with nested tokens - that's the tricky part. The nested () throws off the regex.

     

    String 1 result is:

    (CurrentUser)
    (Date: format="short")

    This is great and something we can work with.

     

    But for string 2, with the nested token, the result is:

    (CurrentUser)
    (defaultFormat)

    This is ok, but my concern is we no longer know whether the 2nd token is a "Date" type. If that is always the case, then we can stop right here and use this shorter solution right now since we're sure of what we're always working with. If not, then we're in a bind. That is mainly why I decided to drill down and group things out. Both are potential options Smile

    Friday, March 28, 2008 9:27 PM