Ask a questionAsk a question
 

AnswerFillpoint matching

  • Thursday, October 22, 2009 3:06 PMRobA2345 Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hi all,

    I'm hoping someone can help me with this problem. For my website I send periodic emails out to subscribers. To add a personal feel I have fillpoints in the email which I replace with the relevant data. The format is [-->FillpointName<--]. I have a regex which matches them which is...

    private static Regex _fillPointMatcher = new Regex(@"([-->[A-Za-z]*<--])", RegexOptions.IgnoreCase);
    
    
    The problem I have is that if I'm stating a monetary value in the html like "The price is &pound;[-->Price<--]" then this matches my regex like pound;[-->Price<--]

    Does anyone know why this is happening and how to fix it? I'm useless at regex's.

    Thanks Rob

Answers

  • Thursday, October 22, 2009 4:06 PMAhmad Mageed Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer
    You've finally stumbled upon some input that revealed an error in your pattern. To fix your pattern simply escape the first square bracket: @"(\[-->[A-Za-z]*<--])"

    I'll explain why this happened, starting from the basics and up to the conclusion. A regex consists of special metacharacters that are used for matching. When you want to match a character that also holds special meaning in regex, you need to let the regex engine know not to treat that character specially, and to treat it literally instead. To do so, you have to "escape" it with a backslash.

    Your pattern uses [A-Za-z] - this represents a character class that matches any character inside, or a range of characters. A character class is expressed by square brackets []. A range is expressed by the dash "-" character. So far so good, right?

    Your FillPoint format is [-->FillPointName<--] - you can see where I'm going with this :) So, what really happened? Well, your pattern didn't mean what you wanted it to mean. As is, your pattern actually translated to this: "[-->[A-Za-z]*" followed by "<--]". In other words, it was actually a character class from the start, not just the inner character class, which never was considered one to begin with.

    Why is this a problem? Why have you gotten lucky so far, only to be bitten by it now?

    Let's look at the beginning: "[--> [A-Za-z]". The "[-->" part is interpreted as "match any character from the range of a dash ("-") to a greater than symbol (">"). Here's the breakdown of what a dash and a greater than symbol represent by their character value:

    Dash (-): 45
    Greater Than (>): 62

    Guess what? A semi-colon (";") happens to have the value of 59. Thus, it fell in the range of characters with values of 45 - 62 and was being captured. I suspect most of your other fill points were usually preceded by a space or any other character that wasn't in that range. A space is a 32.

    By escaping the first "[" in your pattern, using "\[", it effectively tells the regex engine that you don't want to start a new character class, but that you want to match a literal "[" symbol in the text. By doing this, the "-->" no longer signifies a range and the real character class starts at [A-Za-z]. The same applies to parentheses, +, * etc.. If you want to match them literally you would need to escape them: \(, \*, \+. The only exception is when these occur in a character class: [(*+] is valid without being escaped. Take a look at the regular expressions page for more details. A helpful method for escaping things is Regex.Escape and Regex.Unescape , however you need to know what you want to apply that to, and not automatically to your entire pattern otherwise it wouldn't match correctly.

    BTW, your [A-Za-z] already handles case-sensitivity so the RegexOptions.IgnoreCase option is redundant. It won't cause problems, but it's not needed. You could either remove it, or just use [A-Z] and keep it.

    Hopefully you've found this somewhat informative and easy to follow.
    Document my code? Why do you think it's called "code"?

All Replies

  • Thursday, October 22, 2009 4:06 PMAhmad Mageed Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer
    You've finally stumbled upon some input that revealed an error in your pattern. To fix your pattern simply escape the first square bracket: @"(\[-->[A-Za-z]*<--])"

    I'll explain why this happened, starting from the basics and up to the conclusion. A regex consists of special metacharacters that are used for matching. When you want to match a character that also holds special meaning in regex, you need to let the regex engine know not to treat that character specially, and to treat it literally instead. To do so, you have to "escape" it with a backslash.

    Your pattern uses [A-Za-z] - this represents a character class that matches any character inside, or a range of characters. A character class is expressed by square brackets []. A range is expressed by the dash "-" character. So far so good, right?

    Your FillPoint format is [-->FillPointName<--] - you can see where I'm going with this :) So, what really happened? Well, your pattern didn't mean what you wanted it to mean. As is, your pattern actually translated to this: "[-->[A-Za-z]*" followed by "<--]". In other words, it was actually a character class from the start, not just the inner character class, which never was considered one to begin with.

    Why is this a problem? Why have you gotten lucky so far, only to be bitten by it now?

    Let's look at the beginning: "[--> [A-Za-z]". The "[-->" part is interpreted as "match any character from the range of a dash ("-") to a greater than symbol (">"). Here's the breakdown of what a dash and a greater than symbol represent by their character value:

    Dash (-): 45
    Greater Than (>): 62

    Guess what? A semi-colon (";") happens to have the value of 59. Thus, it fell in the range of characters with values of 45 - 62 and was being captured. I suspect most of your other fill points were usually preceded by a space or any other character that wasn't in that range. A space is a 32.

    By escaping the first "[" in your pattern, using "\[", it effectively tells the regex engine that you don't want to start a new character class, but that you want to match a literal "[" symbol in the text. By doing this, the "-->" no longer signifies a range and the real character class starts at [A-Za-z]. The same applies to parentheses, +, * etc.. If you want to match them literally you would need to escape them: \(, \*, \+. The only exception is when these occur in a character class: [(*+] is valid without being escaped. Take a look at the regular expressions page for more details. A helpful method for escaping things is Regex.Escape and Regex.Unescape , however you need to know what you want to apply that to, and not automatically to your entire pattern otherwise it wouldn't match correctly.

    BTW, your [A-Za-z] already handles case-sensitivity so the RegexOptions.IgnoreCase option is redundant. It won't cause problems, but it's not needed. You could either remove it, or just use [A-Z] and keep it.

    Hopefully you've found this somewhat informative and easy to follow.
    Document my code? Why do you think it's called "code"?
  • Tuesday, November 03, 2009 11:46 AMRobA2345 Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    I never got my notification that you had replied so I didn't even know!

    Anyway just writing to thank you for responding it is appriciated. FYI the regex I went with in the end was Regex(@"\[-->\s*(?<tag>\w*)\s*<--\]") which took into account the points raised by yourself.

    Thanks.
    Rob