.NET Framework Developer Center > .NET Development Forums > Regular Expressions > Can't understand look-ahead and look-behind assertions
Ask a questionAsk a question
 

AnswerCan't understand look-ahead and look-behind assertions

  • Tuesday, October 13, 2009 7:06 AMMotaro Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    I can't understand how do look-ahead and look-behind assertions work. For instance (in MSDN documentation):

    "Zero-width positive look-ahead assertion. Continues match only if the subexpression matches AT THIS POSITION ON THE RIGHT."
    "\w(?=\d)" What is this position?

    "Zero-width negative look-ahead assertion. Continues match only if the subexpression does not match AT THIS POSITION ON THE RIGHT."
    And example follows: "\b(?!un)\w+\b"

    Why in first case expression is on the right side and second case - it is on the left side, while doc says "on the right"?

    Please, explain me how these feaures work!


    There is no knowledge that is not power.

Answers

  • Tuesday, October 13, 2009 12:06 PMxalnix Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer
    "On the right" does not refer to where the "look-ahead" subpattern occurs in the pattern.  "On the right" refers to the characters in the text to be match that occur to the right of the regex position pointer.

    There has not been a lot said in the MSDN documentation about the "current position" or "position point" in Regex.  I'm not even sure of the correct terminology, but the concept is pretty straight forward.  As Regex scans your text for a match of the supplied pattern, it keeps track of its position in the string.  As the match is constructed, the position is adjusted to include the matched characters.  But with look-ahead or any "zero-width" assertion, the matching characters are not consumed and the position is maintained.  This allows for multiple assertions to be made on the same sequence of characters.

    But, back to the two examples patterns...

    \w(?=\d)

    This pattern simply says match a single word character ([a-zA-Z0-9_-]) but only if it is followed by a digit character ([0-9]).  Each match found will be only 1 character long.  So some examples...

    abc3def  - only c matches becase its the only character followed by a digit (i.e, on the right)
    abc3456 - there are 4 matches here, c, 3, 4, and 5 because each is followed by a digit (to the right)

    \b(?!un)\w+\b
    This pattern is a little more confusing because both the (?!un) portion of the pattern and the \w+ portion of the pattern apply to the same point in the string, i.e., they overlap.  Consider the example strings...

    fish live under water - matches fish, live, and water
    he was totally undone over the matter - matches he, was, totally, over, the, and matter

    ... If it's easier to understand this way, think of there being two patterns \b(?!un) and \b\w+\b, and that both patterns are applied to the same position in the text.  Both must evaluate true, but only the \b\w+\b will actually capture.  The negative-look-ahead is "zero-width".




    Les Potter, Xalnix Corporation, Yet Another C# Blog
    • Marked As Answer byMotaro Thursday, October 29, 2009 8:28 AM
    •  

All Replies

  • Tuesday, October 13, 2009 12:06 PMxalnix Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer
    "On the right" does not refer to where the "look-ahead" subpattern occurs in the pattern.  "On the right" refers to the characters in the text to be match that occur to the right of the regex position pointer.

    There has not been a lot said in the MSDN documentation about the "current position" or "position point" in Regex.  I'm not even sure of the correct terminology, but the concept is pretty straight forward.  As Regex scans your text for a match of the supplied pattern, it keeps track of its position in the string.  As the match is constructed, the position is adjusted to include the matched characters.  But with look-ahead or any "zero-width" assertion, the matching characters are not consumed and the position is maintained.  This allows for multiple assertions to be made on the same sequence of characters.

    But, back to the two examples patterns...

    \w(?=\d)

    This pattern simply says match a single word character ([a-zA-Z0-9_-]) but only if it is followed by a digit character ([0-9]).  Each match found will be only 1 character long.  So some examples...

    abc3def  - only c matches becase its the only character followed by a digit (i.e, on the right)
    abc3456 - there are 4 matches here, c, 3, 4, and 5 because each is followed by a digit (to the right)

    \b(?!un)\w+\b
    This pattern is a little more confusing because both the (?!un) portion of the pattern and the \w+ portion of the pattern apply to the same point in the string, i.e., they overlap.  Consider the example strings...

    fish live under water - matches fish, live, and water
    he was totally undone over the matter - matches he, was, totally, over, the, and matter

    ... If it's easier to understand this way, think of there being two patterns \b(?!un) and \b\w+\b, and that both patterns are applied to the same position in the text.  Both must evaluate true, but only the \b\w+\b will actually capture.  The negative-look-ahead is "zero-width".




    Les Potter, Xalnix Corporation, Yet Another C# Blog
    • Marked As Answer byMotaro Thursday, October 29, 2009 8:28 AM
    •  
  • Thursday, October 29, 2009 8:29 AMMotaro Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    xalnix, thanks a lot! Sorry for late answer. Now I have firm understand of it. Thanks! :)
    There is no knowledge that is not power.