locked
.NET's Regex class problem looking for words like "ca+" RRS feed

  • Question

  • hi,

    i have c#-generated regex definition that looks for alternations for the following words

    • calcium
    • ca
    • ca++
    • ca2+
    • ca+
    • and so on...

    it is a single regex definition that will attempt to find the above forms. though there is no problem finding "calcium " and "ca " (which just happen to be the forms without a plus (+) character), the .NET Regex class will not find any of the other forms with the plus like "ca++".

    there is no error when i pass in the definition to the constructor, merely a false result during IsMatch ().

    an extract of the regex that looks for "ca+" is:

     

    (?i)ca\+(?=\ )|ca\+\s


     

    here it is working in RegexBuddy

    http://i269.photobucket.com/albums/jj61/beugnen/development/regex/regexplusproblem.png

     

    the full regex is:

    (?i)(?:\b(?:ca(?!\+)|calcium(?!\+))\b)|(?:(?:ca\+(?=\ )|ca\+\s|ca\+\+(?=\ )|ca\+\+\s|ca\+2(?=\ )|ca\+2\s|ca2\+(?=\ )|ca2\+\s))

    ...again working in RegexBuddy

    http://i269.photobucket.com/albums/jj61/beugnen/development/regex/regexplusprobeminnet.png   i know .NET is parsing it ok because "ca" and "calcium" works.

     

    here is a snippet which is failing:

    		private void foo()
    		{
    			var
    				regex = new
    					Regex(@"(?i)ca\+(?=\ )|ca\+\s"
    					);
    
    
    			Debug.Assert(regex.IsMatch("ca+"
    			       	));
    		}

    thanks in advance

     

    UPDATE: we've updated our regex to  (^|\s)ca\+(?!\S)  with great success. the only issue now is that it will return a single leading space in the match if present in the stream. otherwise it knows to exclude scenarios where words are joined together. see my reply to wolfgang below.

     


    MickyD | http://mickyd.wordpress.com/ Help others by voting my post as 'Helpful' if you think it is so.
    Friday, May 28, 2010 5:08 AM

Answers

All replies

  • Debug.Assert(regex.IsMatch("ca+ ")); //add whitespace
    • Proposed as answer by WolfgangKluge Friday, May 28, 2010 7:27 AM
    • Marked as answer by SamAgain Thursday, June 3, 2010 9:04 AM
    • Unmarked as answer by Michael A. Duncan Wednesday, July 14, 2010 11:54 PM
    • Unproposed as answer by Michael A. Duncan Wednesday, July 14, 2010 11:54 PM
    Friday, May 28, 2010 6:28 AM
  • Hi,

    as polishchuk mentioned, you search for "ca+" plus a blank character excluded from match with the pattern ca\+(?=\ ) or for "ca+" plus any other whitespace character (\t \r \n \v \f \x85 and any Unicode character defined with \p{Z}) included in match with ca\+\s.

    Greetings,


    Wolfgang Kluge
    gehirnwindung.de
    • Marked as answer by SamAgain Thursday, June 3, 2010 9:04 AM
    • Unmarked as answer by Michael A. Duncan Thursday, July 15, 2010 12:39 AM
    Friday, May 28, 2010 7:37 AM
  • Hi,

       Could you provide a full list of all the possible forms of strings you want to match?


    Please mark the right answer at right time.
    Thanks,
    Sam
    Monday, May 31, 2010 8:40 AM
  • hi sam,

    see my first post at the top of this thread.


    MickyD | http://mickyd.wordpress.com/ Help others by voting my post as 'Helpful' if you think it is so.
    Wednesday, July 14, 2010 10:35 PM
  • Debug.Assert(regex.IsMatch("ca+ "
    )); //add whitespace
    
    


    hi polish',

    that helps, but sadly i don't want to have to be searching for a trailing space that may not be present in the input stream.


    MickyD | http://mickyd.wordpress.com/ Help others by voting my post as 'Helpful' if you think it is so.
    Wednesday, July 14, 2010 11:56 PM
  • Hi,

    as polishchuk mentioned, you search for "ca+" plus a blank character excluded from match with the pattern ca\+(?=\ )  or for "ca+" plus any other whitespace character (\t \r \n \v \f \x85 and any Unicode character defined with \p{Z}) included in match with ca\+\s .

    Greetings,


    Wolfgang Kluge
    gehirnwindung.de

    hi wolfgang,

    sorry i've had my alerts turned off and just read this now.

    your regex is quite good but we're not quite there yet. pluggin it in i get these results

    LEADING AND TRAILING WHITESPACE IS SHOWN HERE AS UNDERSCORES (_).

    MATCH - success

    MATCH - false match

    FAILED to match

    ca+
    ca+ dog
    dog ca+
    ca+ dog ca+
    ca+ cat
    ca+ cat ca+
    ca+ cat ca+ dog
    ca+ cat ca+ dog ca+
    ca+ cat ca+ dog ca+ x

    ____ca+ ca+ca+ ca+ ca++ +ca+ xca+ ca+x ca+

    seems it cant find 'ca+' when its the only thing in a line or if ca+ is the last thing in a line. stick a space at the end of the line and it works but i dont want that.

    the last line is particularly bad. e.g. "ca+ca+" is wrong because that's 2 ca+'s joined together.  "xca+" also because i dont want the 'x'.

    i only want to match and only match "ca+". it must either be the only test in the input stream or it must be surrounded by whitespace.

    good stuff otherwise wolfgang.

    UPDATE

    the best regex that we've come up with so far is

    (^|\s)ca\+(?!\S)

    ...unfortunately the ONLY thing wrong with it is that it returns a single leading space if present. otherwise it solves all the probs.


    MickyD | http://mickyd.wordpress.com/ Help others by voting my post as 'Helpful' if you think it is so.
    Thursday, July 15, 2010 12:32 AM
  • (^|\s+)ca\+(?!\S)


    Regards
    • Marked as answer by SamAgain Thursday, July 15, 2010 5:05 AM
    • Unmarked as answer by Michael A. Duncan Thursday, July 15, 2010 5:29 AM
    Thursday, July 15, 2010 4:07 AM
  • thanks but that just matches more spaces which i didnt want in the first place sadly.

     

     


    MickyD | http://mickyd.wordpress.com/ Help others by voting my post as 'Helpful' if you think it is so.
    Thursday, July 15, 2010 5:29 AM
  • to re-iteratate in case i was not clear:

     

    1. im looking for the medical term  "ca+"
    2. "ca+" could be anywhere in a line of text
    3. "ca+" must be "by itself" it must not be part of/in middle of etc another word
    4. if there is anything before "ca+" it must be whitespace
    5. if there is anything after "ca+" it must be whitespace
    6. "ca+" may have whitespace before it
    7. "ca+" may have whitespace after it
    8. the regex if matched must return and only return "ca+". no whitespace must be returned

    e.g.

    "ca+" = ok

    "ca+ca+" = bad

    "ca++" = bad

    "_ca++" =bad

    "_ca++_" = bad

    "catdogmouseca++"=bad

    "ca++misspiggy"=bad


    MickyD | http://mickyd.wordpress.com/ Help others by voting my post as 'Helpful' if you think it is so.
    Thursday, July 15, 2010 5:42 AM
  • Hi,

    try this one

    (?<=\s|^)ca\+(?=\s|$)

    Greetings,


    Wolfgang Kluge
    gehirnwindung.de
    Thursday, July 15, 2010 7:04 AM
  • Brilliant!! that worked perfectly. thank-you so much Wolfgang! :)

    thank-you everyone else for their help too.


    MickyD | http://mickyd.wordpress.com/ Help others by voting my post as 'Helpful' if you think it is so.
    Thursday, July 15, 2010 8:04 AM