locked
Regular expression to find the exact match of word in a paragraph

    Question

  •  

    Hi,

    I want to find the exact match of word in a paragraph in spell check or sentence and replace such exact word by another provided word.

    Example (i have put all conditions even though it looks wierd).

    ben bench 01ben ben05 he-ben ben, 

    ben-he ben? Ben ben <ben> ben’s (ben) ‘ben’ $ben 

    <a href=http://ben.com>ben</a> “ben” $ben  ben $ben$. 

    When i say replace "ben" by "Mike", then resulted paragraph should be like below.

    Mike bench 01ben ben05 he-ben Mike,

    ben-he Mike? Mike Mike < Mike> Mike’s (Mike) ‘Mike’

    <a href=http://ben.com> Mike </a> “Mike” $Mike Mike $Mike$.

    I need a Regular expression to match the word and replace it.

    I tried like below, but regex expression used below is wrong, so didn't work.

     

    paragraph.replace(new RegExp("(^|[^[a-zA-Z]])(" + "ben" + ")([^[a-zA-Z]]|$)", "g"), '$1' + "Mike" + '$3')

    Please help and i am using .NET regex.

     

     

     

     


    Wednesday, June 22, 2011 10:40 AM

Answers

  • Hi,

    it already works with unicode characters (use café instead of ben). That's not the problem.

    The unicode-range (you have to use the correct ranges, not the full range!) is for finding characters that are allowed in front of or after the searched word.

    E.g. you can use (I write it as a built-in regex-object - it's more clear)

    /([\s,.\$\\<>‘’“”()\u0021-\u0027\u00A0\u2000-\u200A\u2028\u2029\u202F]|^)ben(?=[\s,.\$\\?<>‘’“”()\u0021-\u0027\u00A0\u2000-\u200A\u2028\u2029\u202F]|$)/g

    There's no built in function because \s is (as \w) not defined with unicode in mind - only ascii. So \s is (probably) only the same as [ \t\r\n\v].

    Greetings,


    Wolfgang Kluge
    gehirnwindung.de
    • Marked as answer by Paul Zhou Wednesday, June 29, 2011 8:15 AM
    Friday, June 24, 2011 3:40 PM

All replies

  • But in .NET Regex there's no string function 'replace' which used that kind of signature.

    Looks to me as it's JavaScript.

    But for the record, the Regex for the .NET framework would be

    string paragraph = @"ben bench 01ben ben05 he-ben ben, 
    ben-he ben? Ben ben <ben> ben’s (ben) ‘ben’ $ben 
    <a href=http://ben.com>ben</a> “ben” $ben ben $ben$. ";
    Regex regex = new Regex( @"\b(?<![-/.])ben(?![-/.])\b", RegexOptions.IgnoreCase );
    string output = regex.Replace( paragraph, "Mike" );
    Console.WriteLine( output );
    
    

    and the output of this is

    Mike bench 01ben ben05 he-ben Mike, 
    ben-he Mike? Mike Mike <Mike> Mike’s (Mike) ‘Mike’ $Mike 
    <a href=http://ben.com>Mike</a> “Mike” $Mike Mike $Mike$. 


    The Regex explained:

    \b: first or last letter of a word

    (?<![-/.]): match only if the word is not preceded by '-', '/' or '.' (to avoid replacement in the url or words like "he-ben")

    ben: (nothing to say here)

    (?![-/.]): match only if the word is not followed by '-', '/' or '.' (to avoid replacement in the url or words like "ben-he")

    \b: first or last letter of a word

    Wednesday, June 22, 2011 12:43 PM
  • Thanks, yes i wanted it for Jquery. I tested it, expression works in expresso, but when i split and append the old word in jquery, it is not writing the result. It seems some syntax error, i missing any thing while appending ?

    <html>

    <body>

    <script type="text/javascript">

    var str = "ben bench 01ben ben05 he-ben ben, ben-he ben? Ben ben <ben> ben’s"

    + "(ben) ‘ben’ $ben <a href=http://ben.com>ben</a> “ben” $ben ben $ben$.";

    var oldWord = "ben";

    var newWord = "Mike";

    var result = str.replace(new RegExp("\b(?<![-/.])" + oldWord + "(?![-/.])\b", "g"), newWord));

    document.write(result);

    </script>

    </body>

    </html>

    And also if there is full stop at the end of sentence (after ben at the end of sentence there is full stop).

    <a href=http://ben.com>ben</a> “ben” $ben ben $ben$ ben.

    then last ben is not replaced.



    Wednesday, June 22, 2011 1:20 PM
  • The problem is that javascript regex doesn't support the lookaround groups "(?<!)" and "(?!)".

    You may try

     

    var result = str.replace(new RegExp("\\b" + oldWord + "\\b", "g"), newWord );
    

     

    but that also replaces things like "he-ben" to "he-Mike".

    Don't know any better but maybe Wolfgang has an idea.

    Good luck.

    P.S.: be careful if "oldWord" has some special characters for regex. Usually you need to escape them before you can put them into an regex.


    Wednesday, June 22, 2011 1:47 PM
  • How do we need escape special character (regex characters) used in old word ? is there any method which will escape regex special character.

     

    Oh!,  Wolfgang, do you have any other solution.

    Any body had any idea or solution for this ? Please reply.


    Wednesday, June 22, 2011 2:02 PM
  • Hi,

    what an honor ;)

    Javascript allows (?!), but not (?<!) and (?<=). Only lookbehind, no lookbefore in Javascript.

    I just answered the other question. This could help here, too. You have to adapt it a bit. Instead of testing against \s, you have to define all characters allowed (or all not allowed) in front of/after the word.

    e.g. \s,.\$\\\?<>‘’“”()

    var r = new RegExp("([\\s,.\\$\\\\\\?<>‘’“”()]|^)(ben)(?=[\\s,.\\$\\\\\\?<>‘’“”()]|$)", "g");
    var t = "ben bench 01ben ben05 he-ben ben, ben-he ben? Ben ben <ben> ben’s (ben) ‘ben’ $ben <a href=http://ben.com>ben</a> “ben” $ben ben $ben$.";
    
    alert(t.replace(r, "$1Mike"));
    

    You can also define unicode blocks. See http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane for more information (this can help with the other question, too). Just use

    [\u0000-\uFFFF] to define such a block.

    Greetings,


    Wolfgang Kluge
    gehirnwindung.de

    Thursday, June 23, 2011 9:46 PM
  • Thanks Wolfgang Kluge,

    when "id.d" said it is not possible, i moved to word boundary search (\b) approach (that's reason i created another question thread).

    Your solution worked but i didn't understand where and how to apply [\u0000-\uFFFF] for unicode characters.

    Can you please modify and send regex expression which satisfy the words with unicode characters(unicode character can be at last or starting or in middle of word).

     

    Friday, June 24, 2011 12:11 PM
  • Hi,

    it already works with unicode characters (use café instead of ben). That's not the problem.

    The unicode-range (you have to use the correct ranges, not the full range!) is for finding characters that are allowed in front of or after the searched word.

    E.g. you can use (I write it as a built-in regex-object - it's more clear)

    /([\s,.\$\\<>‘’“”()\u0021-\u0027\u00A0\u2000-\u200A\u2028\u2029\u202F]|^)ben(?=[\s,.\$\\?<>‘’“”()\u0021-\u0027\u00A0\u2000-\u200A\u2028\u2029\u202F]|$)/g

    There's no built in function because \s is (as \w) not defined with unicode in mind - only ascii. So \s is (probably) only the same as [ \t\r\n\v].

    Greetings,


    Wolfgang Kluge
    gehirnwindung.de
    • Marked as answer by Paul Zhou Wednesday, June 29, 2011 8:15 AM
    Friday, June 24, 2011 3:40 PM