.NET Framework Developer Center > .NET Development Forums > Regular Expressions > How to find misspelled words with regex
Ask a questionAsk a question
 

AnswerHow to find misspelled words with regex

  • Thursday, November 05, 2009 11:14 AMtommyolesen Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hi all

    I need a RegEx that will find a specific word within a long string. The issue is that this word may be misspelled and I need to find it even so. I would like to accept a certain percentage of wrong-ness when looking for the word. Ex.

    The complete string: Hello, this is my comp/et string to look at
    The word to search for: complete

    Let’s say that I which to accept a maximum of two wrong letters in the above string, then the RegEx should match the word complete. However, if I only accept 1 wrong letter it shouldn’t find it. Ideally the RegEx would also be able to handle whitespaces, and missing letters. Ex:

    The complete string: Hello, this is my com plee string to look at
    The word to search for: complete

    This should match the word as well, even though there is a whitespace between ‘m’ and ‘p’ and the letter ‘l’ is missing. Is this possible at all with RegEx or should I be looking at an alternative way to solve it?

    Thanks, Tommy

Answers

  • Thursday, November 05, 2009 1:31 PMOmegaManMVP, ModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer
    Regex is not designed to be a tokenizer and that is where it will fall short for this situation.

    Looking at your example the word common would be a misspell for complete , yet it is not. The plee example you mentioned has nothing to do with com, it would have to be its own rule to mark that is a problem. Hopefully you wouldn't have anyone writing about Comanches as well that would not bring comfort to your parser.

    Unless you want to create multiple patterns and string C#/VB logic behind it to handle each situation, sure it can be done. But there is no one or two patterns to handle this.

    Maybe this is a class assignment or your own sandbox work, and it can be done for a few words, but each word will need its own logic processing and the return on the amount of work done will not be worth it.... IMHO GL
    William Wegerson (www.OmegaCoder.Com)

All Replies

  • Thursday, November 05, 2009 1:31 PMOmegaManMVP, ModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer
    Regex is not designed to be a tokenizer and that is where it will fall short for this situation.

    Looking at your example the word common would be a misspell for complete , yet it is not. The plee example you mentioned has nothing to do with com, it would have to be its own rule to mark that is a problem. Hopefully you wouldn't have anyone writing about Comanches as well that would not bring comfort to your parser.

    Unless you want to create multiple patterns and string C#/VB logic behind it to handle each situation, sure it can be done. But there is no one or two patterns to handle this.

    Maybe this is a class assignment or your own sandbox work, and it can be done for a few words, but each word will need its own logic processing and the return on the amount of work done will not be worth it.... IMHO GL
    William Wegerson (www.OmegaCoder.Com)