none
Regex class replace results in duplicated characters RRS feed

  • Question

  • Hi There
    I'm trying to write a routine to deal with homoglyphs (characters that look similar to other characters e.g. "Å" & "A") when matching one string to another.

    My initial attempt used Regex to replace the homoglyphs with their counterparts:

    text = System.Text.RegularExpressions.Regex.Replace(text, "[ÁÀÂÄÃÅ]", "A")

    This worked fine for simple accented characters, but when I added in characters with higher codepoints (for example "mathematical double-struck capital a (U+1D538)" ) the replace resulted in duplicated characters.

    Here is some example code to show you what I mean:

    Debug.Print(System.Text.RegularExpressions.Regex.Replace("𝔸", "[𝔸]", "A")) 'results in a 'AA'
    Can anyone tell me why this is happening and more importantly how I can stop it?


    FYI I tried the same thing using alternates ("Á|À|Â|Ä|Ã|Å|𝔸") and the performance was bad.

    Alternatively is there a better way to achieve this for homoglyph matching?

    Thanks in advance

    Paul

    Tuesday, April 30, 2019 8:09 AM

Answers

  • Check this too:

    string example = "Some text ÁÀÂÄÃÅȘȚ áàâäãåșț 𝔸.";
    string result = string.Concat( example.Normalize( NormalizationForm.FormKD ).Where( c => { var uc = char.GetUnicodeCategory( c ); return uc != UnicodeCategory.NonSpacingMark && uc != UnicodeCategory.Control; } ) );
    

    • Marked as answer by SignOut Tuesday, April 30, 2019 2:29 PM
    Tuesday, April 30, 2019 12:32 PM

All replies

  • Hi,

     The length of "𝔸" is 2,you can try the code:

     Dim text As String = "𝔸"
     Console.WriteLine(text.Replace("𝔸", "A"))

    Best Regards,

    Alex


    MSDN Community Support Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Tuesday, April 30, 2019 8:55 AM
  • Check this too:

    string example = "Some text ÁÀÂÄÃÅȘȚ áàâäãåșț 𝔸.";
    string result = string.Concat( example.Normalize( NormalizationForm.FormKD ).Where( c => { var uc = char.GetUnicodeCategory( c ); return uc != UnicodeCategory.NonSpacingMark && uc != UnicodeCategory.Control; } ) );
    

    • Marked as answer by SignOut Tuesday, April 30, 2019 2:29 PM
    Tuesday, April 30, 2019 12:32 PM
  • Hi Viorel

    Many thanks, seems to be perfect for what i am trying to achieve.

    Paul

    Tuesday, April 30, 2019 2:30 PM