locked
Replacing national characters in a string RRS feed

  • Question

  • Hello,

    I'm trying to replace national characters for names to an acceptable translation.
    We have "funny" characters in Sweden (as many other countries also have) which doesn't work in an internatoional environment so we need to translate them using a pre-defined translation table.

    This means that I need to translate (as an example);
    Ö with OE
    Ä with AE
    Å with AA

    (in all the translation table is around 25-30 characters)

    I would like a quick and simple solution for this translation, maybe by using regex?

    Many, many years ago I saw a function which replaced characters with other characters. I can't find it anymore but have recreated it as I think it worked.
    This function works as expected but is it nice to computer resources? and is it quick enough?

        Function ReplaceNationalCharacters(ByVal inValue As String, ByVal ParamArray replacechars() As String) As String
    
            For i As Integer = 0 To UBound(replacechars) Step 2
                inValue = Replace(inValue, replacechars(i), replacechars(i + 1))
    
            Next
    
            Return inValue
    
        End Function
    When I run it on a name or word like this,
    Dim result As String =  ReplaceNationalCharacters("Smörgåsbär", "ö","oe","ä","ae","å","aa")
    ...I get "smoergaasbaer" from "smörgåsbär" which is correct.

    But.... is there a better, faster method?

    Thanks in advanced
    Nils
    Thursday, May 7, 2009 2:13 PM

All replies

  • Hi,

    Just for information, you have a namespace for similar operation in the .Net framework.

    By example, a function that remove all diacritics from a string in powerShell :

    function Remove-Diacritics([string]$String)
    {
        $objD = $String.Normalize([Text.NormalizationForm]::FormD)
        $sb = New-Object Text.StringBuilder
     
        for ($i = 0; $i -lt $objD.Length; $i++) {
            $c = [Globalization.CharUnicodeInfo]::GetUnicodeCategory($objD[$i])
            if($c -ne [Globalization.UnicodeCategory]::NonSpacingMark) {
              [void]$sb.Append($objD[$i])
            }
          }
     
        return("$sb".Normalize([Text.NormalizationForm]::FormC))
    }

    Grégory Schiro - PowerShell & MOF
    Thursday, May 7, 2009 2:27 PM
  • Hi Grégory,

    I think your example only strips the dots and umlauts, right? (never heard of the word "diacritics" *smile*)
    This mean a "ö" is translated to "o" which is nothing I can use.

    As an example we have two very common last names in Sweden of "Jönsson" and "Jonsson" (like "Smith" in US)

    I can't translate "Jönsson" to "Jonsson" since that is a real, valid spelling of a name.
    It have to distinguish from the other name which mean I need to see "Joensson" as the non-national name

    Thanks anyway for trying to help me
    Nils
    Thursday, May 7, 2009 3:38 PM
  • I think your routine is probably faster and better than Regex, but here is a cool little example to show off some of the Regex flexibility (using a Lambda for the MatchEvaluator)...

                string[] sources = {
    
                        "smörgåsbär",
    
                        "Jönsson"
    
                                    };
    
                Dictionary<char, string> lookup = new Dictionary<char, string>();
    
                lookup.Add('ö',"oe");
    
                lookup.Add('Ö',"OE");
    
                lookup.Add('ä',"ae");
    
                lookup.Add('Ä',"AE");
    
                lookup.Add('å',"aa");
    
                lookup.Add('Å',"AA");
    
                foreach (string source in sources)
    
                {
    
                    string result = Regex.Replace(source, "[öÖäÄåÅ]", (param) =>
    
                        {
    
                            return lookup[param.Value[0]];
    
                        });
    
                    Console.WriteLine("{0}", result);
    
                }
    
    

    Les Potter, Xalnix Corporation, Yet Another C# Blog
    • Edited by xalnix Thursday, May 7, 2009 8:50 PM typo
    Thursday, May 7, 2009 8:49 PM
  • Hi,
      I think you can use RegeX for your requirment. Performance will be good.
    Please use the following code.
    string value = Regex.Replace(stringInput,"ö""OE");
    and so on.

    -- Thanks Ajith R [Mark as Answer if it is Helpful.]
    Friday, May 8, 2009 8:41 AM
  • Hello Ajith,

    Yes, I can use Regex.Replace but as with my example this would as well require a separate call for each replacement.

    I wish to have something like Public Function Replace(szValue As String, arrSearchArray, arrReplacementArray) and thought (hoped) there was something out there.

    I'm thinking more and more of that the method I described in my question is the closest I can come even if I make individual call for each replacement

    Thanks
    Nils
    Friday, May 8, 2009 12:21 PM
  • After thinking about it a little, the basic approach I described above might have better performance than your first approach (you should compare to be sure).  The reason is that your method will be scanning the whole string for every kind of national character you might expect to find.  My approach should make only one pass and then only lookup the characters it finds in the input string.  Of course, you should only set up the lookup dictionary once for the life of the application.  Here's another bit of code that's an improvement on my first, just add all the character translations you'd like.  This example will work with any single character to multi-char-string translation, not just your specific character set...

                ConvertChars convert = new ConvertChars();  // do this once somewhere
                string[] sources = {
                        "smörgåsbär",
                        "Jönsson"
                         };
                foreach (string source in sources)
                    Console.WriteLine("{0} {1}", source, convert.Convert(source));
    
    ...
    
        public class ConvertChars
        {
            private Dictionary<char, string> _lookup = new Dictionary<char, string>();
            private string _pattern = "";
    
            public ConvertChars()
            {
                _lookup.Add('ö', "oe");
                _lookup.Add('Ö', "OE");
                _lookup.Add('ä', "ae");
                _lookup.Add('Ä', "AE");
                _lookup.Add('å', "aa");
                _lookup.Add('Å', "aa");
                // add any others here
    
                _pattern = "[";
                foreach (char key in _lookup.Keys)
                    _pattern += key;
                _pattern += "]";           
            }
            public string Convert(string input)
            {
                return Regex.Replace(input, _pattern, (param) => _lookup[param.Value[0]]);
            }
        }
    

    Les Potter, Xalnix Corporation, Yet Another C# Blog
    Friday, May 8, 2009 12:58 PM
  • yuck... *smile*

    Isn't it enough with one language? Do I have to learn additional programming languages...? *smile*

    ok, ok... I gave it a try and actually it works! ....but how can I determine which is the fastest...???

    This is my VB ConvertChars class
    Imports System.Collections.Generic
    Imports System.Text.RegularExpressions
    
    Public Class ConvertChars
        Private lookup As New Dictionary(Of String, String)
        Private pattern As String = "["
    
        Sub New()
            With lookup
                .Add("Ö", "OE")
                .Add("Ä", "AE")
                .Add("Å", "AA")
            End With
    
            For Each key As String In lookup.Keys
                pattern += key
    
            Next
            pattern += "]"
    
        End Sub
    
        Public Function Convert(ByVal inValue As String) As String
            Dim regex As New Regex(Me.pattern)
            Dim eval As New MatchEvaluator(AddressOf ReplaceChar)
            Return regex.Replace(inValue, eval)
    
        End Function
    
        Private Function ReplaceChar(ByVal m As Match) As String
            Return lookup(m.ToString)
        End Function
    
    End Class
    ...and then I call it from another module (initiating the class only once!)
        Sub Main()
            Dim test As New ConvertChars
            Dim res As String = test.Convert("SMÖRGÅSBORD")
    
        End Sub
    Slightly different...
    I didn't understood how to translate the C# code (in bold);

    return Regex.Replace(input, _pattern, (param) => _lookup[param.Value[0]]);

    But.... is it faster, better, more friendly (it's more code lines....)

    Cheers
    Nils
    Tuesday, May 12, 2009 3:23 PM
  • (param) => _lookup[param.Value[0]]

    ... is a Lambda Expression available beginning with 3.5 and later.  Your implementation without a lambda is probably just as good.  To see whether your first approach performs better or worse than your last approach, generate some test data and use both approaches on the data.  Grab the time at the start and and end and display the duration (difference).  Unless you are doing a lot of data, you may see no significant difference.  Depending upon how Regex implements Replace, you may or may not see a performance gain over your original method.


    Les Potter, Xalnix Corporation, Yet Another C# Blog
    Tuesday, May 12, 2009 3:44 PM