none
array.sort messes up UTF8 encoding of strings RRS feed

  • Question

  • Hi,

    I need to sort a one dimensional array with about 400.000 strings.

    These strings are in UTF8 because of é è à type characters... in French language.

    When performing an array.sort a string like "Elément d'enquête" becomes "Elément d'enquête" ...

    I tried changing  culture info and so, but with no effect...

    Any help/tip would be strongly appreciated.

    Chris.

    Hi again,

    thx for the replies so far...

    the code is very simple: 

    Dim Records() As String = Split(File.ReadAllText(Datafile, System.Text.Encoding.UTF8), vbCrLf)
    Array.Sort(Records)

    I need to read the text file with "System.Text.Encoding.UTF8" because otherwise the strings are messed up from the start...

    examples of strings are:

    ANMQAY;Elément d'enquête - AF: Vol qualifié;TMM007;Contrôle approfondi - rédiger AI

    ANMIIC;AF: Perte d'objet parfaitement identifié;TMM016;Consulter commentaire

    • Edited by Chris Dr Tuesday, May 30, 2017 1:09 PM
    Tuesday, May 30, 2017 11:27 AM

Answers

  • Without seeing the code used it is hard to know what the issue is.  This test shows that the data is not changed.

            Dim a() As String = {"3Elément d'enquête", "2Elément d'enquête", "1Elément d'enquête"}
            Array.Sort(a)
    


    "Those who use Application.DoEvents() have no idea what it does and those who know what it does never use it." - MSDN User JohnWein    Multics - An OS ahead of its time.

    • Marked as answer by Chris Dr Tuesday, May 30, 2017 1:47 PM
    Tuesday, May 30, 2017 11:52 AM
  • Chris,

    Using the standard sample on MSDN there is no problem, therefore you so probably something strange which we don't see.

    Imports System.Collections
    Public Module Example
        Public Sub Main
            ' Create and initialize a new array.
            Dim words() As String = {"Elément", "d'enquête", "é", "è", "à"}
            ' Instantiate a new custom comparer.
            Dim revComparer As New ReverseComparer()
    
            ' Display the values of the array.
            Console.WriteLine("The original order of elements in the array:")
            DisplayValues(words)
    
            ' Sort a section of the array using the default comparer.
            Array.Sort(words, 1, 3)
            Console.WriteLine("After sorting elements 1-3 by using the default comparer:")
            DisplayValues(words)
    
            ' Sort a section of the array using the reverse case-insensitive comparer.
            Array.Sort(words, 1, 3, revComparer)
            Console.WriteLine("After sorting elements 1-3 by using the reverse case-insensitive comparer:")
            DisplayValues(words)
    
            ' Sort the entire array using the default comparer.
            Array.Sort(words)
            Console.WriteLine("After sorting the entire array by using the default comparer:")
            DisplayValues(words)
    
            ' Sort the entire array by using the reverse case-insensitive comparer.
            Array.Sort(words, revComparer)
            Console.WriteLine("After sorting the entire array using the reverse case-insensitive comparer:")
            DisplayValues(words)
            Console.ReadKey()
        End Sub
    
        Public Sub DisplayValues(arr() As String)
            For i As Integer = arr.GetLowerBound(0) To arr.GetUpperBound(0)
                Console.WriteLine("   [{0}] : {1}", i, arr(i))
            Next
            Console.WriteLine()
        End Sub
    End Module
    Public Class ReverseComparer : Implements IComparer
        ' Call CaseInsensitiveComparer.Compare with the parameters reversed.
        Function Compare(x As Object, y As Object) As Integer _
                     Implements IComparer.Compare
            Return New CaseInsensitiveComparer().Compare(y, x)
        End Function
    End Class


    Success
    Cor

    • Marked as answer by Chris Dr Tuesday, May 30, 2017 1:47 PM
    Tuesday, May 30, 2017 12:07 PM
  • When you mention UTF8 I suspect you are using some Bytes somewhere in there...

    Why don't you show the code where the Strings are sorted?

    As the other comments showed there is no localization problem with Array.Sort, so it's probably somewhere else!

    • Marked as answer by Chris Dr Tuesday, May 30, 2017 1:48 PM
    Tuesday, May 30, 2017 12:13 PM
  • ALL,

    thx very much for your input!

    further analysis of our inputfile (on the basis of your suggestions) learns that IT IS NOT array.sort which causes the problem. The real cause must be somewhere in the code producing the inputfile...

    So, i will keep on looking...

    Thx again!

    • Marked as answer by Chris Dr Tuesday, May 30, 2017 1:51 PM
    Tuesday, May 30, 2017 1:51 PM

All replies

  • Without seeing the code used it is hard to know what the issue is.  This test shows that the data is not changed.

            Dim a() As String = {"3Elément d'enquête", "2Elément d'enquête", "1Elément d'enquête"}
            Array.Sort(a)
    


    "Those who use Application.DoEvents() have no idea what it does and those who know what it does never use it." - MSDN User JohnWein    Multics - An OS ahead of its time.

    • Marked as answer by Chris Dr Tuesday, May 30, 2017 1:47 PM
    Tuesday, May 30, 2017 11:52 AM
  • Chris,

    Using the standard sample on MSDN there is no problem, therefore you so probably something strange which we don't see.

    Imports System.Collections
    Public Module Example
        Public Sub Main
            ' Create and initialize a new array.
            Dim words() As String = {"Elément", "d'enquête", "é", "è", "à"}
            ' Instantiate a new custom comparer.
            Dim revComparer As New ReverseComparer()
    
            ' Display the values of the array.
            Console.WriteLine("The original order of elements in the array:")
            DisplayValues(words)
    
            ' Sort a section of the array using the default comparer.
            Array.Sort(words, 1, 3)
            Console.WriteLine("After sorting elements 1-3 by using the default comparer:")
            DisplayValues(words)
    
            ' Sort a section of the array using the reverse case-insensitive comparer.
            Array.Sort(words, 1, 3, revComparer)
            Console.WriteLine("After sorting elements 1-3 by using the reverse case-insensitive comparer:")
            DisplayValues(words)
    
            ' Sort the entire array using the default comparer.
            Array.Sort(words)
            Console.WriteLine("After sorting the entire array by using the default comparer:")
            DisplayValues(words)
    
            ' Sort the entire array by using the reverse case-insensitive comparer.
            Array.Sort(words, revComparer)
            Console.WriteLine("After sorting the entire array using the reverse case-insensitive comparer:")
            DisplayValues(words)
            Console.ReadKey()
        End Sub
    
        Public Sub DisplayValues(arr() As String)
            For i As Integer = arr.GetLowerBound(0) To arr.GetUpperBound(0)
                Console.WriteLine("   [{0}] : {1}", i, arr(i))
            Next
            Console.WriteLine()
        End Sub
    End Module
    Public Class ReverseComparer : Implements IComparer
        ' Call CaseInsensitiveComparer.Compare with the parameters reversed.
        Function Compare(x As Object, y As Object) As Integer _
                     Implements IComparer.Compare
            Return New CaseInsensitiveComparer().Compare(y, x)
        End Function
    End Class


    Success
    Cor

    • Marked as answer by Chris Dr Tuesday, May 30, 2017 1:47 PM
    Tuesday, May 30, 2017 12:07 PM
  • When you mention UTF8 I suspect you are using some Bytes somewhere in there...

    Why don't you show the code where the Strings are sorted?

    As the other comments showed there is no localization problem with Array.Sort, so it's probably somewhere else!

    • Marked as answer by Chris Dr Tuesday, May 30, 2017 1:48 PM
    Tuesday, May 30, 2017 12:13 PM
  • ALL,

    thx very much for your input!

    further analysis of our inputfile (on the basis of your suggestions) learns that IT IS NOT array.sort which causes the problem. The real cause must be somewhere in the code producing the inputfile...

    So, i will keep on looking...

    Thx again!

    • Marked as answer by Chris Dr Tuesday, May 30, 2017 1:51 PM
    Tuesday, May 30, 2017 1:51 PM
  • I know it's been answered but I thought I'd chime in since something like this happened to me and in my case it was unicode files that from the source had non-printing characters like ChrW(&H200B).

    The string below contains that character between "8" and "9" - in edit mode, moving the cursor with the arrow keys acts oddly, it stops between "8" and "9"

    12345678​90 

    Copying and pasting it into Notepad does the same thing here.
    Wednesday, May 31, 2017 6:17 AM
  • Yep, you have an unprintable in there:

    Let's see how to get to it:

    with the code:

                        Dim oneLine As String = ""
                        Using reader As StreamReader = New StreamReader(theFileName)
                            ' Read first line from file
                            oneLine = reader.ReadLine
                        End Using
                        ' Display raw string in RTB
                        RichTextBox1.AppendText("Raw string from File:" & Environment.NewLine)
                        RichTextBox1.AppendText(oneLine & Environment.NewLine)
    
                        ' Interpreting special Chars
                        Dim interpretedString As String = ""
                        For Each c As Char In oneLine
                            If Char.IsLetterOrDigit(c) Or Char.IsWhiteSpace(c) Then
                                interpretedString &= c
                            Else
                                interpretedString &= "Chr(" & Asc(c) & ")"
                            End If
                        Next
                        RichTextBox1.AppendText("Interpreted:" & Environment.NewLine)
                        RichTextBox1.AppendText(interpretedString & Environment.NewLine)
                    End If
    It's actually very usefull when you are doing communication between devices as Control characters are not printable (EOT ETX SOH...), but you still need to output to trace files...



    Wednesday, May 31, 2017 8:11 AM