none
Some UNICODE characters (big endian) not getting properly on debug/exe run times but we can write those characters in editor (vs 2015 community edition) RRS feed

  • Question

  • Hi,      (vs 2015 community edition, MFC)

     I have some unicode character in a file. i use vs 2005 but it was not supporting all unicode characters. So that i started in visual studio 2015 community edition. its okay we can write all unicode characters manually in editor. (UNICODE big endian). (These chars i stored in an array wchar_t arr[500]/new alloc. Inside those array too some unicode chars elements update is impossible, but some unicode char possible too, described below)

    𠀐亙𠀃𠀃亙亙𠀐𠀐Val𪛕𨕥

    The same characters above, i stored in a file. I can write those manually in editor vs 2015. But on debugging time some of the characters giving a wrong result. (𠀐, 𠀃, 𪛕, 𨕥). so i can use these chars for verifying purpose (if .. else)

    eg: if(chArr[for_count] == L'𠀐') // always getting wrong result

    other chars no problem to work. (already set out _wsetlocale function also)

    The same time i want to write/print those 'non getting' character in a file after verification. So i wish to know about any compiler updates/editor updates/ vs new version. so that i can move successfully.

    Regards,

    Satheesh

    /********************************************************************/

    wcstring is a wchar_t array contains 𠀐亙𠀃𠀃亙亙𠀐𠀐Val𪛕𨕥

    System::Text::Encoding^ encodingWr = System::Text::Encoding::BigEndianUnicode;
    StreamWriter^ writer = gcnew StreamWriter("Converted.txt", true, encodingWr );
    //String^ line = reader->ReadLine();

    for(int ct = 0; ct< ctTot; ct++)
    {

        int ln = wcstring[ct]; // correct number

        wh = /*(wchar_t)*/ wcstring[ct]; //wrong

        str.Format(_T("UNNUM %d %lc"), ln, wh);

            /* https://docs.microsoft.com/en-us/cpp/text/how-to-convert-between-various-string-types?view=vs-2017*/
            // Convert a wide character CStringW to a
            // System::String.
            String ^systemstringw = gcnew String(str);
            //systemstringw += " (System::String)";
            //Console::WriteLine("{0}", systemstringw);
            //delete systemstringw;

        writer->WriteLine(systemstringw);
            delete systemstringw;

        OutputDebugString(str);

    }

    /********************************************************************/

    This char getting properly 亙, V, a, l so that i can print same chars and get the value also.

    These not getting 𠀐,𠀃,𠀃,𠀐,𠀐,𪛕,𨕥

    It needed to get the value as well as all those char too, to print also. Also it needed to change some uncode char in array during the program run. so printing time it is not proper. 

    • Edited by satheesh_in Tuesday, February 12, 2019 3:11 PM codes
    Tuesday, February 12, 2019 1:48 PM

Answers

  • Well, the characters that you are having problems with are surrogates, because surprise, UTF-16 is a variable length character set.

    So your:

    if(chArr[for_count] == L'𠀐')

    is comparing a two byte array entry against a 4 byte character in little endian format. This comparison is also giving you a warning:

    1>c:\users\archa\source\repos\meh\meh\main.cpp(4): warning C4066: characters beyond first in wide-character constant ignored

    telling you that the character takes up more than 1 wchar_t entries in order to fit it. If you don't see this warning then you must have disabled warnings, this is a level 3 warning so it will be visible by default.

    This is verifiable using the following code:

    #include <cstdio>
    
    int wmain()
    {
    	wchar_t st[] = L"𠀐";
    
    	wprintf(L"Array size %d\n", sizeof(st)/sizeof(st[0]));
    	wprintf(L"Element 1 0x%04hx\n", st[0]);
    	wprintf(L"Element 2 0x%04hx\n", st[1]);
    	wprintf(L"Element 3 0x%04hx\n", st[2]);
    
    	return 0;
    }

    this gives the following output:

    Array size 3
    Element 1 0xd840
    Element 2 0xdc10
    Element 3 0x0000

    So this tells you that the string has length 3, the first two elements are a character and the last element is the null terminator. If you then take the surrogate pair, the high surrogate being 0xd840 and the low surrogate being 0xdc10 and convert it into the proper codepoint, then we get:

    0xd840 - 0xd800 = 0x040 (10 bit number)

    0xdc10 - 0xdc00 = 0x010 (10 bit number)

    concatinate them together:

    00 0100 0000 (0x040 from the high surrogate)

    00 0001 0000 (0x010 from the low surrogate)

    00 0100 0000 00 0001 0000

    regrouping gives:

    0001 0000 0000 0001 0000 (20 bit number)

    0x10010

    Then add 0x10000:

    0x10000

    0x10010+

    -----------

    0x20010

    This gives the Unicode codepoint U+20010.

    So just compairing like you are doing will fail.

    I suggest you read up properly on UTF-16 and surrogates, and actually look up where these characters are before you carry on trying to work on text like this.


    This is a signature. Any samples given are not meant to have error checking or show best practices. They are meant to just illustrate a point. I may also give inefficient code or introduce some problems to discourage copy/paste coding. This is because the major point of my posts is to aid in the learning process.

    • Edited by Darran Rowe Tuesday, February 12, 2019 6:03 PM
    • Marked as answer by satheesh_in Thursday, March 7, 2019 6:10 AM
    Tuesday, February 12, 2019 5:59 PM
  • In order to enumerate characters, consider a special enumerator:


       String ^ s = gcnew String( L"𠀐亙𠀃𠀃亙亙𠀐𠀐Val𪛕𨕥" );

       TextElementEnumerator ^ te = StringInfo::GetTextElementEnumerator( s );
       while( te->MoveNext() )
       {
          String ^ element = te->GetTextElement();
          Console::WriteLine( element->Length );
       }

    Instead of comparing wchar_t, you can compare the got sub-strings:


       if( element == L"𠀐" )
       {
           . . .
       }

    To write the file:


       File::WriteAllText( "UTF8.txt", s, Encoding::UTF8 );
       File::WriteAllText( "Unicode.txt", s, Encoding::Unicode );
       File::WriteAllText( "BigEndianUnicode.txt", s, Encoding::BigEndianUnicode );

    Then these files can be viewed in Visual Studio and Word.




    • Edited by Viorel_MVP Tuesday, February 12, 2019 7:05 PM
    • Marked as answer by satheesh_in Thursday, March 7, 2019 6:10 AM
    Tuesday, February 12, 2019 7:02 PM

All replies

  • Well, the characters that you are having problems with are surrogates, because surprise, UTF-16 is a variable length character set.

    So your:

    if(chArr[for_count] == L'𠀐')

    is comparing a two byte array entry against a 4 byte character in little endian format. This comparison is also giving you a warning:

    1>c:\users\archa\source\repos\meh\meh\main.cpp(4): warning C4066: characters beyond first in wide-character constant ignored

    telling you that the character takes up more than 1 wchar_t entries in order to fit it. If you don't see this warning then you must have disabled warnings, this is a level 3 warning so it will be visible by default.

    This is verifiable using the following code:

    #include <cstdio>
    
    int wmain()
    {
    	wchar_t st[] = L"𠀐";
    
    	wprintf(L"Array size %d\n", sizeof(st)/sizeof(st[0]));
    	wprintf(L"Element 1 0x%04hx\n", st[0]);
    	wprintf(L"Element 2 0x%04hx\n", st[1]);
    	wprintf(L"Element 3 0x%04hx\n", st[2]);
    
    	return 0;
    }

    this gives the following output:

    Array size 3
    Element 1 0xd840
    Element 2 0xdc10
    Element 3 0x0000

    So this tells you that the string has length 3, the first two elements are a character and the last element is the null terminator. If you then take the surrogate pair, the high surrogate being 0xd840 and the low surrogate being 0xdc10 and convert it into the proper codepoint, then we get:

    0xd840 - 0xd800 = 0x040 (10 bit number)

    0xdc10 - 0xdc00 = 0x010 (10 bit number)

    concatinate them together:

    00 0100 0000 (0x040 from the high surrogate)

    00 0001 0000 (0x010 from the low surrogate)

    00 0100 0000 00 0001 0000

    regrouping gives:

    0001 0000 0000 0001 0000 (20 bit number)

    0x10010

    Then add 0x10000:

    0x10000

    0x10010+

    -----------

    0x20010

    This gives the Unicode codepoint U+20010.

    So just compairing like you are doing will fail.

    I suggest you read up properly on UTF-16 and surrogates, and actually look up where these characters are before you carry on trying to work on text like this.


    This is a signature. Any samples given are not meant to have error checking or show best practices. They are meant to just illustrate a point. I may also give inefficient code or introduce some problems to discourage copy/paste coding. This is because the major point of my posts is to aid in the learning process.

    • Edited by Darran Rowe Tuesday, February 12, 2019 6:03 PM
    • Marked as answer by satheesh_in Thursday, March 7, 2019 6:10 AM
    Tuesday, February 12, 2019 5:59 PM
  • In order to enumerate characters, consider a special enumerator:


       String ^ s = gcnew String( L"𠀐亙𠀃𠀃亙亙𠀐𠀐Val𪛕𨕥" );

       TextElementEnumerator ^ te = StringInfo::GetTextElementEnumerator( s );
       while( te->MoveNext() )
       {
          String ^ element = te->GetTextElement();
          Console::WriteLine( element->Length );
       }

    Instead of comparing wchar_t, you can compare the got sub-strings:


       if( element == L"𠀐" )
       {
           . . .
       }

    To write the file:


       File::WriteAllText( "UTF8.txt", s, Encoding::UTF8 );
       File::WriteAllText( "Unicode.txt", s, Encoding::Unicode );
       File::WriteAllText( "BigEndianUnicode.txt", s, Encoding::BigEndianUnicode );

    Then these files can be viewed in Visual Studio and Word.




    • Edited by Viorel_MVP Tuesday, February 12, 2019 7:05 PM
    • Marked as answer by satheesh_in Thursday, March 7, 2019 6:10 AM
    Tuesday, February 12, 2019 7:02 PM
  • Thanking you for the reply. I was thinking that before there why a '\0' at the end. Now i got the exact meaning. This much i didn't imagine that. I thought in other way. Like, if its LE or BE the compiler work it in the background and the user wont see what's happening in background, i mean more than one byte but saying it is a character. That is also very tedious inner things the compiler do. 

    As from your opinion, i reached an idea that this much things behind. I really appreciate your knowledge and experiences. This is a very difficult task i understand.

    So i guessed for making a string and more strings to a string array. After taking elements like string then compare. I think it may be fail but i will try. Once i reached the result i will post here.

    My aim is: making it simple if it is difficult in background too. 

    1. unicode array (LE/BE/8) <--- Open from a unicode file.

    2. change that array elements as per my own unicode values (LE/BE/8)

    3. Sometimes repeat line 2 with another same standard unicode values manually initializing.

    4. Print that array values in to a new file. 

    5. Again repeat line 1 without losing quality of what i saved.

    This is.

    Tuesday, February 12, 2019 7:39 PM
  • Hi,

        Thanking you for the reply. This is also profitable. I really appreciate your reply. Because i didn't know and suffering the elements from a CStringW object because of the more bytes. In ascii it is easy, unicode also easy if it not reply in LE/BE. 

    Both reply's helpful for me. From first i studied about the byte length of LE practically. I was not looking to think about that before also something hided before.

    And Now from you, it becomes further easy for me. Let me do it practically.  

    TextElementEnumerator i seen it first time. I think its works better.

    Worthy time:)

    This is what i aim:

    1. unicode array (LE/BE/8) <--- Open from a unicode file.

    2. change that array elements as per my own unicode values (LE/BE/8)

    3. Sometimes repeat line 2 with another same standard unicode values manually initializing.

    4. Print that array values in to a new file. 

    5. Again repeat line 1 without losing quality of what i saved.

    Once i reached the exact, i will post here.

    Tuesday, February 12, 2019 7:54 PM