none
How to store and read in foreign characters to use in generating file names? RRS feed

  • Question

  • Hi all,

    I am about to undertake my second automated test program attempt.  I can write VB code to launch my application, and create and save documents by a name.

    One of the things I have to test is that we can handle saving and opening documents that use foreign characters.  So Chinese, Cyrillic, Japanese, etc.

    I am suspecting that I cannot just save these in Notepad and then read them in and use them.  :)

    I assume I will have to use Unicode of some sort.

    My plan is simply to have a "text file" of various key words and numbers in various languages, read them in, create and save documents with them, and then open the documents.

    Can someone point me in a direction to get started?

    Thanks,

    Steve

    Thursday, July 19, 2018 8:05 PM

Answers

  • I am suspecting that I cannot just save these in Notepad and then read them in and use them.  :)

    Yes, you can, if you are careful. When you click save in Notepad, look at a dropdown that you will see at the bottom of the "Save" dialog. If it is set to ANSI, then you will only be able to save characters in the default set for your version of Windows. But if you change it to one of the Unicode options, such as UTF8, then you will be able to save cyrillic, chinese, japanese, and so on.

    If you process the file using Visual Basic, you can use all the characters as long as you are careful to specify an encoding that matches what is in the file when you open it. For instance, if you open the file using a StreamReader, be sure to specify the Encoding like this:

    Dim sr as New StreamReader(filename, System.Text.Encoding.UTF8)

    UTF8 happens to be the default, but if you are using any other type of encoding, you can use it like this.

    Note that the preceding affects the content of the file, not the file name. The name will be taken from the String that you pass, and the Strings in .Net are always Unicode, so the foreign characters will appear in the filename without doing anything special.

    EDIT: Note that things only work well if you are using the Unicode Base page (the one that only needs 16 bits for the codepoints). If you go beyond that, some things start breaking. For instance, I have seen problems where someone was saving the file names in a database and one of the names contained a Brocoli emoji (U+1F966). This was lost when reading back from the database unless the Collation in the database was set to one of the "wide" encodings that saves 32 bit characters instead of 16.


    Sunday, July 22, 2018 9:20 AM

All replies

  • I am suspecting that I cannot just save these in Notepad and then read them in and use them.  :)

    Yes, you can, if you are careful. When you click save in Notepad, look at a dropdown that you will see at the bottom of the "Save" dialog. If it is set to ANSI, then you will only be able to save characters in the default set for your version of Windows. But if you change it to one of the Unicode options, such as UTF8, then you will be able to save cyrillic, chinese, japanese, and so on.

    If you process the file using Visual Basic, you can use all the characters as long as you are careful to specify an encoding that matches what is in the file when you open it. For instance, if you open the file using a StreamReader, be sure to specify the Encoding like this:

    Dim sr as New StreamReader(filename, System.Text.Encoding.UTF8)

    UTF8 happens to be the default, but if you are using any other type of encoding, you can use it like this.

    Note that the preceding affects the content of the file, not the file name. The name will be taken from the String that you pass, and the Strings in .Net are always Unicode, so the foreign characters will appear in the filename without doing anything special.

    EDIT: Note that things only work well if you are using the Unicode Base page (the one that only needs 16 bits for the codepoints). If you go beyond that, some things start breaking. For instance, I have seen problems where someone was saving the file names in a database and one of the names contained a Brocoli emoji (U+1F966). This was lost when reading back from the database unless the Collation in the database was set to one of the "wide" encodings that saves 32 bit characters instead of 16.


    Sunday, July 22, 2018 9:20 AM
  • Thanks, Alberto!

    I have my program mostly working.  In fact, it works, it's just that the software I'm testing is so robust it is hard to find any failures to verify that my program will catch failures!  :)

    A couple of things I will point out for future readers:

    If you want to use console.writeline to output the characters, you will need to tell Visual Basic to output UTF-8 format.  You can do this with this command:

    Console.OutputEncoding = System.Text.Encoding.UTF8

    Also, even though you output UTF-8, your default CMD output may not have the font set to support displaying it correctly.  If you launch Command Prompt (CMD) and RMB on the icon in the upper-left-hand corner and click Properties, you can change your default font.  I discovered that the NSimSun font appears to display all of the foreign characters I am testing.

    Like Alberto Said, Notepad supports UTF-8 output, you just have to do a save-as and change the default formatting.  Also Notepad++ supports UTF-8 output.

    I am simply reading this in as a string as follows:

    While Not fileReader.EndOfStream
       Dim stringReader As String
       stringReader = fileReader.ReadLine()
       Console.WriteLine("Input String: " & stringReader)
    
       [process the string code]
    
    End While

    This so far has been a pretty easy exercise.

    Steve

    Thursday, July 26, 2018 4:06 PM