locked
How to read and write a text file without screwing up the encoding RRS feed

  • Question

  • I am writing a program that reads in a text file, makes some modifications, and then writes the modified version to a new text file.

    The problem is that any non-english characters in the text file get turned to gibberish. I know this has to do with the encoding of the text files, but I can't figure out what I am supposed to do to keep this from happening.

    Monday, January 16, 2006 12:45 AM

Answers

  • Sounds like you are having a problem with the Bye Order Marker disappearing when you write out the file.  The problem is that when you read in from the StreamReader, the BOM is used to interpret the characters in the file, but is not stored explicitly, so it will not get re-ouput when you write the lines back out.  There is a section on Unicode and Encoding at http://www.personalmicrocosms.com/html/dotnettips.html which gives some more details.

    Thanks,
    Luke Hoban
    Visual C# IDE Program Manager

    Tuesday, January 17, 2006 7:49 PM

All replies

  • What class do you use to save your updates to a textfile? Or better yet, can you provide some codes? so we can see what is causing the problem.

    -chris
    Monday, January 16, 2006 6:44 AM
  • Here is my code for opening and writing to the files:

    System.IO.StreamReader filein = new System.IO.StreamReader(file_in);
    System.IO.StreamWriter fileout = new System.IO.StreamWriter(file_out);

                while ((line_in = filein.ReadLine()) != null)
                {
                    line_out = line_in
                    fileout.WriteLine(line_out);
                }

                filein.Close();
                fileout.Close();


    In my real code, line_out is set to a string that is returned from a function, but this simplified version produces the same garbled output with non-english characters.

    The input text files can be expected to use a variety of text encodings. According to what notepad reports the encodings as, I have seen some in ANSI, some as Unicode, and some as UTF-8.
    Strangely enough, even when the input file is UTF-8, this code will not output the file properly. A friend of mine analyzed the input and output files for me, and told me that the output file was missing 3 bytes at the very beginning of the file that were present in the input file.

    The most common non-english type of characters to appear in these files are Japanese, if that matters.
    Monday, January 16, 2006 6:36 PM
  • And ignore the missing ';' in the code above. When I try to edit the post, it doesn't actually show anything for me to edit...
    Monday, January 16, 2006 6:38 PM
  • Sounds like you are having a problem with the Bye Order Marker disappearing when you write out the file.  The problem is that when you read in from the StreamReader, the BOM is used to interpret the characters in the file, but is not stored explicitly, so it will not get re-ouput when you write the lines back out.  There is a section on Unicode and Encoding at http://www.personalmicrocosms.com/html/dotnettips.html which gives some more details.

    Thanks,
    Luke Hoban
    Visual C# IDE Program Manager

    Tuesday, January 17, 2006 7:49 PM