locked
Write to file -- chinese characters RRS feed

  • Question

  • Dear all,

    I don't know why when I write chinese characters to a text file, it has below characters. Do you know how to solve??

    Wednesday, February 15, 2012 2:50 AM

Answers

  • Hi Raymond Chiu,

      When we use System.IO.StreamReader to read the text file that includes Chinese character,it will read out messy code.(StreamWriter has the similar problem). This is because that the file encoding is different from the  StreamReader encoding or StreamWriter encoding.

      If we want to solve this problem,we need to get  this txt file encoding,so we can create the corresponding instance of StreamReader or the corresponding instance of StreamWriter to read or write,this functionality can guarantee not to appear messy code.Its detailed implementation is when generating txt file, If the encodingf ormat and the system default encoding (default character set under the Chinese system is GB2312) are inconsistent, you need to append specific Encoding Bit Order Mark aka BOM  at the beginning of  this txt file,it is similar to the PE format named "MZ" file header.

     


     

      You can get the offical introduction about BOM such as http://en.wikipedia.org/wiki/Code_page.

    encoding BOM value
    UTF-8 EF BB BF 
    UTF-16 big endian FE FF
    UTF-16 little endian FF FE
    UTF-32 big endian 00 00 FE FF
    UTF-32 little endian FF FE 00 00


       Instead, you can use the following code to implement the principle above:

    using System;
    using System.Text;
    using System.IO;
    namespace Farproc.Text
    {
        /// <summary>
        /// Used to get  txt file encoding
        /// </summary>
        public class TxtFileEncoding
        {
            public TxtFileEncoding()
            {
             
            }
            
    
            public static Encoding GetEncoding(string fileName)
            {
                return GetEncoding(fileName, Encoding.Default);
            }
            
    
            public static Encoding GetEncoding(FileStream stream)
            {
                return GetEncoding(stream, Encoding.Default);
            }
            
            
            public static Encoding GetEncoding(string fileName, Encoding defaultEncoding)
            {
                FileStream fs = new FileStream(fileName, FileMode.Open);
                Encoding targetEncoding = GetEncoding(fs, defaultEncoding);
                fs.Close();
                return targetEncoding;
            }
            
            /// <returns></returns>
            public static Encoding GetEncoding(FileStream stream, Encoding defaultEncoding)
            {
                Encoding targetEncoding = defaultEncoding;
                if(stream != null && stream.Length >= 2)
                {
                
                    byte byte1 = 0;
                    byte byte2 = 0;
                    byte byte3 = 0;
                    byte byte4 = 0;
                    
                    long origPos = stream.Seek(0, SeekOrigin.Begin);
                    stream.Seek(0, SeekOrigin.Begin);
                    
                    int nByte = stream.ReadByte();
                    byte1 = Convert.ToByte(nByte);
                    byte2 = Convert.ToByte(stream.ReadByte());
                    if(stream.Length >= 3)
                    {
                        byte3 = Convert.ToByte(stream.ReadByte());
                    }
                    if(stream.Length >= 4)
                    {
                        byte4 = Convert.ToByte(stream.ReadByte());
                    }
                    
    
                    //Unicode {0xFF, 0xFE};
                    //BE-Unicode {0xFE, 0xFF};
                    //UTF8 = {0xEF, 0xBB, 0xBF};
                    if(byte1 == 0xFE && byte2 == 0xFF)//UnicodeBe
                    {
                        targetEncoding = Encoding.BigEndianUnicode;
                    }
                    if(byte1 == 0xFF && byte2 == 0xFE && byte3 != 0xFF)//Unicode
                    {
                        targetEncoding = Encoding.Unicode;
                    }
                    if(byte1 == 0xEF && byte2 == 0xBB && byte3 == 0xBF)//UTF8
                    {
                        targetEncoding = Encoding.UTF8;
                    }
                      
                    stream.Seek(origPos, SeekOrigin.Begin);
                }
                return targetEncoding;
            }
        }
    } 

      Consequently, you can use TxtFileEncoding class to verify the txt file encoding and use its result to read or write stream.

        string fileName = @"e:\a.txt";
                
                StreamWriter sw = new StreamWriter(fileName, false, Encoding.BigEndianUnicode);
                sw.Write("this is String");
                sw.Close();
     
                //read
                Encoding fileEncoding = TxtFileEncoding.GetEncoding(fileName, Encoding.GetEncoding("GB2312"));
                Console.WriteLine("THis txt file encoding is:" + fileEncoding.EncodingName);
                StreamReader sr = new StreamReader(fileName, fileEncoding);
     
               
               Console.WriteLine("this txt file context is:" + sr.ReadToEnd());
                sr.Close();
                Console.ReadLine();
    

    I hope it will help you resolve your problem.

    Sincerely,

    Jason Wang

    orichisonic http://blog.csdn.net/orichisonic If a post answers your question, please click "Mark As Answer" on that post and "Mark as Helpful".



    • Edited by orichisonic Thursday, February 16, 2012 6:08 AM
    • Marked as answer by Lie You Tuesday, February 21, 2012 2:51 AM
    Thursday, February 16, 2012 5:44 AM

All replies

  • TextWriter tw = new StreamWriter(fileName,Encoding.Unicode);

    Wednesday, February 15, 2012 3:10 AM
  • check this link

    http://phrogram.com/forums/p/1661/5641.aspx

    Wednesday, February 15, 2012 5:56 AM
  • This is very useful link, and i am sure you will get an answer from following:

    http://stackoverflow.com/questions/336781/how-to-read-a-chinese-text-file-from-c

    Thanks

    Wednesday, February 15, 2012 8:02 AM
  • TextWriter tw  = new StreamWriter(path,System.Text.Encoding.UTF8);
    int line = 0;

    while (line < printStringArray.Length)
    {

          tw.WriteLine(this.getPrintLine(printStringArray[line]));
          line++;
    }
    tw.Close();


    Peter Koueik

    • Proposed as answer by Peter Koueik Saturday, February 18, 2012 6:36 PM
    Wednesday, February 15, 2012 9:21 AM
  • Hi Raymond Chiu,

      When we use System.IO.StreamReader to read the text file that includes Chinese character,it will read out messy code.(StreamWriter has the similar problem). This is because that the file encoding is different from the  StreamReader encoding or StreamWriter encoding.

      If we want to solve this problem,we need to get  this txt file encoding,so we can create the corresponding instance of StreamReader or the corresponding instance of StreamWriter to read or write,this functionality can guarantee not to appear messy code.Its detailed implementation is when generating txt file, If the encodingf ormat and the system default encoding (default character set under the Chinese system is GB2312) are inconsistent, you need to append specific Encoding Bit Order Mark aka BOM  at the beginning of  this txt file,it is similar to the PE format named "MZ" file header.

     


     

      You can get the offical introduction about BOM such as http://en.wikipedia.org/wiki/Code_page.

    encoding BOM value
    UTF-8 EF BB BF 
    UTF-16 big endian FE FF
    UTF-16 little endian FF FE
    UTF-32 big endian 00 00 FE FF
    UTF-32 little endian FF FE 00 00


       Instead, you can use the following code to implement the principle above:

    using System;
    using System.Text;
    using System.IO;
    namespace Farproc.Text
    {
        /// <summary>
        /// Used to get  txt file encoding
        /// </summary>
        public class TxtFileEncoding
        {
            public TxtFileEncoding()
            {
             
            }
            
    
            public static Encoding GetEncoding(string fileName)
            {
                return GetEncoding(fileName, Encoding.Default);
            }
            
    
            public static Encoding GetEncoding(FileStream stream)
            {
                return GetEncoding(stream, Encoding.Default);
            }
            
            
            public static Encoding GetEncoding(string fileName, Encoding defaultEncoding)
            {
                FileStream fs = new FileStream(fileName, FileMode.Open);
                Encoding targetEncoding = GetEncoding(fs, defaultEncoding);
                fs.Close();
                return targetEncoding;
            }
            
            /// <returns></returns>
            public static Encoding GetEncoding(FileStream stream, Encoding defaultEncoding)
            {
                Encoding targetEncoding = defaultEncoding;
                if(stream != null && stream.Length >= 2)
                {
                
                    byte byte1 = 0;
                    byte byte2 = 0;
                    byte byte3 = 0;
                    byte byte4 = 0;
                    
                    long origPos = stream.Seek(0, SeekOrigin.Begin);
                    stream.Seek(0, SeekOrigin.Begin);
                    
                    int nByte = stream.ReadByte();
                    byte1 = Convert.ToByte(nByte);
                    byte2 = Convert.ToByte(stream.ReadByte());
                    if(stream.Length >= 3)
                    {
                        byte3 = Convert.ToByte(stream.ReadByte());
                    }
                    if(stream.Length >= 4)
                    {
                        byte4 = Convert.ToByte(stream.ReadByte());
                    }
                    
    
                    //Unicode {0xFF, 0xFE};
                    //BE-Unicode {0xFE, 0xFF};
                    //UTF8 = {0xEF, 0xBB, 0xBF};
                    if(byte1 == 0xFE && byte2 == 0xFF)//UnicodeBe
                    {
                        targetEncoding = Encoding.BigEndianUnicode;
                    }
                    if(byte1 == 0xFF && byte2 == 0xFE && byte3 != 0xFF)//Unicode
                    {
                        targetEncoding = Encoding.Unicode;
                    }
                    if(byte1 == 0xEF && byte2 == 0xBB && byte3 == 0xBF)//UTF8
                    {
                        targetEncoding = Encoding.UTF8;
                    }
                      
                    stream.Seek(origPos, SeekOrigin.Begin);
                }
                return targetEncoding;
            }
        }
    } 

      Consequently, you can use TxtFileEncoding class to verify the txt file encoding and use its result to read or write stream.

        string fileName = @"e:\a.txt";
                
                StreamWriter sw = new StreamWriter(fileName, false, Encoding.BigEndianUnicode);
                sw.Write("this is String");
                sw.Close();
     
                //read
                Encoding fileEncoding = TxtFileEncoding.GetEncoding(fileName, Encoding.GetEncoding("GB2312"));
                Console.WriteLine("THis txt file encoding is:" + fileEncoding.EncodingName);
                StreamReader sr = new StreamReader(fileName, fileEncoding);
     
               
               Console.WriteLine("this txt file context is:" + sr.ReadToEnd());
                sr.Close();
                Console.ReadLine();
    

    I hope it will help you resolve your problem.

    Sincerely,

    Jason Wang

    orichisonic http://blog.csdn.net/orichisonic If a post answers your question, please click "Mark As Answer" on that post and "Mark as Helpful".



    • Edited by orichisonic Thursday, February 16, 2012 6:08 AM
    • Marked as answer by Lie You Tuesday, February 21, 2012 2:51 AM
    Thursday, February 16, 2012 5:44 AM