locked
encoding files RRS feed

  • Question

  • Hi all,

    I have a file File1.txt,

    About encoding, several questions:

    1. How can I detect encoding of File1.txt

    2. what encoding has the file when using File.ReadAllBytes?

    3. what encoding has the file (if the file is a Resource like Binary) when using Resources like

    internal static byte[] file1 {

     

    get {

     

    object obj = ResourceManager.GetObject("file1", resourceCulture);

     

    return ((byte[])(obj));

    }

    }

     


    Thanks in advanced


    Thanks in advanced

     

    Tuesday, May 19, 2009 7:37 AM

Answers

  • Byte order mark         Description
    EF BB BF                       UTF-8
    FF FE                             UTF-16, little endian
    FE FF                             UTF-16, big endian
    FF FE 00 00                    UTF-32, little endian
    00 00 FE FF                    UTF-32, big-endian

    2B 2F 76, and one of them bytes: [ 38 | 39 | 2B | 2F ]   UTF-7

    Table 1: Unicode Signature Byte Sequences
    Byte SequenceEncoding
    FE FF UTF-16BE
    FF FE (not followed by 00 00) UTF-16LE
    00 00 FE FF UTF-32BE
    FF FE 00 00 UTF-32LE
    EF BB BF UTF-8
    0E FE FF SCSU
    FB EE 28 BOCU-1 (U+FEFF must be removed after conversion)
    2B 2F 76 38 2D or
    2B 2F 76 38 or
    2B 2F 76 39 or
    2B 2F 76 2B or
    2B 2F 76 2F
    UTF-7 (only the first sequence can be removed before conversion; otherwise U+FEFF must be removed after conversion)
    DD 73 66 73 UTF-EBCDIC

    The detection of Unicode signatures can be a very simple part of a heuristic encoding detection that also detects legacy encodings

    Note: Microsoft uses UTF-16, little endian byte order

    Tuesday, May 19, 2009 10:19 AM

All replies

  • Hi,



    public static Encoding GetFileEncoding(string srcFile)

    {

    File.WriteAllLines(srcFile, new string[] { "Jagadish", "Krishnan" }, Encoding.Unicode);

    Encoding enc = Encoding.Default; 

    // *** Detect byte order mark if any - otherwise assume default

    byte[] buffer = new byte[5];

    FileStream file = new FileStream(args[0].ToString(), FileMode.Open);

    file.Read(buffer, 0, 5);

    file.Close();

     if (buffer[0] == 239 && buffer[1] == 187 && buffer[2] == 191)

    enc = Encoding.UTF8;

    else if (buffer[0] == 255 && buffer[1] == 254)

    enc = Encoding.Unicode;

    else if (buffer[0] == 255 && buffer[1] == 254 && buffer[2] == 0 && buffer[3] == 0)

    enc = Encoding.UTF32;

    else if (buffer[0] == 74 && buffer[1] == 97 && buffer[2] == 103)

    enc = Encoding.UTF7;

    }

    This refered this link and modified the code a bit http://www.west-wind.com/WebLog/posts/197245.aspx (Rick Strahl)

    The buffer values were incorrect in the link above I guess. But I'm not sure. But the changed code works. Give it a shot

    Thanks,
    Jagadish Krishnan

    • Proposed as answer by liurong luo Thursday, May 21, 2009 11:36 AM
    Tuesday, May 19, 2009 10:13 AM
  • Byte order mark         Description
    EF BB BF                       UTF-8
    FF FE                             UTF-16, little endian
    FE FF                             UTF-16, big endian
    FF FE 00 00                    UTF-32, little endian
    00 00 FE FF                    UTF-32, big-endian

    2B 2F 76, and one of them bytes: [ 38 | 39 | 2B | 2F ]   UTF-7

    Table 1: Unicode Signature Byte Sequences
    Byte SequenceEncoding
    FE FF UTF-16BE
    FF FE (not followed by 00 00) UTF-16LE
    00 00 FE FF UTF-32BE
    FF FE 00 00 UTF-32LE
    EF BB BF UTF-8
    0E FE FF SCSU
    FB EE 28 BOCU-1 (U+FEFF must be removed after conversion)
    2B 2F 76 38 2D or
    2B 2F 76 38 or
    2B 2F 76 39 or
    2B 2F 76 2B or
    2B 2F 76 2F
    UTF-7 (only the first sequence can be removed before conversion; otherwise U+FEFF must be removed after conversion)
    DD 73 66 73 UTF-EBCDIC

    The detection of Unicode signatures can be a very simple part of a heuristic encoding detection that also detects legacy encodings

    Note: Microsoft uses UTF-16, little endian byte order

    Tuesday, May 19, 2009 10:19 AM
  • Thanks for the info. Yes I understood that from the replies to the post on the other web site. I'm yet to sort of plough through byte order mark. I just gave u a pointer so that you could take it from there. Thanks again, Jagadish Krishnan
    Tuesday, May 19, 2009 10:28 AM