locked
ISO-8859-2 to UTF8 conversion problems RRS feed

  • Question

  • Hi,

    I'm writing IMAP client which allows to receive e-mails written in many languages, so I want to convert everything to UTF8. Everything is fine with quoted-printable and base64, but I have problems with 7bit and 8bit content-transfer-encoding when charset is ISO-8859-2. Maybe my conversion methods are wrong. If somebody could look at it will be great.

    Here is my code:

    SslStream secureStream;
    StreamReader r = new StreamReader(secureStream);
    
    string line = r.ReadLine();
    
    Encoding enc = Encoding.GetEncoding("iso-8859-2");
    
    UTF8Encoding utf8 = new UTF8Encoding();
    byte[] src = enc.GetBytes(line);
    byte[] dst = Encoding.Convert(enc, utf8, src);
    line = utf8.GetString(dst);
    <br/>
    

    Tuesday, August 17, 2010 2:18 PM

Answers

  • Hi,
    what you wrote cannot work, the problem being that there is an extra hidden conversion.

    What you get off an SslStream is a stream of bytes. When you perform a ReadLine on your StreamReader, that stream of bytes must be converted to a string. This happens according to whatever encoding the StreamReader is using: by default it's UTF-8, but it can also decode little-endian and big-endian Unicode as long as the stream contains the right byte order marks. Due to how UTF-8 works, ASCII will also work.

    At this point, you have an Unicode string, which may contain garbage if the input wasn't in one of the encodings your StreamReader could use. All the steps you are performing afterwards will either cancel out or produce more junk.

    If you want to decode from iso-8859-2, you may try to set the encoding of your StreamReader before you attempt to read anything:

    SslStream secureStream;
    Encoding enc = Encoding.GetEncoding ("iso-8859-2");
    r = new StreamReader (secureStream, enc); // apply the correct encoding
    string line = r.ReadLine ();

    // line is already in Unicode here, no need to mess with that any further.

    HTH
    --mc

    • Proposed as answer by Louis.fr Tuesday, August 17, 2010 11:13 PM
    • Marked as answer by eagle-eagle Wednesday, August 18, 2010 6:48 AM
    Tuesday, August 17, 2010 3:15 PM

All replies

  • Hi,
    what you wrote cannot work, the problem being that there is an extra hidden conversion.

    What you get off an SslStream is a stream of bytes. When you perform a ReadLine on your StreamReader, that stream of bytes must be converted to a string. This happens according to whatever encoding the StreamReader is using: by default it's UTF-8, but it can also decode little-endian and big-endian Unicode as long as the stream contains the right byte order marks. Due to how UTF-8 works, ASCII will also work.

    At this point, you have an Unicode string, which may contain garbage if the input wasn't in one of the encodings your StreamReader could use. All the steps you are performing afterwards will either cancel out or produce more junk.

    If you want to decode from iso-8859-2, you may try to set the encoding of your StreamReader before you attempt to read anything:

    SslStream secureStream;
    Encoding enc = Encoding.GetEncoding ("iso-8859-2");
    r = new StreamReader (secureStream, enc); // apply the correct encoding
    string line = r.ReadLine ();

    // line is already in Unicode here, no need to mess with that any further.

    HTH
    --mc

    • Proposed as answer by Louis.fr Tuesday, August 17, 2010 11:13 PM
    • Marked as answer by eagle-eagle Wednesday, August 18, 2010 6:48 AM
    Tuesday, August 17, 2010 3:15 PM
  • Thanks a lot!
    Wednesday, August 18, 2010 6:49 AM
  • Hi again,

    I came into next problem. I create StreamReader once and use it to read data from IMAP server (gmail in my case). Problem is when emails are in different encodings. When I create StreamReader with 'utf-8' and try to read mail encoded in iso then I've got question marks instead of special characters. Once StreamReader is created I can't change its encoding. Charset is written in e-mail header so I don't know which encoding to use from the beginning. Any ideas how to solve this?

    Monday, August 23, 2010 12:59 PM
  • It depends on what you are trying to do. The main problem here is that a StreamReader buffers its reads from the underlying stream, so it's not trivial to syncrhonize the position of the StreamReader to swap StreamReaders "on the fly". Another common issue is that, closing or disposing a StreamReader usually closes the underlying stream as well, and this is definitely something you don't want to happen.

    So, we need to decouple the incoming stream (the SslStream, in your case) from your StreamReader. This can be accomplished essentially in two ways:

    1) You can just grab the whole response from the SslStream as bytes (just paying attention to the separators so that you can detect when the response is finished), then create a MemoryStream on the portion of the buffer that needs to be decoded (MemoryStream has an overload that allows you to specify a portion of a byte array). Finally, create a StreamReader on the MemoryStream and proceed to the decoding using the appropriate encoding for the MIME part.

    This has the obvious disadvantage that you must store the whole message in memory before you can decode it; that's not a big issue as it's rather unusual that a message contains more than a few kilobytes of text: the bulk of large messages is constituted by images and attachments that need a different treatment anyway.

    2) You could derive your custom Stream class that only reads up to some specific byte sequence. It requires some work, but it's not really too complicated. At that point you can just attach a StreamReader to your custom Stream and decode on the fly. You can safely dispose the StreamReader then, as it will just close your custom stream, not the underlying SslStream.

    All this sounds more complicated than it really is... if you have questions, just ask.

    HTH
    --mc

    Monday, August 23, 2010 7:54 PM
  • I spend some time with many different encodings and now I see that maybe it is not necessary to change StreamReader. As you probably know, most emails are encoded in quoted-printable. When charset is ISO then it's all ok. Problems starts when it's UTF-8. I've got string like this:

     

    =?UTF-8?Q?some_encoded_text_including_=D2=EF_etc?= 

    iso etc has single representation of each special character. In utf-8 many of characters are represented by double values (for example  =C5=BCeby is "żeby"). I found a lot of examples in C# which decode quoted-printable, but I didn't find any which works with utf-8...

    I tried to cut =C5 and =BC from string and do:

     

    Byte[] b = new Byte[2];
    b[0] = byte.Parse("C5", System.Globalization.NumberStyles.HexNumber);
    b[1] = byte.Parse("BC", System.Globalization.NumberStyles.HexNumber);
    Encoding enc = Encoding.GetEncoding("utf-8");
    string result = enc.GetString(b);

    and that works. The last problem is that I don't know how to see if character is represented by one or two bytes. If I've got two representations in the same place it could be two different characters or just two bytes of one character.

     

    -----------------

    Unfortunately I have to change StreamReader encoding anyway..  

    Tuesday, August 24, 2010 8:14 AM
  • If the first byte is between 00 and 7F, the character is coded on one byte.

    If it's between C0 and DF, the character is coded on two bytes.

    Between E0 and EF, three bytes.

    Between F0 and F7, four bytes.

    Tuesday, August 24, 2010 11:51 AM