none
Mail bodies, declared as code page 50220 but does not seem to be that encoding RRS feed

  • Question

  • I ran across something in MAPI email data which I don't fully understand.

    The application which appears to have created the data is MS Outlook itself. We received it in MSG format. Anyway, the odd thing is that the body was not coming out correctly, and it turns out to be an encoding issue. The message body was supposedly encoded using code page 50220.

    This page: https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx

    Says that 50220's .NET name is ISO-2022-JP.

    If I do decode the data as ISO-2022-JP, using Java, I get an invalid character (diamond with question mark) in the result, so that's obviously wrong.

    If I decode the data as MS50220, which intuitively seems like the right encoding, I don't get the black diamond, but instead, get the square box indicating that my font doesn't have the code point. On closer inspection, the result contained U+E290, a character in the private use area of Unicode.

    If I decode the data as "MSISO2022JP", which is Java charset naming jargon for "ISO-2022-JP but with the same modification to the JIS X encodings which Microsoft code page 932 is doing for Shift_JIS", I can finally decode this character as 鄧, which I am able to confirm is the correct character, as we have seen it in some other context.

    Reading some documentation on these character sets, "MSISO2022JP" is not a superset of MS50220. But here I have a situation where MSISO2022JP is clearly returning the right result, where MS50220 doesn't, even though the data is stored with a marker saying to use MS50220.

    My questions are:

    • Were there multiple versions of each code page throughout history?
    • Is Outlook putting the wrong value into the metadata in some situations?
    • Most importantly: Is it safe to substitute MSISO2022JP in all cases where one would use MS50220, assuming that one is only decoding the data and not intending to encode new data in that encoding?


    Thursday, February 16, 2017 6:03 AM

Answers

  • Hi Trejkaz. As a test I used the Encoding class from .NET to try to decode that series of bytes using code page 50220 and I believe that I got the correct characters back because they are the same ones that Outlook displays for that message.

     

    1B 24 42 7B 7D 3B 56 3B 56 1B 28 42

     

    志志

     

    I can't comment on the history of code page 50220 or why Java's implementation doesn't appear to be complete. But I don't believe that there is anything wrong with the message itself. Since the problem is specific to Java, I would suggest that you seek assistance from one of the many Java developer communities that exist.

    http://www.oracle.com/technetwork/java/community/index.html


    Josh Curry (jcurry) | Escalation Engineer | Open Specifications Support Team

    Wednesday, March 15, 2017 8:45 PM
    Moderator

All replies

  • Hello Trekaj:

    Thank you for contacting Microsoft Support. With "MSG" do you mean .msg extension specified in MS-OXMSG specification (https://msdn.microsoft.com/en-us/library/cc463912(v=exchg.80).aspx) supported by my team ?

    Thanks.


    Tarun Chopra | Escalation Engineer | Open Specifications Support Team

    Thursday, February 16, 2017 4:21 PM
  • Yes, that is right.
    Thursday, February 16, 2017 10:05 PM
  • Hello Trejkaz,

    Thank you for your inquiry about Microsoft Office Specifications. We have created an incident to investigate this issue. One of the Open specifications team member will contact you shortly.

    Thanks.


    Tarun Chopra | Escalation Engineer | Open Specifications Support Team

    Friday, February 17, 2017 6:19 AM
  • Hi Trejkaz, I am the engineer who will be working with you on this issue. I am currently researching the problem and will provide you with an update soon. Thank you for your patience.

    Josh Curry (jcurry) | Escalation Engineer | Open Specifications Support Team

    Friday, February 17, 2017 4:55 PM
    Moderator
  • Hi Trejkaz, would you be able to provide the sample .msg file for review? If you can, please send it to dochelp(at)microsoft(dot)com to my attention and reference this forum thread. Also, can you verify what version of Outlook generated the file? Thanks.

    Josh Curry (jcurry) | Escalation Engineer | Open Specifications Support Team

    Friday, February 17, 2017 9:22 PM
    Moderator
  • Hi Trejkaz, I've been reviewing the message file that you provided. Although I can't answer your specific questions, I think I can explain what the cause is from what I see in the message.

     

     The first thing I noticed is that there does not appear to be anything that specifies the code page as 50220. Looking at the message source I see the following:

     

    <meta http-equiv=Content-Type content="text/html; charset=iso-2022-jp">

    From the MSDN page that you referenced previously, I see that there are actually 2 code pages that is associated with that charset. Is 50222 associated with MSISO2022JP?

     

    Code Page Identifiers

    https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx

     

    Identifier

    .NET Name

    Additional Information

    50220

    iso-2022-jp

    ISO 2022 Japanese with no halfwidth Katakana; Japanese (JIS)

    50222

    iso-2022-jp

    ISO 2022 Japanese JIS X 0201-1989; Japanese (JIS-Allow 1 byte Kana - SO/SI)

     

    Here is what I noticed in the message that I think is the key to understanding what's going on. This is the data in question from the recipients list.

     

    The first 3 bytes of that are an escape sequence that tell us to switch to a different encoding, JIS X 0208-1983, followed by 3 2-byte characters, and finally another escape sequence that tells it to switch back to ASCII encoding. I believe the issue is that only MSISO2022JP is recognizing this escape sequence correctly. The others may not support that.

     

    1B 24 42

    7B 7D 3B 56 3B 56

    1B 28 42

    ESC $ B

     

    ESC ( B

     

    Additional information about that can be found here: https://en.wikipedia.org/wiki/ISO/IEC_2022



    Josh Curry (jcurry) | Escalation Engineer | Open Specifications Support Team

    Wednesday, March 8, 2017 6:11 PM
    Moderator
  • Actually, even standard ISO-2022-JP recognises the escape sequence. Demo:

    http://fiddlybits.org/charsets/iso-2022-jp/decode?utf8=%E2%9C%93&form%5Bdata%5D=1B+24+42+7B+7D+3B+56+3B+56+1B+28+42%0D%0A%0D%0A&form%5Btype%5D=hex&commit=Submit

    This shows the escape sequence as switching to JIS X 0208-1983, but that 7B 7D is not actually in JIS X 0208-1983. Which if you view the table is in fact the case:

    http://fiddlybits.org/charsets/jis-x-0208-1983/table

    But the email does actually have the property PR_INTERNET_CPID set to 50220. So ignoring what it says in the HTML itself, the MAPI metadata is saying to use codepage 50220. Microsoft's code pages do tend to add characters over the standard ones, so of course that is happening here too.

    So both MS50220 and MSISO2022JP do contain an entry for 7B 7D, but they contain different values. MS50220 has a private use character which may have been used to represent the character in the distant past before Unicode contained it(?). MSISO2022JP has a Unicode character which appears to render correctly.

    Java's documentation claims that its tables were generated using a program running on Windows against the real encoder, which is surely the truth.

    So the question is really whether MS50220 got updated at some point between when Java implemented MS50220 and the present. But of course, code pages are supposed to be very much immutable, so you would hope that people wouldn't do things like that...

    Monday, March 13, 2017 3:14 AM
  • Hi Trejkaz. As a test I used the Encoding class from .NET to try to decode that series of bytes using code page 50220 and I believe that I got the correct characters back because they are the same ones that Outlook displays for that message.

     

    1B 24 42 7B 7D 3B 56 3B 56 1B 28 42

     

    志志

     

    I can't comment on the history of code page 50220 or why Java's implementation doesn't appear to be complete. But I don't believe that there is anything wrong with the message itself. Since the problem is specific to Java, I would suggest that you seek assistance from one of the many Java developer communities that exist.

    http://www.oracle.com/technetwork/java/community/index.html


    Josh Curry (jcurry) | Escalation Engineer | Open Specifications Support Team

    Wednesday, March 15, 2017 8:45 PM
    Moderator
  • I'm not sure that this is necessarily Java's fault, since they say they got the encoding data out of Windows itself. I'm inclined to believe this, as it seems unlikely that they would write a problem to deliberately give different code points...

    There was this incident in the past as well:

    • CP932 was historically one of IBM's.
    • Windows used the same ID, 932, for the same encoding, at some point in the past, presumably for easy compatibility with IBM. (No problem with doing that, honestly.)
    • At some point after that, but I'm not sure when, Windows changed the behaviour of the encoding to include more characters. (This is a problem.)
    • IBM then wanted compatibility with that, but IBM follow the strict rules that you can never update a code page once you have defined it, so they ended up defining CP943.

    So I'm sure what's going on here is that something similar has happened to Windows 50220 - some prior version of Windows treated it one way, and then Windows changed the behaviour.

    It's entirely possible that this is even part of the same incident - 50220 might have been defined in terms of 932. If it were, back when 932 changed, the exact same changes would have occurred in 50220.

    Getting access to historical Windows builds for testing this hypothesis would be amazing, but it doesn't seem like the kind of thing which would be easy to do without resorting to piracy. :(

    Thursday, December 27, 2018 11:14 PM
  • Hi Trejkaz, 

    Thanks for the follow-up on this old conversation. Josh has moved to another team and I will assist you. Since we don't retain data for very long from our work on these issues, I would ask you to email dochelp, referencing the URL for this thread and my name so that you can send me the .msg file in question and I can do some investigation. 

    To be clear, are you asking if the Windows 50220 code page was changed at some point after the Java implementation was based on it and if this is what is causing the issue with your decoding using Java? If so, I might be able to use an older version of Windows to verify this.

    Please confirm this is what you are asking.

    Best regards,
    Tom Jebo
    Sr Escalation Engineer
    Microsoft Open Specifications

    Friday, December 28, 2018 2:02 AM
    Moderator
  • That's right. If I can get a hint as to which version, it might be possible to decide which encoding to actually use based on how old the data was. For instance, maybe "MSISO2022JP" is correct for Windows 2000 and later, and "MS50220" is correct for anything earlier - but I just have no way to know without doing an exhaustive manual search of my own.

    As far as the original MSG... I'll see what I can do. Maybe it's still in my sent folder from ages ago.

    Friday, December 28, 2018 2:08 AM
  • Thanks, I'll wait to hear from you on dochelp.

    Tom

    Friday, December 28, 2018 2:11 AM
    Moderator
  • Based on our offline conversation, I'm updating the forum for the benefit of the community.

    As we discussed, the MSISO2022JP codepage you found for Java should be used first and then as a fall back (in case of bad character transform), try the MS50220 codepage. 

    The historical changes to 50220 codepage in Windows reach very far back. The older versions will likely not be needed for transforming email body characters in the covered versions of Exchange for the Open Specifications. 

    Thanks,

    Tom

    Tuesday, January 8, 2019 10:15 PM
    Moderator