none
Outlook AppointmentItem's InternetCodePage does not return an accurate value if Arabic or Russian is used RRS feed

  • Question

  • Hi,

    I am trying to retrieve AppointmentItem.InternetcodePage but that does not return the right encoding that's being used in the appointment. I am using Arabic (Windows-1256). I've had to hard-code mine (for testing purposes) as Windows-1256 to get the text to show correctly when I read AppointmentItem.RTFBody and write back.

    I have already had discussion on Office Dev Center about which encoding to use for Arabic and the only one that helped me get correct results was Windows-1256 (based on feedback obtained here https://social.msdn.microsoft.com/Forums/office/en-US/ed82246d-dd3d-42ed-a370-8a3acf127922/outlook-appointmentitems-rtfbody-byte-content-has-incorrect-values-for-nonlatin-language?forum=outlookdev).

    Ideally, there should be a way to guess which encoding is in use (InternetCodePage flag or something else). Has anyone got any idea please?

    Thanks,

    Hicham


    • Edited by Hisham7G Monday, July 10, 2017 8:17 AM
    Friday, July 7, 2017 4:13 PM

Answers

All replies

  • InternetcodePage is only used for the ASNI stores. Unicode stores do not need it. RTF encoding is specified in the RTF stream itself using ansicpg tag (e.g. "\ansicpg1252").

    Dmitry Streblechenko (MVP)
    http://www.dimastr.com/redemption
    Redemption - what the Outlook
    Object Model should have been
    Version 5.5 is now available!

    Friday, July 7, 2017 4:50 PM
  • Thanks for your reply Dmitry. I now know where to, theoretically, get the encoding from RTF (i.e. ansicpg).

    However, practically speaking, I am still getting the wrong encoding. I have tested with the following two languages for example:

    1. Arabic:

    In a blank appointment, I write السلام and check AppointmentItem.RtfBody. The RTF indicates: ansicpg1252 instead of 1256 (for Windows Arabic)

    2. Russian:

    In a blank appointment, I write русский алфавит and check AppointmentItem.RtfBody. The RTF indicates: ansicpg1252 instead of 1251 (for Windows Cyrillic: Russian, Bulgarian, etc).

    I, intentionally, use blank appointment to make sure there is no language mixture: either pure Arabic or pure Russian.

    It seems 1252 is fixed because my default Outlook (and Windows) language setting is English.

    I'd be glad if you could assist please. The Russian test can perhaps be easily reproduced on your end if you have English Windows/Outlook and switch to Russian when creating an appointment. Alternatively, I can share the RTF content with you for русский алфавит.

    Regards,

    Hicham


    • Edited by Hisham7G Monday, July 10, 2017 1:23 PM
    Monday, July 10, 2017 1:21 PM
  • UPDATE:

    Dmitry, the only flag which I see indicates the language correctly is the last "\lang" (there can be many).

    For the Arabic test I can see \lang1025 and for the Russian one it is \lang1049. Each one appears just before the text in question.

    It seems those numbers (1025 and 1049 for instance) correspond to the language IDs used by Office (see the link below). They're also used by Sharepoint (see the second link).

    https://technet.microsoft.com/en-us/library/cc179219.aspx

    https://technet.microsoft.com/en-us/library/cc287874.aspx

    So I think, I should be getting the encoding from there instead of ansicpg (which points to the system's default). But I am not sure which \lang to extract as there can be many. For now, based on my tests, I can see the last one is what I need but I'm not comfortable with this hacky way of doing it. Any ideas?

    Regards,

    Hicham




    • Edited by Hisham7G Monday, July 10, 2017 3:57 PM
    Monday, July 10, 2017 3:52 PM
  • Are you sure in those cases Outlook does not encode Unicode characters using "\u1234 ?" encoding?

    Dmitry Streblechenko (MVP)
    http://www.dimastr.com/redemption
    Redemption - what the Outlook
    Object Model should have been
    Version 5.5 is now available!

    Monday, July 10, 2017 6:14 PM
  • Hi Dmitry,

    No. I tested using:

    1. Unicode (string rtfBody = System.Text.Encoding.Unicode.GetString(Appointment.RTFBody)) and the RTF text I got was all in Chinese although I was writing in Arabic or Russian:

    Sample:

    屻瑲ㅦ慜敤汦湡ㅧ㈰尵湡楳慜獮捩杰㈱㈵畜ㅣ慜敤晦ㄳ〵尷敤晦尰瑳桳摦换㍨㔱㘰獜獴晨潬档ㄳ〵尶瑳桳桦捩㍨㔱㘰獜獴晨楢ㄳ〵尷敤汦湡㉧㔰尷敤汦湡晧㉥㔰尷桴浥汥湡㉧㔰尷桴浥汥湡晧 .......... etc

    2. UTF-8 (string rtfBody = System.Text.Encoding.UTF8.GetString(Appointment.RTFBody)) and the RTF text I got was in readable but ansicpg was still Latin (1252 ~ see below) although I tested in Arabic and Russian (separately).

    Sample:

    {\rtf1\adeflang1025\ansi\ansicpg1252\uc1\adeff31507\deff0\stshfdbch31506\stshfloch31506\stshfhich31506\stshfbi31507\deflang2057\deflangfe2057\t

    ....etc

    --> The moment I write back to RtfBody from the Unicode-converted string or the UTF-8-converted one, the text gets messed up.

    Solution:

    I have come across the article here (https://www.editpadlite.com/unicode.html) and it seems Outlook is not using Unicode at all. It's rather using Windows ANSI code pages. 

    Here's what I've implemented so far and it seems to work:

    First step: convert RtfBody (bytes) into a string using System.Text.Encoding.Default. The content (string) is not necessarily correct but it will be readable in a way that allows us to get the language tags from RTF. So if I write back the content into RtfBody (bytes) using System.Text.Encoding.Default, the text will get messed up again if it's not Latin alphabet.

    Second step: the readable RTF we got in step one (although not necessarily the most correct representation) will have language tags. Dmitry suggested ansicpg but that unfortunately didn't point to the right encoding. So what I do is scan, in RTF, for all "langXXXX" (could be \alangXXXX, \langXXXX or any other alternative) and find the non-Latin language ID (XXXX) that is recognisable by Office SDK (see table in https://technet.microsoft.com/en-us/library/cc179219.aspx). Note that XXXX is not a code page or encoding but rather a language ID that Office SDK uses. If no recognisable non-Latin language is found then we fall back to English (Latin encoding).

    Third step: I map the language ID (XXXX) retrieved from step two, to its corresponding Windows code page (see Windows only code pages here https://msdn.microsoft.com/en-us/library/microsoft.office.interop.outlook._sharingitem.internetcodepage(v=office.14).aspx).

    For example: for Russian, XXXX would be 1049 (so RTF will have a lang1049 somewhere). Knowing that it's Russian, we can then use the Windows version of Russian (Cyrillic more globally) code page which is Windows-1251.

    Last step: Once we have the code page, we can create the right encoding and use it for GetString (so we can customise the RTF) or GetBytes (so we can write back to RtfBody).

    Conclusion: there is no direct way for Outlook (Office SDK) to handle converting RTF content to text and converting it back to bytes. I had to implement it manually (get language from RTF, then get code page from that, to finally get the right encoding that will allow us to encode and decode properly without the non-Latin text getting messed up).

    I hope that helps and thank you for your collaboration :) - please let me know if you see any flaws in my approach.

    Hicham


    Wednesday, July 12, 2017 4:26 PM
  • Hi Hisham7G,

    Thanks for sharing the solution.

    As you know, RTFBody is in AsciiEncoding, I would suggest you try to check whether using AsciiEncoding will work directly.

    private void GetRTFBodyForMail() 
    
    { 
    
     if (Application.ActiveInspector().CurrentItem is Outlook.MailItem) 
    
     { 
    
     Outlook.MailItem mail = 
    
     Application.ActiveInspector().CurrentItem as Outlook.MailItem; 
    
     byte[] byteArray = mail.RTFBody as byte[]; 
    
     System.Text.Encoding encoding = new System.Text.ASCIIEncoding(); 
    
     string RTF = encoding.GetString(byteArray); 
    
     Debug.WriteLine(RTF); 
    
     } 
    
    } 

    # MailItem.RTFBody Property (Outlook)

    https://msdn.microsoft.com/VBA/Outlook-VBA/articles/mailitem-rtfbody-property-outlook

    In addition, for this thread and your previous thread below, I would suggest you mark the solution as answer and then others who run into the same issue would find the solution easily, and this is the recommend way to close a thread.

    # Outlook AppointmentItem's RtfBody byte content has incorrect values for non-Latin language

    https://social.msdn.microsoft.com/Forums/azure/en-US/ed82246d-dd3d-42ed-a370-8a3acf127922/outlook-appointmentitems-rtfbody-byte-content-has-incorrect-values-for-nonlatin-language?forum=outlookdev

    Best Regards,

    Edward


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    • Marked as answer by Hisham7G Friday, July 14, 2017 1:30 PM
    Friday, July 14, 2017 8:46 AM
  • Hi Edward,

    Thanks for your reply. Your suggested solution (ASCII) actually works well when converting from bytes (RTFBody) to string and vice versa.

    I was getting different results during my tests because of the third party library I was using for RTF (RtfDomParser available from NuGet (https://www.nuget.org/packages/RtfDomParser/). The reason I had to use that library was because I wanted to insert a few items properly hence the need for a proper RTF DOM parser. Parsing RTF through that library messes up the RTF content a bit so for me even ASCII wasn't working. When I eliminated that extra step of parsing I totally forgot to test ASCII again.

    But anyway, in my case, I definitely need that RtfDomParser library so the only way for me to maintain a valid RTF is by following the workaround I've implemented (RTF-level processing to overcome the limitations found in the library). However, I've marked your reply as the valid answer because it solves the problem I am raising here (readers don't necessary have to deal with the bigger problem I have).

    And by the way, for future reference, could you point me to the best RTF parsing library you or Microsoft recommends? I couldn't find any; that's why I opted for RtfDomParser (which, for instance, does not use Windows ANSI code pages hence the need for changing the library to fit my purpose)

    Thanks!

    Hicham

    • Edited by Hisham7G Friday, July 14, 2017 1:41 PM
    Friday, July 14, 2017 1:37 PM
  • Absolutely not. RTF body is not ASCII encoded. It is not encoded at all. That is why OOM exposes RTF body as a byte array - it is a binary blob. Do not assume any encoding.

    Dmitry Streblechenko (MVP)
    http://www.dimastr.com/redemption
    Redemption - what the Outlook
    Object Model should have been
    Version 5.5 is now available!

    Friday, July 14, 2017 2:47 PM
  • I am confused because Edward's code (with ASCII) is the only variation that has worked for me. You can try it yourself: use a different encoding (not ASCII) and non-Latin text will get messed up (encoding RtfBody to string then decoding to bytes again and writing it back to RtfBody). That makes me think there is an ASCII element there (just like the example below from MSDN).

    https://msdn.microsoft.com/VBA/Outlook-VBA/articles/mailitem-rtfbody-property-outlook

    Thanks

    Hicham

    Friday, July 14, 2017 4:00 PM
  • You got lucky that the ASCII -> UTF-16 conversion did not corrupt your data in your particular case. Generally, that is not the case. And you should expect corruption in the future. There is  good reason why Outlook returns and takes a byte array and not a string.

    It is a binary array and must be treated as such.


    Dmitry Streblechenko (MVP)
    http://www.dimastr.com/redemption
    Redemption - what the Outlook
    Object Model should have been
    Version 5.5 is now available!

    Friday, July 14, 2017 4:58 PM
  • I understand that point Dmitry. Thanks. But what confuses me is the following: at some point, Outlook will have to convert that byte array into text in order to display correct content to the user. How does Outlook do that in a perfect way? I know RTF body is a binary array for a reason but also there has to be a way (and we developers should be able) to convert it into a string with no issues. That's how Office applications do it after all.

    Thanks

    Hicham

    Monday, July 17, 2017 10:05 AM
  • Outlook never converts the whole blob to text - it parses it.

    Dmitry Streblechenko (MVP)
    http://www.dimastr.com/redemption
    Redemption - what the Outlook
    Object Model should have been
    Version 5.5 is now available!

    Monday, July 17, 2017 2:28 PM
  • Okay understood. So it parses it as bytes and that will generate text. I use parsing as well (RtfDomParser third-party library) but before that I admit that I convert to text (not recommended as per your reply).

    So do you happen to know a good parsing library that takes bytes instead of string so I pass RtfBody directly to it? Or even better: do you know what Outlook uses (if it's exposed to developers anyway).

    Thanks,

    Hicham

    Monday, July 17, 2017 3:49 PM