none
Character encoding of PR_HTML data in TNEF file RRS feed

  • Question

  • I've come across a situation where I cannot accurately determine the character encoding of the data in the PR_HTML attribute of a TNEF file.

    In one case I have a TNEF file where the Locale is 1031 (Germany), the Codepage is 1252 and the text in PR_HTML contains a "Content-Type" tag that indicates ISO-8859-1 encoding. It turns out that the PR_HTML data is infact encoded in ISO-8859-1.

    In another case I have another TNEF file where the locale is 1033 (USA), the Codepage is 1252 and again the text in PR_HTML contains a "Content-Type" tag that indicates ISO-8859-1 encoding. However, decoding the PR_HTML data with the ISO-8859-1 encoding results in corrupt extended characters. Decoding with UTF-8 gives the correct results.

    I attempted to find information on this by reading through the TNEF specs as well as the Exchange attribute specs, however I could not find anything definitive regarding character encoding of PR_HTML.

    How should I be determining the correct character encoding of the data in the PR_HTML attribute?

    Friday, March 11, 2011 6:42 PM

Answers

  • Alan,

    If the content-type meta tag exists, the parameter value SHOULD match content-type header's character set parameter value.

    • Marked as answer by King Salemno Thursday, March 24, 2011 6:44 AM
    Thursday, March 24, 2011 6:43 AM

All replies

  • Hi, Alan,

      Thanks for your question.  One of our team members will work on it and respond to you soon.

     


    Hongwei Sun -MSFT
    • Proposed as answer by King Salemno Monday, March 14, 2011 2:26 PM
    • Unproposed as answer by King Salemno Monday, March 14, 2011 2:26 PM
    Friday, March 11, 2011 7:07 PM
  • Alan,

    I am the engineer who has taken ownership of your inquiry. I am currently investigating this and will update you as things progress.

    Friday, March 11, 2011 7:15 PM
  • Alan,

    The character set is indeed ISO 8859-1 (Latin alphabet no. 1), however, as a whole, the PR_HTML attribute contains the message body in HTML form. This property is described in [MS-OXPROPS] Section 2.809 and is defined as follows:

    2.809   PidTagHtml 

    Canonical name: PidTagHtml 

    Description: Contains the message body text in HTML format. 

    Description: Contains message body text in HTML format. 

    Property ID: 0x1013 

    Data type: PtypBinary, 0x0102 

    Area: General Message Properties 

    Defining Reference: [MS-OXCMSG] section 2.2.1.44.9 

    Consuming references: [MS-OXBBODY], [MS-OXCMAIL], [MS-OXOPOST], [MS-OXORMMS], [MS-OXORSS], [MS-OXOSMMS] 

    Alternate names: PR_HTML, ptagHtml

    I hope this assists you.

    • Proposed as answer by King Salemno Monday, March 14, 2011 2:28 PM
    Monday, March 14, 2011 2:28 PM
  • You did not answer my question.

    I asked how to properly determine the character encoding of the HTML body, not where the HTML body is located.

    I cannot trust the Locale or Codepage properties as they do not accurately indicate what the encoding of the HTML body is. Do you know of a property that accurately defines that encoding every time?

    Monday, March 14, 2011 6:00 PM
  • Alan,

    I am looking into this.

    Friday, March 18, 2011 12:32 PM
  • Alan,

    If the content-type meta tag exists, the parameter value SHOULD match content-type header's character set parameter value.

    • Marked as answer by King Salemno Thursday, March 24, 2011 6:44 AM
    Thursday, March 24, 2011 6:43 AM