none
Display of character 55357 in docx is governed by ascii font, instead of cs RRS feed

  • Question

  • Hi,

      I have a display issue in Word, the cat head character (55357) is displayed from a docx file useing the value of the rFonts/ascii definition, but this character is definitely not in the ascii range.

      Could you explain why is the ascii definition is the one that counts here?

      In the sample document I have two identical text runs, where the only difference is the value of the ascii font. When I open the document in Word, one of the characters shows as it should (with font Segoe UI Symbol), while the other is just a square because of the Time New Roman font.

      I was expecting that the ascii font value is not taken into consideration when the 55357 character is shown.

      I can send a sample docx if you wish. Until then, the sample paragraph is given below (the 55357 character is replaced by ??, as the forum engine does not accept the request with that character in the text).

    Thanks for your help,

      Sándor Kolumbán

    <w:p w14:paraId="5AA6D36F" w14:textId="79FF250F" w:rsidR="00167EEB" w:rsidRDefault="00B77E4D" w:rsidP="00B77E4D">
    <w:pPr>
    <w:spacing w:after="0" w:line="240" w:lineRule="auto"/>
    </w:pPr>
    <w:r w:rsidRPr="007B6943">
    <w:rPr>
    <w:rFonts w:ascii="Segoe UI Symbol" w:eastAsia="Times New Roman" w:hAnsi="Segoe UI Symbol" w:cs="Segoe UI Symbol"/>
    <w:color w:val="000000"/>
    <w:sz w:val="24"/>
    <w:szCs w:val="24"/>
    <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
    <w:lang w:eastAsia="en-GB"/>
    </w:rPr>
    <w:t>??(char)55357</w:t>
    </w:r>
    <w:r w:rsidRPr="007B6943">
    <w:rPr>
    <w:rFonts w:ascii="Times New Roman" w:eastAsia="Times New Roman" w:hAnsi="Segoe UI Symbol" w:cs="Segoe UI Symbol"/>
    <w:color w:val="000000"/>
    <w:sz w:val="24"/>
    <w:szCs w:val="24"/>
    <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
    <w:lang w:eastAsia="en-GB"/>
    </w:rPr>
    <w:t>??(char)55357</w:t>
    </w:r>
    <w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/>
    </w:p>

    Tuesday, July 19, 2016 8:09 AM

Answers

  • Hi Sándor, I have concluded my research for this issue. Here is an explanation of the behavior…

     

    MS-OI29500 section 2.1.87, item b, contains the details on how Word chooses which font from the rFonts element to use for the text in a given run. From the table that follows it, Word should be using the font specified by the High and Low Surrogate blocks, which is the eastAsia (or eastAsiaTheme) font. But we know that it's not. Instead, it's choosing the ascii font. I have filed a request to have the table corrected.

     

    It may not seem obvious why this character falls into the High/Low Surrogate category and that was one of the things that took quite a bit of research to figure out. But if you are interested, the following explanation should help…

     

    • The character in the run (U-1F63C) is part of the Emoticons range of the Unicode standard. This range of Unicode code point values for those are from U-1F600 to U-1F64F. This is significant because it requires more than 2 bytes to store.

     

    • Word stores the character in the XML file using UTF-8 encoding. If you examine the XML file using a hex editor you'll see that it's stored as a 4-byte character with the value 0xF09F98BC.

     

    • When Word reads the value from the file into memory it converts it to a pair of UTF-16 encoded values. Each one is a 2 byte value and together they represent a High (0xD83D) and Low (0xDE3C) Surrogate pair.

     

    If you are interested in how the Unicode/UTF-8/UTF-16 encoding conversions work, please take a look at Chapter 3: Conformance sections D91 and D92 of the Unicode standard.


    Josh Curry (jcurry) | Escalation Engineer | Open Specifications Support Team

    Tuesday, August 9, 2016 6:43 PM
    Moderator

All replies

  • Hi Sandor,

    Thank you for contacting Microsoft Open Protocols support.  We have received the request and someone from the team will be in touch.

    Thank you,

    Nathan Manis

    Microsoft Open Specifications Support

    Tuesday, July 19, 2016 3:29 PM
    Moderator
  • Hello Sandor:

    I'm researching this for you. Please send your .docx file to my attention at : dochelp@microsoft.com

    Regards.


    Tarun Chopra | Escalation Engineer | Open Specifications Support Team

    Tuesday, July 19, 2016 3:49 PM
  • Hello Tarun,

    I have sent the file. If there is any problem with receiving it, please let me know.

    Thanks.

    Tuesday, July 19, 2016 3:52 PM
  • Hi Tarun,

      Did you manage to get closer to the solution of this issue?

    Cheers,

      Kolumbán

    Friday, August 5, 2016 9:03 AM
  • Hi Sándor,

    As of right now, the only thing we know for certain is that if the cs element is included in the run properties Word will choose the correct font. We are still investigating why it is not choosing the font specified by the cs attribute of the rFonts element on its own, but we believe that it has something to do with the Unicode range it falls into. I will let you know when I have more information. Thank you for your continued patience.


    Josh Curry (jcurry) | Escalation Engineer | Open Specifications Support Team

    Friday, August 5, 2016 7:11 PM
    Moderator
  • Hi Josh,

    No problem, I see that this is a delicate issue. Let me know when you have something about why the cs attribute is not chosen. If I know why the ascii font is chosen, that is good enough for me.

    Best regards,

    Sándor

    Saturday, August 6, 2016 3:20 PM
  • Hi Sándor, I have concluded my research for this issue. Here is an explanation of the behavior…

     

    MS-OI29500 section 2.1.87, item b, contains the details on how Word chooses which font from the rFonts element to use for the text in a given run. From the table that follows it, Word should be using the font specified by the High and Low Surrogate blocks, which is the eastAsia (or eastAsiaTheme) font. But we know that it's not. Instead, it's choosing the ascii font. I have filed a request to have the table corrected.

     

    It may not seem obvious why this character falls into the High/Low Surrogate category and that was one of the things that took quite a bit of research to figure out. But if you are interested, the following explanation should help…

     

    • The character in the run (U-1F63C) is part of the Emoticons range of the Unicode standard. This range of Unicode code point values for those are from U-1F600 to U-1F64F. This is significant because it requires more than 2 bytes to store.

     

    • Word stores the character in the XML file using UTF-8 encoding. If you examine the XML file using a hex editor you'll see that it's stored as a 4-byte character with the value 0xF09F98BC.

     

    • When Word reads the value from the file into memory it converts it to a pair of UTF-16 encoded values. Each one is a 2 byte value and together they represent a High (0xD83D) and Low (0xDE3C) Surrogate pair.

     

    If you are interested in how the Unicode/UTF-8/UTF-16 encoding conversions work, please take a look at Chapter 3: Conformance sections D91 and D92 of the Unicode standard.


    Josh Curry (jcurry) | Escalation Engineer | Open Specifications Support Team

    Tuesday, August 9, 2016 6:43 PM
    Moderator
  • Hi Josh,

    Thanks for the detailed answer. I think i will manage with this information.

    Cheers,

    Sándor

    Wednesday, August 10, 2016 8:34 AM