none
How lang-attribute affects the classification of characters in PPTX files? RRS feed

  • Question

  • Within the OOXML specifications, Annex J provides information about Bidirectional support. The Annex J.7 specifies that before displaying text contained within WordprocessingML documents, a consumer must use the Unicode Bidirectional Algorithm for the resolving of the classification of characters in each line. The Annex J.7 does not tell what should be done for text contained within DrawingML documents. However, the Annex J.2 tells that certain properties are shared to provide identical bidirectional support for WordprocessingML and DrawingML. Due to the "identical bidirectional support" I assume here that the Unicode Bidirectional Algorithm should be used for the resolving of the classification of character for text within DrawingML documents.

    In the following discussion capital letters represent letters within Hebrew language and Unicode Bidirectional Algorithm is used for the resolving of the classification of characters. Texts exist within PPTX file. The formatting of paragraphs is identical in each example. Only the number of text runs, the formatting of text runs and the content of text runs differ.

    If Hebrew text exists in a single text run, PowerPoint works as expected:

    <a:r>
      <a:rPr lang="he-IL"/>
      <a:t>FIRST SECOND</a:t>
    </a:r>

    The space between the words is considered to be of class R due to the N1 rule. Therefore, output is:

    DNOCES TSRIF

    However, PowerPoint works in an undocumented way, if the space between the words is changed to have English language.

    <a:r>
      <a:rPr lang="he-IL"/>
      <a:t>FIRST</a:t>
    </a:r>
    <a:r>
      <a:rPr lang="en-US"/>
      <a:t> </a:t>
    </a:r>
    <a:r>
      <a:rPr lang="he-IL"/>
      <a:t>SECOND</a:t>
    </a:r>

    It looks line the lang-attribute adds a left-to-right mark before the space, because the output is:

    TSRIF DNOCES

    The simple adding of left-to-right mark does not seem to be the way how lang-attribute works, because:

    <a:r>
      <a:rPr lang="he-IL"/>
      <a:t>FIRS</a:t>
    </a:r>
    <a:r>
      <a:rPr lang="en-US"/>
      <a:t>T</a:t>
    </a:r>
    <a:r>
      <a:rPr lang="en-US"/>
      <a:t> </a:t>
    </a:r>
    <a:r>
      <a:rPr lang="en-US"/>
      <a:t>S</a:t>
    </a:r>
    <a:r>
      <a:rPr lang="he-IL"/>
      <a:t>ECOND</a:t>
    </a:r>

    and

    <a:r>
      <a:rPr lang="he-IL"/>
      <a:t>FIRS</a:t>
    </a:r>
    <a:r>
      <a:rPr lang="en-US"/>
      <a:t>T S</a:t>
    </a:r>
    <a:r>
      <a:rPr lang="he-IL"/>
      <a:t>ECOND</a:t>
    </a:r>

    produce the output of:

    DNOCES TSRIF

    Due to the unexpected behaviour of PowerPoint, my questions are:

    1) What algorithm PowerPoint uses for the classification of characters, if it is not Unicode Bidirectional Algorithm?

    2) What undocumented effect the lang-attribute has to the classification of characters within PPTX files? When the effect is applied?

    3) What other attributes or elements exist within PPTX files that have undocumented effect to the classification of characters? What are the effects of those attributes and elements and when the effects are applied?

    Thank you for the information!

      Tero

    Thursday, February 7, 2013 1:11 PM

Answers

  • Hi Tero,

    Sorry for the late response to your last post.  I'll reiterate your questions here with my responses:

    >>1) What algorithm PowerPoint uses for the classification of characters, if it is not Unicode Bidirectional Algorithm?
    In general the UBA (Unicode Bidirectional Algorithm) is used.  However, and I meant to be more explicit about the implication of the information I shared with you earlier but apparently was not, rich text processing applications like PowerPoint (and Word), unlike pure text processing products, will use the install language and/or w:lang attribute when processing neutral characters.

    >>2) What undocumented effect the lang-attribute has to the classification of characters within PPTX files? When the effect is applied?
    The effect of the lang attribute on the directionality of characters is not considered a normative part of the standard.  Appendix I, in which the lang attribute is discussed with respect to bidirectionality is informative.  This is what I was alluding to in my last response.  

    >>3) What other attributes or elements exist within PPTX files that have undocumented effect to the classification of characters? What are the effects of those attributes and elements and when the effects are applied?
    Beyond the effects of the specific case you have brought up, none are known.  If you find others, please feel free to post here and we will investigate.

    Thanks,
    Tom

    Monday, May 6, 2013 7:11 PM
    Moderator

All replies

  • Hi Tero

    Thanks for contacting Microsoft Support. A support engineer will contact you to assist further. 

    Thanks


    Tarun Chopra | Escalation Engineer | Open Specifications Support Team

    Thursday, February 7, 2013 9:19 PM
  • Hi Tero,

    I will be investigating this for you. 

    Best regards,
    Tom Jebo
    Escalation Engineer
    Microsoft Open Specifications

    Thursday, February 7, 2013 9:57 PM
    Moderator
  • Hi Tero,

    After reviewing the algorithm/rules and the PowerPoint behavior, it appears that the presence of Hebrew characters in the run marked with lang="en-US" cause the run to be at the same embedding level and the space to be treated as RL due to it's surrounding characters.  The isolated space marked with lang="en-US", however, seems to be (as you've observed) treated as a new embedding level and causing a change in direction and subsequent break in the two Hebrew runs around it. 

    I need more time to get more definitive information but these are my first observations.

    Tom

    Saturday, February 9, 2013 2:28 AM
    Moderator
  • Hi Tom,

    thank you for your thoughts. Spaces are not the only characters that the language setting affects. For example, take a look at numbers. The text runs of:

    <a:r>
      <a:rPr lang="en-US"/>
      <a:t>abc123FIRST456def</a:t>
    </a:r>
    <a:r>
      <a:rPr lang="he-IL"/>
      <a:t>abc123FIRST456def</a:t>
    </a:r>

    produce the output of: "abc123TSRIF456def" and "abc456TSRIF123def", respectively. Both differ from the expected one of:"abc123456TSRIFdef".

    The difference in the outputs can not be explained by only surrounding characters in this case. I hope that you can get a definitive answer of how lang-attribute affects embedding levels.

      Tero

    Tuesday, February 19, 2013 8:28 AM
  • Thanks Tero,

    I'm still investigating (and waiting for some input from our DrawingML team).  I'll let you know what I find.

    Tom

    Tuesday, February 19, 2013 10:45 PM
    Moderator
  • Hi Tero,

    Thanks again for your patience.  We continue to discuss this.  I hope to be able to post a summary of our findings soon.

    Tom

    Monday, March 4, 2013 7:48 PM
    Moderator
  • Hi Tero,

    Because the Appendix I (J in the ECMA 376 spec) is informative and not normative, it is not limiting as to the behavior of any application.  What might be useful to you is this blog post by Murray Sargent:

    Tailoring the Unicode Bidi Algorithm

    which could help explain the behavior you are seeing.  Specifically, refer to the first section "Keyboard-Driven Bidi Algorithm"

    Tom


    Friday, March 8, 2013 11:21 PM
    Moderator
  • Hi Tom,

    thank you for the reply. Unfortunately, the reply does not answer to my questions that I asked in the original post. How PowerPoint implements the Office Open XML format regarding Bidi text is still a mystery.

    I do not doubt that the handling of Bidi text is complex. If it were simple, there would not have been any need for the questions.

      Tero

    Tuesday, March 12, 2013 10:06 AM
  • Hi Tero,

    Sorry for the late response to your last post.  I'll reiterate your questions here with my responses:

    >>1) What algorithm PowerPoint uses for the classification of characters, if it is not Unicode Bidirectional Algorithm?
    In general the UBA (Unicode Bidirectional Algorithm) is used.  However, and I meant to be more explicit about the implication of the information I shared with you earlier but apparently was not, rich text processing applications like PowerPoint (and Word), unlike pure text processing products, will use the install language and/or w:lang attribute when processing neutral characters.

    >>2) What undocumented effect the lang-attribute has to the classification of characters within PPTX files? When the effect is applied?
    The effect of the lang attribute on the directionality of characters is not considered a normative part of the standard.  Appendix I, in which the lang attribute is discussed with respect to bidirectionality is informative.  This is what I was alluding to in my last response.  

    >>3) What other attributes or elements exist within PPTX files that have undocumented effect to the classification of characters? What are the effects of those attributes and elements and when the effects are applied?
    Beyond the effects of the specific case you have brought up, none are known.  If you find others, please feel free to post here and we will investigate.

    Thanks,
    Tom

    Monday, May 6, 2013 7:11 PM
    Moderator
  • Hi Tom,

    thank you for the more detailed answer.

    Because the effect of the lang attribute is not specified in the standard, the interoperability of different products using the standard can not be achieved. The problem relies on the fact that the meaning of text that is stored as standard specifies is ambiguous. For example, consider text stored as:"A n B n C n n" where n specifies neutral characters.

    Two programs might show text as:"A n B n C n n" and "n n A n B n C", respectively. Because the standard does not specify how text should exactly be shown, both products show text correctly as far as the standard is considered. However, it is very probable that the user of the products thinks that only one of the products work correctly.

    Would you know the forum from which to ask about how PowerPoint handles neutral characters?

      Tero

    Tuesday, May 21, 2013 10:45 AM
  • I would post this to the Office developer forums or Office IT Pro discussion forums.   

    Tom

    Tuesday, May 21, 2013 7:57 PM
    Moderator