none
Determining font script from major/minorFont and lang and text in DrawingML RRS feed

  • Question

  • I can't quite figure out how to do something and need a fairly urgent reply. I don't think this is as much about implementation as it is lack of documentation. I need to determine what the mapping is for a font's script code/name (such as Jpan or Thai) below and the language tag (such as ja-JP or th-Th).

     

    I have the following in my theme1.xml:

          <a:minorFont>
            <a:latin typeface="Segoe UI"/>
            <a:ea typeface=""/>
            <a:cs typeface=""/>
            <a:font script="Jpan" typeface="MS Pゴシック"/>
            <a:font script="Hang" typeface="맑은 고딕"/>
            <a:font script="Hans" typeface="宋体"/>
            <a:font script="Hant" typeface="新細明體"/>
            <a:font script="Arab" typeface="Arial"/>
            <a:font script="Hebr" typeface="Arial"/>
            <a:font script="Thai" typeface="Cordia New"/>
            <a:font script="Ethi" typeface="Nyala"/>
            <a:font script="Beng" typeface="Vrinda"/>
            <a:font script="Gujr" typeface="Shruti"/>
            <a:font script="Khmr" typeface="DaunPenh"/>
            <a:font script="Knda" typeface="Tunga"/>
            <a:font script="Guru" typeface="Raavi"/>
            <a:font script="Cans" typeface="Euphemia"/>
            <a:font script="Cher" typeface="Plantagenet Cherokee"/>
            <a:font script="Yiii" typeface="Microsoft Yi Baiti"/>
            <a:font script="Tibt" typeface="Microsoft Himalaya"/>
            <a:font script="Thaa" typeface="MV Boli"/>
            <a:font script="Deva" typeface="Mangal"/>
            <a:font script="Telu" typeface="Gautami"/>
            <a:font script="Taml" typeface="Latha"/>
            <a:font script="Syrc" typeface="Estrangelo Edessa"/>
            <a:font script="Orya" typeface="Kalinga"/>
            <a:font script="Mlym" typeface="Kartika"/>
            <a:font script="Laoo" typeface="DokChampa"/>
            <a:font script="Sinh" typeface="Iskoola Pota"/>
            <a:font script="Mong" typeface="Mongolian Baiti"/>
            <a:font script="Viet" typeface="Arial"/>
            <a:font script="Uigh" typeface="Microsoft Uighur"/>
            <a:font script="Geor" typeface="Sylfaen"/>
          </a:minorFont>

     

    I have a run in simplified Chinese, as below which displays in the 宋体 (called "SimSun" in English):

                <a:r>
                  <a:rPr lang="zh-CN" smtClean="0">
                    <a:solidFill>
                      <a:schemeClr val="bg1"/>
                    </a:solidFill>
                  </a:rPr>
                  <a:t>你好</a:t>
                </a:r>
                <a:r>

    It displays correctly.

     

    However, if I change the lang="zh-CN"  to lang="en-US", it displays it in my  <a:latin typeface="Segoe UI"/> font.

     

    If I use Chinese with an English language attribute of lang="en-US", it still displays in the SimSun font.

     

                <a:r>
                  <a:rPr lang="en-US" dirty="0" smtClean="0">
                    <a:solidFill>
                      <a:schemeClr val="bg1"/>
                    </a:solidFill>
                  </a:rPr>
                  <a:t>你好</a:t>
                </a:r>

    So again, the question is how do lang ids/tags in runs map to script codes in minorFont/majorFont font elements in DrawingML. Also, is there something else in determining what font displays?

     

    I'm well aware of RFC 4646/BCP 47 for the lang attribute. The script codes ("Hans, Ethi, etc.") appear to be from ISO 15924, but that is not explicitly stated - nor is any kind of mapping from RFC 4646 (or 4647) defined in any of the documentation.  20.1.4.1.16 of ISOIEC-29500 states under the script attribute: "Specifies the script, or language, in which the typeface is supposed to be used. The possible values for this attribute are defined by the W3C XML Schema string datatype." This just tells us it could be any string at all, like "gingersnap" - not that the actual values listed in the attribute are semantic and have a mapping.

    It appears that WG4 has also noticed this is a problem and has started discussing it (see .doc on http://mailman.vse.cz/pipermail/sc34wg4/2011-March/002204.html) - but the focus appears to be on WordprocessingML, not DrawingML.

    Sunday, July 10, 2011 4:49 AM

Answers

  • Okatu,

    Thanks for your patience.  Here are answers to your follow-up questions:

    >>I guess this is the question. How do you figure out weither it's latin/ea/cs from the characers? Is it just looking for a particular set of codepages based on all the text, some of the text, or something else? Is it mulitple codepages?

    We use UNICODE sub ranges + some Windows APIs to decide this. Some typical examples as follows:
    a. South Asia + Bidi -> CS.
    b. Surrogates are classified to FE.
    c. FE check also uses a Windows API GetStringTypeExW.

    That said, we use UNICODE rather than depending on codepages / locales when dealing with text layout and glyph things.

    >>Confirmed for Uigher, but for Mongolian and Yi, the display font remains Calibri. For example:
     
    Sub CreateLangEntryBox()
         Dim p As Presentation: Set p = ActivePresentation
         Dim S As slide: Set S = p.Slides(1)
         Dim sh As Shape: Set sh = S.Shapes.AddShape(msoShapeRectangle, 200, 200, 200, 50)
         sh.Name = "Mongolian"
         With sh.TextFrame.TextRange
             .Text = "Mongolian"
             .LanguageID = msoLanguageIDMongolian
         End With
     End Sub
     
    Probably you want to set the language first, and then the text, something like:

    activepresentation.Slides(1).Shapes(2).TextFrame.TextRange.LanguageID=msoLanguageIDYi
    activepresentation.Slides(1).Shapes(2).TextFrame.TextRange.Text="am I yiii"

    Please notice that Mongolian corresponds to “Cyrl”.

    Best regards,
    Tom Jebo
    Escalation Engineer
    Microsoft Open Specifications

     

     

     

     

    • Marked as answer by Todd Main Monday, August 15, 2011 8:51 PM
    Thursday, August 11, 2011 3:24 PM

All replies

  • Hi Okatu,

    Thank you for your question. A member of the protocol documentation team will respond to you soon.

     


    Josh Curry (jcurry) | Escalation Engineer | US-CSS DSC Protocols Team
    Sunday, July 10, 2011 6:02 PM
    Moderator
  • Hi Okatu,

    I'll be investigating your question for you.  I'll get back to you as soon as I have an answer for you.

    Best regards,
    Tom Jebo
    Escalation Engineer
    Microsoft Open Specifications

    Monday, July 11, 2011 3:17 PM
  • Thanks Tom.

    I've done some research on what PowerPoint is implementing and wanted to share my results with you.

     

    Research Scenario:

    1. I created a .PPTM in PowerPoint 2010 (x64/SP1) and a VBA script which inserted a single textbox for every available MsoLanguageID (for example msoLanguageIDEdo, msoLanguageIDChineseHongKongSAR, etc.). The ".Text" string of the ".TextFrame.TextRange" was the string value of the MsoLanguageID (so inside the textbox would be the text "msoLanguageIDEdo", for example). I then ran a VBA script to find what font the TextBox was displaying in.
    2. After this, I looped through the .PPTX in Visual Studio and grabbed the "lang" value (and "alt-lang", if exists) of the TextBox.
    3. Then I compared the display fonts of the TextBoxes from <minorFont>.<font script="X"> values of my theme1.xml.
    4. From there, I manually compared all lang values (IETF language tags) against the IANA sub-tag registry on http://www.iana.org/assignments/language-subtag-registry. I believe this list is the same as ISO 15924.

    I used a slightly different <minorFont> set this time, here it is:

          <a:minorFont>
            <a:latin typeface="Calibri"/>
            <a:ea typeface=""/>
            <a:cs typeface=""/>
            <a:font script="Jpan" typeface="MS Pゴシック"/>
            <a:font script="Hang" typeface="맑은 고딕"/>
            <a:font script="Hans" typeface="宋体"/>
            <a:font script="Hant" typeface="新細明體"/>
            <a:font script="Arab" typeface="Times New Roman"/> <!--used = yes-->
            <a:font script="Hebr" typeface="Arial"/> <!--used = yes-->
            <a:font script="Thai" typeface="Cordia New"/> <!--used = yes-->
            <a:font script="Ethi" typeface="Nyala"/>  <!--used = yes-->
            <a:font script="Beng" typeface="Vrinda"/> <!--used = yes-->
            <a:font script="Gujr" typeface="Shruti"/> <!--used = yes-->
            <a:font script="Khmr" typeface="DaunPenh"/> <!--used = yes-->
            <a:font script="Knda" typeface="Tunga"/> <!--used = yes-->
            <a:font script="Guru" typeface="Raavi"/> <!--used = yes-->
            <a:font script="Cans" typeface="Euphemia"/>  <!--used = yes-->
            <a:font script="Cher" typeface="Plantagenet Cherokee"/> <!--used = yes-->
            <a:font script="Yiii" typeface="Microsoft Yi Baiti"/>  <!--used in PowerPoint = NO-->
            <a:font script="Tibt" typeface="Microsoft Himalaya"/> <!--used = yes-->
            <a:font script="Thaa" typeface="MV Boli"/> <!--used = yes-->
            <a:font script="Deva" typeface="Mangal"/> <!--used = yes-->
            <a:font script="Telu" typeface="Gautami"/> <!--used = yes-->
            <a:font script="Taml" typeface="Latha"/> <!--used = yes-->
            <a:font script="Syrc" typeface="Estrangelo Edessa"/> <!--used = yes-->
            <a:font script="Orya" typeface="Kalinga"/> <!--used = yes-->
            <a:font script="Mlym" typeface="Kartika"/> <!--used = yes-->
            <a:font script="Laoo" typeface="DokChampa"/> <!--used = yes-->
            <a:font script="Sinh" typeface="Iskoola Pota"/> <!--used = yes-->
            <a:font script="Mong" typeface="Mongolian Baiti"/>  <!--used in PowerPoint  = NO-->
            <a:font script="Viet" typeface="Candara"/>  <!--used = yes-->
            <a:font script="Uigh" typeface="Microsoft Uighur"/>   <!--used in PowerPoint  = NO-->
          </a:minorFont>

     

    Conclusion:

    The results are in an Excel file on http://dl.dropbox.com/u/16442383/font%20list.xlsx for you to view. The conclusion is that "almost everything" maps to IANA sub tags.

    • Not everything though as there are non-IETF tags used in PowerPoint (or at least, non-consistent). Sometimes a three-letter tag is used. Sometimes a two-letter tag + a four-letter script are used.
    • In one case, I was expecting the "Ethi" script to be used based off the language tag, but it wasn't.
    • Not all scripts are used by (or available to) the PowerPoint client - in particular "Uigh", "Mong" and "Yiii".
    • It appears that the display mechanism generally uses IANA script groupings, unless it is is an East Asian font, in which case it is a combination of the the lang attribute and Unicode.

     

    My conclusion is that I believe I can figure out about 80-90% of what is going on. However, it would be great to have confirmation from you on the exact font selection algorithm and mapping.


    Tuesday, July 12, 2011 6:38 PM
  • Tom, it's been a week and I haven't heard anything from you, despite my detailed follow up on my own investigation. By not responding, have you given up on assisting?
    Monday, July 18, 2011 7:22 AM
  • Josh, do you think you could get someone else to help? Tom has not responded at all.
    Tuesday, July 19, 2011 6:31 AM
  • is there a reason all my atemps to solicit a reply - any reply at all - are being ignored?
    Tuesday, July 19, 2011 8:07 PM
  • Okatu,

    I apologize for not responding to your post yesterday.  I'm waiting for a response from our Word team, this seems a bit more complicated than we had at first thought.

    Thank you for your patience.

    Best regards,
    Tom Jebo
    Escalation Engineer
    Microsoft Open Specifications

    Wednesday, July 20, 2011 4:00 PM
  • Thank you Tom for the reply! Note that I'm not asking about Word here, I'm asking about PowerPoint. Word appears to have richer features in determining which script to use, it's DrawingML that is my concern.
    Wednesday, July 20, 2011 4:34 PM
  • I should have said "Office" team.  They share a lot of this drawingml code so there's some overlap here.

    Tom

    Wednesday, July 20, 2011 5:29 PM
  •  Okatu,

    Here's an overview of the algorithm from the PowerPoint team which may help a little:

    Algorithm:
       a. Call MSO APIs to convert a LCID to a script tag. It uses language or alt language (if the former is missing). And it requires a language and slot match; otherwise it returns the default typeface.
       b. Use the script tag to search the major (or minor) font collection. If found, then OK.
       c. If not found, we return the default typeface, depending on whether it’s latin, or ea, or cs.
    For each run, we run three times: latin, ea, and cs. That said, for each run, we have three typefaces at runtime. But in the Ribbon font gallery, OArt text appears the theme name – e.g. a piece of English text with “Mlym” (Malayalam) appears in Ribbon “Kartika (Body)” in font gallery, but actually it renders in the slide with Calibri.

    Regarding the three mismatch cases:
    1. Why doesn't “Mong” work? Most likely, its LCID (0x450) is converted to script “Cyrl”.

    2. Why doesn't “Yiii” work? We can’t reproduce this. Can you provide a .pptx sample document to help us reproduce the problem?

    3. Why doesn't “Uigh” work? Do you mean VBA doesn’t support it? If so, you are right. But you can change the Xml using “UG-CN”.

    Best regards,
    Tom Jebo
    Escalation Engineer
    Microsoft Open Specifications


    Saturday, July 23, 2011 7:00 PM
  • Thanks Tom.

     

    • Regarding point A: How are the LCIDs mapped to script tags? Does it use IANA scripts or...? I can't imagine this would be propriatary information as that would defeat the whole of purpose of having an open format for DrawingML and any type of globalization. That said, it doesn't appear to be documented anywhere. As demonstrated in my follow up post, I can sort of figure things out, but need the standards to document what the standard is.

     

    • Regarding point C: Is it that there is only one default typeface (latin, ea or cs) at a type in <rPr> ? Is the default script found based on unicode of the characters in the run text or something else?

     

    • Regarding the sentance that begins with "For each run": I'm not sure if I understand what you're saying. Can you elaborate?

     

    • Regarding your questions 1, 2 and 3: those are eamples of where VBA doesn't create this using any of the MSOLanguageIDs - i.e. if you create a textbox in all languages that PowerPoint supports automated input from, these 3 never appear (using PowerPoint 2010 x64).
    Saturday, July 23, 2011 9:32 PM
  • Ok, Okatu, I'm still looking into these. 

    Tom

    Tuesday, July 26, 2011 3:50 PM
  • Any updates?
    Monday, August 1, 2011 7:42 PM
  • Hi Okatu,

    I will be posting some answers today.  Thanks for your patience.

    Tom

    Wednesday, August 3, 2011 1:59 PM
  • Okatu,

    Here are the answers I mentioned:

    >> Regarding point A: How are the LCIDs mapped to script tags? Does it use IANA scripts or...? I can't imagine this would be propriatary information as that would defeat the whole of purpose of having an open format for DrawingML and any type of globalization. That said, it doesn't appear to be documented anywhere. As demonstrated in my follow up post, I can sort of figure things out, but need the standards to document what the standard is.

    Please send an email to dochelp at microsoft.com and I will discuss this with you further. 

    >> Regarding point C: Is it that there is only one default typeface (latin, ea or cs) at a type in <rPr> ? Is the default script found based on unicode of the characters in the run text or something else?

    For each theme, there is a default for each of them:


          <a:latin typeface="Calibri"/>
          <a:ea typeface=""/>
            <a:cs typeface=""/>

    There is no need to calculate the default script. We figure out whether it’s latin or ea or cs from characters, then we know which one from above to use.

    >> Regarding the sentance that begins with "For each run": I'm not sure if I understand what you're saying. Can you elaborate?

    This is PowerPoint runtime behavior. E.g. for each <a:r>


          <a:r>
             <a:rPr lang="ii-CN" altLang="ii-CN" dirty="0" smtClean="0"/>
             <a:t>ZmsoLanguageIDYi</a:t>
             Latin-typeface = the default latin typeface // Use this for layout / glyph at runtime
             CS-typeface = the default cs typeface // Use this for layout / glyph at runtime
          </a:r>

    >> Regarding your questions 1, 2 and 3: those are eamples of where VBA doesn't create this using any of the MSOLanguageIDs - i.e. if you create a textbox in all languages that PowerPoint supports automated input from, these 3 never appear (using PowerPoint 2010 x64).

    For “Mong” and “Yiii”, they work in PowerPoint 2010, and they are msoLanguageIDMongolian and msoLanguageIDYi. But for “Uigh”, you are right – there is no MsoLanguageID for this one. You can still use 1152 (0x480) as follows:


          activepresentation.Slides(1).Shapes(1).TextFrame.TextRange.LanguageID=1152
     
    Best regards,
    Tom Jebo
    Escalation Engineer
    Microsoft Open Specifications

     

    Wednesday, August 3, 2011 5:14 PM
  • >>Please send an email to dochelp at microsoft.com and I will discuss this with you further.

    Done.

     

    >>There is no need to calculate the default script. We figure out whether it’s latin or ea or cs from characters, then we know which one from above to use.

    I guess this is the question. How do you figure out weither it's latin/ea/cs from the characers? Is it just looking for a particular set of codepages based on all the text, some of the text, or something else? Is it mulitple codepages?

     

    >>For “Mong” and “Yiii”, they work in PowerPoint 2010, and they are msoLanguageIDMongolian and msoLanguageIDYi. But for “Uigh”, you are right – there is no MsoLanguageID for this one. You can still use 1152 (0x480)

    Confirmed for Uigher, but for Mongolian and Yi, the display font remains Calibri. For example:

    Sub CreateLangEntryBox()
        Dim p As Presentation: Set p = ActivePresentation
        Dim S As slide: Set S = p.Slides(1)
        Dim sh As Shape: Set sh = S.Shapes.AddShape(msoShapeRectangle, 200, 200, 200, 50)
        sh.Name = "Mongolian"
        With sh.TextFrame.TextRange
            .Text = "Mongolian"
            .LanguageID = msoLanguageIDMongolian
        End With
    End Sub

    Maybe this is like Chinese where in order for the font to be displayed, the text in the shape needs to be written in the script? I'm not sure how I would enter either of those scripts in in VBA, but if you can confirm that this is like Chinese where it only displays in Calibri if *not* in expected script, that would be fine. Also, I believe this may be related to your >>This is PowerPoint runtime behavior. Is that correct?

     

    Friday, August 5, 2011 5:37 PM
  • Thanks Okatu, I'll get back to you soon.

    Tom

    Friday, August 5, 2011 6:17 PM
  • Okatu,

    Thanks for your patience.  Here are answers to your follow-up questions:

    >>I guess this is the question. How do you figure out weither it's latin/ea/cs from the characers? Is it just looking for a particular set of codepages based on all the text, some of the text, or something else? Is it mulitple codepages?

    We use UNICODE sub ranges + some Windows APIs to decide this. Some typical examples as follows:
    a. South Asia + Bidi -> CS.
    b. Surrogates are classified to FE.
    c. FE check also uses a Windows API GetStringTypeExW.

    That said, we use UNICODE rather than depending on codepages / locales when dealing with text layout and glyph things.

    >>Confirmed for Uigher, but for Mongolian and Yi, the display font remains Calibri. For example:
     
    Sub CreateLangEntryBox()
         Dim p As Presentation: Set p = ActivePresentation
         Dim S As slide: Set S = p.Slides(1)
         Dim sh As Shape: Set sh = S.Shapes.AddShape(msoShapeRectangle, 200, 200, 200, 50)
         sh.Name = "Mongolian"
         With sh.TextFrame.TextRange
             .Text = "Mongolian"
             .LanguageID = msoLanguageIDMongolian
         End With
     End Sub
     
    Probably you want to set the language first, and then the text, something like:

    activepresentation.Slides(1).Shapes(2).TextFrame.TextRange.LanguageID=msoLanguageIDYi
    activepresentation.Slides(1).Shapes(2).TextFrame.TextRange.Text="am I yiii"

    Please notice that Mongolian corresponds to “Cyrl”.

    Best regards,
    Tom Jebo
    Escalation Engineer
    Microsoft Open Specifications

     

     

     

     

    • Marked as answer by Todd Main Monday, August 15, 2011 8:51 PM
    Thursday, August 11, 2011 3:24 PM
  • This is perfect guidance, it really helps me a lot. Thanks so much!
    Monday, August 15, 2011 8:51 PM
  • Hi BoulderPika, thank you for your question. A member of the protocol documentation team will respond to you soon.

    Josh Curry (jcurry) | Escalation Engineer | Open Specifications Support Team

    Wednesday, February 29, 2012 2:55 PM
    Moderator
  • Hi BoulderPika,

    I think I understand your questions but it's been a while since I've looked at this.  Let me review it and get back to you shortly.

    Best regards,
    Tom Jebo
    Escalation Engineer
    Microsoft Open Specifications

    Thursday, March 1, 2012 4:28 PM
    Moderator
  • Hi BoulderPika,

    Please check out the post I just made:

    http://social.msdn.microsoft.com/Forums/en-US/os_openXML-ecma/thread/1bf1f185-ee49-4314-94e7-f4e1563b5c00

    The table should give you what you need to translate unicode ranges to latin, cs and ea. 

    Tom

    Monday, March 5, 2012 4:19 PM
    Moderator
  • Hi BoulderPika,

    I'll take a look at the file.

    Tom

    Tuesday, March 6, 2012 7:57 PM
    Moderator
  • Hi BoulderPika,

    Thanks for your patience on this.  We have been discussing this because there are implications that go beyond this specific case.  I will try to get you an answer by the end of this week. 

    Tom

    Tuesday, March 13, 2012 4:22 PM
    Moderator
  • BoulderPika,
     
    Yes, from what you’ve described, your interpretation seems correct and the specification is itself correct (aside from the defect report information that is pending, relevant parts of which I've posted here already).  And PowerPoint is following the specification wrt your font question.  We have discussed your scenario questions about selecting font when the theme typeface (i.e. <a:ea typeface=""/>) is blank, here is the result of our discussion:
     
    You correctly traced back to the a:fontScheme element and its children.  In this case, the file (typeface attribute of the a:ea element) has not requested the use of a particular font face for that type of text and the implementation is free to select an appropriate font for its context, given its particular needs and constraints.
     
    This gets into the realm of font substitution, fallback fonts and font linking.  This area is rather complicated and is a typical task for application developers dealing with text.  It arises even when font faces are specified, as the font requested is not always available in a given environment (e.g., which fonts are installed by default on various OSes, use of user-installed fonts).  Industry-wide, there is no common way to handle font selection, partly because applications sometimes have different needs for how detailed they need to be in choosing a font.  Often, developers will let the OS figure it out, but applications with more specific needs, such as Office, often put a lot of work into logic for finding the most appropriate font.
     
    So, this is up to the developer to decide which font to select based on the type of text, fonts available on the system, and any particular design goals of the application.  An application might decide to use another major font, or the corresponding minor font if it defined, or fall back to a “hard-coded” font appropriate for that font type (e.g., East Asian), or do further analysis of the text run and the available fonts on the system to find a more appropriate fallback.  There is a globalization article that discusses these topics on MSDN that you may find helpful in getting started determining how your application wants to handle fonts: http://msdn.microsoft.com/en-us/goglobal/bb688134.

    Best regards,
    Tom Jebo
    Escalation Engineer
    Microsoft Open Specifications

    Tuesday, March 27, 2012 7:21 PM
    Moderator