Scandinavian letters formatting bugged. RRS feed

  • Question

  • I'm a student working part time in a company, and at the moment I'm working on a project that automatically generates some word documents for our customers.

    I've used openXML in the past solving a lot of the automation issues, but with this particular document I've encountered an issue I've been unable to solve.

    The document is supposed to show the customer-company's name in the document, and here of cause there is a risk of the letters "æ" "ø" and "å" since our customers are mainly from Denmark.

    I use a code that looks like this

    public static void InsertIntoBookmark(BookmarkStart bookmarkStart, string text, String fontSize, Boolean bold)
            { //se kommentare i non-overload
                String ascii = "Arial";
                String val = fontSize;
                OpenXmlElement elem = bookmarkStart.NextSibling();
                while (elem != null && !(elem is BookmarkEnd))
                    OpenXmlElement nextElem = elem.NextSibling();
                    elem = nextElem;
                DocumentFormat.OpenXml.Wordprocessing.Text tex = new DocumentFormat.OpenXml.Wordprocessing.Text(text);
                DocumentFormat.OpenXml.Wordprocessing.RunProperties rPr = null;
                if (!bold) rPr = new DocumentFormat.OpenXml.Wordprocessing.RunProperties(new RunFonts() { Ascii = ascii }, new FontSize() { Val = val });
                else rPr = new DocumentFormat.OpenXml.Wordprocessing.RunProperties(new RunFonts() { Ascii = ascii }, new FontSize() { Val = val }, new Bold());
                DocumentFormat.OpenXml.Wordprocessing.Run r = new DocumentFormat.OpenXml.Wordprocessing.Run(tex);
                bookmarkStart.Parent.InsertAfter<DocumentFormat.OpenXml.Wordprocessing.Run>(r, bookmarkStart);

    And it actually works fine with pretty much anything. EXCEPT those evil scandinavian letters.

    I've made a small picture that illustrates the problem below. 
    The upper part is the text as the program generates the document.
    The lower part is the same text where I've been in Word reformatting it to how it should look.

    It actually changes the font from arial to times new roman, on JUST the "ø" and nothing else.
    But it gets better. When I look in document.xml in the .docx file the misformatted part looks like this:

    <w:bookmarkStart w:name="AktoerInterntSagsNavn3" w:id="47" />
    <w:r><w:rPr><w:rFonts w:ascii="Arial" /><w:sz w:val="18" /><w:b /></w:rPr>
    <w:t>Ringkøbing skole og musikskole</w:t>
    </w:r><w:bookmarkEnd w:id="47" />

    So how the !"¤% is it possible for word to choose to use Times New Roman only on æ ø and å?

    Any suggestions and tips would be insanly appreciated.
    Also please understand, that although I'm a bit of a novice in this field and the code might not be the prettiest thing ever seen, I'm not at liberty to show all of the document or all of the code due to the type of work I'm doing here :)

    And also, I appologise if I haven't been able to explain the problem in a sufficient way. I'm not used to explain myself in english. Follow-up questions are much welcome.

    - Kaspar Kjeldsen

    Wednesday, January 30, 2013 12:50 PM


All replies

  • Hi Kaspar

    No worries about your English :-)  Your problem description is quite clear.

    I understand the restrictions under which you work, but would it be possible for you to generate a simpler document that demonstrates the behavior? It need be only a couple of lines of (nonsense) text. Preferably with all three problem characters in it.

    I'd like to take a look at this and, if I see the same and can't track down the problem, possibly pass it "up the line".

    I'm not too familiar with Danish, but I am wondering if we're looking at a problem with character recognition. You are specifying the ASCII font, not ANSI (or Unicode : HANSI). And I remember from the "good old DOS days" that ASCII supports only a limited set of characters / character page. I think it's possible that Word is looking at these characters, not finding them in ASCII Arial and is therefore substituting the Normal style font for HANSI? If you look in styles.xml and theme1.xml what fonts are listed as the document defaults and Normal style default?

    In theme1.xml you're looking for the elements <a:majorFont> and <a:minorFont>.

    In settings.xml you're looking for <w:docDefaults><w:rPrDefault><w:rPr><w:rFonts> as well as <w:style styleId="Normal" >

    My best guess, at this point, is that Word is performing a font substitution because it can't find these characters in the ASCII code page...

    Cindy Meister, VSTO/Word MVP, my blog

    Wednesday, January 30, 2013 2:59 PM
  • Thanks for the quick reply :) 
    I've left work for the day, but I'll be back friday, and I'll see if I can cook up a small example document I can share. I don't really have the oppertunity to recreate the problem at home.

    Final note. I noticed when I opened the document in Word and tried to just delete the text and write it again (instead of marking all the text and choosing font and style) that the problem persisted even in Word, and not just in my program. I could litterly write a line like this "This is æ test øf åll chars", one char at the time and watch in dismay as æ ø and å magically turne to Times New Roman.

    Anyways. I'll be back with an update friday.

    Wednesday, January 30, 2013 3:07 PM
  • Hi Kaspar

    Before you go to the trouble of creating a small sample, explore my suggestion that the problem may be due to the way you're setting the font. See also the explanation about font formatting in the Answer in this discussion:

    noting especially this statement: w:hAnsi corresponds to any character in the Unicode range that does not fall into one of the categories above

    You're setting the ASCII font, only. I believe you need to set the hAnsi attribute, too/instead.

    Cindy Meister, VSTO/Word MVP, my blog

    Thursday, January 31, 2013 9:18 AM
  • Thanks again for all the replies.

    It was the lack of an w:hAnsi attribute.
    Tried this dirty quick-fix this morning

    String wrun = "<w:rPr><w:rFonts w:ascii=\"Arial\" w:hAnsi=\"Arial\"/><w:b/><w:sz w:val=\"{0}\"/><w:szCs w:val=\"{1}\"/></w:rPr><w:t>{2}</w:t>"; wrun = String.Format(wrun, 18,18, "Testing ÆØÅ Done testing"); DocumentFormat.OpenXml.Wordprocessing.Run r = new DocumentFormat.OpenXml.Wordprocessing.Run(wrun);

    bookmarkStart.Parent.InsertAfter<DocumentFormat.OpenXml.Wordprocessing.Run>(r, bookmarkStart);

    And it worked like a charm.

    Now I just need to clean up :)

    Friday, February 1, 2013 7:58 AM
  • Hi Kaspar

    Glad we tracked it down so quickly :-)

    Tip: If you're applying the same formatting combination more than once it would make sense to create a STYLEDEFINITION for that formatting in styles.xml then reference that style (style id in pPr - ParagraphProperties) in order to apply the formatting. That should streamline both your code and the Word document.

    Or, if this is formatting for a Run, create a Character rather than Paragraph style and reference the style id in rPr.

    Cindy Meister, VSTO/Word MVP, my blog

    Friday, February 1, 2013 8:25 AM