Wrong values trying to read words count from a Microsoft Word document with OpenXML

Pregunta Wrong values trying to read words count from a Microsoft Word document with OpenXML

  • Saturday, August 18, 2012 12:44 AM
     
      Has Code

    Hello everyone,

    I'm getting some problems using the OpenXML library. Although I can open the file fine, the values returned from the code are completely wrong. For instance, I open a new Word file, write "TEST TEST TEST    ", with some empty characters after it, and it gives me 5 words instead of 3. Also, the character count is a bit wrong as well.

    Here is the relevant part of the code:

    using (WordprocessingDocument document = WordprocessingDocument.Open(Path, false))
    {
     int _wordCount;
     if(!int.TryParse(document.ExtendedFilePropertiesPart.Properties.Words.Text, out _wordCount))
     {
        _wordCount = -1;
     }
     Console.WriteLine(_wordCount);
    }
    

    Does anyone knows the reason for this? Inside Word, the number of words shows just fine.

    Thanks.

    Regards,

    Bruno


    "Racing, competing, is in my blood, is part of me, is part of my life" - Ayrton Senna da Silva

All Replies

  • Tuesday, August 21, 2012 2:15 AM
    Moderator
     
     

    Hi Bruno,

    Thanks for posting in the MSDN Forum.

    I can reproduce your issue. I will involve some experts into your issue to see whether they can help you. There might be some time delay, appreciate for your patience.

    Have a good day,

    Tom


    Tom Xu [MSFT]
    MSDN Community Support | Feedback to us

  • Wednesday, August 22, 2012 8:47 PM
    Moderator
     
      Has Code

    Hi Bruno,

    I couldn't repeat your experience. Your code in my project produced a word count = 2. That is what you would find in the 'app.xml' properties part of the test document. Here is the xml from that document from my system:

    The contents of the “app.xml” of the test document show the count of Words = 2
    
    
    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
      <Template>Normal.dotm</Template>
      <TotalTime>3</TotalTime>
      <Pages>1</Pages>
      <Words>2</Words>
      <Characters>13</Characters>
      <Application>Microsoft Office Word</Application>
      <DocSecurity>0</DocSecurity>
      <Lines>1</Lines>
      <Paragraphs>1</Paragraphs>
      <ScaleCrop>false</ScaleCrop>
      <HeadingPairs>
        <vt:vector size="2" baseType="variant">
          <vt:variant>
            <vt:lpstr>Title</vt:lpstr>
          </vt:variant>
          <vt:variant>
            <vt:i4>1</vt:i4>
          </vt:variant>
        </vt:vector>
      </HeadingPairs>
      <TitlesOfParts>
        <vt:vector size="1" baseType="lpstr">
          <vt:lpstr></vt:lpstr>
        </vt:vector>
      </TitlesOfParts>
      <Company></Company>
      <LinksUpToDate>false</LinksUpToDate>
      <CharactersWithSpaces>14</CharactersWithSpaces>
      <SharedDoc>false</SharedDoc>
      <HyperlinksChanged>false</HyperlinksChanged>
      <AppVersion>14.0000</AppVersion>
    </Properties>
    
    If you look in the Word.Document.xml part you see there is only one text string which says "TEST TEST TEST" so the second word in the extendedProperties is including some other text as a countable 'word'. Consider unpacking your test document to see what text strings are in the document.xml part and in the app.xml part.

    Please remember to mark the replies as answer if they help and unmark them if they provide no help. and click "Vote as Helpful" this and other helpful posts, so other users will see your thread as useful. Best Regards, Chris Jensen

  • Wednesday, March 27, 2013 1:05 PM
     
     
    Sorry for the huge delay. cjatms, I don't agree with your result as well. In that case, the correct expected output would be 3, since I have three words and some empty chars just after it.

    "Racing, competing, is in my blood, is part of me, is part of my life" - Ayrton Senna da Silva

  • Thursday, March 28, 2013 11:45 AM
    Moderator
     
     

    Hi Bruno

    Word uses an algorithm to define what a "word" is. You don't give us any detailed information about what "further" or "empty" characters are, such as their ANSI codes, so that makes it difficult for us to provide a reasonable response to your complaint.

    Certainly, one thing you should do is compare how many words WORD says it finds in its UI, as compared to what Open XML is telling you. If those are the same, then the problem is with your input. If they're different, then we have to take another look...


    Cindy Meister, VSTO/Word MVP, my blog