none
Getting formatting from word documents using c# interop

    Question

  • Am using Word Interop adn C# to build a program at work and one of the features in it is getting a word count.

    Now this can't be the Word word count as i need to emulate the word count of a CAT toool used at work.

    One of the issues i found is that the CAT tool uses text formatting to split up words. This means that if i have the word 1st with st superscripted, word will count one word (as there is nothing separating the two) and the CAT tool counts 2 words as per the text format change.

    Thing is the CAT tool keeps track of the format changes and that information breaks the word.

    So, i could go word by word, character by character, and check all possibilities (font, bold, italic, etc) but that would be really slow working with multiple documents each with 1000s of words.

    Does anyone know a better solution?


    Luís Rodrigues
    Thursday, January 05, 2012 6:25 PM

Answers

  • Hi Luis

    OK, so this means you will need to open the files in the Word application. But I suppose you can choose the version of Word you use to open them? Can you open them in 2010 or 2007?

    If yes, then the approach I'd look at would be to use Document.Content.WordOpenXML to extract the content into a string. The content will be in the Office Open XML "flat package" format, meaning it should contain everything.

    You should then be able to "parse" the string to get the information you need.

    If you look at such a string, you should see that all the text is in <w:t> elements. If there's formatting, then it will break the <w:t> into parts - one part for each formatting change. So all that you'd need to do in addition to extracting all the w:t elements would be to check for the punctuation and spaces that otherwise delineate "words" in the text.


    Cindy Meister, VSTO/Word MVP
    • Marked as answer by 537mfb Friday, January 06, 2012 3:29 PM
    Friday, January 06, 2012 1:02 PM
    Moderator

All replies

  • Hi Luis

    I'm not sure I understand exactly how you need to proceed with the Word document content... You need to pick up each change in formatting?

    Which Word version are we dealing with here? Mainly, I'm concerned with the file format in which these documents have been saved... *.doc or *.docx/*.docm?


    Cindy Meister, VSTO/Word MVP
    Friday, January 06, 2012 8:07 AM
    Moderator
  • Hi Cindy

    Thanks for replying

    Basicly those documets arrive from the customer so they can be in any number of formats (.doc . docx .rtf)

    probably not .docm although i wouldn't completly discard it - never happened so far though

    Basicly what i want is to count words, but a word wouldn't just be anything separated by spaces but also by formatting.

    So as per my example, 1st with st superscript would need to be two words, or be converted in plain text to 1 st (space in between)

    Am trying to mimic the word count of a CAT tool used at work wich replaces any change in format with tags so it can convert back after translation correctly, but in it, the tags end up splitting words too


    Luís Rodrigues
    • Edited by 537mfb Friday, January 06, 2012 9:36 AM
    Friday, January 06, 2012 9:35 AM
  • Hi Luis

    OK, so this means you will need to open the files in the Word application. But I suppose you can choose the version of Word you use to open them? Can you open them in 2010 or 2007?

    If yes, then the approach I'd look at would be to use Document.Content.WordOpenXML to extract the content into a string. The content will be in the Office Open XML "flat package" format, meaning it should contain everything.

    You should then be able to "parse" the string to get the information you need.

    If you look at such a string, you should see that all the text is in <w:t> elements. If there's formatting, then it will break the <w:t> into parts - one part for each formatting change. So all that you'd need to do in addition to extracting all the w:t elements would be to check for the punctuation and spaces that otherwise delineate "words" in the text.


    Cindy Meister, VSTO/Word MVP
    • Marked as answer by 537mfb Friday, January 06, 2012 3:29 PM
    Friday, January 06, 2012 1:02 PM
    Moderator
  • Hi Cindy

    Thanks for the reply

    We use Office 2007 at work (sorry for not mentioning)

    I am actually since this morning (15:30 here now) experimenting with exactly that but using Range.get_XML() instead.

    It gives me what i want but there's some issue grabbing text for part of my test document (haven't exactly figured out what's so special about that text). it takes longer to process get_XML and in some cases even throws an exception (invalid operation with end of line or something like that)

    will try with content.wordopenxml instead and see if that fixes it all

    Once again, thanks for replying

    Luis Rodrigues


    Luís Rodrigues
    Friday, January 06, 2012 3:34 PM
  • Hi Luis

    I almost suggested get_XML, but remembered in time that it won't give you anything new in Word 2007 that's not part of Word 2003. So if there were any text in content controls, for example, you'd not pick that up.

    No idea if this is also the reason behind the exceptions you're seeing...


    Cindy Meister, VSTO/Word MVP
    Friday, January 06, 2012 4:23 PM
    Moderator
  • Hi Cindy

    I don't know what that exception and delay was about with get_XML but your seggestion to use WordOpenXML was spot on.

    I get everything i need - although took me a while to manage to wrap my head around the format itself

    Had wierd things like words beeing split into 2 ranges for no apparent reason (no change in format or paragraph)

    But i have that sorted out and moving along to figure out what that CAT tool sees as word or not - but now i have all the data i will need

    just need to understand it all

    Thanks a million


    Luís Rodrigues
    Tuesday, January 10, 2012 11:00 AM