none
Hello - Interop.Word.Document's attribute WordOpenXML is slow for large documents... RRS feed

  • Question

  • Hello

    I am trying to read the current active documents WordOpenXML.

    The problem is, when calling for the property WordOpenXML on a large document [Interop.WOrd.Document.WordOpenXML], Word hangs for several seconds, even though the reading is done on another thread (but of course, this does not help for this particular property, because the WordOpenXML lives within the application, which is reachable for "all" threads, which mean all threads will hang, all threads using this property...,sort of...). Locked was the word I was looking for, thanks Cindy. :)

    The document saved on disc takes 5 MB.

    The WordOpenXML-string that is returned, and stored on disc in a normal .txt-file takes 37 MB.

    I do know Word compresses the files within the document.docx (the XML-files are compressed), but from 37 to 5MB? Hm..

    Anyways, is there another way of reading the whole content of the current ActiveDocument that does not freeze Word for the amount of time it takes to read the WordOpenXML-property? It can take 20-30 seconds for all I care, as long as I can read the whole content, without the application being frozen, so the user can continue to work meanwhile the search is going on...

    Using Word 2007 and docx files.

    Suggestions are more than welcome!

    PS: How in the world can I find "Word for developers" forum, I needed to google to come into the right spot. Clicking "Forums" on this page, gives me a result of most recent posts in any forum and not an overview of all forums I can enter... Clicking...

    Edit: Also, the text "Word is publising <document.docx> page 1 of N" is also showing and counting upwards meanwhile I grab the WordOpenXML-property... Strange? No? Guess that is normal when calling the WordOpenXML, that the document is saved then grabbed...





    • Edited by colaohye Monday, July 8, 2013 10:11 AM typo
    Friday, July 5, 2013 12:31 PM

Answers

  • <<PS: Tested WordOpenXML with 500 pages of pure text, =rand(10,10), and it was fast. Tested again by adding a large image to the content, which of course is the "only" reason a document ever can be 5MB when saved to disc... And it hangs... So no function/property that just grabs the plain text (ignoring headers/footer)? :)>>

    WordOpenXML will always return the valid OpenXML for the given range. If you use it on Document.Content then, yes, you'll get the headers, footers and all sorts of stuff in the OPC flat file format. You could try doing something like this (pseudo-code, off the top of my head!) to drop the last paragraph mark (which drops the secProps element with the headers/footers for a one section document):

    Dim rng as Word.Range
    rng = Doc.Content
    rng.MoveEnd(Unit:=Word.WdUnit.wdCharacter, Count:=-1)

    If the document has more than one section, then you'd probably have to loop the sections, dropping the section breaks.


    Cindy Meister, VSTO/Word MVP, my blog

    • Marked as answer by colaohye Monday, July 8, 2013 10:06 AM
    Monday, July 8, 2013 8:36 AM
    Moderator

All replies

  • Hi colaohye

    Would you have access to Word 2010 (or 2013), to run some tests? The 2007 version was a "transition" version: the original release of Word 2007 worked with the old binary format in the background. Later, the application code was changed to work with the new file format. It's possible, therefore, that you'd see better results in a later version, where Word Open XML is the "native" file format and there's not a lot of "conversion" going on in the background.

    That's just a guess, however. In any case, there will be a conversion, going from the Zip to the OPC flat file format, so it might not make that big a difference...

    About the only alternative that occurs to me would be to try something like:

    - copy/paste the content to another (new, not-visible) document and pick up the Word Open XML from that. Note that you'd want to use Select All, copy in order to pick up the last paragraph mark.

    - close the document, make a copy, re-open the document and work with the copy you made.

    Since Word is doing a conversion of the document content, it must lock the document during the process. It would mess things up quite a bit, if the user were allowed to edit while Word processes the content!

    <<PS: How in the world can I find "Word for developers" forum, I needed to google to come into the right spot. Clicking "Forums" on this page, gives me a result of most recent posts in any forum and not an overview of all forums I can enter... Clicking...>>

    There a few possibilites. One is, while you're looking at the list of messages, at the top right you'll see an arrow (not the one next to Quick Access, the one UNDER Quick Access, next to the first message in the list). This will display a pane at the left of the message list with a list of forums. You want the topic "Microsoft Office for Developers". Uncheck "Select All" and check "Word".

    I never use this, however. Instead, I have a link in my "Favorites" (call it a bookmark) to the forum. I've always used links to open the forums I moderate, but I ran into the same issue as you when I wanted to view another forum, recently, which is how I found out about the "pain pane"! My link:

    http://social.msdn.microsoft.com/Forums/office/en-US/home?forum=worddev

    Another way to get there is to find a message in the list that "lives" in the forum you want to go to. There will be a "breadcrumb" telling you the forum hierarchy for that message. Those "breadcrumbs" are links, so you can click "Word for Developers" to get here (takes you to the same place as my link, above).


    Cindy Meister, VSTO/Word MVP, my blog

    Friday, July 5, 2013 3:03 PM
    Moderator
  • >>>"(not the one next to Quick Access, the one UNDER Quick Access,"

    Nice, designers going crazy... Thanks though! :)

    I have thought about copying the content, but this will replace the clipboard... This process is running in the background as long as a type of document is active (type, special template). Would have to paste content on the clipboard first to a new hidden document, then back to the main document copy and paste it to another hidden document, then... Yes, hm...

    PS: Tested WordOpenXML with 500 pages of pure text, =rand(10,10), and it was fast. Tested again by adding a large image to the content, which of course is the "only" reason a document ever can be 5MB when saved to disc... And it hangs... So no function/property that just grabs the plain text (ignoring headers/footer)? :)

    Ok, hopefully its a bit better in 2010/2013 then..., company is upgrading to 2010 within the year, hopefully...

    Thanks! :)

    Monday, July 8, 2013 7:54 AM
  • <<PS: Tested WordOpenXML with 500 pages of pure text, =rand(10,10), and it was fast. Tested again by adding a large image to the content, which of course is the "only" reason a document ever can be 5MB when saved to disc... And it hangs... So no function/property that just grabs the plain text (ignoring headers/footer)? :)>>

    WordOpenXML will always return the valid OpenXML for the given range. If you use it on Document.Content then, yes, you'll get the headers, footers and all sorts of stuff in the OPC flat file format. You could try doing something like this (pseudo-code, off the top of my head!) to drop the last paragraph mark (which drops the secProps element with the headers/footers for a one section document):

    Dim rng as Word.Range
    rng = Doc.Content
    rng.MoveEnd(Unit:=Word.WdUnit.wdCharacter, Count:=-1)

    If the document has more than one section, then you'd probably have to loop the sections, dropping the section breaks.


    Cindy Meister, VSTO/Word MVP, my blog

    • Marked as answer by colaohye Monday, July 8, 2013 10:06 AM
    Monday, July 8, 2013 8:36 AM
    Moderator
  • My mistake, should not ignore footers headers, but ignore images in content, footers and headers. :)

    Solution for now: set a limit on the filesize for my process to run, only allowed running on documents less than 1.5MB, which should be 95% of all documents... and the freeze is "gone", lasts for 0.5-1.0 seconds if document is about 1.5 MB, on my computer...

    Thanks, I'll try around some with the Range-object... seems like a good idea. :)

    Thanks again.



    • Edited by colaohye Monday, July 8, 2013 10:08 AM typo
    Monday, July 8, 2013 10:03 AM