none
how to read large word document tables very fast RRS feed

  • Question

  • Hi,

              i am creating a word addin in which i have to read big number of tables (around 500 table ,each with 50 rows and 50 columns.).Basically i have to create an xml document with all tables data into it. i tried these 2 approach.

    • currently i am reading it through sequential execution by iterating on each table and thus each row one by one. But this algorithm approach getting much time to parse and read each table. 

      

    •  i was just trying for a 2nd approach in which on first iteration , i will create an xml of each table and place this xml in each table as hidden text. in this way when user open this document again and click addin button , then i don't have to parse each table again, only what i will do is get already created xml of each table from hidden text and dump it into an xml document. but here problem is that- what if user edit any table in document. if there is any event by which i can capture what user has changes then it will be easy to me directly reach that table and create xml of that table again.  

    i am finding 2nd approach quite good. but it also has some limitations as said above.please help me. 

                                               or

      Is there any other approach i can follow to get rid of performance issue.

    thanks

    shashank 

    Saturday, June 16, 2012 9:10 PM

Answers

  • Are you using Word Automation to read? Did u try Microsoft Open XML SDK. Limitation of Open XML SDk is you can only use it against .docx. Older word formats are not compatibile. To create / read large documents open XML SDK is pretty quick. Here is the MSDN Link

    MSDN LINK


    --Krishna

    • Marked as answer by shashank Tuesday, June 19, 2012 9:40 AM
    • Unmarked as answer by shashank Wednesday, July 18, 2012 7:43 AM
    • Marked as answer by shashank Friday, July 20, 2012 1:17 PM
    Sunday, June 17, 2012 5:17 AM
  • OpenXML would be a good option to this requirement. Particularly, Open XML sdk is good to handle the concern about performance and memory usage. This article shows how to use API like OpenXmlReader to parse and read Excel file fast, you may write your own solution for Word: http://blogs.msdn.com/b/brian_jones/archive/2010/05/27/parsing-and-reading-large-excel-files-with-the-open-xml-sdk.aspx

    As to the second option you mentioned, I don't think there's specific event for tables content change. So it's not flexible enough in terms of user edit.

    thanks,


    Forrest Guo | MSDN Community Support | Feedback to manager

    • Marked as answer by shashank Tuesday, June 19, 2012 9:40 AM
    • Unmarked as answer by shashank Wednesday, July 18, 2012 7:43 AM
    • Marked as answer by shashank Friday, July 20, 2012 1:17 PM
    Tuesday, June 19, 2012 7:23 AM
    Moderator
  • Hi Shashank

    I support what Krishna suggests: your best bet is going to be to work with Word's native Open XML file format. This can be done using the Open XML SDK or, since your project is an add-in and the file is already open in Word, you can work with standard XML tools, only.

    In the open document you can extract the XML for a particular range of text using either the XML or the WordOpenXML property. THe XML property returns the Word 2003 XML, which is less verbose, but doesn't support anything introduced in later versions of Word. It's also not as well documented.

    The Range.WordOpenXML property returns the full representation of the Range as a valid document in Open XML. The result is an Open XML file package in the "flat file" format. THis means that, instead of separate XML files ("parts") in a ZIP package - which is how a document file is stored - you have one single XML that contains all the "parts" in the one string. This resulting XML is comparatively complex and verbose compared to the result of the Range.XML property. But extracting the table data should be pretty much the same in both cases.

    So you'd what you want is something like string tableXML = TableObject.Range.WordOpenXML; (or .XML). Then you can load that string into a standard XMLDocument, XMLReader or whatever you want to use and process it (transform?) for your XML.

    If the only purpose of your add-in (or this functionality in your add-in) is to extract the table information, note that you can also work with the closed file - it needn't be opened in Word in order to get at this information.

    As your first stop for learning about working with Open XML file formats, I recommend the OpenXMLDeveloper.org site and its forums. There's also a forum on MSDN if you decide to work with the SDK: http://social.msdn.microsoft.com/Forums/en-US/oxmlsdk/threads


    Cindy Meister, VSTO/Word MVP

    • Marked as answer by shashank Friday, July 20, 2012 1:17 PM
    Wednesday, July 18, 2012 12:41 PM
    Moderator

All replies

  • Are you using Word Automation to read? Did u try Microsoft Open XML SDK. Limitation of Open XML SDk is you can only use it against .docx. Older word formats are not compatibile. To create / read large documents open XML SDK is pretty quick. Here is the MSDN Link

    MSDN LINK


    --Krishna

    • Marked as answer by shashank Tuesday, June 19, 2012 9:40 AM
    • Unmarked as answer by shashank Wednesday, July 18, 2012 7:43 AM
    • Marked as answer by shashank Friday, July 20, 2012 1:17 PM
    Sunday, June 17, 2012 5:17 AM
  • HI Krishna,

                       Thanks for the prompt reply.Yes , i am using word automation to read. It will be nice if you can guide me through this approach because it is very difficult for us to switch to Open XML SDK. 

    thanks

    shashank


    shashank

    Sunday, June 17, 2012 8:14 AM
  • OpenXML would be a good option to this requirement. Particularly, Open XML sdk is good to handle the concern about performance and memory usage. This article shows how to use API like OpenXmlReader to parse and read Excel file fast, you may write your own solution for Word: http://blogs.msdn.com/b/brian_jones/archive/2010/05/27/parsing-and-reading-large-excel-files-with-the-open-xml-sdk.aspx

    As to the second option you mentioned, I don't think there's specific event for tables content change. So it's not flexible enough in terms of user edit.

    thanks,


    Forrest Guo | MSDN Community Support | Feedback to manager

    • Marked as answer by shashank Tuesday, June 19, 2012 9:40 AM
    • Unmarked as answer by shashank Wednesday, July 18, 2012 7:43 AM
    • Marked as answer by shashank Friday, July 20, 2012 1:17 PM
    Tuesday, June 19, 2012 7:23 AM
    Moderator
  • thanks Krishna and Guo.

    it will be nice if you can tell similar kind of sample examples for word document or any sample to start with.

    thanks a lot again.


    shashank

    Tuesday, June 19, 2012 9:47 AM
  • can any buddy suggest me another good algorithm,

    thanks


    shashank

    Wednesday, July 18, 2012 7:44 AM
  • Hi Shashank

    I support what Krishna suggests: your best bet is going to be to work with Word's native Open XML file format. This can be done using the Open XML SDK or, since your project is an add-in and the file is already open in Word, you can work with standard XML tools, only.

    In the open document you can extract the XML for a particular range of text using either the XML or the WordOpenXML property. THe XML property returns the Word 2003 XML, which is less verbose, but doesn't support anything introduced in later versions of Word. It's also not as well documented.

    The Range.WordOpenXML property returns the full representation of the Range as a valid document in Open XML. The result is an Open XML file package in the "flat file" format. THis means that, instead of separate XML files ("parts") in a ZIP package - which is how a document file is stored - you have one single XML that contains all the "parts" in the one string. This resulting XML is comparatively complex and verbose compared to the result of the Range.XML property. But extracting the table data should be pretty much the same in both cases.

    So you'd what you want is something like string tableXML = TableObject.Range.WordOpenXML; (or .XML). Then you can load that string into a standard XMLDocument, XMLReader or whatever you want to use and process it (transform?) for your XML.

    If the only purpose of your add-in (or this functionality in your add-in) is to extract the table information, note that you can also work with the closed file - it needn't be opened in Word in order to get at this information.

    As your first stop for learning about working with Open XML file formats, I recommend the OpenXMLDeveloper.org site and its forums. There's also a forum on MSDN if you decide to work with the SDK: http://social.msdn.microsoft.com/Forums/en-US/oxmlsdk/threads


    Cindy Meister, VSTO/Word MVP

    • Marked as answer by shashank Friday, July 20, 2012 1:17 PM
    Wednesday, July 18, 2012 12:41 PM
    Moderator
  •  HI  Cindy,All,

                      thanks you so much.. so your are saying that i will have to extract xml or wordopenxml then do xsl transformation and get only required data. it is fine.

    what would you say about performance while approaching this algorithm. because i will have to  iterate on each table and extract xml or openxml and do XSLT  and moving to next table. i know this is my homework, but i just want a fair idea . Will this algorithm lead my algorithm ?  

    thanks


    shashank

    Friday, July 20, 2012 5:45 AM
  • >>what would you say about performance while approaching this algorithm. because i will have to  iterate on each table and extract xml or openxml and do XSLT  and moving to next table. i know this is my homework, but i just want a fair idea . Will this algorithm lead my algorithm ?<<

    I'm afraid I don't understand the question...

    However, quickest would be to get the entire XML from the document, then work with that, looping or identifying the tables. The reason it's certainly faster is because you have only one call to the Word API. To get the entire XML / WordOpenXML for the document: Document.Content.WordOpenXML


    Cindy Meister, VSTO/Word MVP

    Friday, July 20, 2012 1:00 PM
    Moderator
  • thanks Krishna, Cindy and Guo. OpenXMl is best technique to achieve this. 

    shashank

    Friday, July 20, 2012 1:16 PM