none
Import word-tables using C# RRS feed

  • Question

  • Hi all.

    I have a word-2003 document, with a lot of text and a lot of tables. I need to import some of that to excel. So far I have been using interop, but it take to long time. I think the problem is, that it switch between word and C# for every cell in the tabel. I have tried to copy the whole word document to a string, which is super fast, but it is a bit chaotic to seach the string and export the things I want.

    Then I thought of using openXML to open the .doc file, save it as a .docx file, and then I hoped that the openXML format had some cool functions?

    Regards.

    Thursday, January 5, 2012 12:26 PM

All replies

  • Here are all my different tries..

    First attempt:

    Microsoft.Office.Interop.Word._Application wApp = new Microsoft.
    Office.Interop.Word.Application(); _Document wd = wApp.Documents.Open(filePatch); foreach (Table tb in wd.Tables) { // Look through every cell. }

    Secend attempt: I hoped that this would store one table in a variable, look throug every cell, and then store the next table in the variable. However, this method is just as slow as my first attempt.

    Microsoft.Office.Interop.Word._Application wApp = new Microsoft.
    Office.Interop.Word.Application(); _Document wd = wApp.Documents.Open(filePatch); for (int j = 1; j < wd.Tables.Count; j++) { Microsoft.Office.Interop.Word.Table tb = wd.Tables[j]; // Look through every cell. }


    And my final attempt: Stores the whole word file in a string. This is really fast, but I can't find a way to sort it, so that I only gets the "cells" that I want.

    Microsoft.Office.Interop.Word._Application wApp = new Microsoft.
    Office.Interop.Word.Application(); _Document wd = wApp.Documents.Open(filePatch); string allText; wd.ActiveWindow.Selection.WholeStory(); wd.ActiveWindow.Selection.Copy(); IDataObject data = Clipboard.GetDataObject(); allText = data.GetData(DataFormats.Text).ToString();

     

    I really hope you can help me.

     

    • Edited by the_julle Thursday, January 5, 2012 2:32 PM
    Thursday, January 5, 2012 2:28 PM
  • Hi Julie

    Open XML can't work with *.doc files. That's the old binary file format. Open XML is a completely different file format (a zipped package of XML files).

    The most efficient method, using the Interop, would probably be to extract the Word table information into an array, then assign the array to a Range in Excel. That's usually pretty fast on the Excel side.

    A way to speed things up on the Word side is to use the ConvertTableToText method to convert the table to a delimited string (think of the content of a *.csv file for an example). Read that string into memory, convert it to an array. This could only be a problem if the cells contain new paragraph marks, as Word will use these as the record (row) delimiters. So you'd need to replace the paragraph marks with a different character before coverting the table to text. And before you dump it in Excel you'd need to change it to the new line character Excel uses (Char 10, as I recall).

    Another possibility would be to use Table.Range.XML to pick up the table content as XML, then transform that to an array. But you'd have to work through all the "extraneous" information (formatting, etc.) to boil the XML down to just the data in the table cells.


    Cindy Meister, VSTO/Word MVP
    Friday, January 6, 2012 1:17 PM
    Moderator
  • Hi Cindy.

    Thank you very much for answering.

    Okay, I'll have to take a deeper look into the Interop, since XML can't be used (what if I save the .doc file as a .docx file? Will it still be in the binary file format?). I'm new to C# so I have some followups to your suggestions:
    - Does that mean, that I can read the world file into an array and convert it into a table in excel? Could you give me a code-example? You don't need to think about converting characters, as I already have made some functions to control that.

    Again, thank you!

    Friday, January 6, 2012 3:58 PM
  • Hi Julie

    Before I can answer your questions I need to know more about the environment in which your solution will be used.

    Are you doing this "server-side" or will there be user-interaction with Word/Excel?

    Is this all to be done for Office 2003? (You propose saving in docx file format, but would this only be for the convenience of getting at the XML?)

    Word 2003 does have its own XML vocabulary that you can access while the file is open in Word 2003. A Word 2003 binary file (*.doc) can also be saved with the extension .xml, which will save it as an XML document in WordProcessingML.

    WordProcessingML is essentially a precursor of Word Open XML (much is very similar). Essentially, the content of Table.Range.XML will be the same as the XML for the table that you can pull out of the 2003 document saved as *.xml. All you need is standard XML tools to work with WordProcessingML. Neither the Open XML SDK nor System.IO.Packaging is required.

    If you save a Word 2003 file as *.docx then it will be Word Open XML (the 2007/2010 kind of file format). At that point, you can use System.IO.Packaging or the Open XML SDK to access the content.

    <<I'm new to C# so I have some followups to your suggestions:
    - Does that mean, that I can read the world file into an array and convert it into a table in excel? >>

    I'm not sure to which suggestion you're referring in this question. The interop one wouldn't read the entire Word file, just pick up the table content, which you'd need to turn into an array, then drop into Excel.

    There's no way to directly put content into Excel from XML or from Word - some kind of conversion process is always required. A table in Excel's XML is completely different from a table in Word's XML. 


    Cindy Meister, VSTO/Word MVP
    Friday, January 6, 2012 4:19 PM
    Moderator
  • There will be no user-interaction with Word / Excel. I'm programming a windows-application, so that the user can specify the file path for the word-file to read from. The word file is a 2003 document (.doc), so yes your right; it would only be for the convenience of getting at the XML that I purposed to save the file in a .docx format.

    I did not know of the XML for the .doc format. How do I make use of that, so that I can access and save each table in the word-document. It would be really nice if I could save the table as a table object, so that it preserves its structure (cells).

    As already mentioned, at this moment I look through tables using the Interop:

    foreach (Table tb in wd.Tables)

    {

                          // Look through every cell.

    }

     

    Can I do something like this with the XML for .doc formats?

     

    Monday, January 9, 2012 11:49 AM
  • Hi Julie

    If you have the document open in Word 2003, then you can pick up the WordProcessingML using something like this:

    foreach (Table tb in wd.Tables)

    {

           string tbContent = tb.Range.XML;

    }

     

    As you can see, the XML property of the Range object gives you a string. That string contains the valid WordProcessingML. You can work with that as you would with any XML, using an XMLReader, Linq, whatever you like. All in memory (rather than working with the API), so it will be tons faster.

     

    You could speed things up even more by using

       document.Content.XML

    Because then you could use the XML tools to loop through the tables, as well.

     

    I don't remember exactly how the XML "tree" looks for tables, but if you create a simple document with a small table, then you can extract the string and investigate it. Your starting point would be the w:body element - everything outside of that you can ignore.

     

    I think you can find more information on OpenXMLDeveloper.org. Before Office 2007 came out, they handled WordProcessingML there, and I believe there's still some information about working with it.


    Cindy Meister, VSTO/Word MVP
    Monday, January 9, 2012 3:54 PM
    Moderator
  • But this will save each table as one long string. That makes it pretty hard to get the exact cells that I want?

    After I have stored a table in the variable tb, I look through cell 2 and 3 and check what value it has. If it has a specific value, I look deeper into the table and stores SOME of the values on some of the cells.

    if (tb.Range.Cells[2].Range.Text.StartsWith("TEST") || tb.Range.Cells[3].Range.Text.StartsWith("TEST"))
    
    


    That will become a diffecult task to do with a string?

    Thank you very much for your time and suggestions.

     

    • Edited by the_julle Tuesday, January 10, 2012 2:28 PM
    Tuesday, January 10, 2012 2:17 PM
  •  

    Hello Julie,

    In Cindy Meister’s post on January 9 she suggested that you might find additional help in the content at www.OpenXMLDeveloper.org.

    The content there includes articles and blogs about WordprocessML. An example is the following:

    Search and Replace Text in an Open XML WordprocessingML Document
    http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2011/05/12/search-and-replace-text-in-an-open-xml-wordprocessingml-document.aspx

    You can find more content at the site that may be helpful. Incidentally, in the comments at the end of the blog mentioned above blogger Eric White includes his email address.

    All told, because of the complexity of your application and the unacceptability of the expert assistance provided to this date your issue falls into the paid support category which requires a more in-depth level of support.  Please visit the below link to see the various paid support options that are available to better meet your needs. http://support.microsoft.com/default.aspx?id=fh;en-us;offerprophone

    Please click on “Mark as answer” if the information posted here helps you resolve your issue.

    Regards,
    Chris Jensen
    Senior Technical Support Lead

    Tuesday, January 17, 2012 2:32 PM
    Moderator
  • Hi Julie

    When working with XML you use tools such as XPath or Linq in order to pick up the exact elements you want to work with. You don't parse the string as a string. Programming XML is a major paradigm shift from object model programming. It's a different world-view.

    You're either going to have to stick with the Word APIs, which are slower but you at least understand how to use them, or you're going to need to expand your horizons and learn XML programming. Or you need to pass this task on to someone who already has these skills.


    Cindy Meister, VSTO/Word MVP
    Tuesday, January 17, 2012 4:41 PM
    Moderator