locked
What is the fastest way to read paragraphs in word and c#? RRS feed

  • Question

  • What is the fastest way to read paragraphs in word and c #?
    Now it takes hours to read a file of 1000 pages.

    Example:

    foreach (Microsoft.Office.Interop.Word.Paragraph MyParagraph in MyWordDocument.Paragraphs)
    string Text = MyParagraph.Range.Text;

    Friday, January 26, 2018 1:53 PM

Answers

  • <<
    3.- Can someone from Microsoft Word explain to us how the paragraphs are really read or is it a business secret? If it is a secret that they say it and the controversy is done and if it is not, explain how it is read quickly.
    >>

    I think if the information was openly available, you would be able to find it quite easily on the Microsoft site, or someone who actually works for Microsoft would probably have pointed you to it by now. As background, Microsoft does do Open Source, e.g. the Office Open XML SDK has been open source for a while AFAIK, but they still clearly prefer to keep their Office code as proprietary code. At one time I think there was a mechanism where organisations could get hold of Office code under very strict conditions, but I do not know if that still exists. The same probably applies to the converter code they developed to allow earlier versions of Word to read the .docx formats.

    In my view, that tells you that you have to use something else to achieve your objective. If my (e) is ruled out because of .doc, then you either have to find someone else's code (e.g. from another Open Source project), or you have to use one of the approaches that uses the Word object model.

    If you use the object model, you absolutely do need to find out why your code is taking so long to run. But the time will certainly vary depending on how exactly you do that - e.g. when you posted your question I don't think you said whether you were using out-of-process or in-process. The latter seems to be around 40 times the former. So I think you should try for the former. Further, since Word *has to be present* when you use the object model, I would have to consider using VBA, which is an awful lot simpler than using a .NET Addin, for that. If you do that, I strongly recommend that you ensure your loops occasionally execute the DoEvents method, which makes it slightly more easy to stop your code running without crashing Word.

    (not to be taken seriously, but the code for one version of Word is publicly available, although unlikely to be useful for any practical purpose today. It's the code for Word for Windows 1, which you can get find at http://www.computerhistory.org/atchm/microsoft-word-for-windows-1-1a-source-code/  Unlike the later .doc format, I don't think Microsoft has ever published the specification for the earlier .doc format used in that version.)


    Peter Jamieson


    • Edited by Peter Jamieson Wednesday, January 31, 2018 7:56 PM
    • Marked as answer by zequion1 Thursday, February 1, 2018 6:30 AM
    Wednesday, January 31, 2018 7:49 PM

All replies

  • A for each loop is the fastest way to read individual paragraphs. With 1,000 pages to process, I'm not surprised it takes a while. Perhaps you should consider whether you can be more selective about which paragraphs get read (e.g. limiting to specific Styles and/or content, which you could use Find for).

    Cheers
    Paul Edstein
    [MS MVP - Word]

    Saturday, January 27, 2018 12:28 AM
  • How do programs that convert from word to pdf?
    I need to do the same without waiting for hours.
    Monday, January 29, 2018 10:22 AM
  • Your question is as clear as mud. What are you trying to achieve?

    Cheers
    Paul Edstein
    [MS MVP - Word]

    Monday, January 29, 2018 12:14 PM
  • How do programs that convert from word to pdf?
    I need to do the same without waiting for hours.

    Use Word's ability to save a document in PDF format.

    See https://msdn.microsoft.com/en-us/library/microsoft.office.interop.word._document.saveas.aspx and use the WdSaveFormat enumeration member wdFormatPDF



    • Edited by RLWA32 Monday, January 29, 2018 12:29 PM added link
    Monday, January 29, 2018 12:20 PM
  • The pdf format does not work for me because it does not treat the paragraphs in the same way.I need to read the paragraphs in word quickly. It does Word itself and other conversion programs.How do they do that?
    Monday, January 29, 2018 3:32 PM
  • Well, then don't use out-of-process COM interop.  The fastest way would be for you to open the word document directly and read the data yourself through reference to the Word document's file format documentation.  This will not be a trivial task.

    Monday, January 29, 2018 3:43 PM
  • So continuing from where RLWA32 left off, open the file as XML (rename with ending .zip, extract it, find the correct documents and read from there). There are some tools that lets you do this without needing to do everything from scratch.
    Tuesday, January 30, 2018 12:10 AM
  • OK, but how is it done? With time and many coffees it can be done, but is there any example?
    Tuesday, January 30, 2018 5:28 AM
  • As I said earlier, it might help if you actually took the time to explain what you're trying to achieve. Presumably there is some purpose behind reading the various paragraphs...

    Cheers
    Paul Edstein
    [MS MVP - Word]

    Tuesday, January 30, 2018 6:13 AM
  • It is clear how the mud ... what I intend to read all the paragraphs ...
    If I take hours to read the Word file, someone has screwed it up and it has not been me.
    Tuesday, January 30, 2018 6:21 AM
  • How does word read? Because word does not use this system. Do they hide it from us so that we are in a lower level?
    Wednesday, January 31, 2018 4:45 AM
  • Reading the paragraphs is what you're doing. Unless you're reading them for the sake of reading them, that's not what you're trying to achieve...

    Cheers
    Paul Edstein
    [MS MVP - Word]

    Wednesday, January 31, 2018 7:29 AM
  • If word read fast is because they use another system. What is that system?
    The same with conversion programs. How is it going to take hours to read the paragraphs of a file that occupies 10mb? It is not possible unless there is a big programming problem.

    Let's ramble, I've focused the problem and there are people who know how to read fast.
    Wednesday, January 31, 2018 9:08 AM
  • Programs that convert from Word to PDF typically attempt to convert *everything* in the document, i.e. ordinary text, formatting, tables, inline and floating images, displayed objects and so on. To do that requires either 
     a. a lot of understanding of the structure of a .docx (or worse, a .doc) and how to create the equivalent types of object in PDF or
     b. a library or converter that will do most of the work for you - and in a sense, the Word Object has such a library which lets you read a document in one line and save it as a PDF in another, but there are doubtless other libraries out there that do not depend on Word.

    How a 3rd-party library/converter (as in (b)) does that is that whoever wrote it has to have (a), then write either "raw" code in C/C#/whatever, and/or use one or more helper libraries. For .docx documents using C# the most useful library is probably the Open XML SDK. However, you still have to understand what you need to get from the .docx.

    Let's start with your example code. It gets the plain text of each paragraph in the main body of the document. It does not get the headers/footers/text in floating objects, and so on. It won't tell you if a paragraph is inside a table or which cell it is in. But if all you need is the plain text of the paragraphs, there are a number of ways you can consider doing that. I have done some simple performance tests for some of them - I cannot attempt them all. These were done using an old 2.66GHz quad core Core2 processor with 16Gb memory running 64-bit Windows 10 and (where appropriate) 32-bit Word 2016. I used a 10000-page document containing approximately 133000 paragraphs (i.e. probably not as many as you might expect in such a document) of gibberish - a total of just over 4000000 "words". No tables, no inline or floating objects. Opening this document using Word (i.e. manually) takes about a minute.

    If you want the fastest method, skip to point (e), but ensure you read the comments.
    a. use the VBA equivalent of your code. This uses the Word object model, in-process. 

    It processed the entire document (already open) in 26 seconds. The code was

    Sub readParasUsingForEach()
    Dim p As Word.Paragraph
    Dim s As String
    Debug.Print Now()
    For Each p In ActiveDocument.Paragraphs
      s = p.Range.Text
    Next
    Debug.Print Now()
    End Sub

    b. use VBA but iterate using a regular for loop. This uses the Word object model, in-process. 

    It processed the first 500 paragraphs in 10 seconds, the next in 15 seconds, and continued to slow down. paras 2500-3000 took 60 seconds. (We can guess that Word might even be counting the paragraphs from the beginning for each iteration, which spells doom for processing a long document - in fact, my original tests never finished.) The code I used was:

    Sub readParasUsingFor()
    Dim c As Long
    Dim j As Long
    Dim lng As Long
    Dim s As String
    Debug.Print Now()
    With ActiveDocument.Paragraphs
      c = .Count
      For lng = 0 To 5
        For j = 1 To 500
          s = .Item(lng * 500 + j).Range.Text
        Next
        DoEvents
        Debug.Print Now()
      Next
    End With
    End Sub

    c. use a .NET VSTO Addin with code like yours. This uses VSTO wrappers that invoke the Word object model via COM Interop. I believe this generates a .dll and that the calls are in-process, but I am not certain about that. This was tested using Visual Studio debug mode.

    After opening the document it processed the entire document in 50 seconds.

    The relevant code was like this (you don't need the line that sets the count - that takes about 3 seconds, but he way). I use the i++ and j++ lines to give me places to set breakpoints.

            private void ThisAddIn_Startup(object sender, System.EventArgs e)
            {
                Word._Document MyWordDocument = this.Application.Documents.Add("d:\\test\\2018012801 speed test para read\\doc1.docx");
                int i = 0;
                int j = MyWordDocument.Paragraphs.Count;
                foreach (Word.Paragraph MyParagraph in MyWordDocument.Paragraphs)
                {
                    i++;
                    if (i==30000)
                    {
                        i = 0;
                        j++;
                    }
                    string myText = MyParagraph.Range.Text;
                }
                MyWordDocument.Close();
            }


    d. Use COM Interop and the Word object model, but out-of-process. In this case I wrote a simple Console App. This was tested using Visual Studio debug mode.

    After creating the Word object (which actually took about 60 seconds) and opening the Word document ( another 180 seconds !) this processed the entire document in around 28 minutes. However, the processing speed per paragraph does not seem to vary much over that time, unlike example (b). 

    Out-of-process is (in essence) where the process has to make Word calls to a different EXE, and for that to work, data has to be "marshalled" between processes. All that stuff is done "under the hood" so nothing in the code really indicates that it is going on, but that is one reason why it is almost invariably going to take longer than in-process code.

    e. use the Office Open XML SDK. This does not need Word, but will only work with documents such as .docx, not the older .doc type.

    I adapted the code at 

    https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/how-to-retrieve-paragraphs-from-an-office-open-xml-document 

    and ran it, removing the paragraph-by-paragraph console logging.

    This only took 2 seconds to process the entire document. Even with logging it was faster than everything except (a). However, I have to warn you that it is not really doing quite the same thing as the other examples. It's really just grabbing the content of all the <w:t> text Elements within <w:p> paragraph Elements. But for a real document, you might also need to take account of where the <w:t> nodes are. For example, they may be nested inside a piece of XML that records a change such as an insertion or deletion, and in a real-world situation you might have to reconstruct the "up-to-date" text by including and omitting the current <w:t> elements.

    If you wanted to go faster, you might find that one of the open source projects that does Word document conversion has code you can re-use. I am sure there are a lot, but for example a project such as LibreOffice must have all the code necessary to (a) read .docx (and .doc) and probably (b) to create .pdf. Whether it is easy code to re-use is another matter.


    Peter Jamieson


    Wednesday, January 31, 2018 4:38 PM
  • I appreciate your effort. Your answer is very extensive and checking each point exhaustively takes time. Quickly I answer:

    1.- Word read in seconds a large file. If I use the system they recommend it can take hours and I no longer say if it has 10,000 pages. Anyone can check it. That means that word does not use the system that recommends others. You say that you can use third-party libraries but Microsoft does not use them. What Microsoft uses, because it is not what they recommend to others.

    2.- You use the same system as me, which is the only one that exists and says that it has taken 26 seconds to process it. Where does that value come from because, as I say, it can be hours.
    Your code is the same as mine:
    For Each p In ActiveDocument.Paragraphs
       s = p.Range.Text
    Next

    c. Reuse the same system:
    MyWordDocument.Paragraphs.Count;
                
    foreach (Word.Paragraph MyParagraph in MyWordDocument.Paragraphs)
                    
    string myText = MyParagraph.Range.Text;

    d.- Re-use the same system but this time called an exe.

    e.- Office Open XML SDK. I need to process .doc and .docx and if Microsoft does not use it, why do I? It is easier to tell us what is the system they use. Anyway, for .docx this is the best answer, but the code that they recommend (ActiveDocument.Paragraphs) is different.

    conclusion:
    1.- Point e is interesting but does not support .doc and I have to make big changes in my code.

    2.- Why does the standard method take 26 seconds and to me hours if it is the same code and we know that the standard method is very slow?

    3.- Can someone from Microsoft Word explain to us how the paragraphs are really read or is it a business secret? If it is a secret that they say it and the controversy is done and if it is not, explain how it is read quickly.

    I keep reading slowly because you use the only known system but you are going fast.

    Very grateful for your response.

    Wednesday, January 31, 2018 6:38 PM
  • Microsoft Word is proprietary software owned by Microsoft and licensed for use by end-users.  Outside of whatever is available with Open XML, it does not publicly make available any documented library, engine, or API that a developer can use to create, read, or modify Word documents, other than its documented Object Model for use with COM.  Microsoft has publicly disclosed the file formats for Word documents.

    If out-of-process COM interop is too slow and you don't want to write your own code by reference to the documented file formats then you can search the internet (Google is your friend) for open-source or commercial libraries that can provide the desired access to Word documents.





    • Edited by RLWA32 Wednesday, January 31, 2018 7:19 PM
    Wednesday, January 31, 2018 7:08 PM
  • The previous answer clarifies the situation. We're talking about Microsoft using different code, that's why it reads so fast.

    I have been doing tests and I have verified that asking the page to which the paragraph belongs (I need it to size) makes the reading 4 times slower:

    It is true that if you only do this:
    foreach (Microsoft.Office.Interop.Word.Paragraph MyParagraph in MyWordDocument.Paragraphs)
    {string Text = MyParagraph.Range.Text;

    The reading is about 40 minutes per 10000 pages.

    If you ask the page to which the paragraph belongs (I need it to size) the reading is 4 times slower (about 160 minutes):

    string MyText = null;
    foreach (Microsoft.Office.Interop.Word.Paragraph MyParagraph in MyWordDocument.Paragraphs)
    {  MyRange = MyParagraph.Range;
        MyText = MyRange.Text;
        int Page = MyRange.Information [Microsoft.Office.Interop.Word.WdInformation.wdActiveEndAdjustedPageNumber];
    }

    Also, I'm forced to do more things, that's why time increases to an unbearable point.

    I still have the same problem, someone has some idea that does not pass by limiting the reading to .docx?

             

    • Edited by zequion1 Wednesday, January 31, 2018 7:41 PM
    Wednesday, January 31, 2018 7:37 PM
  • <<
    3.- Can someone from Microsoft Word explain to us how the paragraphs are really read or is it a business secret? If it is a secret that they say it and the controversy is done and if it is not, explain how it is read quickly.
    >>

    I think if the information was openly available, you would be able to find it quite easily on the Microsoft site, or someone who actually works for Microsoft would probably have pointed you to it by now. As background, Microsoft does do Open Source, e.g. the Office Open XML SDK has been open source for a while AFAIK, but they still clearly prefer to keep their Office code as proprietary code. At one time I think there was a mechanism where organisations could get hold of Office code under very strict conditions, but I do not know if that still exists. The same probably applies to the converter code they developed to allow earlier versions of Word to read the .docx formats.

    In my view, that tells you that you have to use something else to achieve your objective. If my (e) is ruled out because of .doc, then you either have to find someone else's code (e.g. from another Open Source project), or you have to use one of the approaches that uses the Word object model.

    If you use the object model, you absolutely do need to find out why your code is taking so long to run. But the time will certainly vary depending on how exactly you do that - e.g. when you posted your question I don't think you said whether you were using out-of-process or in-process. The latter seems to be around 40 times the former. So I think you should try for the former. Further, since Word *has to be present* when you use the object model, I would have to consider using VBA, which is an awful lot simpler than using a .NET Addin, for that. If you do that, I strongly recommend that you ensure your loops occasionally execute the DoEvents method, which makes it slightly more easy to stop your code running without crashing Word.

    (not to be taken seriously, but the code for one version of Word is publicly available, although unlikely to be useful for any practical purpose today. It's the code for Word for Windows 1, which you can get find at http://www.computerhistory.org/atchm/microsoft-word-for-windows-1-1a-source-code/  Unlike the later .doc format, I don't think Microsoft has ever published the specification for the earlier .doc format used in that version.)


    Peter Jamieson


    • Edited by Peter Jamieson Wednesday, January 31, 2018 7:56 PM
    • Marked as answer by zequion1 Thursday, February 1, 2018 6:30 AM
    Wednesday, January 31, 2018 7:49 PM
  • Are the Word documents in question stored on a network server or on the local drive?
    Thursday, February 1, 2018 3:27 AM
  • Asking the page number of the paragraph makes it 4 times slower. The reading is in a local unit.

    Thank you for your answers. You can close this post if you want.

    • Edited by zequion1 Thursday, February 1, 2018 6:31 AM
    Thursday, February 1, 2018 6:30 AM
  • FWIW expect any method of retrieving page number information to be slow. Page numbers are not stored inside the .doc/.docx. Word has to communicate with the "current" printer driver to lay out pages, and it seems to be quite a slow process. 

    With VBA, adding the equivalent code to retrieve the page number massively increased the execution time (far more than a factor of 4), and became slower as the process continued.

    In some cases, using

    Application.ScreenUpdating = False

    at the beginning of the process and setting it to True again at the end (or if an error occurs) can make a difference. SOmetimes it can be large. Sometimes, computing the page count via the ActiveDocument.ComputeStatistics function can itself take a long time but results in more consistent retrieval performance as you get further into the document. It is worth noting that an obvious-looking way to return page count ActiveDocument.Content.Information(wdActiveEndAdjustedPageNumber)
    is not reliable - the value seems to indicate how far Word has got with paginating until it reaches the final page.

    Another way to determine page number information would be to iterate through the pages, retrieving the .Range.Start of each page. Then you could compare start and/or enfd of the paragraph range with the page position as you retrieved the paragraphs.

    Some simple code to do that would be

    Dim r As Range
    Set r = ActiveDocument.Range(0,0)

    to start with, then use a loop containing

    Set r = r.GoToNext(What:=WdGoToItem.wdGoToPage)

    to go to the next page. 

    Even that gets slower as it proceeds, but with ScreenUpdating off it seems to be tolerable.



    Peter Jamieson

    Thursday, February 1, 2018 11:58 AM
  • Paul,

    I was wondering if you would have time to give suggestion/solution to this post. Thanks..Nam

    Thursday, November 21, 2019 2:11 AM