none
Bursting document using Interop.Word RRS feed

  • Question

  • A preexisting process combines multiple customer invoices into a single .doc file. I need to extract or copy each which span multiple pages and create a new document for each using the Interop.Word assembly as this is part of a win forms app. Any general guidance very much appreciated.
    Doyle
    Saturday, January 28, 2012 6:39 PM

All replies

  • Hi Doyle

    The first thing you need to do is determine what kind of thing uniquely identifies the start of a new invoice.

    Please confirm: These documents are in the old Word 97-2003 *.doc file format and not in the new 2007-2010 *.docx file format?

    Which version of Word will you have available to work on the files?

    Are you supposed to be doing this in a server environment?


    Cindy Meister, VSTO/Word MVP
    Sunday, January 29, 2012 8:21 AM
    Moderator
  • Thank you so much for your assistance Cindy,

    The page starting each invoice may be identified by the presence of the string "Page 1".

    This is Word 97-2003.doc file format.

    I am not running on a server... this is a win forms app sitting on each client ...don't think I am out of bounds on licensing.. you bring up an excellent point as I don't presently know what word version will be started when the word app fires up but just checked the version property running on my dev box and of course got 14 for the 2010 on my box.

    This work is for a single fortune 500 company so my firm could require that they be on 2007 or 2010 even if that makes a significant difference to the approach to this effort, time to completion, or performance. I will certainly rely on your advise for that decision.

    I can not change the initial starting point of Word 97-2003 as that is what is exported out of Crystal Reports.

    My folks here will agree to laying a version requirement on the client rather than coding to multiple versions of Word.

    Thank you Cindy.

     

     

     

     


    Doyle
    Monday, January 30, 2012 7:58 PM
  • Hi Doyle

    At this point, I don't think a version requirement is necessary, although something could come up as we go that would make it desirable.

    My question was more along the lines of whether we could leverage the Open XML File formats rather than needing to automate the Word application. But since we're dealing with the old file format, that's not possible.

    The string "Page 1" that you want to use to identify the start of each document... I take it it's in the Header/Footer area? If yes, that's not going to be good enough as it's not really possible to associate a header/footer with a "virtual page" opened in the UI. Your human eye can do it, but the programming interface can't. We need something else... If it's static text in the document body, then it can work.


    Cindy Meister, VSTO/Word MVP
    Tuesday, January 31, 2012 12:42 PM
    Moderator
  • Cindy,

    The text is static not in a header or footer !

    I should say that this is my first effort at word automation and the interop namespace.

     

     

     


    Doyle
    Tuesday, January 31, 2012 2:58 PM
  • Cindy this is a stub I did

    private void button1_Click(object sender, EventArgs e)
            {
                // get the path to each .doc file to be processed
                string[] pdfFilePaths = Directory.GetFiles(@"C:\TrioInvoicePdfs\ToBeProcessed\", "*.doc");
    
                Word._Application iWordApp = (Word._Application) new Word.Application();
                iWordApp.Visible = false;
               
                object missing = System.Type.Missing;           
    
                // for each .doc file found, find each invoice and create a separate .doc file containing each invoice
                // and save the .doc file as a pdf
    
                foreach (string pdfFilePath in pdfFilePaths)
                {                
                    object fileObject = (object)pdfFilePath;
    
                    Word._Document iDocument = null;
    
                    iDocument = (Word._Document)iWordApp.Documents.Open(ref fileObject,
                        ref missing, ref missing, ref missing, ref missing,
                        ref missing, ref missing, ref missing, ref missing,
                        ref missing, ref missing, ref missing, ref missing,
                        ref missing, ref missing, ref missing);
                   
                    iDocument.Activate();
                              
                    // create the new .doc files for each invoice
                   
    
                    // some trial code to use on the newly created .docs in above process
    
                    object saveFormatAsObject = (object)Word.WdSaveFormat.wdFormatPDF;                
                    string newFilePath = pdfFilePath.Replace("ToBeProcessed", "ProcessedArchive");
                    newFilePath = newFilePath.Replace(".doc", ".pdf");
                    object newFilePathAsObject = (object)newFilePath;
    
                    iDocument.SaveAs(ref newFilePathAsObject, ref saveFormatAsObject, ref missing,
                            ref missing, ref missing, ref missing,
                            ref missing, ref missing,
                            ref missing, ref missing,
                            ref missing, ref missing, ref missing,
                            ref missing, ref missing, ref missing);               
    
                    iDocument.Close(missing, missing, missing);  
                   
                }
    
                iWordApp.Quit();
            }
    


    Doyle
    Tuesday, January 31, 2012 8:12 PM
  • Hi Doyle

    OK, then you'll need to use the Range.Find() method to pick up that string. My inclination would be to create an Array/collection to store the Range objects so that, once all have been found you can go back and "loop" them. This will be trickier than a straight-forward "Find" since you'll want the Range you're working with to extend from one "Page 1" to the start of the next.

    Follow-up question on this: What's between the end of one document and the beginning of the next? I'm guessing there's something that's forcing a new page, but you need to know what because you won't want that in the resulting document (forcing a new blank page at the end). You may want to turn on the display of non-printing characters...

    Take a look at the Range.Find.Execute method in the API documentation to get a feel for the parameters it offers. In this scenario, you can pass System.Type.Missing to most of them.

    Start by declaring a Word.Range object that contains the entire body of the document:
      Word.Range rngDoc = iDocument.Content;

    And another for doing the actual search:
      Word.Range rngSearch = rngDoc.Duplicate;

    At the beginning, they point to the same thing, but that will change, which is why it's a good idea to have both.

    When rngSearch.Find.Execute is successful, the content of rngSearch will change to include what was found. At this point, you need a new Range object that you assign to the array/collection.

    The next step is to set the search range to search from the END of the last "hit" to the end of the original range. If you'd just start over again, you'd go into a loop and continually pick up that first "hit":
        object oCollapseEnd = Word.WdCollapseDirection.wdCollapseEnd;
        rngSearch.Collapse(ref oCollapseEnd);
       rng.Search.End = rngDoc.End;

    Now repeat from the Find.Execute Stop onwards until Find no longer returns True.

    The array/collection will contain Ranges that each point to an instance of "Page 1" in the text. You loop through these in pairs and extend the Range of the first to the starting point of the Range of the second (except for the last, which you extend to the end point of the document). Something like this:
       rngFirst.End = rngSecond.Start -1;

    And put that range into a new, empty document:
       Word.Document newDoc = Documents.Add(//all ref missing);
        newDoc.Content.FormattedText = rngFirst.FormattedText;
        newDoc.SaveAs(//etc.);
       newDoc.Close(//params);
        newDoc = null;


    Cindy Meister, VSTO/Word MVP
    Wednesday, February 1, 2012 8:14 AM
    Moderator
  • Thank you Cindy for your reply, You gave me some good information from which I have been able to develop a little code. I think we may have had a small misunderstanding as the "Page 1" string is on the first page of an invoice but it is not the first character string on that page. So I back up from the first hit to zero for the first page and backup all others by the same amount. As you mentioned in paragraph two of your prior post I am having difficulties with the page breaks. On the first iteration of the final loop in the code below I get a document created and saved ! It does however have an additional page with a box (which encloses basic customer info - not populated on the unwanted page-) it also has a couple of returns. The second iteration blows at inewDoc.Content.FormattedText = rng.FormattedText;...I have spent some hours looking at the API and trying to find a way to hunt down and remove the last pagebreak in the ranges but have had no luck. Thanks for the help.

    private void button1_Click(object sender, EventArgs e)
            {
                try
                {
                    // get the path to each .doc file to be processed
                    string[] pdfFilePaths = Directory.GetFiles(@"C:\TrioInvoicePdfs\ToBeProcessed\", "*.doc");
    
                    Word._Application iWordApp = (Word._Application)new Word.Application();
                    iWordApp.Visible = false;
    
                    object missing = System.Type.Missing;        
                    
                    // for each .doc file found, find each invoice and create a separate .doc file containing each invoice
                    // and save the .doc file as a pdf
                    
                    List<int> rangeStartingPositions = new List<int>();
                    List<int> rangeEndingPositions = new List<int>();
                    object fileObject;         
    
                    foreach (string pdfFilePath in pdfFilePaths)
                    {
                        fileObject = (object)pdfFilePath;
    
                        Word._Document iDocument = null;
    
                        iDocument = (Word._Document)iWordApp.Documents.Open(ref fileObject,
                            ref missing, ref missing, ref missing, ref missing,
                            ref missing, ref missing, ref missing, ref missing,
                            ref missing, ref missing, ref missing, ref missing,
                            ref missing, ref missing, ref missing);
                        
                        iDocument.Activate();
    
                        Word.Range rngDoc = iDocument.Content;
                        Word.Range rngSearch = rngDoc.Duplicate;                   
    
                        rngSearch.Find.ClearFormatting();
                        rngSearch.Find.Text = "Page 1";
                        rngSearch.Find.Forward = true;                    
    
                        do
                        {
                            rngSearch.Find.Execute(
                                ref missing, ref missing, ref missing, ref missing, ref missing,
                                ref missing, ref missing, ref missing, ref missing, ref missing,
                                ref missing, ref missing, ref missing, ref missing, ref missing);
    
                            if (rngSearch.Find.Found)
                            {                            
                                rangeStartingPositions.Add(rngSearch.Start);
                                rangeEndingPositions.Add(rngSearch.End);                            
                            }
                        }
                        while (rngSearch.Find.Found);                   
    
                        // adjust range starting point as the string Page 1 is not at the first of each invoice
    
                        int initialRangeOneStartingPoint = rangeStartingPositions[0];                   
    
                        for (int i = 0; i < rangeStartingPositions.Count; i++)
                        {
                            // starting position of first range goes to zero and others are reduced by the same amount
                            rangeStartingPositions[i] = rangeStartingPositions[i] - initialRangeOneStartingPoint;
                        } 
                          
                        // ending position for each range is set the starting position of next range reduced by 1
                        // except for the last range 
                        for (int j = 0; j < rangeEndingPositions.Count; j++)
                        {
                            if (j < rangeEndingPositions.Count - 1)
                            {
                                rangeEndingPositions[j] = rangeStartingPositions[j + 1] - 1;
                            }
                            else
                            {
                                rangeEndingPositions[j] = rngDoc.End;
                            }
                        }
    
                        object startingPosAsObject;
                        object endingPosAsObject;
                        object saveFormatAsObject = (object)Word.WdSaveFormat.wdFormatDocument;
                        object filePathAsObject;
    
                        for (int k = 0; k < rangeStartingPositions.Count; k++)
                        {
                            startingPosAsObject = rangeStartingPositions[k];
                            endingPosAsObject = rangeEndingPositions[k];
                            //endingPosAsObject = rangeEndingPositions[k] - 30;
    
                            Word.Range rng = iDocument.Range(ref startingPosAsObject, ref endingPosAsObject);                                                                       
    
                            Word._Document inewDoc = iWordApp.Documents.Add(ref missing, ref missing, ref missing, ref missing);
                            inewDoc.Content.FormattedText = rng.FormattedText;                        
    
                            string datetimeAsString = DateTime.Now.ToString("s");
                            datetimeAsString = datetimeAsString.Replace(":", "-");
                            datetimeAsString = datetimeAsString.Replace("T", "-");
                            string filepath = "C:\\TrioInvoicePdfs\\ProcessedArchive\\" + datetimeAsString + "-" + k.ToString() + ".doc";
                            filePathAsObject = (object)filepath;
                            inewDoc.SaveAs(ref filePathAsObject, ref saveFormatAsObject, ref missing,
                                    ref missing, ref missing, ref missing,
                                    ref missing, ref missing,
                                    ref missing, ref missing,
                                    ref missing, ref missing, ref missing,
                                    ref missing, ref missing, ref missing);
    
                            inewDoc.Close(missing, missing, missing);         
    
                        }
                    }
    
                    iWordApp.Quit();
                }
                finally
                { 
                    
                }
            }                   


    Doyle
    Friday, February 3, 2012 11:00 PM
  • Hi Doyle

    Good, you've got the basics. But, as they say, the devil is in the details...

    <<I think we may have had a small misunderstanding as the "Page 1" string is on the first page of an invoice but it is not the first character string on that page>>

    I suspected that was the case, but I can only work with the information you give me...

    What is the first thing on each "Page 1" of each document? Once you have the unique text it's possible to work backwards - probably in a more reliable way than counting characters.

    You also haven't answered my question about what's between the end of one document and the start of the next? In order for this to work, we have to know that. See my question in my previous reply.


    Cindy Meister, VSTO/Word MVP
    Saturday, February 4, 2012 8:38 AM
    Moderator
  • Hi Cindy and thanks,

    <<What is the first thing on each "Page 1" of each document? Once you have the unique text it's possible to work backwards - probably in a more reliable way than counting characters.>>

    I believe by "document" you mean "invoice" within the document. The first thing on on the first page of an invoice is a logo. However, the logo appears on each page of the invoice. This is because a Crystal Reports page "header" was used in the application and the report when exported to Word maintains this. There is no unique text at the start of an invoice. The only unique text is in what was the Page Count object in Crystal and in the case of the first page of each invoice is "Page 1 of x". 

    <<You also haven't answered my question about what's between the end of one document and the start of the next? In order for this to work, we have to know that. See my question in my previous reply.>>

    Cindy , from my looking at it, there is nothing "between". The next invoice is just another page with the page break at the top like all invoices following the first invoice.It will contain the unique string "Page 1". In an attempt to simplify, would it help if we consider the end of the line containing the string  "Subtotal of Current Charges" to be the ending point. This is true on the images I am including but that can vary due to runtime modifications to the original Crystal report object- but perhaps I can handle that with a switch. Also, I have discovered that all text is inside textboxes.

    On this post I am including 2 images - the first invoice and an intermediate invoice.

    I will make another post with an image of the final invoice in the document due to the post 2 image limit.

    Thanks so much for your help.

     

     

     


    Doyle
    Monday, February 6, 2012 8:47 PM
  • Cindy, An image of the final invoice in a document


    Doyle
    Monday, February 6, 2012 8:51 PM
  • Hi Doyle

    Sometimes the complicated Microsoft method isn't the best method.

    I had the same challenge you did, but some googling found this open-source solution:

    http://www.pdfsam.org/

    Ten seconds later, I was done.

    Sunday, April 29, 2012 12:55 PM