none
Extract embedded document with the word document.

    Question

  •  

    Hi,

     

       I have a word document that has attached(ie. embedded ) documents like word, ppt, pdf, etc.

     

    I have to extract those embedded documents in the document through code.

     

     

     

    To extract embedded word document. I used the following code.

     

    word = new Microsoft.Office.Interop.Word.Application();

     

    doc = word.Documents.Open(ref fileName, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing);

    doc.Activate();

    int attachmentsCount = doc.InlineShapes.Count;

    for (int i = 0; i < attachmentsCount; i++)

    {

    embedDoc = doc.InlineShapes;

    tempDoc = (Document)embedDoc.OLEFormat.Object;

    tempDoc.SaveAs(ref tempFileName, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing);

     

    I need help to extract other types of documents(generally all type of attachments).

     

     

    Please advice.

    Monday, September 03, 2007 4:34 AM

Answers

  • Hi Ayyanar,

     

    Not every type of file can be extracted from the Word document. If we want to extract the OLEObject file, we need the file’s associated application’s support. With Word, Excel, PowerPoint these three applications’ support, the following steps we should do to extract the files:

    1.       Add the corresponding PIA to our project. If we want to extract the worksheet, we should add Microsoft Excel 12.0 Object Library.

    2.       Use DoVerb Method to active the OLE file.

    3.       Use Marshal.GetActiveObject to get the current instance of the Application.

    4.       Save active file of the application using the handle we get in step 3.

    The following is my code, which works in my side. Note that, make sure there is not an existed Excel or Power Point process executing before we extract the files.

    Code Snippet

         object VerbIndex = 1;

                object missing = Type.Missing;

                Word.Document doc = app.ActiveDocument as Word.Document;

                foreach (Word.InlineShape inlineShape in doc.InlineShapes)

                {

                   

                    if(inlineShape.OLEFormat.ProgID != null)

                    {

                        switch (inlineShape.OLEFormat.ProgID)

                        {

                            case "PowerPoint.Show.12":

                                inlineShape.OLEFormat.DoVerb(ref VerbIndex);

                                PowerPoint.Application ppt = Marshal.GetActiveObject("PowerPoint.Application") as PowerPoint.Application;

                                ppt.ActivePresentation.SaveAs(@"C:\testPPT.pptx", Microsoft.Office.Interop.PowerPoint.PpSaveAsFileType.ppSaveAsPresentation, Microsoft.Office.Core.MsoTriState.msoTrue);

                                ppt.Quit();

                                break;

     

                            case "Excel.Sheet.12":

                                inlineShape.OLEFormat.DoVerb(ref VerbIndex);

                                Excel.Application excel = Marshal.GetActiveObject("Excel.Application") as Excel.Application;

                                excel.ActiveWorkbook.SaveAs(@"C:\testBOOK.xlsx", missing, missing, missing, missing, missing, Microsoft.Office.Interop.Excel.XlSaveAsAccessMode.xlNoChange, missing,

                                    missing, missing, missing, missing);

                                excel.Workbooks.Close();

                                excel.Quit();

                                break;

     

                            case "Word.Document.12":

                                Word.Document document = inlineShape.OLEFormat.Object as Word.Document;

                                object fileName = @"C:\testDOC.docx";

                                document.SaveAs(ref fileName, ref missing, ref missing, ref missing, ref missing, ref missing,

                                    ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,

                                    ref missing, ref missing, ref missing);

                                break;

     

                            default:

                                break;

                        }

                    }

                }

    As to PDF file, I am afraid it is not achievable via VSTO technology.

     

     

    Thanks

    Ji

    Thursday, September 06, 2007 10:10 AM
    Moderator
  • Hi,

     

    Any file that can be embedded in a document as an OLE object can be extracted.  However, we may not be able to provide you with a simple code example if the technology doesn't belong to us (such as Adobe Acrobat files). 

     

    What we are doing with Office objects is activating the object and then taking advantage of the exposed IDispatch interface so that we can use COM interop to communicate directly with the object's programming model.  As it happens, the Office applications generally expose SaveAs methods that we can call to save the files in question. Going through the Office programming model in this fashion is a handy shortcut that enables saving embedded objects with very little code. 

     

    I suspect that Adobe Acrobat exposes a similar programming model because there is an Adobe Acrobat Type Library.  You will have to browse the Type Library to see if it exposes some sort of Save or Save as method.  If it does, you can add it as a reference to your project (via the COM References tab of the Add Reference dialog in Visual Studio) and take a similar approach as Ji suggests in his post above.

     

    Basically what the code above does is it gets the embedded object via OLEFormat.Object and activates it--at which point the programming model can be accessed.  However, there are a couple of issues with the code above that should be pointed out.

     

    First, since what you are trying to do is activate the object, there is no need to call OLEFormat.DoVerb.  Instead, call OLEFormat.Activate--it is much more straightforward and you can't call it incorrectly.  If you do call DoVerb, the correct argument to pass would be the wdOLEVerb.wdOLEVerbPrimary constant.  I'm not sure which constant '1' maps to in the code above, but you should always call the primary verb to activate.  Doing this will ensure that the object activates correctly.  But again, calling the Activate method directly is the better choice.

     

    Second, there is no need to use Marshal.GetActiveObject.  Once you have activated the embedded object, you can cast OLEFormat.Object directly to the appropriate type--which is safer and more straightforward than going through the ROT (which is what Marshal.GetActiveObject does).

     

    Here is an example of how to do this for an embedded Excel document:

     

    Code Snippet

    Word.InlineShape embeddedWorkbook = this.InlineShapes[1];

    embeddedWorkbook.OLEFormat.Activate();

    Excel.Workbook workbook = (Excel.Workbook)embeddedWorkbook.OLEFormat.Object;

    workbook.SaveAs("test.xslx", Missing , Missing, Missing, Missing, Missing,

        Excel.XlSaveAsAccessMode.xlNoChange, Missing, Missing, Missing, Missing, Missing);

    Excel.Application excel = workbook.Application;

    excel.Quit();

     

     

     

    As I mentioned earlier, utilizing the embedded object's programming model to perform the save is something of a shortcut.  There is a more involved solution that will work with any embedded object.  In order for the object to be embedded in the first place, it must support one of the COM IPersist interfaces (i.e IPersistStorage, IPersistStreamInit, IPersistFile, etc).  Therefore, an embedded object can always be extracted by calling Marshal.QueryInterface on the OLEFormat.Object (to determine the appropriate persistance interface), casting accordingly and then calling the appropriate method.  Depending on which persistence interface you use, you may need to call some additional methods to expose the appropriate storage over the top of a file.  Also, depending on the type of embedded object, you may still need to activate the object prior to being able to successfully QueryInterface for the persistance interfaces.  

     

    Some of the persistence interfaces have predefined Runtime Callable Wrappers (RCWs)--see System.Runtime.InteropServices and System.Runtime.InteropServices.ComTypes namespaces.  Also Visual Studio 2008 introduces the Microsoft.VisualStudio.OLE.Interop namespace which includes RCWs for pretty much all OLE interfaces.  If you can't find the RCW definition you need, you will need to create the RCW yourself. 

     

    One thing that should be mentioned here; the extracted bits represent the object in whatever state it was in when it was persisted.  There is no requirement that these bits be equivalent to a document file, so you should not expect to be able to write them to a file and then turn around and open the file in the application.  It is quite possible that there is an actual document file wrapped within the persistence blob and you may be able to extract it if the persistence format is documented.  Otherwise, the only thing you can do with the blob is to run the appropriate application, get the appropriate IPersistXXX interface and call Load, passing in the persisted bits.  This will depersist the object into its previously persisted state.

     

    In any event, this approach is definitely non-trivial, and requires a solid understanding of both COM COM Interop, but it will work for any embedded object.

     

    Sincerely,

     

    Geoff Darst

    Microsoft VSTO Team

    Thursday, September 06, 2007 5:08 PM
    Answerer

All replies

  • Hi Ayyanar,

     

    Not every type of file can be extracted from the Word document. If we want to extract the OLEObject file, we need the file’s associated application’s support. With Word, Excel, PowerPoint these three applications’ support, the following steps we should do to extract the files:

    1.       Add the corresponding PIA to our project. If we want to extract the worksheet, we should add Microsoft Excel 12.0 Object Library.

    2.       Use DoVerb Method to active the OLE file.

    3.       Use Marshal.GetActiveObject to get the current instance of the Application.

    4.       Save active file of the application using the handle we get in step 3.

    The following is my code, which works in my side. Note that, make sure there is not an existed Excel or Power Point process executing before we extract the files.

    Code Snippet

         object VerbIndex = 1;

                object missing = Type.Missing;

                Word.Document doc = app.ActiveDocument as Word.Document;

                foreach (Word.InlineShape inlineShape in doc.InlineShapes)

                {

                   

                    if(inlineShape.OLEFormat.ProgID != null)

                    {

                        switch (inlineShape.OLEFormat.ProgID)

                        {

                            case "PowerPoint.Show.12":

                                inlineShape.OLEFormat.DoVerb(ref VerbIndex);

                                PowerPoint.Application ppt = Marshal.GetActiveObject("PowerPoint.Application") as PowerPoint.Application;

                                ppt.ActivePresentation.SaveAs(@"C:\testPPT.pptx", Microsoft.Office.Interop.PowerPoint.PpSaveAsFileType.ppSaveAsPresentation, Microsoft.Office.Core.MsoTriState.msoTrue);

                                ppt.Quit();

                                break;

     

                            case "Excel.Sheet.12":

                                inlineShape.OLEFormat.DoVerb(ref VerbIndex);

                                Excel.Application excel = Marshal.GetActiveObject("Excel.Application") as Excel.Application;

                                excel.ActiveWorkbook.SaveAs(@"C:\testBOOK.xlsx", missing, missing, missing, missing, missing, Microsoft.Office.Interop.Excel.XlSaveAsAccessMode.xlNoChange, missing,

                                    missing, missing, missing, missing);

                                excel.Workbooks.Close();

                                excel.Quit();

                                break;

     

                            case "Word.Document.12":

                                Word.Document document = inlineShape.OLEFormat.Object as Word.Document;

                                object fileName = @"C:\testDOC.docx";

                                document.SaveAs(ref fileName, ref missing, ref missing, ref missing, ref missing, ref missing,

                                    ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,

                                    ref missing, ref missing, ref missing);

                                break;

     

                            default:

                                break;

                        }

                    }

                }

    As to PDF file, I am afraid it is not achievable via VSTO technology.

     

     

    Thanks

    Ji

    Thursday, September 06, 2007 10:10 AM
    Moderator
  • Hi,

     

    Any file that can be embedded in a document as an OLE object can be extracted.  However, we may not be able to provide you with a simple code example if the technology doesn't belong to us (such as Adobe Acrobat files). 

     

    What we are doing with Office objects is activating the object and then taking advantage of the exposed IDispatch interface so that we can use COM interop to communicate directly with the object's programming model.  As it happens, the Office applications generally expose SaveAs methods that we can call to save the files in question. Going through the Office programming model in this fashion is a handy shortcut that enables saving embedded objects with very little code. 

     

    I suspect that Adobe Acrobat exposes a similar programming model because there is an Adobe Acrobat Type Library.  You will have to browse the Type Library to see if it exposes some sort of Save or Save as method.  If it does, you can add it as a reference to your project (via the COM References tab of the Add Reference dialog in Visual Studio) and take a similar approach as Ji suggests in his post above.

     

    Basically what the code above does is it gets the embedded object via OLEFormat.Object and activates it--at which point the programming model can be accessed.  However, there are a couple of issues with the code above that should be pointed out.

     

    First, since what you are trying to do is activate the object, there is no need to call OLEFormat.DoVerb.  Instead, call OLEFormat.Activate--it is much more straightforward and you can't call it incorrectly.  If you do call DoVerb, the correct argument to pass would be the wdOLEVerb.wdOLEVerbPrimary constant.  I'm not sure which constant '1' maps to in the code above, but you should always call the primary verb to activate.  Doing this will ensure that the object activates correctly.  But again, calling the Activate method directly is the better choice.

     

    Second, there is no need to use Marshal.GetActiveObject.  Once you have activated the embedded object, you can cast OLEFormat.Object directly to the appropriate type--which is safer and more straightforward than going through the ROT (which is what Marshal.GetActiveObject does).

     

    Here is an example of how to do this for an embedded Excel document:

     

    Code Snippet

    Word.InlineShape embeddedWorkbook = this.InlineShapes[1];

    embeddedWorkbook.OLEFormat.Activate();

    Excel.Workbook workbook = (Excel.Workbook)embeddedWorkbook.OLEFormat.Object;

    workbook.SaveAs("test.xslx", Missing , Missing, Missing, Missing, Missing,

        Excel.XlSaveAsAccessMode.xlNoChange, Missing, Missing, Missing, Missing, Missing);

    Excel.Application excel = workbook.Application;

    excel.Quit();

     

     

     

    As I mentioned earlier, utilizing the embedded object's programming model to perform the save is something of a shortcut.  There is a more involved solution that will work with any embedded object.  In order for the object to be embedded in the first place, it must support one of the COM IPersist interfaces (i.e IPersistStorage, IPersistStreamInit, IPersistFile, etc).  Therefore, an embedded object can always be extracted by calling Marshal.QueryInterface on the OLEFormat.Object (to determine the appropriate persistance interface), casting accordingly and then calling the appropriate method.  Depending on which persistence interface you use, you may need to call some additional methods to expose the appropriate storage over the top of a file.  Also, depending on the type of embedded object, you may still need to activate the object prior to being able to successfully QueryInterface for the persistance interfaces.  

     

    Some of the persistence interfaces have predefined Runtime Callable Wrappers (RCWs)--see System.Runtime.InteropServices and System.Runtime.InteropServices.ComTypes namespaces.  Also Visual Studio 2008 introduces the Microsoft.VisualStudio.OLE.Interop namespace which includes RCWs for pretty much all OLE interfaces.  If you can't find the RCW definition you need, you will need to create the RCW yourself. 

     

    One thing that should be mentioned here; the extracted bits represent the object in whatever state it was in when it was persisted.  There is no requirement that these bits be equivalent to a document file, so you should not expect to be able to write them to a file and then turn around and open the file in the application.  It is quite possible that there is an actual document file wrapped within the persistence blob and you may be able to extract it if the persistence format is documented.  Otherwise, the only thing you can do with the blob is to run the appropriate application, get the appropriate IPersistXXX interface and call Load, passing in the persisted bits.  This will depersist the object into its previously persisted state.

     

    In any event, this approach is definitely non-trivial, and requires a solid understanding of both COM COM Interop, but it will work for any embedded object.

     

    Sincerely,

     

    Geoff Darst

    Microsoft VSTO Team

    Thursday, September 06, 2007 5:08 PM
    Answerer
  • Hi

    Your explenations are very helpful Geoff.
    Thanks for your detailed information.

    I am having the same problem like Ayyanar.
    I need to extract various embedded object from worddocuments.
    Is there any code example which shows how other embedded objects like .wav files or even "Packages" could be extracted?
    I am especially worried about the "Packages" as I even was not able to determine what it actually contains.
    I guess this could only be done using the second method you mentioned (using
    Marshal.QueryInterface on the OLEFormat.Object).


    Thanks for your help.
    Bernhard


    Wednesday, October 17, 2007 1:35 PM
  • Answer from Geoff see the following thread:

    http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=2281409&SiteID=1
    Sunday, October 21, 2007 9:27 AM
  • Hi Geoff,

    To deal with an embedded OLE object i am using the same technique as you propose here (i.e. directly type cast the embedded Workbook that i have to Excel.Workbook) and everything works fine if i donot have an already running instance of Excel due to some other pretext. However, if i have an pre running Excel instance, then my code (in my case Worksheet.Select) gives 0x800A03EC.

    What could be the problem?

    cheers!
    Monday, December 03, 2007 7:32 AM
  • Hi,

    Is there any way through which I can get the underlying interface without actually activating the Object?
    Becoz I have a series of Ole Objects on excel sheet whose that I need to process and activating them takes a lot of time.

    regards

    Wednesday, December 05, 2007 1:00 PM
  • If you are still interested in getting the original file out of an embedded OLE object created with the "Packager", such as an arbitrary executable or any other file you care to mention, then drop me a line at mailscanner@ecs.soton.ac.uk and I will tell you how to do it.

    I analysed the file format by hand, it's really very simple. Just a few lines of code in any language of your choice, and very fast (no fancy Windows API calls or anything like that needed).

    Let me know if you want any help. I wrote it so that MailScanner (www.mailscanner.info) could extract embedded files from within Microsoft Office documents and subject them to all the same tests that every other file in an email message has to pass.

    --
    Jules
    mailscanner@ecs.soton.ac.uk

    Monday, April 07, 2008 9:10 PM
  • I do have a same problem.

    Have you figured it out? How to exract package objects from word?

    Please share your code snippet.

    plz do reply asap.

     

     

    Thursday, August 19, 2010 6:47 AM
  • Hello Abhimanyu,

    I understand that this is an older post.

    There is a sample available which demonstrates how to extract the embedded files from Office 2007 format files (.docx, xlsx etc). This sample does not activate the OleObjects but just extract the bits from the oleObject.bin file.

    You can download the sample from the following location:

    http://code.msdn.microsoft.com/CSOfficeDocumentFileExtract-e5afce86

    Thanks,

    Sreerenj G Nair

    Wednesday, October 03, 2012 2:07 PM