none
Extract all types of embedded and attachment files from word document RRS feed

  • Question

  • Hi,

    i'm using VSTO interop v15 to convert my office documents to PDF. 

    I have been stuck up while converting word document which contains embedded and attachments files.

    My requirement  is we need to extract those documents(embedded and attachments) and convert them in PDF.

    Framework : 4.5

    Language : c# 

    Thanks



    • Edited by chandu537 Thursday, October 26, 2017 6:21 PM
    Thursday, October 26, 2017 6:09 PM

Answers

  • Hello,

    The package object is as OLE object in the main document, so you could

    use document.MainDocumentPart.Document.Descendants<Ovml.OleObject>()

    using Ovml = DocumentFormat.OpenXml.Vml.Office;
    
                        using (WordprocessingDocument document = WordprocessingDocument.Open(fileName, false))
                        {
                            
                            if( document.MainDocumentPart.GetPartsCountOfType<EmbeddedPackagePart>()>0)
                            {
                                MessageBox.Show("There is embeddedPackage like Excel Spreadsheet, Word Document");
                            }
    
                            if (document.MainDocumentPart.Document.Descendants<Ovml.OleObject>().Count()>0)
                            {
                                foreach (Ovml.OleObject emb in document.MainDocumentPart.Document.Descendants<Ovml.OleObject>())
                                {
                                    if (emb.ProgId == "Package")
                                    {
                                        MessageBox.Show("Package");
                                        string fileName1= document.MainDocumentPart.GetPartById(emb.Id.ToString()).Uri.ToString().Remove(0, embeddingPartString.Length);
                                        string filePath = "D:\\test\\" + fileName1;
      
                                        // Write the steam to the file.
                                        System.IO.FileStream writeStream = new System.IO.FileStream(filePath, FileMode.Create, FileAccess.Write);
                                        ReadWriteStream(document.MainDocumentPart.GetPartById(emb.Id).GetStream(), writeStream);
    
                                        // If the file is a structured storage file stored as a oleObjectXX.bin file
                                        // Use Ole10Native class to extract the contents inside it.
                                        if (fileName1.Contains("oleObject"))
                                        {
                                            // The Ole10Native class is defined in Ole10Native.cs file
                                            Ole10Native.ExtractFile(filePath, "D:\\test\\");
                                        }
                                    }
                                }
                            }
                        }

    Regards,

    Celeste


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    • Marked as answer by chandu537 Friday, November 17, 2017 10:34 AM
    Friday, November 17, 2017 6:25 AM
    Moderator

All replies

  • Hello,

    I think Open XML is better than Office Interop to extract embedded files.

    Please visit the similar thread:

    Extract embedded document with the word document

    You could download sample code from Extract embedded files from Office documents (CSOfficeDocumentFileExtractor)

    If it doesn't work for you, please let me know.

    Regards,

    Celeste


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Friday, October 27, 2017 2:22 AM
    Moderator
  • Hello,

    I have gone through the link earlier which you mentioned  but that didn't help me much..so i thought of creating a new thread.

    Basically My Application is Scheduler which converts the documents (office documents and text file) to PDF.

    i have a byte array object which is a word document(original document) and it has embeeded object(office documents,pacakge,etc).

    when i try to convert the document to PDF. embeeded object are as images.i cannot access them.

    what im trying to achive is, i need to get all the embeeded objects and the original document as a embeeded object and place all this embedded objects which includes original document as a embeeded object and place them into a new word document. 

    the orginal document has to be converted to PDF before we embeed into new word docuemnt

    if you have any other suggestions,please let me know..i will be happy to look..

     

    Friday, October 27, 2017 2:11 PM
  • Hello,

    What code do you use now? Do you get any error?

    Could you get all the embedded objects using the code from Extract embedded files from Office documents (CSOfficeDocumentFileExtractor)?

    Here is the code I modify to extract embedded files from a selected word document. All extracted files are stored in D:\test, you could use Office Interop to insert these file as embedded file into a new document.  

                Microsoft.Office.Core.FileDialog fd =
       Globals.ThisAddIn.Application.get_FileDialog(Microsoft.Office.Core.MsoFileDialogType.msoFileDialogOpen);
                fd.AllowMultiSelect = true;
                fd.Filters.Clear();
                fd.Filters.Add("Word Files", "*.docx;*.docm");
                fd.Filters.Add("All Files", "*.*");
    
                if (fd.Show() != 0)
                {
                    //fd.Execute(); // Open the file 
                    foreach (string fileName in fd.SelectedItems)
                    {
                        Package pkg = Package.Open(fileName);
                        string embeddingPartString = "/word/embeddings/";
                        foreach (PackagePart pkgPart in pkg.GetParts())
                        {
                            if (pkgPart.Uri.ToString().StartsWith(embeddingPartString))
                            {
                                string fileName1 = pkgPart.Uri.ToString().Remove(0, embeddingPartString.Length);
                                // Get the stream from the part
                                System.IO.Stream partStream = pkgPart.GetStream();
                                string filePath = "D:\\test\\" + fileName1;
    
                                // Write the steam to the file.
                                System.IO.FileStream writeStream = new System.IO.FileStream(filePath, FileMode.Create, FileAccess.Write);
                                ReadWriteStream(pkgPart.GetStream(), writeStream);
    
                                // If the file is a structured storage file stored as a oleObjectXX.bin file
                                // Use Ole10Native class to extract the contents inside it.
                                if (fileName1.Contains("oleObject"))
                                {
                                    // The Ole10Native class is defined in Ole10Native.cs file
                                    Ole10Native.ExtractFile(filePath, "D:\\test");
                                }
                            }
                        }
                        pkg.Close();
                    }
                }
            }
    
            private void ReadWriteStream(Stream readStream, Stream writeStream)
            {
                int Length = 256;
                Byte[] buffer = new Byte[Length];
                int bytesRead = readStream.Read(buffer, 0, Length);
                // write the required bytes
                while (bytesRead > 0)
                {
                    writeStream.Write(buffer, 0, bytesRead);
                    bytesRead = readStream.Read(buffer, 0, Length);
                }
                readStream.Close();
                writeStream.Close();
            }

    Regards,

    Celeste


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Monday, October 30, 2017 8:10 AM
    Moderator
  • Hi,

    This looks good, when I try  extract package or pdf it saves to Ole object.bin any workaround for this?

    Before we extract the embedded objects. I need to perform check(if condition) does the document contains embedded files or not. If yes then we can perform the above code which you mention.

    I tried performing check using packaging getparts() but the problem is I need to save the file and then extract the files. Alot of read n write IO operations. I have a Byte array. Is there Any work around? 

    Tuesday, October 31, 2017 7:07 PM
  • Do you test the code sample in  Extract embedded files from Office documents (CSOfficeDocumentFileExtractor) or my post with your files? Could you get all expected embedded objects?

    I suggest share your code and your file here, so that we could reproduce your issue. You could upload the file into OneDrive and share the link here. Please visit Share OneDrive files and folders


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Wednesday, November 1, 2017 6:18 AM
    Moderator
  • Hi,

    I am using the same code from Extract embedded files from Office documents (CSOfficeDocumentFileExtractor)  to Extract the documents .The code you mentioned above is similar to CSOfficeDocumentFileExtactor

    If the word document contains charts and equations that too are considered as embedded files and are being Extracted(which should not happen).

    Equations are extracted as .bin 
    Charts are extracted as Excel document.
    Code I am using to check if Document consists embedded files 

    public bool CheckEmbeddedFilesExists(byte[] bytearray)
            {
                string fileName = "D:\\test.docx";
                using (var stream = File.OpenWrite(fileName))
                {
                    stream.Write(fileData, 0, fileData.Length);
                }
                if (fileName == string.Empty || !System.IO.File.Exists(fileName))
                {
                    return false;
                }
    
                // Open the package file
                Package pkg = Package.Open(fileName);
    
                System.IO.FileInfo fi = new System.IO.FileInfo(fileName);
    
                string extension = fi.Extension.ToLower();
    
                if ((extension == ".docx") || (extension == ".dotx") || (extension == ".docm") || (extension == ".dotm"))
                {
                    embeddingPartString = "/word/embeddings/";
                }
                else if ((extension == ".xlsx") || (extension == ".xlsm") || (extension == ".xltx") || (extension == ".xltm"))
                {
                    embeddingPartString = "/excel/embeddings/";
                }
                else
                {
                    embeddingPartString = "/ppt/embeddings/";
                }
    
                // Get the embedded files names.
                foreach (PackagePart pkgPart in pkg.GetParts())
                {
                    if (pkgPart.Uri.ToString().StartsWith(embeddingPartString))
                    {
                        return true;
                        break;
                    }
                }
                pkg.Close();
                File.Delete(fileName);
                return false;
            }



    The Code to Extract document Files I have used office extractor dll  

    public string[] ExtractEmbeddedFiles(byte[] bytearray)
            {
               
    
              string tempPath = "D:test1.docx"; ;
              using (var stream = File.OpenWrite(tempPath))
                    {
                     stream.Write(bytearray, 0,bytearray.Length);
    
                  }
    
        //Office Extractor DLL
        var extractor = new OfficeExtractor.Extractor();
        var files = extractor.SaveToFolder("fromFile","DestinationFolder");;
    
    return files;
    }



    I will get the path of all the file path extracted in string array. where then I can perform my operation
    Thursday, November 2, 2017 6:05 PM
  • Hello,

    >>If the word document contains charts and equations that too are considered as embedded files and are being Extracted(which should not happen).

    This is expected result. The charts and equations are embedded objects.

    The following code you are using now is to check if the url starts with "/word/embeddings/", so it would extract all embedded files. Please visit the picture below to see all embedded files.

          // Get the embedded files names.
                foreach (PackagePart pkgPart in pkg.GetParts())
                {
                    if (pkgPart.Uri.ToString().StartsWith(embeddingPartString))
                    {
                        return true;
                        break;
                    }
                }

    If you only want to extract embedded files like Word documents, spreadsheet and presentation. You could specify to get the embedded files which name like "/word/embeddings/*.docx" or "/word/embeddings/*.xlsx" or "/word/embeddings/*.pptx". Please note that we are unable to distinguish between chart and normal Excel file.

    Besides, there is no need to check the file extension because your requirement is to convert word document. You could hard code embeddingPartString into "/word/embeddings/".

    Regards,

    Celeste


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Friday, November 3, 2017 6:21 AM
    Moderator
  • Hi,


    Is there any other way? where I can restrict document to check which only consists embedded files and attachments (Not Charts and Equations object) because my Method CheckEmbeddedFileExists return always true.

    My Requirement is to Check the document before converting to PDF. if the method CheckEmbeddedFileExists returns true the it will extract the documents as we discussed.

    Embedded files can also be as Package.

    Monday, November 6, 2017 2:02 PM
  • Hello,

    You may use the following method. It would skip charts, equations and 97-2003 format files. They are EmbeddedObjectPart.

                        using (WordprocessingDocument document = WordprocessingDocument.Open(fileName, false))
                        {
                            if( document.MainDocumentPart.GetPartsCountOfType<EmbeddedPackagePart>()>0)
                            {
                                MessageBox.Show("There is embeddedPackage like Excel spreadsheet, Word Document");
                            }
                            //// extract files
                            //foreach (EmbeddedPackagePart pkgPart in document.MainDocumentPart.GetPartsOfType<EmbeddedPackagePart>())
                            //{
                            //    if (pkgPart.Uri.ToString().StartsWith(embeddingPartString))
                            //    {
                            //        string fileName1 = pkgPart.Uri.ToString().Remove(0, embeddingPartString.Length);
                            //        // Get the stream from the part
                            //        System.IO.Stream partStream = pkgPart.GetStream();
                            //        string filePath = "D:\\test\\" + fileName1;
    
                            //        // Write the steam to the file.
                            //        System.IO.FileStream writeStream = new System.IO.FileStream(filePath, FileMode.Create, FileAccess.Write);
                            //        ReadWriteStream(pkgPart.GetStream(), writeStream);
                            //    }
                            //}
                        }

    Regards,

    Celeste


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Tuesday, November 7, 2017 8:26 AM
    Moderator
  • Hi,

    - Correct me if I'm wrong,In the above code sample are we using Reference of DocumentFormat.OpneXML DLL ?

    - I think we can remove the Office Extractor DLL and its  code with the Comment Code "Extract Files"  which is mentioned above  to extract files? 

    * Just one more query,As i'm using VSTO interop v15 do I need Microsoft office installed on the running machine(on server)  to Run my Application(Console Application)? 

    Tuesday, November 7, 2017 7:22 PM
  • Hello,

    >>In the above code sample are we using Reference of DocumentFormat.OpneXML DLL ?

    Yes. WordprocessingDocument is in DocumentFormat.OpenXML. WindowsBase is also needed.

    >>I think we can remove the Office Extractor DLL and its  code with the Comment Code "Extract Files"  which is mentioned above  to extract files? 

    Yes. I agree with you. Office Extractor DLL is used to extract packages like "object1.bin" 

    >>As i'm using VSTO interop v15 do I need Microsoft office installed on the running machine(on server)  to Run my Application(Console Application)? 

    Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, ASP.NET, DCOM, and NT Services), because Office may exhibit unstable behavior and/or deadlock when Office is run in this environment.

    Please visit Considerations for server-side Automation of Office for more information.

    I suggest you create a winform application or console application and use Open XML library to extract embedded files.

    Regards,

    Celeste


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Wednesday, November 8, 2017 1:34 AM
    Moderator
  • Hi,

    I was trying to test with most of cases with the code you have provided

     using (WordprocessingDocument document = WordprocessingDocument.Open(fileName, false))
                        {
                            if( document.MainDocumentPart.GetPartsCountOfType<EmbeddedPackagePart>()>0)
                            {
                                MessageBox.Show("There is embeddedPackage like Excel spreadsheet, Word Document");
                            }
    }

    - If there is a package as embedded it was not able to check.

    - Regarding Microsoft Office required to install on server?

    >>to extract embedded i will be using openXML,But i have other operation(Primary operation of converting doc to PDF) and most of the code written by using Interop(microsoft.office.interop).

    Wednesday, November 8, 2017 7:03 PM
  • Hello,

    Microsoft does not recommend and support server side automation of Office, it means if you use Office.Interop dll in server side, lots of errors mentioned in the link above may occur.

    Microsoft strongly recommends that developers find alternatives to Automation of Office if they need to develop server-side solutions. Because of the limitations to Office's design, changes to Office configuration are not enough to resolve all issues. 

    Open XML library is recommended to manipulate Office files in server side.

    Regards,

    Celeste


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Thursday, November 9, 2017 2:18 AM
    Moderator
  • Hi,

    Apologies for delay, Your were right about  not using Microsoft Office on server with  VSTO Interop. 

    I see you have recommended Open XML, which we have been using for Extracting Embedded Files.

     

    My Requirement, As we know My Primary Requirement is to Convert the office document(word,excel,power point)  and text file(.txt) to PDF.

    -- Sub requirement, If the word document has embedded files then we are extracting the files.(part of my requirement, which we already found the solution from you,Thanks for It.)  

    But I guess now,I have rewrite my whole code with Open XML. Any suggestions to get start with Open XML?

    Like,

    - A example of converting word document(byte array) to PDF using Open XML. 

    - Which DLLs do I need ?

    - Does Open XML need Micrsoft Office on Server ?

    Tuesday, November 14, 2017 6:41 AM
  • Hello,

    Getting started with Open XML, please visit Welcome to the Open XML SDK 2.5 for Office.

    Unfortunately, Open XML does not support to convert files into PDF. So you may use third party library. Sorry that we could not recommend any third party libraries. 

    You do not need to install MS Office on server because Open XML library directly manipulate Office files based on its packages and xml files.

    Regards,

    Celeste


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Thursday, November 16, 2017 2:19 AM
    Moderator
  • Hi,

    Thanks for your help and valuable suggestions over the time.

    Before I mark as answer,Just a query regarding  Extracting the embedded files the code you provided is not able to extract a package object.  

                     extract files
                    using (MemoryStream stream = new MemoryStream(fileData))
                    {
                        using (WordprocessingDocument document = WordprocessingDocument.Open(stream, false))
                        {
                            foreach (EmbeddedPackagePart pkgPart in document.MainDocumentPart.GetPartsOfType<EmbeddedPackagePart>())
                            {
                                if (pkgPart.Uri.ToString().StartsWith(embeddingPartString))
                                {
                                    string fileName1 = pkgPart.Uri.ToString().Remove(0, embeddingPartString.Length);
                                    // Get the stream from the part
                                    System.IO.Stream partStream = pkgPart.GetStream();
                                    using (var packagePartMemoryStream = new MemoryStream())
                                    {
                                        string filePath = "D\test\";
                                        partStream.CopyTo(packagePartMemoryStream);
                                        // Write the steam to the file.
                                        //System.IO.FileStream writeStream = new System.IO.FileStream(filePath, FileMode.Create, FileAccess.Write);
                                        File.WriteAllBytes(filePath, packagePartMemoryStream.ToArray());
                                    }
                                    //ReadWriteStream(pkgPart.GetStream(), writeStream);
                                }
                            }
                        }
                    }

    Thursday, November 16, 2017 11:51 AM
  • Hello,

    The package object is as OLE object in the main document, so you could

    use document.MainDocumentPart.Document.Descendants<Ovml.OleObject>()

    using Ovml = DocumentFormat.OpenXml.Vml.Office;
    
                        using (WordprocessingDocument document = WordprocessingDocument.Open(fileName, false))
                        {
                            
                            if( document.MainDocumentPart.GetPartsCountOfType<EmbeddedPackagePart>()>0)
                            {
                                MessageBox.Show("There is embeddedPackage like Excel Spreadsheet, Word Document");
                            }
    
                            if (document.MainDocumentPart.Document.Descendants<Ovml.OleObject>().Count()>0)
                            {
                                foreach (Ovml.OleObject emb in document.MainDocumentPart.Document.Descendants<Ovml.OleObject>())
                                {
                                    if (emb.ProgId == "Package")
                                    {
                                        MessageBox.Show("Package");
                                        string fileName1= document.MainDocumentPart.GetPartById(emb.Id.ToString()).Uri.ToString().Remove(0, embeddingPartString.Length);
                                        string filePath = "D:\\test\\" + fileName1;
      
                                        // Write the steam to the file.
                                        System.IO.FileStream writeStream = new System.IO.FileStream(filePath, FileMode.Create, FileAccess.Write);
                                        ReadWriteStream(document.MainDocumentPart.GetPartById(emb.Id).GetStream(), writeStream);
    
                                        // If the file is a structured storage file stored as a oleObjectXX.bin file
                                        // Use Ole10Native class to extract the contents inside it.
                                        if (fileName1.Contains("oleObject"))
                                        {
                                            // The Ole10Native class is defined in Ole10Native.cs file
                                            Ole10Native.ExtractFile(filePath, "D:\\test\\");
                                        }
                                    }
                                }
                            }
                        }

    Regards,

    Celeste


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    • Marked as answer by chandu537 Friday, November 17, 2017 10:34 AM
    Friday, November 17, 2017 6:25 AM
    Moderator
  • Hi chandu,

    Do you have any issue with thread below?

    #How to insert values and OLE Object as embedded file inside a (.dotx) template

    https://social.msdn.microsoft.com/Forums/vstudio/en-US/f747c861-5446-4332-b453-a2125db9f1ad/how-to-insert-values-and-ole-object-as-embedded-file-inside-a-dotx-template?forum=vsto 

    If there is, I would suggest you keep following.

    >>Just a query regarding  Extracting the embedded files the code you provided is not able to extract a package object.

    For this issue, you may need to consider posting a new thread, and then community could focus on this issue.

    Regards,

    Tony


    Help each other

    Friday, November 17, 2017 7:39 AM
  • Hi Tony,

    Thanks for your suggestion. Surely will open a new thread.

    apologies about  the other thread not been followed ,I'm new to this forum just getting use to it.



    Friday, November 17, 2017 1:10 PM
  • Hi,

    I am not able to extract PDF and vsdx files.

    Can you please help me on this?

    Thursday, November 14, 2019 12:54 PM
  • Hi,

    I am not able to extract visio and pdf files from word.

    Can you please help me?

    Thursday, November 14, 2019 12:57 PM