none
How can I get all the elements info in doc and docx document using c++? RRS feed

  • Question

  • I need to convert the word document into a pdf-like layout file in my c++ program. So, I need to get all the information about the elements in the document.


    For example, I have a word document with two pages. The first page contains two word “Hello word”, the second page contains a picture.

    What I want to get about the first page is: data content, font type, font color, font size, coordinate, and other attributes.

    What I want to get about the second page is: picture data, size, coordinate.

    With all the information I have obtained, I hope to create a layout file that looks like the same as the original word document.

    Is there any office API that can get the data I need?

    Looking forward to your reply.


    • Edited by John-Junior Wednesday, August 21, 2019 1:59 AM
    Wednesday, August 21, 2019 1:56 AM

All replies

  • I found the following example on Office Dev Center “C++ app automates Word (CppAutomateWord)” 

    https://code.msdn.microsoft.com/office/CppAutomateWord-28938be1

    According to this example, I can create a word document through Word.Application COM object’s "Documents" interface. Then I can add a paragraph to this document.

    This example is useful, but it's too simple to meet my usage requirements. 

    Is there any official document describing the Word.Application COM object?

    I want to know what I can do and what data I can get with Word.Application COM object.

    Looking forward to your reply.

    Wednesday, August 21, 2019 8:59 AM
  • The Word.Application object *is* Word, i.e. the desktop version of it on Windows or Mac.

    The key fact is that your C++ program would have to run on a machine that has Word installed to be able to use that object (and that generally means a client computer, not a server).

    Word has a large object model that allows you to extract information about most aspects of a document. You can find the documentation (it is basically the documentation for Word VBA so you have to adapt it for C++)  at

    https://docs.microsoft.com/en-us/office/vba/api/overview/

    The part specific to Word is at

    https://docs.microsoft.com/en-us/office/vba/api/overview/word

    It might actually be easier and more productive to
     - open Word
     - ensure that you can see the Developer tab (go to File->Options->Customize Ribbon and ensure the Developer tab is checked in the list on the right)
     - start the Visual Basic editor (use the Visual Basic icon at the left of the ribbon)
     - use View->Object Browser and select the Word object in the dropdown at the top to see the available objects, members etc. - use VBA rather than C++ to experiment with what you can and cannot do.

    The other main way to access information in Word documents is to use an API that accesses the .docx file directly (i.e. you don't need the Word Application to do that). The standard library provided by Microsoft is the Open Office XML API, which is a .NET library, so you would basically need to write managed code to use it. That can read .docx format but not the old .doc format. However, .docx XML can be very complicated - you would have to experiment to see if you can reliably extract the information you need.

    Otherwise, you may be better off either finding another library that makes it easier to access all the info. you need.

    if you are trying to convert .docx to PDF, you can do it via the Word object model by opening the .docx and saving it as .PDF. But if you cannot use the object model, you might be better off finding some open source code to do the exact conversion you need and re-using that. 

    Peter Jamieson

    Wednesday, August 21, 2019 11:41 AM