none
Would be Open XML SDK the right choice for this scenario ? RRS feed

  • Question

  • I’m looking for some help in choosing the right architecture, tools and technology to solve this need.

    In an engineering project, we have a list of Requirements (the so-called Compliance Matrix) that is basically a list (excel, sql db, sharepoint list, doesn’t matter) with a unique Id and Description, eg.

    ReqID   ReqDescription

    1                    Req1Desc…

    2                    Req2Desc…

    When we are writing the technical documentation of the project (dozens of docx files for analysis, design, test, procedures, etc.) we should insert which ReqId we are satisfying with that paragraph. Let’s says for example at the end of paragraph I insert a “placeholder” like [%Req234%].

    A the end of document writing, there should be “something” that scans the docx files, search for all placeholders and fill the Compliance matrix with the details of the doc (basically filename and chapter).

    Eg.

    ReqID   ReqDescription                Satisifed by

    234                Req234Desc…   foo.docx, chap. 4.1.3

    My questions are:

    - Would it be Open XML SDK the right choice to develop the “something” stuff ? Or should I look elsewhere (VSTO ? Macro ? PIA, VBA ?). Remember that I have to fill something “extern” to the document (a db, a sharepoint list or a new word/excel file), so the solution should not be “sandboxed” to the same doc I’m writing, but it must access (with right credentials) to external resources.

    - what is the best form for the placeholder? Normal text with strange characters (eg. [%  %] ), bookmarks, field codes, what else ?

    - what is the best deployment scenario (document level, template, add-in) considering that we already have a lot of ready documents where we have only to insert the ReqIDs

    Thanks in advance

    Sandro

    Tuesday, March 19, 2013 2:00 PM

Answers

  • Hi Sandro

    I think Open XML might be the "right" tool for this. From the sound of it, the "something" doesn't need to interact with users in the Word application UI? (IOW you're not looking to extend Word's interface in order to interact with the user, your main interest in this question is "mining" the finished document content.)

    An Open XML solution could be a Console App, Windows Form or anything else (including non-Microsoft programming languages) able to work with Zip-file packages and XML. Anything the programming environment of your choice can use to communicate with external "others" (db, sharepoint, whatever) should integrate just fine with the Open XML part.

    The Open XML SDK is a DLL that "wraps" the Packaging and XML used to work with an Office file in a more "programmer friendly" way and is generally used with VB.NET or C#. While you still need to understand how a Word document is "packaged", you don't need to know all the details about the underlying XML.

    Generally, it's recommended to use content controls as "markers" in a document. There are other possibilities, such as bookmarks or "strange characters", as you mention. But content controls would be simplest. If you want to get really fancy, you could provide an UI for inserting these which would link them to a Custom XML Part - that's an XML file inside the *.docx Zip package. The XML can be anything you want, meaning it could conform to a schema that makes it simpler to transfer the data into the database. That would make the processing side for the document more efficient, but you can also work without.

    In order to provide a UI for the user - for inserting the content controls, or providing lists of ReqIDs - you can use either VSTO or VBA. I tend to think VSTO, though, so you can put the tools in a TaskPane. This could be a VSTO template (meaning the users would create new documents from it) or an add-in (meaning the users could use the tools with any document).


    Cindy Meister, VSTO/Word MVP, my blog

    Tuesday, March 19, 2013 4:38 PM
    Moderator
  • Hi Sandro

    Both Open XML and the Word object model can give you the heading information concerning where to find the ReqId. In order to get a page number the object model is the better bet.

    If you want to enable someone to jump directly to these, it might make sense to associate bookmarks with the content controls so you can (hyper)link to them in the list?

    Edit: there's no reason your application couldn't consist of multiple parts (Open XML and VSTO) if you come to the conclusion that you need both.


    Cindy Meister, VSTO/Word MVP, my blog


    Wednesday, March 20, 2013 10:57 AM
    Moderator
  • Hi Sandro

    <<The bookmark is a great idea, I didn't know you can build a direct link (in what form? like html? #bookmarkName at the end of url?)>>

    Best would be for you to test it, yourself :-) Create a "small" document in Word and set a bookmark (Insert/Bookmark, give it a name, Add).

    Move elsewhere in the document and Insert/Hyperlink. As category, choose "Places in this document". Select the bookmark.

    Press Alt+F9 in order to see the field codes. That will show you how Word handles a hyperlink to a bookmark.

    Additional info: Note that a cross reference can also be a hyperlink (Word creates a bookmark for cross references). That's why "Places in this document" also lists Headings. This is also the way links in a TOC function.


    Cindy Meister, VSTO/Word MVP, my blog

    Wednesday, March 27, 2013 4:46 PM
    Moderator

All replies

  • Hi Sandro

    I think Open XML might be the "right" tool for this. From the sound of it, the "something" doesn't need to interact with users in the Word application UI? (IOW you're not looking to extend Word's interface in order to interact with the user, your main interest in this question is "mining" the finished document content.)

    An Open XML solution could be a Console App, Windows Form or anything else (including non-Microsoft programming languages) able to work with Zip-file packages and XML. Anything the programming environment of your choice can use to communicate with external "others" (db, sharepoint, whatever) should integrate just fine with the Open XML part.

    The Open XML SDK is a DLL that "wraps" the Packaging and XML used to work with an Office file in a more "programmer friendly" way and is generally used with VB.NET or C#. While you still need to understand how a Word document is "packaged", you don't need to know all the details about the underlying XML.

    Generally, it's recommended to use content controls as "markers" in a document. There are other possibilities, such as bookmarks or "strange characters", as you mention. But content controls would be simplest. If you want to get really fancy, you could provide an UI for inserting these which would link them to a Custom XML Part - that's an XML file inside the *.docx Zip package. The XML can be anything you want, meaning it could conform to a schema that makes it simpler to transfer the data into the database. That would make the processing side for the document more efficient, but you can also work without.

    In order to provide a UI for the user - for inserting the content controls, or providing lists of ReqIDs - you can use either VSTO or VBA. I tend to think VSTO, though, so you can put the tools in a TaskPane. This could be a VSTO template (meaning the users would create new documents from it) or an add-in (meaning the users could use the tools with any document).


    Cindy Meister, VSTO/Word MVP, my blog

    Tuesday, March 19, 2013 4:38 PM
    Moderator
  • Cindy,

    first of all thanks a lot for your exhaustive answer.

    You're right, the extraction phase doesn't require any user interaction, but I was thinking too to give the user a Custome task Pane where to insert the Content Controls (where do you advice to put my ID ? name, tag, content ?)

    Now the decision point would be this:

    Which one between Open XML SDK and Word Object Model of VSTO is able to tell me the childhood of the Content Control to its Heading Section (or at least to the page number where is located) ?

    Eg.

    4. Title

    Lorem ipsum

    4.1 Subtitle

    Lorem ipsum

    [Req234]

    5. Title

    ...

    I have to give the most possible detail (not only filename but also "Chap 4.1", or Page 23) where the Content Control is located, so that the reviewer could jump immediately to the point where the ReqID is.

    Thanks again

    Sandro

    Wednesday, March 20, 2013 9:16 AM
  • Hi Sandro

    Both Open XML and the Word object model can give you the heading information concerning where to find the ReqId. In order to get a page number the object model is the better bet.

    If you want to enable someone to jump directly to these, it might make sense to associate bookmarks with the content controls so you can (hyper)link to them in the list?

    Edit: there's no reason your application couldn't consist of multiple parts (Open XML and VSTO) if you come to the conclusion that you need both.


    Cindy Meister, VSTO/Word MVP, my blog


    Wednesday, March 20, 2013 10:57 AM
    Moderator
  • Thanks again.

    I think the OpenXml makes more sense if one day I will porting the code of the "parser" under an event or a workflow of Sharepoint (where both the docs and the matrix are stored).

    The bookmark is a great idea, I didn't know you can build a direct link (in what form? like html? #bookmarkName at the end of url?)

    Thanks again

    Have a nice day

    Sandro

    Wednesday, March 20, 2013 1:37 PM
  • Hi Sandro

    <<The bookmark is a great idea, I didn't know you can build a direct link (in what form? like html? #bookmarkName at the end of url?)>>

    Best would be for you to test it, yourself :-) Create a "small" document in Word and set a bookmark (Insert/Bookmark, give it a name, Add).

    Move elsewhere in the document and Insert/Hyperlink. As category, choose "Places in this document". Select the bookmark.

    Press Alt+F9 in order to see the field codes. That will show you how Word handles a hyperlink to a bookmark.

    Additional info: Note that a cross reference can also be a hyperlink (Word creates a bookmark for cross references). That's why "Places in this document" also lists Headings. This is also the way links in a TOC function.


    Cindy Meister, VSTO/Word MVP, my blog

    Wednesday, March 27, 2013 4:46 PM
    Moderator