locked
Determining location of style changes RRS feed

  • Question

  • I'm writing an add-in that will take an open Word document and write out an HTML-like file by walking through the document model. To do this I need to break the document into spans of identical formatting attributes so I can write out something like <p>Test <span format="font-weight:bold">bold</span> text</p>, but I haven't found a good way to do that. I've tried walking the document character-by-character and comparing the previous range's format with the one I'm on now. That works, but is far too slow to be practical.

    Is there a better way of finding out which runs of text have identical formatting?

    Tuesday, August 2, 2011 7:17 PM

Answers

  • I'm targetting the 97-2003 .doc format using the COM interop API.


    Hi buzaan

    It would have been better if you were targeting the newer file formats, as you could have used OpenXML. You could, possibly, parse the old binary file formats. These are now publicly available and there are MSDN forums to support them.

    But why not export the Word file to HTML (the full "round-trip" file format) or RTF? Then you could parse that without needing to work with the (slow) interop APIs?


    Cindy Meister, VSTO/Word MVP
    • Marked as answer by buzaan Monday, August 8, 2011 10:33 PM
    Monday, August 8, 2011 9:22 AM

All replies

  • Which version of Word (document file format) are we talking about? 97-2003 *.doc or 2007/2010 *.docx?
    Cindy Meister, VSTO/Word MVP
    Wednesday, August 3, 2011 11:40 AM
  • I'm targetting the 97-2003 .doc format using the COM interop API.

    Friday, August 5, 2011 7:40 PM
  • Dear Buzaan,

    >>That works, but is far too slow to be practical.

    Is there a better way of finding out which runs of text have identical formatting?

    I think it should behavior like this. In order to enhance the performance, you can use StringBuilder to compare the contents between them. 

    Another way is through Open XML technology, but it is also very slow to compare these properties applied to the paragraph.

    Hope this helps.

    Regards,


    Be happy.
    Monday, August 8, 2011 8:37 AM
  • I'm targetting the 97-2003 .doc format using the COM interop API.


    Hi buzaan

    It would have been better if you were targeting the newer file formats, as you could have used OpenXML. You could, possibly, parse the old binary file formats. These are now publicly available and there are MSDN forums to support them.

    But why not export the Word file to HTML (the full "round-trip" file format) or RTF? Then you could parse that without needing to work with the (slow) interop APIs?


    Cindy Meister, VSTO/Word MVP
    • Marked as answer by buzaan Monday, August 8, 2011 10:33 PM
    Monday, August 8, 2011 9:22 AM
  • Thanks for your feedback, everyone. Unfortunately many people using the add-in are on older versions of Word, but I'll look in to using OpenXML in the future. I have a working (more or less) solution using HTML export already, but was hoping that working with the document object directly would be possible as that would be more convenient. It seems sticking with transforming the HTML for the time being is best.
    Monday, August 8, 2011 10:32 PM