none
Getting a string of entire word document RRS feed

  • Question

  • Hi,

    Using the interop, i'm tying to compare a word document, with itself, at two different stages, to check if anything have changed.

    My plan was to compare the content, as a string, but i'm having some trouble with this.

    I want the entire document, including headers, footers and formatting, so I tried using Document.Content.Get_XML(). It appears to be everything, but it is never the same, even if I have made no changes.

    When I say "have made no changes", it is not completely true. I do delete and rewrite certain tables, but the data is the EXACT same.

    Does it fail my check, because I rewrite?

    Any others ways I could/should do this?

    It has to be something I can do to the document, in memory. I don't want to save anything, to compare. The whole excercise is to determine, wether or not, I need to save the document.


    Nicolai Søndergaard - LM Wind Power A/S
    Friday, September 16, 2011 8:53 AM

Answers

  • Hi Nicolai

    <<I want the entire document, including headers, footers and formatting, so I tried using Document.Content.Get_XML(). It appears to be everything, but it is never the same, even if I have made no changes.

    When I say "have made no changes", it is not completely true. I do delete and rewrite certain tables, but the data is the EXACT same.>>

    The XML (get_XML) property (method) is going to return a lot more than text. Depending on your settings in Word, it will include some GUID values that allow its internal "Track Changes" and "Compare" functionality to determine whether something has been deleted/added, even though it appears to be "the same".

    If you truly want to compare only the text, one possibility would be to copy the content of each document to the clipboard and use PasteSpecial to paste it as pure text.

    The other way would be to continue to use the XML, but before you make the comparison use a transformation to strip out all (or most of) the XML tags, leaving only the text.


    Cindy Meister, VSTO/Word MVP
    • Marked as answer by Nicoolai Monday, September 19, 2011 11:28 AM
    Saturday, September 17, 2011 7:31 AM
    Moderator

All replies

  • I'm not sure what software you are using to do this, but there is a VBA command (ActiveDocument.Saved) that provides a boolean value of True if changes have been made to a document.
    Kind Regards, Rich ... http://greatcirclelearning.com
    Friday, September 16, 2011 1:18 PM
  • I'm not sure what software you are using to do this, but there is a VBA command (ActiveDocument.Saved) that provides a boolean value of True if changes have been made to a document.
    Kind Regards, Rich ... http://greatcirclelearning.com
    Yea, but if I delete something, and write the EXACT same text again, it will say it is not saved. Fair enough, since it isn't, but i'm looking to determine, if any content has actually changed.
    Nicolai Søndergaard - LM Wind Power A/S
    Friday, September 16, 2011 2:42 PM
  • Hi Nicolai

    <<I want the entire document, including headers, footers and formatting, so I tried using Document.Content.Get_XML(). It appears to be everything, but it is never the same, even if I have made no changes.

    When I say "have made no changes", it is not completely true. I do delete and rewrite certain tables, but the data is the EXACT same.>>

    The XML (get_XML) property (method) is going to return a lot more than text. Depending on your settings in Word, it will include some GUID values that allow its internal "Track Changes" and "Compare" functionality to determine whether something has been deleted/added, even though it appears to be "the same".

    If you truly want to compare only the text, one possibility would be to copy the content of each document to the clipboard and use PasteSpecial to paste it as pure text.

    The other way would be to continue to use the XML, but before you make the comparison use a transformation to strip out all (or most of) the XML tags, leaving only the text.


    Cindy Meister, VSTO/Word MVP
    • Marked as answer by Nicoolai Monday, September 19, 2011 11:28 AM
    Saturday, September 17, 2011 7:31 AM
    Moderator
  • Hi Nicolai

    The XML (get_XML) property (method) is going to return a lot more than text. Depending on your settings in Word, it will include some GUID values that allow its internal "Track Changes" and "Compare" functionality to determine whether something has been deleted/added, even though it appears to be "the same".

    If you truly want to compare only the text, one possibility would be to copy the content of each document to the clipboard and use PasteSpecial to paste it as pure text.

    The other way would be to continue to use the XML, but before you make the comparison use a transformation to strip out all (or most of) the XML tags, leaving only the text.


    Cindy Meister, VSTO/Word MVP

    Is it possible, to copy all the content, including the headers/footers? Whenever I have played with the content range, they are not included.

    I could of course grab the header and footer ranges afterwards, but it would be alot easier, if I could grab it all at once.


    Nicolai Søndergaard - LM Wind Power A/S
    Saturday, September 17, 2011 7:55 AM
  • Hi Nicolai

    You can test by opening such a document, Ctrl+A (SelectAll), Ctrl+C (copy), open a new document, then Paste Special as text.

    Normally, Ctrl+A will pick up the headers and footers.


    Cindy Meister, VSTO/Word MVP
    Saturday, September 17, 2011 8:24 AM
    Moderator
  • Hi Nicolai

    You can test by opening such a document, Ctrl+A (SelectAll), Ctrl+C (copy), open a new document, then Paste Special as text.

    Normally, Ctrl+A will pick up the headers and footers.


    Cindy Meister, VSTO/Word MVP

    ctrl+a does pick up headers and footers, but I can't seem to replicate this in code.

    If I try to copy WordDoc.Content (which I expect to be the ctrl+a range), I only get the text content, and not headers.


    Nicolai Søndergaard - LM Wind Power A/S
    Monday, September 19, 2011 6:39 AM
  • It looks like all the document tracking, in the xml extract, is called wsp:rsidR and wsp:rsidRDefault, so I'm going to trying and strip them out of hte xml, and then compare the xml again.
    Nicolai Søndergaard - LM Wind Power A/S
    Monday, September 19, 2011 6:49 AM
  • Hi Nicolai

    <<If I try to copy WordDoc.Content (which I expect to be the ctrl+a range), I only get the text content, and not headers.>>

    WordDoc.Select() should do it.


    Cindy Meister, VSTO/Word MVP
    Monday, September 19, 2011 7:35 AM
    Moderator
  • Hello,

     

    I have a code that removes the IDs of the revisions, or at least that is what is intended to do:

    private string GetStateOfWordOpenXML(string wordOpenXML)
            {
                try
                {
                    #region Remove ID at the beginning
                    string positionIdentifierStart =  "w:rsidRDefault=\"";
    
                    string positionIdentifierEnd ="\"";
                    while (wordOpenXML != null && wordOpenXML.Contains(positionIdentifierStart))
                    {
                        int startPosition = wordOpenXML.IndexOf(positionIdentifierStart);
                        int endPosition = wordOpenXML.Substring(startPosition+positionIdentifierStart.Length).IndexOf(positionIdentifierEnd) + positionIdentifierEnd.Length;
                        string value = wordOpenXML.Substring(startPosition, endPosition+positionIdentifierStart.Length);
                        wordOpenXML = wordOpenXML.Replace(value, "");
                    }
                    #endregion
    
                    #region Remove ID at the beginning
                    positionIdentifierStart = "w:rsidR=\"";
                    while (wordOpenXML != null && wordOpenXML.Contains(positionIdentifierStart))
                    {
                        int startPosition = wordOpenXML.IndexOf(positionIdentifierStart);
                        int endPosition = wordOpenXML.Substring(startPosition + positionIdentifierStart.Length).IndexOf(positionIdentifierEnd) + positionIdentifierEnd.Length;
                        string value = wordOpenXML.Substring(startPosition, endPosition + positionIdentifierStart.Length);
                        wordOpenXML = wordOpenXML.Replace(value, "");
                    }
                    #endregion
    
                    #region Remove ID at the beginning
                    positionIdentifierStart = "w:rsidP=\"";
                    while (wordOpenXML != null && wordOpenXML.Contains(positionIdentifierStart))
                    {
                        int startPosition = wordOpenXML.IndexOf(positionIdentifierStart);
                        int endPosition = wordOpenXML.Substring(startPosition + positionIdentifierStart.Length).IndexOf(positionIdentifierEnd) + positionIdentifierEnd.Length;
                        string value = wordOpenXML.Substring(startPosition, endPosition + positionIdentifierStart.Length);
                        wordOpenXML = wordOpenXML.Replace(value, "");
                    }
                    #endregion
    
                    #region Remove all revision IDs
                    positionIdentifierStart = "<w:rsids>";
                    positionIdentifierEnd = "</w:rsids>";
                    while (wordOpenXML.Contains(positionIdentifierStart))
                    {
                        int startPosition = wordOpenXML.IndexOf(positionIdentifierStart);
                        int endPosition = wordOpenXML.Substring(startPosition).IndexOf(positionIdentifierEnd)+positionIdentifierEnd.Length;
                        string value = wordOpenXML.Substring(startPosition, endPosition);
                        wordOpenXML = wordOpenXML.Replace(value, "");
                    }
                    #endregion
                    return wordOpenXML;
                }
                catch (Exception ex)
                {
    
                }
                return null;
            }
    


    Hope this helps,

    Silviu.

     


    http://www.rosoftlab.net/
    Monday, September 19, 2011 10:48 AM
  • Yea, I ended up comparing the XML, but stripping all the ID marks, out of it first.

    It was a nice lesson in xml to linq, so it was all good :)


    Nicolai Søndergaard - LM Wind Power A/S
    Monday, September 19, 2011 11:28 AM