none
How can I using Open XML Format SDK to retrieve the whole content of a Word, Excel, Power Point? RRS feed

  • Question

  • Hello,
    I'm very interested in this, I think it is possible to retrieve the whole content of a Word document or Exel and Power Point, but I cannot find out the way to do that, can you please give me some advices? Thank you so much.

    Su, Le
    Friday, March 12, 2010 1:55 AM

Answers

  • Hi Asusu,

    Thanks for your description.

    To retrieve the text of a word document, you need to be clear about the file format. For example, usually the text content is under Pragraph\Run\Text element. To retrieve this, you can use the following code:

    var paragraphsInDoc = WordprocessingDocument.MainDocumentPart.Document.Descendants<Paragraph>();

    to get all the Paragraph elements under ‘Document’ element including the ones that are not direct children of it, and then get the text content of each paragraph through Paragraph.InnerText or to retrieve the Text elements and get their text through Text.Text property.

    BTW, this is just for general text document. If it contains a table or image, you may need to consider some more complex scenario.

    Hope this helps. If you have any question, please let me know.

    Thanks,

    Lu
    • Marked as answer by Asusu Friday, March 12, 2010 9:52 AM
    Friday, March 12, 2010 9:36 AM

All replies

  • Hi Asusu,

    Thanks for your question.

    I think I need to clarify what do you mean by "the whole content" to help investigate into your scenario and find a good solution. Do you mean the text of a Word document, for example? If not, could you give us some explanation about this? For I'm also not quite sure about the definition of the whole content of Excel or PowerPoint.

    Thanks,

    Lu
    Friday, March 12, 2010 7:35 AM
  • Hi,

    Thank you for your considering.

    I mean is the text of a word document, How can I retrieve the text of a word document and the formatting?

    Friday, March 12, 2010 8:57 AM
  • Hi Asusu,

    Thanks for your description.

    To retrieve the text of a word document, you need to be clear about the file format. For example, usually the text content is under Pragraph\Run\Text element. To retrieve this, you can use the following code:

    var paragraphsInDoc = WordprocessingDocument.MainDocumentPart.Document.Descendants<Paragraph>();

    to get all the Paragraph elements under ‘Document’ element including the ones that are not direct children of it, and then get the text content of each paragraph through Paragraph.InnerText or to retrieve the Text elements and get their text through Text.Text property.

    BTW, this is just for general text document. If it contains a table or image, you may need to consider some more complex scenario.

    Hope this helps. If you have any question, please let me know.

    Thanks,

    Lu
    • Marked as answer by Asusu Friday, March 12, 2010 9:52 AM
    Friday, March 12, 2010 9:36 AM
  • I'll try this and reply the result as soon as possible.
    Friday, March 12, 2010 9:53 AM
  • Hi,

    It's cool, thx u so much.

    Here's my details:

    using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(@"D:\Assets\sample.docx",false))

    {

    var ps = wordDoc.MainDocumentPart.Document.Descendants<Paragraph>();

    foreach (Paragraph p in ps)

    {

    string someText = p.InnerText;

    string result = string.Format("{0} ", someText)

    Console.WriteLine(result);

    }

    Console.ReadLine();

    }

    Saturday, March 13, 2010 2:12 AM
  • Hi

    Wont this work?

    wordDoc.MainDocumentPart.Document.InnerText

    Saturday, April 17, 2010 3:33 PM
  • Hi Ramakrishna Pillai,

    Thanks for your question.

    Usually it is not suggested to use InnerText to get the content of the document, for although InnerText will get the concatenated values of the node and all its children, sometimes the content is what you need as you see in the document. For example, if you insert a "Field" into a Word document, the XML content of a Run will be like this:

          <w:r>
            <w:instrText xml:space="preserve"> DATE  \@ "MMMM d, yyyy"  \* MERGEFORMAT </w:instrText>
          </w:r>

    The InnerText will be replaced by the exact date when it is shown in Word. So you'd better retrieve all the Text elements instead of using the InnerText.

    Hope this helps. If you have any question, please let me know.

    Thanks,

    Lu

    Thursday, April 22, 2010 5:47 AM
  • Thanks for the reply Lu.

    Yes recently this issue was discovered when there is a mergefield in the word document that my code is reading. When I used wordDoc.MainDocumentPart.Document.InnerText, it came up with unwanted text like MERGEFIELD etc.

    So I followed the method of looping through the list of paragraphs and reading the content.

    Cheers,
    Rama

    Tuesday, July 13, 2010 3:38 PM
  • Dear Rama,

        Could you post the code that use the method of looping through the list of paragraphs and reading the content?

     

        Thanks.

    Bob

    Wednesday, December 15, 2010 10:04 AM