locked
Extract "sections" by heading in DOCX RRS feed

  • Question

  • Is it possible to extract headings from a DCOX-file? As far as I know OOXML does not have the concept of "heading", but merely styles named "heading 1", "heading 2" etc.

     

    The reason I need to do this is my use-case:

    I need to take a DOCX-file uploaded from the web (perhaps generated using Microsoft Word), split it into "sections by heading", i.e. one chunk for each "heading 1" with the corresponding text in that "section" and convert it to HTML with as high fidelity in conversion as possible. So an author might have generated a paper in a DOCX-file with three headings (sections) and when uploaded, I need to convert that DOCX-file into three seperate HTML-files.

    You guys have any idea of how I should approach this?


    /Jesper www.idippedut.dk
    Monday, November 22, 2010 11:42 AM

Answers

  • Hello Jesper,

    Your question points to two interpretations - is the heading in the body of the text - as described in the KB content :

    See How to number chapters, appendixes, and pages in documents that contain both chapter and appendix headings in Word
    http://support.microsoft.com/kb/290953

    or is the heading a document header - for which there is only one per the document, no matter the number of sections.

    In the first case the following xml

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14">
     <w:body>
      <w:p w:rsidR="00463B2A" w:rsidRDefault="00463B2A" w:rsidP="00C21052" />
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:pPr>
        <w:pStyle w:val="Heading1" />
       </w:pPr>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:r>
        <w:t>This is the text of the section called Chapter 1.</w:t>
       </w:r>
       <w:bookmarkStart w:id="0" w:name="_GoBack" />
       <w:bookmarkEnd w:id="0" />
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052">
       <w:r>
        <w:br w:type="page" />
       </w:r>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:r>
        <w:lastRenderedPageBreak />
        <w:t>Page two of chapter 1.</w:t>
       </w:r>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:pPr>
        <w:sectPr w:rsidR="00C21052">
         <w:pgSz w:w="12240" w:h="15840" />
         <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" />
         <w:cols w:space="720" />
         <w:docGrid w:linePitch="360" />
        </w:sectPr>
       </w:pPr>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:pPr>
        <w:pStyle w:val="Heading1" />
       </w:pPr>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052" />
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:r>
        <w:t>This is the section called chapter 2.</w:t>
       </w:r>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052" />
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:pPr>
        <w:sectPr w:rsidR="00C21052">
         <w:pgSz w:w="12240" w:h="15840" />
         <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" />
         <w:cols w:space="720" />
         <w:docGrid w:linePitch="360" />
        </w:sectPr>
       </w:pPr>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:pPr>
        <w:pStyle w:val="Heading1" />
       </w:pPr>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052" />
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:r>
        <w:t xml:space="preserve">Page one of the </w:t>
       </w:r>
       <w:proofErr w:type="gramStart" />
       <w:r>
        <w:t>section</w:t>
       </w:r>
       <w:proofErr w:type="gramEnd" />
       <w:r>
        <w:t xml:space="preserve"> called Chapter 3.</w:t>
       </w:r>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052">
       <w:r>
        <w:br w:type="page" />
       </w:r>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRPr="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:r>
        <w:lastRenderedPageBreak />
        <w:t>Page two of chapter 3 – i.e. a section called chapter 3</w:t>
       </w:r>
      </w:p>
      <w:sectPr w:rsidR="00C21052" w:rsidRPr="00C21052">
       <w:pgSz w:w="12240" w:h="15840" />
       <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" />
       <w:cols w:space="720" />
       <w:docGrid w:linePitch="360" />
      </w:sectPr>
     </w:body>
    </w:document>
    
    shows that the paragraph and style tags 
    <w:p w:rsidR="00463B2A" w:rsidRDefault="00463B2A" w:rsidP="00C21052" />
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:pPr>
        <w:pStyle w:val="Heading1" />
       </w:pPr>
      </w:p>
    
    declare the heading. The content of the sequence
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:r>
        <w:t>This is the text of the section called Chapter 1.</w:t>
       </w:r>
       <w:bookmarkStart w:id="0" w:name="_GoBack" />
       <w:bookmarkEnd w:id="0" />
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052">
       <w:r>
        <w:br w:type="page" />
       </w:r>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:r>
        <w:lastRenderedPageBreak />
        <w:t>Page two of chapter 1.</w:t>
       </w:r>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:pPr>
        <w:sectPr w:rsidR="00C21052">
         <w:pgSz w:w="12240" w:h="15840" />
         <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" />
         <w:cols w:space="720" />
         <w:docGrid w:linePitch="360" />
        </w:sectPr>
       </w:pPr>
      </w:p>
    
    show page and section breaks, but between those look for the contents between the <w:t> </w:t> tags as the text of the section.
    If you need the contents within a section when there is no Style delineation but the document does have a header/footer 
    the following xml shows the contents of a document is above the header tags, and the content of
    a section is above the section (<w:sectPr> tags, as in this xml
    <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3">
       <w:pPr>
        <w:sectPr w:rsidR="00A759B3">
         <w:pgSz w:w="12240" w:h="15840" />
         <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" />
         <w:cols w:space="720" />
         <w:docGrid w:linePitch="360" />
        </w:sectPr>
       </w:pPr>
      </w:p>
    
    The document header is at the end of this document's first page XML in the following snippet
    <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3">
       <w:pPr>
        <w:sectPr w:rsidR="00A759B3">
         <w:headerReference w:type="default" r:id="rId8" />
         <w:pgSz w:w="12240" w:h="15840" />
         <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" />
         <w:cols w:space="720" />
         <w:docGrid w:linePitch="360" />
        </w:sectPr>
       </w:pPr>
      </w:p>
    
    The complete document is in the following XML
    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>

    <w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14"> <w:body> <w:p w:rsidR="000621CA" w:rsidRDefault="00A759B3"> <w:r> <w:t xml:space="preserve">One </w:t> </w:r> <w:r w:rsidR="000621CA"> <w:t>Headings and Sections research</w:t> </w:r> </w:p> <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3" w:rsidP="00A759B3"> <w:r> <w:t>This is paragraph text for Paragraph 1.</w:t> </w:r> </w:p> <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3" w:rsidP="00A759B3"> <w:r> <w:t>Here is p</w:t> </w:r> <w:r> <w:t>aragraph text</w:t> </w:r> <w:r> <w:t xml:space="preserve"> for Paragraph 2</w:t> </w:r> <w:r> <w:t>.</w:t> </w:r> </w:p> <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3" /> <w:p w:rsidR="000621CA" w:rsidRDefault="000621CA" /> <w:p w:rsidR="000621CA" w:rsidRDefault="000621CA"> <w:r> <w:br w:type="page" /> </w:r> </w:p> <w:p w:rsidR="000621CA" w:rsidRDefault="000621CA"> <w:r> <w:lastRenderedPageBreak /> <w:t xml:space="preserve">This is page 2 of Section 1 </w:t> </w:r> </w:p> <w:p w:rsidR="000621CA" w:rsidRDefault="00A759B3"> <w:r> <w:t>This is text for paragraph 1.</w:t> </w:r> </w:p> <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3" w:rsidP="00A759B3"> <w:r> <w:t>Here</w:t> </w:r> <w:r> <w:t xml:space="preserve"> is text for paragraph </w:t> </w:r> <w:r> <w:t>2</w:t> </w:r> <w:r> <w:t>.</w:t> </w:r> </w:p> <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3"> <w:pPr> <w:sectPr w:rsidR="00A759B3"> <w:headerReference w:type="default" r:id="rId8" /> <w:pgSz w:w="12240" w:h="15840" /> <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" /> <w:cols w:space="720" /> <w:docGrid w:linePitch="360" /> </w:sectPr> </w:pPr> </w:p> <w:p w:rsidR="000621CA" w:rsidRDefault="000621CA"> <w:r> <w:lastRenderedPageBreak /> <w:t>Page 1 of Section 2</w:t> </w:r> </w:p> <w:p w:rsidR="000621CA" w:rsidRDefault="00A759B3"> <w:r> <w:t>This is the brief text for paragraph 1.</w:t> </w:r> </w:p> <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3" w:rsidP="00A759B3"> <w:r> <w:t>More</w:t> </w:r> <w:r> <w:t xml:space="preserve"> brief text for paragraph </w:t> </w:r> <w:r> <w:t>2 is here</w:t> </w:r> <w:r> <w:t>.</w:t> </w:r> </w:p> <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3"> <w:pPr> <w:sectPr w:rsidR="00A759B3"> <w:pgSz w:w="12240" w:h="15840" /> <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" /> <w:cols w:space="720" /> <w:docGrid w:linePitch="360" /> </w:sectPr> </w:pPr> </w:p> <w:p w:rsidR="000621CA" w:rsidRDefault="000621CA"> <w:proofErr w:type="gramStart" /> <w:r> <w:lastRenderedPageBreak /> <w:t>Section 3, Page 1</w:t> </w:r> <w:r w:rsidR="004304BA"> <w:t>.</w:t> </w:r> <w:proofErr w:type="gramEnd" /> </w:p> <w:p w:rsidR="004304BA" w:rsidRDefault="00A759B3"> <w:r> <w:t>This is the end of the text in document one.</w:t> </w:r> <w:bookmarkStart w:id="0" w:name="_GoBack" /> <w:bookmarkEnd w:id="0" /> </w:p> <w:p w:rsidR="004304BA" w:rsidRDefault="004304BA" /> <w:p w:rsidR="004304BA" w:rsidRDefault="004304BA" /> <w:p w:rsidR="000621CA" w:rsidRDefault="000621CA" /> <w:sectPr w:rsidR="000621CA"> <w:pgSz w:w="12240" w:h="15840" /> <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" /> <w:cols w:space="720" /> <w:docGrid w:linePitch="360" /> </w:sectPr> </w:body> </w:document>

    Once you know how to parse the xml you can write each section as an individual stream saved to a html file.


    Chris Jensen
    • Proposed as answer by cjatms Tuesday, November 30, 2010 7:10 PM
    • Marked as answer by Jesper Lund Stocholm Tuesday, October 20, 2015 7:03 AM
    Tuesday, November 30, 2010 7:09 PM

All replies

  • Hello Jesper,

    Thanks for posting. We are doing research on this issue. It might take some time before we get back to you. Have a nice day

    Best regards,
    Bessie Zhao - MSFT
    MSDN Subscriber Support in Forum
    If you have any feedback of our support, please contact msdnmg@microsoft.com.
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.
    Tuesday, November 23, 2010 6:16 AM
  • Hello Jesper,

    Your question points to two interpretations - is the heading in the body of the text - as described in the KB content :

    See How to number chapters, appendixes, and pages in documents that contain both chapter and appendix headings in Word
    http://support.microsoft.com/kb/290953

    or is the heading a document header - for which there is only one per the document, no matter the number of sections.

    In the first case the following xml

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14">
     <w:body>
      <w:p w:rsidR="00463B2A" w:rsidRDefault="00463B2A" w:rsidP="00C21052" />
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:pPr>
        <w:pStyle w:val="Heading1" />
       </w:pPr>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:r>
        <w:t>This is the text of the section called Chapter 1.</w:t>
       </w:r>
       <w:bookmarkStart w:id="0" w:name="_GoBack" />
       <w:bookmarkEnd w:id="0" />
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052">
       <w:r>
        <w:br w:type="page" />
       </w:r>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:r>
        <w:lastRenderedPageBreak />
        <w:t>Page two of chapter 1.</w:t>
       </w:r>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:pPr>
        <w:sectPr w:rsidR="00C21052">
         <w:pgSz w:w="12240" w:h="15840" />
         <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" />
         <w:cols w:space="720" />
         <w:docGrid w:linePitch="360" />
        </w:sectPr>
       </w:pPr>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:pPr>
        <w:pStyle w:val="Heading1" />
       </w:pPr>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052" />
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:r>
        <w:t>This is the section called chapter 2.</w:t>
       </w:r>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052" />
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:pPr>
        <w:sectPr w:rsidR="00C21052">
         <w:pgSz w:w="12240" w:h="15840" />
         <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" />
         <w:cols w:space="720" />
         <w:docGrid w:linePitch="360" />
        </w:sectPr>
       </w:pPr>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:pPr>
        <w:pStyle w:val="Heading1" />
       </w:pPr>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052" />
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:r>
        <w:t xml:space="preserve">Page one of the </w:t>
       </w:r>
       <w:proofErr w:type="gramStart" />
       <w:r>
        <w:t>section</w:t>
       </w:r>
       <w:proofErr w:type="gramEnd" />
       <w:r>
        <w:t xml:space="preserve"> called Chapter 3.</w:t>
       </w:r>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052">
       <w:r>
        <w:br w:type="page" />
       </w:r>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRPr="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:r>
        <w:lastRenderedPageBreak />
        <w:t>Page two of chapter 3 – i.e. a section called chapter 3</w:t>
       </w:r>
      </w:p>
      <w:sectPr w:rsidR="00C21052" w:rsidRPr="00C21052">
       <w:pgSz w:w="12240" w:h="15840" />
       <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" />
       <w:cols w:space="720" />
       <w:docGrid w:linePitch="360" />
      </w:sectPr>
     </w:body>
    </w:document>
    
    shows that the paragraph and style tags 
    <w:p w:rsidR="00463B2A" w:rsidRDefault="00463B2A" w:rsidP="00C21052" />
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:pPr>
        <w:pStyle w:val="Heading1" />
       </w:pPr>
      </w:p>
    
    declare the heading. The content of the sequence
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:r>
        <w:t>This is the text of the section called Chapter 1.</w:t>
       </w:r>
       <w:bookmarkStart w:id="0" w:name="_GoBack" />
       <w:bookmarkEnd w:id="0" />
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052">
       <w:r>
        <w:br w:type="page" />
       </w:r>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:r>
        <w:lastRenderedPageBreak />
        <w:t>Page two of chapter 1.</w:t>
       </w:r>
      </w:p>
      <w:p w:rsidR="00C21052" w:rsidRDefault="00C21052" w:rsidP="00C21052">
       <w:pPr>
        <w:sectPr w:rsidR="00C21052">
         <w:pgSz w:w="12240" w:h="15840" />
         <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" />
         <w:cols w:space="720" />
         <w:docGrid w:linePitch="360" />
        </w:sectPr>
       </w:pPr>
      </w:p>
    
    show page and section breaks, but between those look for the contents between the <w:t> </w:t> tags as the text of the section.
    If you need the contents within a section when there is no Style delineation but the document does have a header/footer 
    the following xml shows the contents of a document is above the header tags, and the content of
    a section is above the section (<w:sectPr> tags, as in this xml
    <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3">
       <w:pPr>
        <w:sectPr w:rsidR="00A759B3">
         <w:pgSz w:w="12240" w:h="15840" />
         <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" />
         <w:cols w:space="720" />
         <w:docGrid w:linePitch="360" />
        </w:sectPr>
       </w:pPr>
      </w:p>
    
    The document header is at the end of this document's first page XML in the following snippet
    <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3">
       <w:pPr>
        <w:sectPr w:rsidR="00A759B3">
         <w:headerReference w:type="default" r:id="rId8" />
         <w:pgSz w:w="12240" w:h="15840" />
         <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" />
         <w:cols w:space="720" />
         <w:docGrid w:linePitch="360" />
        </w:sectPr>
       </w:pPr>
      </w:p>
    
    The complete document is in the following XML
    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>

    <w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14"> <w:body> <w:p w:rsidR="000621CA" w:rsidRDefault="00A759B3"> <w:r> <w:t xml:space="preserve">One </w:t> </w:r> <w:r w:rsidR="000621CA"> <w:t>Headings and Sections research</w:t> </w:r> </w:p> <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3" w:rsidP="00A759B3"> <w:r> <w:t>This is paragraph text for Paragraph 1.</w:t> </w:r> </w:p> <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3" w:rsidP="00A759B3"> <w:r> <w:t>Here is p</w:t> </w:r> <w:r> <w:t>aragraph text</w:t> </w:r> <w:r> <w:t xml:space="preserve"> for Paragraph 2</w:t> </w:r> <w:r> <w:t>.</w:t> </w:r> </w:p> <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3" /> <w:p w:rsidR="000621CA" w:rsidRDefault="000621CA" /> <w:p w:rsidR="000621CA" w:rsidRDefault="000621CA"> <w:r> <w:br w:type="page" /> </w:r> </w:p> <w:p w:rsidR="000621CA" w:rsidRDefault="000621CA"> <w:r> <w:lastRenderedPageBreak /> <w:t xml:space="preserve">This is page 2 of Section 1 </w:t> </w:r> </w:p> <w:p w:rsidR="000621CA" w:rsidRDefault="00A759B3"> <w:r> <w:t>This is text for paragraph 1.</w:t> </w:r> </w:p> <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3" w:rsidP="00A759B3"> <w:r> <w:t>Here</w:t> </w:r> <w:r> <w:t xml:space="preserve"> is text for paragraph </w:t> </w:r> <w:r> <w:t>2</w:t> </w:r> <w:r> <w:t>.</w:t> </w:r> </w:p> <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3"> <w:pPr> <w:sectPr w:rsidR="00A759B3"> <w:headerReference w:type="default" r:id="rId8" /> <w:pgSz w:w="12240" w:h="15840" /> <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" /> <w:cols w:space="720" /> <w:docGrid w:linePitch="360" /> </w:sectPr> </w:pPr> </w:p> <w:p w:rsidR="000621CA" w:rsidRDefault="000621CA"> <w:r> <w:lastRenderedPageBreak /> <w:t>Page 1 of Section 2</w:t> </w:r> </w:p> <w:p w:rsidR="000621CA" w:rsidRDefault="00A759B3"> <w:r> <w:t>This is the brief text for paragraph 1.</w:t> </w:r> </w:p> <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3" w:rsidP="00A759B3"> <w:r> <w:t>More</w:t> </w:r> <w:r> <w:t xml:space="preserve"> brief text for paragraph </w:t> </w:r> <w:r> <w:t>2 is here</w:t> </w:r> <w:r> <w:t>.</w:t> </w:r> </w:p> <w:p w:rsidR="00A759B3" w:rsidRDefault="00A759B3"> <w:pPr> <w:sectPr w:rsidR="00A759B3"> <w:pgSz w:w="12240" w:h="15840" /> <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" /> <w:cols w:space="720" /> <w:docGrid w:linePitch="360" /> </w:sectPr> </w:pPr> </w:p> <w:p w:rsidR="000621CA" w:rsidRDefault="000621CA"> <w:proofErr w:type="gramStart" /> <w:r> <w:lastRenderedPageBreak /> <w:t>Section 3, Page 1</w:t> </w:r> <w:r w:rsidR="004304BA"> <w:t>.</w:t> </w:r> <w:proofErr w:type="gramEnd" /> </w:p> <w:p w:rsidR="004304BA" w:rsidRDefault="00A759B3"> <w:r> <w:t>This is the end of the text in document one.</w:t> </w:r> <w:bookmarkStart w:id="0" w:name="_GoBack" /> <w:bookmarkEnd w:id="0" /> </w:p> <w:p w:rsidR="004304BA" w:rsidRDefault="004304BA" /> <w:p w:rsidR="004304BA" w:rsidRDefault="004304BA" /> <w:p w:rsidR="000621CA" w:rsidRDefault="000621CA" /> <w:sectPr w:rsidR="000621CA"> <w:pgSz w:w="12240" w:h="15840" /> <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0" /> <w:cols w:space="720" /> <w:docGrid w:linePitch="360" /> </w:sectPr> </w:body> </w:document>

    Once you know how to parse the xml you can write each section as an individual stream saved to a html file.


    Chris Jensen
    • Proposed as answer by cjatms Tuesday, November 30, 2010 7:10 PM
    • Marked as answer by Jesper Lund Stocholm Tuesday, October 20, 2015 7:03 AM
    Tuesday, November 30, 2010 7:09 PM