none
How to read index numbers(bullet numbers) of a paragraph RRS feed

  • Question

  • What: Want to read the bullet number of the paragraph in a docx document.

    Why: We maintain different versions of a document docx. We need to show the difference between two version by showing the paragraph number.

    Problem detail: 

    I am trying hard to find a way to read the index numbers of the paragraph of the docx file. I have check each of the xml files of the compressed package, but I cannot understand anything out of them. There is no pattern or anything which I can derive to a logic.

    Please note that the bullet numbers are created using the style of the MS Word docx.

    I need to do this using java and I have tried using both the jars api below:

    1. org.apache.poi.xwpf
    2. org.docx4j.openpackaging

    Really appreciate if anyone have any clue here

    Thnx in adv.

    Lester


    Tuesday, March 13, 2018 6:17 AM

All replies

  • What: Want to read the bullet number of the paragraph in a docx document.

    Why: We maintain different versions of a document docx. We need to show the difference between two version by showing the paragraph number.

    Problem detail: 

    I am trying hard to find a way to read the index numbers of the paragraph of the docx file. I have check each of the xml files of the compressed package, but I cannot understand anything out of them. There is no pattern or anything which I can derive to a logic.

    Please note that the bullet numbers are created using the style of the MS Word docx.

    I need to do this using java and I have tried using both the jars api below:

    1. org.apache.poi.xwpf
    2. org.docx4j.openpackaging

    Really appreciate if anyone have any clue here

    Thnx in adv.

    Lester


    Tuesday, March 13, 2018 6:16 AM
  • I don't know the full details, or whether there is anything in the Java libraries you are using that works this stuff out for you, but below I describe approximately how it all fits together (for paragraph numbering).

    It's actually complicated enough that you might be better off finding some 3rd-party library that can render a document with all the expected numbering in a format that lets you get at the numbering more easily, but I do not have any suggestions on what that might be.

    Let's say we are taking about paragraphs in the main body of the document (typically the document part called document.xml, although it does not have to be).

    Each paragraph in the body may
     a. not be numbered
     b. be numbered using direct number formatting
     c. be numbered as a consequence of having a paragraph style that has a numbering scheme defined.

    The paragraph numbers themselves are not stored in document.xml. You have to work them out by looking up the numbering scheme for each paragraph, calculating the next number in that scheme, then formatting the number.

    Because of (c), a paragraph may not have any elements or attributes that tell you directly whether or not the paragraph is numbered. So in the case of a paragraph with no numbering attributes, you have to look up the paragraph's style information, determine whether it has an associated numbering scheme, then look up that scheme.

    So let's say you see this XML within the <w:document> element in document.xml:

    <w:p><w:pPr><w:pStyle w:val="mynumpara"/></w:pPr><w:r><w:t xml:space="preserve">my text</w:t></w:r></w:p>

    First, you need to find the part where the styles are defined. Typically, the document.xml Part is in a folder structure like this:

    /word/document.xml

    You first need the .rels part for document.xml, which is

    /word/_rels/document.xml.rels

    In there, you should find a Relationship element something like this...

    <Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/>

    You need the element where Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles"

    Retrieve the name of the target and work out its path (in this case it will be

    /word/styles.xml

    Open that part, and look for a w:style element where the w:styleId attribute is "mynumpara", e.g.

    <w:style w:type="paragraph" w:customStyle="1" w:styleId="mynumpara"><w:name w:val="mynumpara"/><w:basedOn w:val="Normal"/><w:qFormat/><w:pPr><w:numPr><w:numId w:val="6"/></w:numPr><w:ind w:left="714" w:hanging="357"/></w:pPr></w:style>

    The <w:numId> element specifies the numbering scheme associated with the style - in this case it is scheme "6"

    Now you need to find the Part that contains the numbering scheme info. Again, you need to look up the appropriate relationships part, then look for the Relationship with Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/numbering" , and get the name of the numbering part from the Target attribute (typically you would have Target="numbering.xml")

    Within that, you need to look for the <w:num> element whose w:numID attribute value matches the value of the w:val attribute of the w:numId element in the style, i.e. you want w:num>

    i.e. in w:style, we had <w:numId w:val="6">

    so we need a <w:num> like

    <w:num w:numId="6">

    That element should contain an element like this:

    <w:abstractNumId w:val="1"/>

    So to get our numbering info., we actually need to look for the <w:abstractNum> element (in the same numbering.xml part) which has an attribute w:abstractNumId="1".

    This element will typically contain numbering for 9 levels of numbering (even if the numbering scheme actually only use a single level. so the numbering for level 1 is in a <w:lvl> element with attribute w:ilvl="0" (note that 'human' level 1 is represented by 'machine' level 0 in this case.

    However, our <w:num w:numId="6"> element may also contain override information that overrides the info. contained in that <w:abstractNum> element. If so, it will be in a set of <w:lvlOverride> elements, e.g. <w:lvlOverride w:ilvl="0">. Use the appropriate values from the w:lvlOverride elements if present, and from the <w:abstractNum> element if not.

    Within a <w:lvl> or <w:lvlOverride> element, the XML may specify:
     a. starting number for the level, e.g. <w:start w:val="2">
     b. a number format for the current level's number, e.g. <w:numFmt w:val="lowerRoman"> for  the format I, II, III etc.
     c. a format string for the sequence number as a whole, specifying numbering info. from other levels in the same scheme, and "trim" characters such as ".", "(", "Chapter " and so on. 

    So for example, if you see <w:lvlText w:val="%1(%2)"/> , the level 1 (ilvl 0) numFmt is "decimal", its current value is 3, and the level 2 numFmt is "lowerLetter" and current value is 4, the full sequence number would be 3(d)

    In a multilevel list, the XML for a level (let's say level 4) may also contain an element like this:

    <w:lvlRestart w:val="1"/>

    This means "restart level 4 numbering when the level 1 number changes". If that element is not there, multilevel numbering for level n+1 restarts when the level n number changes.

    I leave it to you to find out what other attributes and features could affect the numbering. (In addition to anything defined in the XML, some settings in the application itself can affect what Word displays - there is an option to do with whether the "local" digits characters or Hindi-Arabic numbering is used. Since it is set in the application, it is impossible to tell from the .docx alone what numbering any specific user will actually see).

    So that's what (roughly) happens if the numbering comes from the style. Numbering applied directly to the paragraph works in much the same way except that the 

    <w:numPr><w:numId w:val="something"/>

    is defined in the <w:pPr> (paragraph properties) element for the paragraph, not the style.   In that case you should be able to skip looking up the style. The numbering can be defined in the paragraph even in cases where the style has associated numbering. If for example you specify that numbering should restart at a certain paragraph within a list, Word will (AFAIK) use a w:numId reference in the paragraph to point to a different numbering scheme from the one specified in the style.

    Finally, you can have automatic paragraph numbering in a number of different document parts (the main body, typically in document.xml, headers, footers, floating objects such as text boxes and so on), but the method for working out how to number those parts is the same.

    Peter Jamieson

    Tuesday, March 13, 2018 3:56 PM
  • Hello Lester_Japang,

    Has your original issue been resolved? If it has, I would suggest you mark the helpful reply as answer or provide your solution and mark as answer to close this thread. If not, please feel free to let us know your current issue.

    Best Regards,

    Terry


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Thursday, March 15, 2018 7:20 AM