none
WordProcessingML and page breaks RRS feed

  • Question

  • hi.

     

    i'm taking a word document as the input and need to output a xml file of that document.

    i'm writing a .net application (C#) for that purpose and with the use of the word application i can extract the xml of the document (WordProcessingML) which is great but not enough...

     

    i need to add some "custom" tags to that xml file, one of those tags should be the "page" tag.

    i.e.:

    Code Block

    <nitm:page>
     <w:p>  
      <w:pPr>
      ....
    </nitm:page>

     

     

    the problem is that i have no idea how to figure where one page ends and another starts.

    i searched for the answer and the only thing that i understood from that (and please correct me if i'm wrong) is that a new "wx : sect" will be added when the author used "Insert ==> Break ==> Page Break" in the document.

    that isn't good enough since the page will break if the text overflows the current one...

     

    one solution that i can think of is to "travel" the word document (dynamically) and each time the application reaches a new page it will look up the location in the xml file and add a "page" tag.

    this solution should do the trick (and again, please correct me if i'm wrong) but i don't like it one bit!  it's ugly and clumsy and i'm looking for a more elegant solution...

     

    any ideas?

    thanks, nitzan.

    Tuesday, October 2, 2007 4:28 PM

Answers

  • Let's see: docX is Open XML, doc doesn't.

    WordProcessingML  doesn't necessarily involves Open XML

    You can search this forum looking for information in "how to convert".

    You can use a batch or automation/manual mode to convet the files.

    All what I'm telling you from the beginning is because I suppose you are working with Open XML.

    It's evident that this is the reason for the absence of the tags.

    Wednesday, October 31, 2007 3:50 PM

All replies

  • If you are looking por a page break they are defined as paragraphs of that type:

     

    <wStick out tongue w:rsidR="00422402" w:rsidRDefault="00422402">

    <w:r>

    <w:br w:type="page"/><!--It is -->

    </w:r>

    </wStick out tongue>

     

    Tuesday, October 2, 2007 8:12 PM
  • thanks, i didnt notice that...

    but it looks like it's a result of the "insert => break => page" action of word, as i said i need to know where each and every page of the document ends (and the one after it begins), not just page breaks..

    for example, let's say i start a new a document and i write something long, one page wont be enough so the text will "overflow" to a second page and so on..   i want/need to know where each and every page ends.

     

    thanks.

    Sunday, October 7, 2007 9:36 AM
  • Mmmmm, I'll search into some documents to see if Word caches in some place this information but it is very relative to fonts, fonts size, page size, margins, etc.

    I can't see it now but I'll do, tell me if you find something else.

    Sunday, October 7, 2007 12:59 PM
  • yes, i'm aware that what i'm asking for is relative to a lot of page and font properties, but the xml files i'm working with are xml of documents that won't be changed in the future, so for me everything is set and defined and since this is the case all of the page/font properties are absolute.

     

    i searched a lot for an answer for this problem and could not find a thing..

    thanks a lot for your help, if you do find something i will be thrilled to learn about since it will save me a lot of dirty work.

     

    thanks again, nitzan.

    Sunday, October 7, 2007 1:11 PM
  • Back here Wink thinking ....

    If a get the page breaks (something like XPATH= "//w:br") I get the paragraphs where they are because the paragraphs are parent of the page break. It can be done with with XPATH or some xml navigation.

    Something changes for you getting the paragraph?

    Monday, October 8, 2007 11:01 AM
  • Maybe <w:lastRenderedPageBreak/> is what you are looking for.
     
    Per the Ecma Open XML spec:
     
    2.3.3.13 lastRenderedPageBreak (Position of Last Calculated Page Break)
    This element specifies that this position delimited the end of a page when this document was last saved by an
    application which paginates its content.
    [Guidance: This element shall be used by applications to specify the locations of page breaks within a document
    when it is saved as WordprocessingML, in order to allow other applications (e.g. assistive software) to utilize this
    information when reading the document. end guidance]
    [Example: Consider a run which consists of the text This is the end of the page, where the word end
    was the last word on a page. If the application saving this file had paginated this content, that information may
    be saved with the file as follows:
    <w:r>
    <w:t>This is the end</w:t>
    <w:lastRenderedPageBreak/>
    <w:t xmlTongue Tiedpace="preserve"> of the page</w:t>
    </w:r>
    The lastRenderedPageBreak element indicates that there was a page break resulting from pagination of this
    content, which occurred between the word end and the word of. end example]
    Friday, October 12, 2007 2:58 AM
  • this sounds like what i'm looking for, the only problem is that i searched the xml document for this tag and it does not appear there even once...

    maybe there's a special way to save the document as xml file with those page break tags?

     

    thanks, nitzan

    Monday, October 22, 2007 12:40 PM
  • I'm only supposing: perhaps older Office versions don't generate this cache.

    Is it or documents saved with Word 2007 avoid the cache tag?

    And, again, did you try to locate the last paragraph before the page brek mark?

     

    Saturday, October 27, 2007 2:13 PM
  • the documents i'm using are generated with word 2003, i might need to make the application support word 2007 documents but that's the future, right now it should support 2003 documents only.

     

    i located the last paragraphs before each page break in the xml file but there's nothing in there that might suggest a page break

    Saturday, October 27, 2007 2:34 PM
  • Do you need an XPath instruction to get te paragrapghs with a page break?

     

    Sunday, October 28, 2007 8:09 PM
  • thanks, but the xpath is not the problem here, it's just that there's nothing to look for with the xpath, there are no page break paragraphs.

     

    Monday, October 29, 2007 12:55 AM
  • Did you look for [//w:br]?

     

    Wednesday, October 31, 2007 2:01 PM
  • hi.

     

    yes i did look for that but i found only 6 of those in a document that has 36 pages..  that's not what i'm looking for.. i need to know where each page stops.

     

    thanks!

     

    Wednesday, October 31, 2007 2:28 PM
  • I understand you now: you are looking for ALL page breaks (not only those inserted by the user) including those breaks managed by Microsoft Word (if you change the font this will change, if you put it bold.... and so on)

    I did this:

    1)I saved a .doc file as docx using Word 2007

    2)looked for the <w:lastRenderedPageBreak/> tag you were told before

    3)It was exactly in the places you are looking for

     

    Which is the reason for not to be there in your .docx files?

    The way to convert them?

    Wednesday, October 31, 2007 2:46 PM
  • as i said before, i'm using word 2003 and not 2007, so i dont have those tags..

    is it safe to assume that there's no easy solution to my problem if i use WordProcessingML generated from word 2003?

     

    thanks!

     

    Wednesday, October 31, 2007 2:50 PM
  • Please read my previous post. I used WordProcessingML generated from word 2003 and I found the tags.

    Why? Why you don't?

    There is a moment in which you - obviously -convert from doc to docx. I'm suspecting that here is the issue.

    Wednesday, October 31, 2007 3:15 PM
  •  

    why do you assume that i convert from doc to docx?  as far as i know docx is the format of word 2007, and since i'm not working with word 2007 why would, or better yet, how could i convert the doc to docx?
    Wednesday, October 31, 2007 3:35 PM
  • Let's see: docX is Open XML, doc doesn't.

    WordProcessingML  doesn't necessarily involves Open XML

    You can search this forum looking for information in "how to convert".

    You can use a batch or automation/manual mode to convet the files.

    All what I'm telling you from the beginning is because I suppose you are working with Open XML.

    It's evident that this is the reason for the absence of the tags.

    Wednesday, October 31, 2007 3:50 PM
  • oh, i'm sorry about that, i didnt understand all of this before..

     

    i found this article:

    http://blogs.msdn.com/dmahugh/archive/2007/02/09/converting-office-documents-to-open-xml.aspx

    and i'll probably use this (OFC) to convert to open xml.

    thanks a lot!

     

    a few messages back you suggested to help me with a xpath expression, can i ask you to have a look at another thread i started:

    http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=2302200&SiteID=1

    i cant figure out the exact xpath that i need, and the thread didnt get a lot of attention...

     

    thanks again for your help!

    nitzan.

    Sunday, November 4, 2007 5:16 PM
  • I'll read your posts and try to help Smile

     

    Monday, November 5, 2007 12:37 PM
  • I am having a document that has been saved in 2007 format in the 2007 word editor. When I look for the occurances of w:lastRenderedPageBreak in the document, I see that while positions of some of the tags reflect the actual pagination breaks in the document, other are off my 2-3 lines. Is this is known issue. If so, when will this be fixed?. Where can I file a bug for this.
    Thanks,
    -Aneesh
    Saturday, November 1, 2008 6:42 AM
  • I have document where the app.xml says that there are 6 pages in the document and the document.xml has only 2 lastRenderedPageBreak, making it 3 pages in total. Is that a bug with m/s 2007 editor?. Where can I file a bug on this?
    Thanks,
    -Pawan.
    Saturday, November 8, 2008 6:52 AM
  • w:lastRenderedPageBreak doesn't work when we debugging the aplication. I think it should be declared official that there is no way to find out pagebreaks at run time (may be the reason could be the document may shrink or grow when we debugging the XML).
    Wednesday, May 19, 2010 7:06 AM