locked
What's the fastest way to read a .docx file line-by-line in c# using openxml RRS feed

  • Question

  • User-1453200658 posted

    Hi,

    I need to read this Word (.DOC and .DOCX) file line by line using OpenXML.

    enter image description here

    In this code I have set an regexp is to check if a line starts with a whitespace, a letter, the bullet character or the - character or number

    protected void Page_Load(object sender, EventArgs e)
    {
        if (!this.IsPostBack)
        {
    
            string file = @"C:\Users\Downloads\qst.docx";
    
            using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(file, true))
            {
                Body body = wordDoc.MainDocumentPart.Document.Body;
                string contents = "";
    
                var reg = new Regex(@"^[\s\p{L}\d\•\-\►]");
    
                foreach (Paragraph co in
                            wordDoc.MainDocumentPart.Document.Body.Descendants<Paragraph>().Where<Paragraph>(somethingElse =>
                            reg.IsMatch(somethingElse.InnerText)))
                {
                    if (co.ParagraphProperties != null || co.ParagraphProperties.NumberingProperties != null)
                    {
                        contents += co.InnerText + "<br />";
                    }
                    else
                    {
                        // Do other checking.
                    }
                }
    
                Response.Write(contents);
            }
        }
    }
    

    Using this code the return in browser is wrong, because the bulleted and numbered lists of the word file are not displayed...

    Section 1
    
    - Para 1.1
    Content 1.1
    test 2
    test 3
    
    •Gaio Giulio Cesare
    •Quinto Orazio Flacco
    •Marco Porcio Catone
    
    
    Section 2
    
    - Para 2.1
    Content 2.1
    test 4
    test 5
    
    - Gaio Giulio Cesare
    - Quinto Orazio Flacco
    - Marco Porcio Catone
    
    ► Marco Porcio Catone
    ► Quinto Orazio Flacco
    ► Gaio Giulio Cesare
    
    Thursday, April 1, 2021 10:39 AM

Answers

  • User753101303 posted

    Seems we can just forget abourt the regexp which is irrelevant for now. To me iit's exactly as trying to extract text from https://www.w3schools.com/html/html_lists.asp ie it would show
    Item
    Item
    Item
    Item
    as the bullet character is not part of the markup but rendered because of the ul tag is defining a bullet list. A search is giving me https://docs.microsoft.com/en-us/previous-versions/office/developer/office-2010/ee922775(v=office.14)?redirectedfrom=MSDN which explain something similar ie :

    "When implementing a conversion of Open XML word processing documents to HTML, one of the interesting issues is accurately converting numbered and bulleted lists. You must write specific code to process them, because they affect the text that the document contains, but that text is not directly in the markup. If you are accurately extracting the text of the document, you must process some elements and attributes to assemble the correct text"

    https://stackoverflow.com/questions/1940911/openxml-2-sdk-word-document-create-bulleted-list-programmatically seems to explain how to create a bulleted list with the OpenXML SDK.

    So you should likely follow the reverse process ie analyzing if the paragraph is usiing a NumberingProperites and which kind of numbering/symbol is used to add then that explicitly to the resulting HTML markup (and/or to use the matching HTML styles).

    Not easy. As pointed by mgebhard you may have better support in a dedicated OpenXML forum.

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Thursday, April 1, 2021 8:35 PM

All replies

  • User753101303 posted

    Hi,

    Seems you assumed those characters are really part of an actual string. More likely, as for HTML, they are rather generated based on the paragraph propertiies that are making this paragraph to be formatted as a bullet list.

    Not familir with the OpenSDK but if you start by ennumerating all paragraphs and stopping on known innerText, you could be able then to inspect the paragraph with the debugger to understand what is causing this pragrah to be a bulleted list.

    Thursday, April 1, 2021 11:28 AM
  • User-1453200658 posted

    Hi, thanks for reply...

    Using debug of VS2019 the return is this string

    Section 1<br /> <br />- Para 1.1 <br />Content 1.1<br />test 2<br />test 3<br />•Gaio Giulio Cesare<br />•Quinto Orazio Flacco<br />•Marco Porcio Catone<br /> <br /> <br />Section 2<br /> <br />- Para 2.1<br />Content 2.1<br />test 4<br />test 5<br />- Gaio Giulio Cesare<br />- Quinto Orazio Flacco<br />- Marco Porcio Catone<br />► Marco Porcio Catone<br />► Quinto Orazio Flacco<br />► Gaio Giulio Cesare<br />

    these bulleted and numbered lists of the word file are displayed without number and the • character

    1.	Content 1.1
    2.	test 2
    3.	test 3
    
    •	Content 2.1
    •	test 4
    •	test 5
    

    Thursday, April 1, 2021 11:51 AM
  • User753101303 posted

    Ah got it. So it seems this document includes both hand written bulleted and actual bulleted lists ? You could use https://tech.trailmax.info/2014/04/open-xml-sdk-tool-to-analyse-documents-and-generated-c-code/ to understand the structure of the document.

    For "true" bulleted lists you'll likely have to use the HTML ol or ul tag to render them as HTML bulleted lists.

    For simple documents it should be ok but if you try to write a DOCX to HTML converter you likely have existing libraries especially if you want to handle also images, tables etc... rather than just basic DOCX documents...

    Edit: or maybe add explicit bullet characters if you prefer? For now it doesn't seems related to your regexp.

    Thursday, April 1, 2021 12:54 PM
  • User-1453200658 posted

    Yes, document includes both hand written bulleted and actual bulleted lists ...

    I don't to handle also images, tables...

    But if in the regular expression the validation of numbers and the bullet character has been foreseen, why are they not displayed correctly?

    Thursday, April 1, 2021 2:48 PM
  • User753101303 posted

    I woud debug this by skipping the where clause for now and looking at InnerText and the IsMatch result in my loop. Could it be that the paragraph starts wiith a line break or whatever ?

    I'm not sure about the purpose of this regexp as it seems for now that you want to export the whole dociment. If this is to filter out unwanted content I would perhaps exclude explicitely what I don't want rather than trying to come up with only what I want especially as  it seems you have missing cases for now.

    Thursday, April 1, 2021 4:32 PM
  • User-1453200658 posted

    I woud debug this by skipping the where clause for now and looking at InnerText and the IsMatch result in my loop. Could it be that the paragraph starts wiith a line break or whatever ?

    I'm not sure about the purpose of this regexp as it seems for now that you want to export the whole dociment. If this is to filter out unwanted content I would perhaps exclude explicitely what I don't want rather than trying to come up with only what I want especially as  it seems you have missing cases for now.

    thanks for suggestion.

    I have tried without Regex using

                    foreach (Paragraph co in wordDoc.MainDocumentPart.Document.Body.Descendants<Paragraph>())
                    {
                       contents += co.InnerText + "<br />";
                    }
    
                    Response.Write(contents);

    Debug of VS2019 return is this string (don't have changes...)

    Section 1<br /> <br />- Para 1.1 <br />Content 1.1<br />test 2<br />test 3<br />•Gaio Giulio Cesare<br />•Quinto Orazio Flacco<br />•Marco Porcio Catone<br /> <br /> <br />Section 2<br /> <br />- Para 2.1<br />Content 2.1<br />test 4<br />test 5<br />- Gaio Giulio Cesare<br />- Quinto Orazio Flacco<br />- Marco Porcio Catone<br />► Marco Porcio Catone<br />► Quinto Orazio Flacco<br />► Gaio Giulio Cesare<br />

    Thursday, April 1, 2021 5:45 PM
  • User475983607 posted

    You'll find better support if you asked this question on an OpenXml support forum.  This is a an ASP.NET support forum.

    https://social.msdn.microsoft.com/Forums/office/en-US/home?forum=oxmlsdk

    Thursday, April 1, 2021 7:30 PM
  • User753101303 posted

    Seems we can just forget abourt the regexp which is irrelevant for now. To me iit's exactly as trying to extract text from https://www.w3schools.com/html/html_lists.asp ie it would show
    Item
    Item
    Item
    Item
    as the bullet character is not part of the markup but rendered because of the ul tag is defining a bullet list. A search is giving me https://docs.microsoft.com/en-us/previous-versions/office/developer/office-2010/ee922775(v=office.14)?redirectedfrom=MSDN which explain something similar ie :

    "When implementing a conversion of Open XML word processing documents to HTML, one of the interesting issues is accurately converting numbered and bulleted lists. You must write specific code to process them, because they affect the text that the document contains, but that text is not directly in the markup. If you are accurately extracting the text of the document, you must process some elements and attributes to assemble the correct text"

    https://stackoverflow.com/questions/1940911/openxml-2-sdk-word-document-create-bulleted-list-programmatically seems to explain how to create a bulleted list with the OpenXML SDK.

    So you should likely follow the reverse process ie analyzing if the paragraph is usiing a NumberingProperites and which kind of numbering/symbol is used to add then that explicitly to the resulting HTML markup (and/or to use the matching HTML styles).

    Not easy. As pointed by mgebhard you may have better support in a dedicated OpenXML forum.

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Thursday, April 1, 2021 8:35 PM