locked
What's the fastest way to read a .docx file line-by-line in c# RRS feed

  • Question

  • User-2004582644 posted

    Hello there,

    After upload I want to read a .docx file line by line.

    This is my file.docx and is divided for chapters and paragraphs of the chapter

    The structure of file.docx

    Chapter 1 - Events

    • alert or disservices
    • significant activities

    Chapter 2 – Safety

    • near miss
    • security checks

    Chapter 3 – Training

    • environment
    • upkeep

    I need read a .docx file line by line and according to the chapter I have to insert chapter and content of the paragraph in the corresponding database table

    e.g.

    Chapter 1 - Events
     - alert or disservices
    Lorem ipsum dolor sit amet, consectetur adipiscing elit ….
    …. ….
    …. ….
    - significant activities
    Phasellus dui nunc, rutrum vitae dictum eleifend, ullamcorper hendrerit sem ….
    …. ….
    …. ….

    must be inserted in the table Events

    -- ----------------------------
    -- Table structure for events
    -- ----------------------------
    DROP TABLE IF EXISTS `events`;
    CREATE TABLE `events` (
      `sID` int(11) NOT NULL AUTO_INCREMENT,
      `alert_or_disservices` longtext,
      `significant_activities` longtext,
      PRIMARY KEY (`sID`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

    I wanted how to do it as efficiently as possible within the .NET C# scope of things.

    Please can you help me?

    Thanks in advance for any help or suggestion

    Monday, September 14, 2020 4:43 PM

Answers

  • User-939850651 posted

    Hi Chevy Marl Sunderland,

    If you need to read a Word document line by line, have you tried using Microsoft.Office.Interop.Word to read the document, and then perform corresponding operations on the document, such as splitting paragraphs, intercepting table names, etc., and finally inserting the data into the database.

    Something like this:

    protected void Page_Load(object sender, EventArgs e)
            {
                Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
                object miss = System.Reflection.Missing.Value;
                object path = @"YourFilepath\file.docx";
                object readOnly = true;
                Microsoft.Office.Interop.Word.Document docs = word.Documents.Open(ref path, ref miss, ref readOnly, ref miss, ref miss,
                            ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss);
                string totaltext = "";      //the whole document
                for (int i = 0; i < docs.Paragraphs.Count; i++)
                {
                    //Determine the beginning of an entire paragraph and intercept the table name
                    //Get the column name
                    //......
                    totaltext +=  docs.Paragraphs[i + 1].Range.Text.ToString();
                }
                Response.Write(totaltext);
                docs.Close();
                word.Quit();
            }

    The whole document:

    Regarding your request, I think it might be more suitable to use List<string>.

    Hope this can help you.

    Best regards,

    Xudong Peng

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Tuesday, September 15, 2020 8:51 AM

All replies

  • User475983607 posted

    You need an docx API.   OpenXml is good one.  You'll need to read the documentation to learn how to use the API.

    Monday, September 14, 2020 4:56 PM
  • User-939850651 posted

    Hi Chevy Marl Sunderland,

    If you need to read a Word document line by line, have you tried using Microsoft.Office.Interop.Word to read the document, and then perform corresponding operations on the document, such as splitting paragraphs, intercepting table names, etc., and finally inserting the data into the database.

    Something like this:

    protected void Page_Load(object sender, EventArgs e)
            {
                Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
                object miss = System.Reflection.Missing.Value;
                object path = @"YourFilepath\file.docx";
                object readOnly = true;
                Microsoft.Office.Interop.Word.Document docs = word.Documents.Open(ref path, ref miss, ref readOnly, ref miss, ref miss,
                            ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss);
                string totaltext = "";      //the whole document
                for (int i = 0; i < docs.Paragraphs.Count; i++)
                {
                    //Determine the beginning of an entire paragraph and intercept the table name
                    //Get the column name
                    //......
                    totaltext +=  docs.Paragraphs[i + 1].Range.Text.ToString();
                }
                Response.Write(totaltext);
                docs.Close();
                word.Quit();
            }

    The whole document:

    Regarding your request, I think it might be more suitable to use List<string>.

    Hope this can help you.

    Best regards,

    Xudong Peng

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Tuesday, September 15, 2020 8:51 AM
  • User753101303 posted

    Hi,

    For OpenXML see the documentation at https://docs.microsoft.com/en-us/office/open-xml/word-processing 

    I would avoid Microsoft.Office.Interop.Word as much as I can. It uses Microsoft Word which would need to be installed on the web server and was intended to provide end users with automation capabilities rather than as a real programming library.

    Tuesday, September 15, 2020 9:30 AM
  • User-2004582644 posted

    thank you all for help and suggestion.

    in the hosting server is installed Microsoft Office and not Open XML SDK.

    i asked for installation of Open XML SDK but i don't know how long it takes

    i tried the suggestion of Xudong Peng and the return is

    but this part is not clear to me

    could you explain better? thank you

     //Determine the beginning of an entire paragraph and intercept the table name
    //Get the column name
    Tuesday, September 15, 2020 2:28 PM
  • User753101303 posted

    Seems just a suggestion depending on which kind of content you have in those paragraphs. Assuming you have Word tables, you may have to inspect them and extract relevant data etc.... Or you want just to process bulleted lists ?

    For Microsoft Word consider https://support.microsoft.com/en-us/help/257757/considerations-for-server-side-automation-of-office . Also if Word shows any dialog it will stop (depending on what you are doing you may have to disable warnings etc...). Also make sure you'll be able to install Word on the real web server (if using an hosting service it means a dedicated server is liekly required).

    OpenXML is just a reference to a DLL that will ship with as uusal with all other referenced DLLs.

    Tuesday, September 15, 2020 2:54 PM
  • User475983607 posted

    Office interop is not recommended bt Microsoft for use in a web server application.  Office interop actually opens an instance of Word on the server for each request.  This can cause the web application to become unresponsive.  You really should use OpenXML as recommended.

    Tuesday, September 15, 2020 2:55 PM
  • User-2004582644 posted

    Okay, thanks

    but do you have any example of OpenXML for my case?

    Tuesday, September 15, 2020 3:06 PM
  • User475983607 posted

    but do you have any example of OpenXML for my case?

    No, but OpenXML has a support forum; https://social.msdn.microsoft.com/Forums/office/en-US/home?forum=oxmlsdk

    Tuesday, September 15, 2020 3:36 PM