none
Can we use multi threading to convert Microsoft Word document to HTML in C#? RRS feed

  • Question

  • I have a Windows Service which polls the database for any uploaded documents of type doc, docx, pdf and rtf and convert them to HTML and saves them into the local file system. The documents are fetched from database and queued in the memory and then picked up by multiple threads for processing from the shared queue.

    The problem I am facing is, the processing become slower over a period of time. The conversion is happening faster in the initial few days say 2 seconds for a document of size 50 KB and slower after few days of time say 20 seconds for the same document. All I can see is a declining trend in the processing time as the days are progressing. I couldn't nail down to what is causing this declining trend. Even restarting of the Windows Service is not helping.

    The code snippet is has follows

    //This send message adds the document into the queue

    [MethodImpl(MethodImplOptions.Synchronized)]

     public void SendMessage(ServiceMessage Message)

            {

                if (synchQ != null)

                        synchQ.Enqueue(Message);

            }

     

    //This start method in instantiate the thread for processing

     [MethodImpl(MethodImplOptions.Synchronized)]

            public void Start()

            {

                            workerThreads = new Thread[ApplicationContext.Instance.NumberOfThreads];

                            //Worker threads are created based on the config settings

                            for (int i = 0; i < ApplicationContext.Instance.NumberOfThreads; i++)

                            {

                                ThreadStart threadStart = new ThreadStart(MainLoop);

                                workerThreads[i] = new Thread(threadStart);

                                workerThreads[i].Name = "Thread_" + i;

                                workerThreads[i].Start();

                            }

            }

     

    private void MainLoop()

            {

                    for (; ; )

                    {

                        if (q != null)

                        {

                            //If there are no messages to pass on then just fire with a null

                            if (q.Count == 0)

                            {

                                Heartbeat.DynamicInvoke(new object[] { null });

                            }

                            else

                            {

                                //Get the message from the queue

                                ServiceMessage msg = null;

    //locker is a private static readonly variable.

                                lock (locker)

                                {

                                    if (synchQ != null)

                                    {

                                        if(q.Count>0)

                                            msg = (ServiceMessage)synchQ.Dequeue();

                                    }

                                }

                                if (msg != null)

                                {

                                            Heartbeat.DynamicInvoke(new object[] { msg.Args });

                   }

            }


    //DynamicInvoke delegate invokes this private method for converting the Word document to HTML. 

    //Third party converter which converts Doc,Docx to HTML. Microsoft office is required to be installed for the converter to work. The converter reads the document //from the local file system and generates the output to the local file system.

    UseOffice wordDocConverter = new UseOffice();

    UseOffice.eDirection eDirection = UseOffice.eDirection.DOC_to_HTML;

    string htmlFileName = string.Empty;

    string fileName = Path.GetFileName(fullPath);

    if (fileName.Contains("docx"))

    eDirection = UseOffice.eDirection.DOC_to_HTML;

    else if (fileName.Contains("doc"))

    eDirection = UseOffice.eDirection.DOCX_to_HTML;


    htmlFileName = Path.GetFileNameWithoutExtension(fullPath) + ".html";

    outputFilePath = Path.Combine(fileOutputPath, htmlFileName);

    try

    {

    //Return values:

    //0 - Loading successfully

    //1 - Can't load MS Word® library in memory 

    int isWordInitalized = wordDocConverter.InitWord();

    if (isWordInitalized == 1)

    return null;

    else

    // This is where the problem is. It is taking more time to convert a document to HTML as days are progressing. A declining trend.

    errorCode = wordDocConverter.ConvertFile(fullPath, outputFilePath, eDirection);

    }

    catch (Exception ex)

    {

    throw ex;

    }

    finally

    {

    wordDocConverter.CloseWord();

    }

    The load per day for processing the word document to HTML would be roughly 2000 documents.

    So my question is can we use multi threading to process Microsoft Word document to HTML? Is it not scaling because of multi threading?

    Your answers is highly appreciated. 

    Saturday, May 14, 2011 10:58 PM

Answers

  • Hi PSiva

    <<//Third party converter which converts Doc,Docx to HTML. Microsoft office is required to be installed for the converter to work. The converter reads the document //from the local file system and generates the output to the local file system.>>

    This is probably the issue, right here. This converter is probably a "black box" for you, but I'm assuming it's starting the Office application, opening documents, closing them, etc. Office was designed as an end-user tool and not for use in a server environment. With time, if the applications are never completely released, a lot of "junk" piles up in the form of temporary files, scratch files, etc. Without knowing exactly how that tools works it's difficult to know exactly what the cause is. But bottom-line, you probably aren't going to be able to solve this with your scenario and that particular tool.

    If you were to shut down Windows on the machine where Office is being used, then start it up again, you'd probably see speeds go back up.

    And no, multi-threading won't solve this.


    Cindy Meister, VSTO/Word MVP
    Thursday, May 19, 2011 1:28 PM
    Moderator

All replies

  • Sure thing you can do it. However, I don’t think the using multi-threading will speed up the process speed. You can try to move all the document into one folder; iterate over each file in the folder; and convert them into HTML.


    Apple
    Thursday, May 19, 2011 8:46 AM
  • Hi PSiva

    <<//Third party converter which converts Doc,Docx to HTML. Microsoft office is required to be installed for the converter to work. The converter reads the document //from the local file system and generates the output to the local file system.>>

    This is probably the issue, right here. This converter is probably a "black box" for you, but I'm assuming it's starting the Office application, opening documents, closing them, etc. Office was designed as an end-user tool and not for use in a server environment. With time, if the applications are never completely released, a lot of "junk" piles up in the form of temporary files, scratch files, etc. Without knowing exactly how that tools works it's difficult to know exactly what the cause is. But bottom-line, you probably aren't going to be able to solve this with your scenario and that particular tool.

    If you were to shut down Windows on the machine where Office is being used, then start it up again, you'd probably see speeds go back up.

    And no, multi-threading won't solve this.


    Cindy Meister, VSTO/Word MVP
    Thursday, May 19, 2011 1:28 PM
    Moderator