none
SpreadsheetDocument Dispose Performance RRS feed

  • Question

  • Open XML SDK Gurus:

    I'm working on an ASP.NET solution to generate an Open XML Spreadsheet document based on the server.  For this development effort, I'm using the Open XML SDK 2.  I generated the "template" C# code for the spreadsheet using the reflector command built into the OpenXmlSdkTool that ships with the SDK. 

    The code works (a valid spreadsheet file is generated by the code), but the issue that I am currently trying to resolve is one of performance.  I found that the time that it takes the code to generate a spreadsheet document is around 8 seconds for a spreadsheet containing 9 columns and 20,000+ rows.  When dissecting the overall time, I found that only approx 2.4 seconds is required for the actual generation of content for the worksheet, but around 5.5.seconds is consumed when calling the Dispose method on the SpreadsheetDocument object.   

    Anyone else experience this slow performance when disposing of SpreadsheetDocument objects? 
    Is there any way to improve the performance of this process? 
    Am I perhaps doing something wrong, or should I be considering an alternate implementation?

    Sample code is shown below; note that in the CreateParts method, I have similar timing code to that shown below.  The CreateParts method consistently executes around 2.4 seconds (2390 milliseconds).  The CreatePackage method, however, takes around 7.8 seconds (7827 milliseconds) which results in a net time of 5.4 seconds (7.8 - 2.4 = 5.4).  I've rewritten the code to create and dispose the SpreadDocument object without using the "using" statement and found that the 5.4 seconds stems from the call to Dispose().

    Thanks in advance.

    D

            // Creates a SpreadsheetDocument.
    
            private void CreatePackage()
    
            {
    
                DateTime startTime = DateTime.Now;
    
    
    
                using(SpreadsheetDocument package = SpreadsheetDocument.Create(stream, SpreadsheetDocumentType.Workbook))
    
                {
    
                    CreateParts(package);
    
                }
    
                
    
                DateTime stopTime = DateTime.Now;
    
                System.IO.StreamWriter writer = new System.IO.StreamWriter("c:\\temp\\xlsxperfsum.log");
    
                TimeSpan ts = stopTime - startTime;
    
                writer.WriteLine("CreatePackage(): " + ts.TotalMilliseconds);
    
                writer.Close();
    
            }
    
    
    • Edited by D Nickels Friday, March 12, 2010 3:08 PM
    Thursday, March 11, 2010 4:27 PM

Answers

  • Hi, D Nickels,

    Thank you very much for your detail description.

    I have re-checked my sample code, and get the same running result as you do. I have found several things, which may be of help.

    First of all, when Dispose is called, the Open XML Spreadsheet document in the memory are writing out to the disk to save , which is the most time-comsuming. Thus, I think time spending on this method is reasonable.

    Secondly, in the last reply of mine, I ran faster, because  I donot specify the DataType of the cell. In this case, some infomation are missed.

    As far as I am concerned, I donot find any ways to improve the performance.

    Thanks,
    Raymond 
    • Marked as answer by D Nickels Tuesday, March 16, 2010 1:26 PM
    Tuesday, March 16, 2010 9:54 AM

All replies

  • Hi D Nickels,

    Thanks for your question.

    To clarify your scenario, could you please describe "Dispose" in detail? Do you mean the generated sample code by the Tool takes too long time? If so, would you please show the sample code generated by the Tool to help find out what's the problem? Or do you mean it takes too long time on the SpreadsheetDocument.Create() method?

    Thanks,

    Lu
    Friday, March 12, 2010 6:17 AM
  • Hi Lu,

    Maybe this will clarify:

    The original code generated by the OpenXmlSdkTool looked like below:

            // Creates a SpreadsheetDocument.
            private void CreatePackage()
            {
                using(SpreadsheetDocument package = SpreadsheetDocument.Create(stream, SpreadsheetDocumentType.Workbook))
                {
                    CreateParts(package);
                }
            }
    This method was taking on average 8 seconds to complete.  To isolate which exact call was taking so long, I split the code in the method into the following:

            // Creates a SpreadsheetDocument.
            private void CreatePackage()
            {
                //using(SpreadsheetDocument package = SpreadsheetDocument.Create(stream, SpreadsheetDocumentType.Workbook))
                //{
                //    CreateParts(package);
                //}
    
                DateTime packageCreateStart = DateTime.Now;
                SpreadsheetDocument package = SpreadsheetDocument.Create(stream, SpreadsheetDocumentType.Workbook);
                DateTime packageCreateStop = DateTime.Now;
    
                DateTime partsCreateStart = DateTime.Now;
                CreateParts(package);
                DateTime partsCreateStop = DateTime.Now;
    
                DateTime packageDisposeStart = DateTime.Now;
                package.Dispose();
                DateTime packageDisposeStop = DateTime.Now;
    
                TimeSpan packageCreate = packageCreateStop - packageCreateStart;
                TimeSpan partsCreate = partsCreateStop - partsCreateStart;
                TimeSpan packageDispose = packageDisposeStop - packageDisposeStart;
    
                System.IO.StreamWriter writer = new System.IO.StreamWriter("c:\\temp\\perflog.log");
                writer.WriteLine("packageCreate: " + packageCreate.TotalMilliseconds);
                writer.WriteLine("partsCreate: " + partsCreate.TotalMilliseconds);
                writer.WriteLine("packageDispose: " + packageDispose.TotalMilliseconds);
                writer.Close();
            }
    The log file created by a call to the method contains the following results:

    packageCreate: 15.6246
    partsCreate: 1906.2012
    packageDispose: 5218.6164

    As you can see, the call to "package.Dispose()" is taking by far the bulk of the time (5.2 seconds).  Is there something else that I should be doing to make the code more performant?  Is this performance perhaps an issue with the CTP version of the Open XML SDK 2 that is planned to be addressed?

    Thanks for taking the time to review.
    D
    Friday, March 12, 2010 3:07 PM
  • Hi D Nickels,

    Thanks for your detailed information.

    Firstly, it is needed to point out that when calling package.Dispose() method, it will save the package to disk and deal with garbage collection. It is reasonable such IO related operations will take much more time than memory based operations. And garbage collection also needs to take some time.

    To help investigate into your scenario, could you please show us the content of your spreadsheet or how you generate it in CreateParts(package)? The performance may be related to the data type as well.

    Hope this helps. If you have any question, please let me know.

    Thanks,

    Lu

    Monday, March 15, 2010 9:37 AM
  • hi D Nickels,

    Thanks for your question.

    I have tried to generate a a spreadsheet having 9 columns and 20,000 rows, all the value of the cells are all "1".
    The running results are as following:
    packageCreate: 39.0039 ms
    partsCreate: 1361.1361 ms
    packageDispose: 1526.1526 ms

    IF setting the value of the cells to the random value, the running results are:
    packageCreate: 38.0038 ms
    partsCreate: 1445.1445 ms
    packageDispose: 2358.2358 ms

    The are some time consuming on dispose of the string value, however, the difference between PartsCreate and packageDispose are not that much as you mentioned.

    Agree with Lu, would you please tell some more info: 
    1. what are the contents in your spreadsheet;
    2. what version of SDK are you using?

    B.T.W, I am using the RTM version of SDK, it is available at http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=c6e744e5-36e9-45f5-8d8c-331df206e0d0
    Monday, March 15, 2010 11:40 AM
  • Raymond,

    Prior to this morning, I was using the December CTP version of the Open XML SDK 2.  As of this morning, I have switched to the RTM version.  However, the performance issue still remains.

    To answer your question, the contents of the spreadsheet are as follows:

    Column 1 = Number
    Column 2 = String
    Column 3 = String
    Column 4 = Number
    Column 5 = Number
    Column 6 = Number
    Column 7 = Number
    Column 8 = Number
    Column 9 = String

    The code for the CreateParts method (asked for by Lu Zhang) is copied below; this code is essentially the code produced by the OpenXmlSdkTool:

            // Adds child parts and generates content of the specified part.
            private void CreateParts(SpreadsheetDocument document)
            {
                ExtendedFilePropertiesPart extendedFilePropertiesPart = document.AddNewPart<ExtendedFilePropertiesPart>("rId5");
                GenerateExtendedFilePropertiesPartContent(extendedFilePropertiesPart);
    
                WorkbookPart workbookPart = document.AddWorkbookPart();
                GenerateWorkbookPartContent(workbookPart);
    
                WorkbookStylesPart workbookStylesPart = workbookPart.AddNewPart<WorkbookStylesPart>("rId3");
                GenerateWorkbookStylesPartContent(workbookStylesPart);
    
                ThemePart themePart = workbookPart.AddNewPart<ThemePart>("rId2");
                GenerateThemePartContent(themePart);
    
                WorksheetPart worksheetPart = workbookPart.AddNewPart<WorksheetPart>("rId1");
                GenerateWorksheetPartContent(worksheetPart);
    
                SharedStringTablePart sharedStringTablePart = workbookPart.AddNewPart<SharedStringTablePart>("rId4");
                GenerateSharedStringTablePartContent(sharedStringTablePart);
    
                SetPackageProperties(document);
            }

    The GenerateWorksheetPartContent contains the logic of interest.  This method generates the spreadsheet data based on an XML string passed in to the class (via the class' constructor).  For performance, parsing of the XML string is done using XmlTextReader.  An abbreviated version of the method is copied below:

            private void GenerateWorksheetPartContent(WorksheetPart worksheetPart)
            {
                //create worksheet objects
                Worksheet worksheet = new Worksheet();
                worksheetPart.Worksheet = worksheet;
    
                //open query results XML using XmlTextReader
                //(XmlTextReader used for iteration for performance)
                TextReader stringReader = new StringReader(xml);
                XmlTextReader reader = new XmlTextReader(stringReader);


    //POPULATE columnDataTypes ARRAY WITH DATATYPE; NOT SHOWN HERE FOR SAKE OF BREVITY


    //iterate through query results SheetData sheetData = new SheetData(); uint rowIndex = 1; string lastCell = ""; while (reader.Read()) { //process /RECORDS/RECORD elements if (reader.NodeType == XmlNodeType.Element && reader.Name.Equals("RECORD")) { //create new row rowIndex++; Row row = new Row(); row.RowIndex = rowIndex; //process all child elements (columns) of RECORD while (reader.Read() && !(reader.NodeType == XmlNodeType.EndElement && reader.Name.Equals("RECORD"))) { if (reader.NodeType == XmlNodeType.Element) { lastCell = AddCellToRow(row, reader.ReadString(), columnDataTypes[row.ChildElements.Count], 0); } } sheetData.Append(row); } } //REMAINING LOGIC REMOVED FOR SAKE OF BREVITY }


    The GenerateWorksheetPartContent method calls another method named AddCellToRow for adding a cell.  This method takes the parameters Row, cell value, cell data type and cell style index as parameters.  In the above code for GenerateWorksheetPartContent , you can see that the cell data types come from an array.  Note that for cells containing "string" values, I'm using the CellValues.InlineString enum value, not the CellValues.SharedString enum value.  When I initially tried using SharedString, the performance of the GenerateWorksheetPartContent was very poor - in excess of 5 minutes to complete.  Switching to using InlineStrings decreased the performance significantly.  The time that it currently takes the GenerateWorksheetPartContent method to complete is < 2 seconds.  I know that the file size is much bigger, but if we have to choose between file size and time to generate the file, larger file sizes are the better choice for this particular implementation.  The code for the AddCellToRow method is copied below:

            //returns the cell ID of the newly added cell
            private string AddCellToRow(Row row, string value, CellValues type, uint styleIndex)
            {
                Cell cell = new Cell();
                
                //calculate column ID
                string columnId = "";
                //set initial dividend value to 1-based index of new cell
                //(i.e. cell #1 => index 1, cell #2 => index 2, etc)
                int dividend = (row.ChildElements.Count + 1);
                int modulo;
                while (dividend > 0)
                {
                    modulo = (dividend - 1) % 26;
                    columnId = Convert.ToChar(65 + modulo).ToString() + columnId;
                    dividend = (int)((dividend - modulo) / 26);
                }
                cell.CellReference = string.Format("{0}{1}", columnId, row.RowIndex);
    
                if (styleIndex > 0) cell.StyleIndex = styleIndex;
    
                //set cell data type
                cell.DataType = type;
    
                //handle cell value based on data type            
                switch (cell.DataType.Value)
                {
                    case CellValues.InlineString:
                        InlineString iString = new InlineString();
                        Text t = new Text();
                        t.Text = value;
                        iString.Append(t);
                        cell.Append(iString);
                        break;
                    case CellValues.SharedString:
                        //for string values, string must first be added to shared string table
                        CellValue sStringVal = new CellValue();
                        int sstIndex = AddItemToSharedStringTable(value);
                        sStringVal.Text = Convert.ToString(sstIndex);
                        cell.Append(sStringVal);
                        break;
                    case CellValues.Number:
                        //convert number strings to number and then back to text 
                        //(raw string values with commas cause excel to treat numbers as strings)
                        CellValue nValue = new CellValue();
                        double dblValue = Convert.ToDouble(value);
                        nValue.Text = Convert.ToString(dblValue);
                        cell.Append(nValue);
                        break;
                    case CellValues.Date:
                        CellValue dValue = new CellValue();
                        dValue.Text = value;
                        cell.Append(dValue);
                        break;
                    default:
                        CellValue cellValue = new CellValue();
                        cellValue.Text = value;
                        cell.Append(cellValue);
                        break;
                }            
    
                row.Append(cell);
    
                return cell.CellReference.Value;
            }
    Lastly, I'll note that the underlying stream for the SpreadsheetDocument is a memory stream (this logic is being executed in the context of an ASP.NET application and we don't want to write temp files to disk).  I don't know if the fact that a MemoryStream is being used is significant or not.

    D
    Monday, March 15, 2010 3:59 PM
  • Hi, D Nickels,

    Thank you very much for your detail description.

    I have re-checked my sample code, and get the same running result as you do. I have found several things, which may be of help.

    First of all, when Dispose is called, the Open XML Spreadsheet document in the memory are writing out to the disk to save , which is the most time-comsuming. Thus, I think time spending on this method is reasonable.

    Secondly, in the last reply of mine, I ran faster, because  I donot specify the DataType of the cell. In this case, some infomation are missed.

    As far as I am concerned, I donot find any ways to improve the performance.

    Thanks,
    Raymond 
    • Marked as answer by D Nickels Tuesday, March 16, 2010 1:26 PM
    Tuesday, March 16, 2010 9:54 AM
  • Thanks Raymond.  Not the answer that I was hoping to hear, but I'm glad that a second person was able to confirm the result that I've been getting.  I appreciate the time that you've taken in reviewing this issue.

    D
    Tuesday, March 16, 2010 1:26 PM
  • I'm also working with a spreadsheet created from scratch in ASP.NET, and have experienced the same performance issues.  I want to deliver the spreadsheet without touching the SDK touching the disk at any intermediate step; is this possible?
    Tuesday, March 30, 2010 4:28 PM
  • Hi mceaton,

    Thanks for your question.

    One thing I need to be clear about your scenario is that how do you deliver the spreadsheet? Could you give us a detailed description about it? This may help find a solution to your problem.

    BTW, if you want to avoid SDK touching the disk at any intermediate step, you could do anything to the package and dispose it finally. There will not be any IO operation when you process the document before disposing it.

    Thanks,

    Lu

    Wednesday, March 31, 2010 8:45 AM
  • Lu,

    Can you clarify what you mean by the statements:

    "...if you want to avoid SDK touching the disk at any intermediate step, you could do anything to the package and dispose it finally. There will not be any IO operation when you process the document before disposing it."

    As an FYI to anyone else who happens to stumble across this post, I opened a Connect ticket for the dispose performance asking that it be reviewed to see if there is anything that can be done to boost performance when a MemoryStream is used as the backing stream for the package.  The issue can be found at https://connect.microsoft.com/site589/feedback/details/542244/spreadsheetdocument-dispose-performance?wa=wsignin1.0.

    Thanks,

    D

    Thursday, April 1, 2010 1:10 PM
  • Hi D Nickels,

    By the statement in my last post, I mean that since the dispose process takes a lot of time, you could call it after all the memory operations are done instead of writing to the disk in the middle of the process. It is needed to dispose finally, right? But you could avoid calling it in the middle of the process.

    BTW, could you describe your requirement in detail? I just think you need to avoid the time consumed on IO operations and maybe there is some understanding.

    Hope this clarifies. If you have any question, please let me know.

    Thanks,

    Lu

    Tuesday, April 6, 2010 2:34 AM