none
How to validate large Office Open XML xlsx documents? RRS feed

  • Question

  • I want to validate xlsx documents using OpenXMLValidator and the code from the MS examples: 

     try {
        OpenXmlValidator validator = new OpenXmlValidator();
        int count = 0;
        foreach (
            ValidationErrorInfo error in validator.Validate(
                SpreadsheetDocument.Open(xlsxFile, true))
         )
         {
             count++;
             Console.WriteLine("Error " + count);
             Console.WriteLine("Description: " + error.Description);
             Console.WriteLine("Path: " + error.Path.XPath);
             Console.WriteLine("Part: " + error.Part.Uri);
             Console.WriteLine("-------------------------------------------");
          }
              Console.ReadKey();
    } catch (Exception ex) {
        Console.WriteLine(ex.Message);
    } finally {
    
    }
    

    Everything works fine for the rather small documents.

    Both my application and Open XML Productivity Tool validate these documents fine.

    But the real problem is that I have files with more than 300000 rows (about 150 megabytes sized). They have errors so I want to validate them. I can`t do it neither using Productivity Tool (cause it exits with error) nor my application (throws OutOfMemoryException).

    What advice can you give me?


    Friday, October 14, 2011 5:16 AM

Answers

  • Thank you, Chris.

    After many experiments with the data and rows in the spreadsheet document, I`ve found the root of the problem. That were the large string values in the cells.

    • Proposed as answer by cjatmsModerator Monday, October 24, 2011 1:22 PM
    • Marked as answer by Bruce Song Friday, October 28, 2011 8:46 AM
    Monday, October 24, 2011 8:47 AM

All replies

  • Hi mynameismasel,

    Thanks for posting in the MSDN Forum.

    In my opinion, the application must read all of the content into memory before validating it. I think the large size of you document lead your application to crash. I will consult senior engineers for it to see whether there exists good work round. It will take you some time for response thanks for your understanding and supporting.

    Have a good day,

    Tom

     


    Tom Xu [MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

    Monday, October 17, 2011 9:53 AM
    Moderator
  • Hi mynameismasel,

    Thanks for posting in the MSDN Forum.

    In my opinion, the application must read all of the content into memory before validating it. I think the large size of you document lead your application to crash. I will consult senior engineers for it to see whether there exists good work round. It will take you some time for response thanks for your understanding and supporting.

    Have a good day,

    Tom

     


    Tom Xu [MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.


    Tom, thank you for the answer.

    But MS Excel itself performs validation rather quickly and I get a message about unreadable content instantly and without huge memory consumption. I thought I could do the same with Open XML SDK.

    Really I don`t care about the validation speed. I need only the validation itself.


    Monday, October 17, 2011 10:25 AM
  •  Hi mynameismasel,

    Excel has no issue with matching the contents against the schema – which is what the validator is doing. That’s inherent in Excel (or any other Office application that builds OpenOfficeXML files.) Possibly, when Excel finds its first error it stops and throws the Unreadable Content error.

     However, there really is an issue of working with too-large worksheets using the SDK.

    See the June, 2010 blog

    -Brian Jones & Zeyad Rajabi: Office Solutions
    http://blogs.msdn.com/b/brian_jones/rss.aspx

    In that blog he specifically addresses the problem with the too-large worksheet.

    You might consider the idea of working with the number of errors discussed in the following MSDN page:
    OpenXmlValidator.MaxNumberOfErrors Property (DocumentFormat ...
    http://msdn.microsoft.com/en-us/library/documentformat.openxml.validation.openxmlvalidator.maxnumberoferrors.aspx

    Since you get the unreadable content message please refer to this Forum thread:
    Create Excel file always results in Unreadable Content error on open
    http://social.msdn.microsoft.com/Forums/en/oxmlsdk/thread/cd4a5fac-9ebe-4530-851b-1fb5db0c7299

    As a side item of interest see the blog:
    Finding Open XML Errors with Open XML SDK Validation - Brian Jones ...
    http://blogs.msdn.com/b/brian_jones/archive/2009/09/24/finding-open-xml-errors-with-open-xml-sdk-validation.aspx

    Another side item is the discussion in the MSDN library starting with the tree of content at:
    XML Document Validation - MSDN – Explore Desktop, Web, Cloud ...
    http://msdn.microsoft.com/en-us/library/ms255815.aspx

    Please let us know whether this helps resolve your issue.

    Regards,
    Chris Jensen
    Senior Technical Support Lead

    Monday, October 17, 2011 8:23 PM
    Moderator
  •  Hi mynameismasel,

    ....

    Please let us know whether this helps resolve your issue.

    Regards,
    Chris Jensen
    Senior Technical Support Lead


    No, this doesn`t help.

    I had read the articles you`ve recommended just before posting the first message in the topic.

    Thursday, October 20, 2011 7:17 AM
  •  

    Hi mynameismasel,

    A different approach to identifying the source of the error you see is to see if you get an error if you validate a worksheet with only 200000 rows. If that works then you may need to consider the possibility that the cause of the problem is lack of available memory.

    Both Tom_Xu and Brian Jones say the application must read all of the content into memory before validating it. Brian Jones describes a workaround.

    You say “I had read the articles you’ve recommended just before posting the first message…” What was the result of using Brian Jones’ SAX approach?

    As an alternative to using the Validator tool you could try to find the problem within Visual Studio. Try this:

    Unzip the .xlsx file and open the worksheet (e.g. Sheet1.xml) as a file in Visual Studio 2010. Does the file open as xml in the Visual Studio edit window? To see the file as nodes between tags click the VS Edit button, scroll to the ‘Advance” bar and click the ‘Format Document’ bar. Do you see any indication of problems at the left of the <Sheet name>.xml window? The XML between the tags is validated against the schema listed as the attributes in the <Worksheet namespace element.

    If Visual Studio balks at opening the file it may give an explanation in the error dialog. That could explain the problem.

    Please share the results of your troubleshooting at this point.

    Regards,
    Chris Jensen
    Senior Technical Support Lead

    Thursday, October 20, 2011 3:10 PM
    Moderator
  •  

    Hi mynameismasel,

    A different approach to identifying the source of the error you see is to see if you get an error if you validate a worksheet with only 200000 rows. If that works then you may need to consider the possibility that the cause of the problem is lack of available memory.

    Well I created the spreadsheet with the same number of rows and columns with the fake data. It can`t be validated doo (the same OutOfMemoryException). I`m sure that the problem is lack of available memory and I`m also sure that there should be some workaround because MS Excel validates and repair file rather fast and without large memory consumption. 

     You say “I had read the articles you’ve recommended just before posting the first message…” What was the result of using Brian Jones’ SAX approach?

    Well, there was a promblem there. I created the topic "Could I validate SpreadSheetML parts separately using Open XML SDK?" http://social.msdn.microsoft.com/Forums/en-US/oxmlsdk/thread/189f4361-e0ea-413e-83fc-ccf5a354254c

    And in the last post I asked about partial validation of spreadsheet. But I have no answers yet.

    If Visual Studio balks at opening the file it may give an explanation in the error dialog. That could explain the problem.

    Visual Studio shows the familiar OutOfMemoryException message.

    

    Thank you for your attention.

    Sergey Maslov.

    Friday, October 21, 2011 5:14 AM
  • Hello Sergey,

    So far the posts in this thread haven’t talked about
    1. The Windows platform you’re using – 32-bit or 64 bit
    2. The version of Excel you use to open the workbook – which you report gives you a message about unreadable content. Excel 2003 limits rows to 65,536, while Excel 2007 and 2010 have a worksheet size of 1,048,576 rows by 16,384 columns.  Your 300,000 rows are not a consideration. In this case the message about unreadable content does not result from the size of your workbook.  

    Excel manages memory internally and in its code can create and control the size and location of the heap. So it can get the workbook in memory regardless of the size. Windows will page to virtual memory in order to fit the file regardless of the available free memory. 32-bit windows will let your program address 2 GB of memory (real and/or virtual) while 64 – bit Windows changes that.

    Please see the following blog:
    Understanding 64-Bit Office - Microsoft Office 2010 Engineering ...
    http://blogs.technet.com/b/office2010/archive/2010/02/23/understanding-64-bit-office.aspx

    You mentioned “my program”. If you’re building the 300,000 row worksheet programmatically rather than populating it from a database then the reason for the invalid XML is inherent in your code. You can verify this by using your code to build a worksheet with only 1,000 rows. Does that one fail to validate?

    Beyond this we’ve pretty well covered the bases. If you need to work with a Microsoft Support Engineer then because of the complexity of your needs your questions fall into the paid support category which requires a more in-depth level of support.  Please visit the below link to see the various paid support options that are available to better meet your needs. http://support.microsoft.com/default.aspx?id=fh;en-us;offerprophone

    Regards,
    Chris Jensen
    Senior Technical Support Lead

    Friday, October 21, 2011 3:14 PM
    Moderator
  • Thank you, Chris.

    After many experiments with the data and rows in the spreadsheet document, I`ve found the root of the problem. That were the large string values in the cells.

    • Proposed as answer by cjatmsModerator Monday, October 24, 2011 1:22 PM
    • Marked as answer by Bruce Song Friday, October 28, 2011 8:46 AM
    Monday, October 24, 2011 8:47 AM