locked
Azure Data Lake processing XLSX in USQL RRS feed

  • Question

  • All,

    Is there any out of the box method for handling XLSX files in Azure Data Lake and ADF?  Assuming from what I have seen that this isnt possible.  

    If not possible Out of Box, is it supported/possible to create a custom extractor in USQL/C# to process an XLSX file stored in ADLS?  Assuming this is possible, because what can't I do in .NET, but would like confirmation, even an example if someone has this.

    Thanks!


    • Edited by dcb99 Friday, April 8, 2016 3:13 PM
    Tuesday, April 5, 2016 7:53 PM

Answers

  • Hi... you may want to post this in the Azure Data Lake Forum.

    There is no out of the box method for the following reason (and the same reasons hold for writing a custom extractor which is doable, but probably not the most performant approach):

    XSLX is a format that compresses Office Excel XML formatted files into an archive. The XML formats themselves are not necessarily designed for data processing but for preserving the Office formatting and making it efficient to work with from within Excel. Providing an efficient built-in extractor is not really feasible since the extractor would have to decompress the XSLX file (which is a single process, non-parallelizable step), then extract possibly multiple files into the local working dir (again not parallelizable, since the current scale out framework needs to see the data in the scale out storage locations) and then you need to process the XML which again is not parallelizable given the structural nature of XML as used by Excel.

    Unless you need access to all the formatting and formulas etc, I think you would be much better off at transforming the XSLX file first into a TSV or CSV format file (or set of files) and operate on them instead.


    Michael Rys

    Monday, June 6, 2016 6:10 PM

All replies

  • Hi... you may want to post this in the Azure Data Lake Forum.

    There is no out of the box method for the following reason (and the same reasons hold for writing a custom extractor which is doable, but probably not the most performant approach):

    XSLX is a format that compresses Office Excel XML formatted files into an archive. The XML formats themselves are not necessarily designed for data processing but for preserving the Office formatting and making it efficient to work with from within Excel. Providing an efficient built-in extractor is not really feasible since the extractor would have to decompress the XSLX file (which is a single process, non-parallelizable step), then extract possibly multiple files into the local working dir (again not parallelizable, since the current scale out framework needs to see the data in the scale out storage locations) and then you need to process the XML which again is not parallelizable given the structural nature of XML as used by Excel.

    Unless you need access to all the formatting and formulas etc, I think you would be much better off at transforming the XSLX file first into a TSV or CSV format file (or set of files) and operate on them instead.


    Michael Rys

    Monday, June 6, 2016 6:10 PM
  • Hi,

    Did you got any solution for this? If yes, can you please help me in resolving the issue.

    Thanks!!

    Wednesday, November 1, 2017 8:41 AM