Handling large number of zip files RRS feed

  • Question

  • Requirement:
    We have large number (around 500000) of small zip files (file size 200MB) stored in Azure Blob Storage. We need to extract xml file and process it, ignoring metadata files using HDinsight Spark Cluster. From HDinsight Spark Cluster we need to store the results into Azure SQL Database.

    - To extract zip files which is better approach
      - Extract xml file and store it into some storage using Data Factory e.g. xml file or SQL database or flat file
      - Use Spark to extract xml from zip file do further processing in one go
    - We are using using a blob storage to store dependency jar files and refer via classpath. Is it better to create a fat Jar?
    Thursday, September 21, 2017 4:05 AM