First Modern Data Warehouse - need general advice, what is the best format for the data going into Data Lake RRS feed

  • Question

  • Hi I am moving data from a DB2 source database into a staging Data Lake resource.  It will later be transformed and moved into a Azure SQL DB or SQL Data Warehouse.

    When I create the pipeline to load the data into Data Lake (no transformations at this point) what is the best format for the data?  Especially noting that the data can be saved in so many formats.  Would the best be csv? JSON? Parque?  I am really unfamiliar with these formats.

    The data will go through a second pipeline to be loaded into Azure SQL DB.  This is where transforms might be done. So this may be an issue, do any of the formats work better than others.

    The second pipeline may call SQL procs or use databricks to Transform the data.

    With our end goal in mind and the process we will be implementing, in what format do others recommend we use in the Data Lake?

    Monday, September 23, 2019 3:42 PM

All replies

  • I think you should go with csv/tsv file format , the reason being that these are native file format and are supported across all the different products .

    Thanks Himanshu

    Monday, September 23, 2019 5:31 PM