none
Error writing 600MB json file from Blob to Azure SQL

    Question

  • Setup

    I have the following Azure Resources deployed and working:
    (NB: they are all in West Europe data center)

    Azure Storage account v2, LRS, Hot tier
    A 600MB json file in a Blob container
    The file contains 286,000+ rows of data, over 44 columns

    Azure SQL Database v12 in an Elastic Pool
    The Elastic Pool has 1200 DTUs and a max size of 100GB
    A table in a staging schema that has no FKs, no indexes, and is truncated before each run

    Azure Data Factory v2
    A pipeline containing a Copy Activity with the json file as the Source, and the SQL table as the Sink
    The Copy Activity has a setting of 16 DIUs, and auto parallelism
    The batch size is 10,000

    Error

    The pipeline ran for 14 hours and then failed with the generic message:
    {
        "errorCode": "500",
        "message": "InternalServerError executing pipeline",
        "failureType": "SystemError"
    }

    The SQL Database it is writing to is showing AVG EDTU of 0.024, which looks like it is barely ticking over
    Before failing the throughput visible in DF Activity Runs was 3.092KB/s, which sounds very low
    The database is showing no errors

    Question 1:
    How can I get a more detailed error message than "Internal server error".  This is all the info I can find about why it failed :-(
    600MB doesn't sound big enough to have capsized Data Factory.

    Where can I find/set up more detailed logging in Data Factory.

    Question 2:
    Why is this pipeline running so slow?  Am I being naive to expect Data Factory to process 200,000 rows in much less time than 14 hours?  When the Blob source and the destination database are in the same datacenter, surely the data throughput should be blazing fast.

    In addition to changing the DIUs and/or batch size, what other options are available to speed up this performance?

    (Pre-empt: switching to Azure Data Warehouse is not an option on this project)

    I have read through all the entries that come up for a keyword search of "performance" in this forum and none of them match the exact scenario described above.

    Any help much appreciated.

    Wednesday, April 3, 2019 8:40 PM

All replies

  • Does it even work with a smaller source file?  Like 100 records?
    Wednesday, April 3, 2019 9:06 PM
  • Yup.

    A master pipeline calls a dozen or so child pipelines, each child pipeline is identical to the one that fails.  But they have much smaller files.  The next smallest is 120MB, there's also an 80MB, an 18MB and an 11MB.  And they all work fine.

    I have also tested all pipelines with much smaller recordsets (1000 or so test data rows) and all succeeded.

    Thursday, April 4, 2019 7:28 AM
  • On further tests it seems that 200MB is the largest json file Data Factory can handle.

    The files I am working with are larger than that.

    Friday, April 5, 2019 7:02 PM
  • Hello Johnnev and thank you for your question.  I am preparing to repro your issue.  Could you please provide me information on whether you have any nested JSON objects?
    Saturday, April 6, 2019 12:29 AM
  • Hi Martin, 

    Yes, the json is highly structured.  FYI it's output from a product called Debezium, which is an CDC tool for MySql databases.  Debezium offers very few options how the json gets exported, so we have to work with what is there.

    Here is a sample of 2 records of output (tho obviously the real data has many hundreds of thousands of records):

    {
    	"root": [
    		{
    			"before": null,
    			"after": {
    				"db_name.topic_name.Value": {
    					"id": 1,
    					"hr_id": "9876",
    					"description": {
    						"string": "Bank Transfer"
    					},
    					"unit_type": null,
    					"pay_element_type": 1,
    					"basis": 0,
    					"enabled_for_gcc": {
    						"int": 1
    					},
    					"created_at": {
    						"long": 1473042357000
    					},
    					"updated_at": {
    						"long": 1475704482000
    					},
    					"last_updated_by": {
    						"string": "sysfunction"
    					}
    				}
    			},
    			"source": {
    				"version": {
    					"string": "0.9.0.Beta2"
    				},
    				"connector": {
    					"string": "mysql"
    				},
    				"name": "db_name",
    				"server_id": 0,
    				"ts_sec": 0,
    				"gtid": null,
    				"file": "mysql-bin.007769",
    				"pos": 6417283,
    				"row": 0,
    				"snapshot": {
    					"boolean": true
    				},
    				"thread": null,
    				"db": {
    					"string": "topic_name"
    				},
    				"table": {
    					"string": "payelements"
    				},
    				"query": null
    			},
    			"op": "c",
    			"ts_ms": {
    				"long": 1550490567106
    			}
    		},
    {
    			"before": null,
    			"after": {
    				"db_name.topic_name.Value": {
    					"id": 2,
    					"hr_id": "8765",
    					"description": {
    						"string": "Gross Pay"
    					},
    					"unit_type": null,
    					"pay_element_type": 0,
    					"basis": 0,
    					"enabled_for_gcc": {
    						"int": 1
    					},
    					"created_at": {
    						"long": 1473042413000
    					},
    					"updated_at": {
    						"long": 1475704501000
    					},
    					"last_updated_by": {
    						"string": "sysfunction"
    					}
    				}
    			},
    			"source": {
    				"version": {
    					"string": "0.9.0.Beta2"
    				},
    				"connector": {
    					"string": "mysql"
    				},
    				"name": "db_name",
    				"server_id": 0,
    				"ts_sec": 0,
    				"gtid": null,
    				"file": "mysql-bin.007769",
    				"pos": 6417283,
    				"row": 0,
    				"snapshot": {
    					"boolean": true
    				},
    				"thread": null,
    				"db": {
    					"string": "topic_name"
    				},
    				"table": {
    					"string": "payelements"
    				},
    				"query": null
    			},
    			"op": "c",
    			"ts_ms": {
    				"long": 1550490567106
    			}
    		}
    	]
    }

    I've obfuscated the values, but hopefully you get an idea of what the json is like.

    Thanks, John

    Monday, April 8, 2019 5:50 PM
  • I believe what you should try doing, is an intermediate step where you flatten the JSON.  I generated a large file, of similar number of rows, but 200MB, and using the lowest database DTU, ingested in under 10 minutes.  Would you like a support ticket?
    Thursday, April 11, 2019 8:48 PM
  • Is this step possible with a Data Factory Activity?  Or would this need to be done before Data Factory opens the file?

    Come back XML + XSLT.... all is forgiven!

    Monday, April 15, 2019 3:05 PM