none
OutOfMemoryError Copying from OnPrem SQLServer to Sink dataset in parquet RRS feed

  • Question

  • I have  SelfHosted IR on my OnPremSQL Server.

    I have set the  env variable JAVA_OPTS to: "-Xms1024m -Xmx1536m" <-- No quotes of course. JRE is 64-bit.


    I am copying a complete table ~3mil rows from SQLServer to a parquet file which is on ADLSv2 Blob Storage (hierarchical-hot). Same behavior in cold as well.

    I am using a very simple Copy Activity from my Data Factory to send to this Sink. The job however fails with OOME.

    Can you please advise how I can fix the issue?

    What is the approach to size this? I have a larger table and if this fails, then I am sure I will have an issue with larger datasets.

    The system is yet to allow me to post an image...

    However, I notice...

    SQLServer: DataRead: 1.005GB; RowsRead:2,118,757; PeakConns: 1

    Azure Data Lake Storage Gen2: DataWritten: 83.886MB; FilesWritten: 0; RowsWritten: 2,118,757; PeakConns: 1; Throughput: 8.461 MB/s

    Nothing was finally copied to parquet.

    Smaller tables have copied over without issues.

     


    • Edited by rraj26 Friday, July 12, 2019 10:38 PM Added more info.
    Friday, July 12, 2019 10:19 PM

Answers

  • Thanks to MSFT Support Team, this issue was resolved...

    The suggestion was:



    On the server running the  SelfHosted IR on my OnPremSQL Server...

    I MUST set the  env variable _JAVA_OPTIONS to: "-Xms1024m -Xmx4096m" (not JAVA_OPTS as I had.)

    Set a value large enough for the process to complete on the OnPrem Server. I set it to 4g but the same can be set as per the needs.


    Raj R


    • Marked as answer by rraj26 Wednesday, July 31, 2019 4:51 PM
    • Edited by rraj26 Wednesday, July 31, 2019 4:53 PM
    Wednesday, July 31, 2019 4:51 PM

All replies

  • Hi there,

    Could you try setting a smaller writeBatchSize property ? That might mitigate the OOM issue.

    Hope this helps.

    Monday, July 15, 2019 5:37 AM
    Owner
  • Thanks for your response Chirag...

    Where do I set this?

    My pipeline with the CopyData activity has no such option...

    (Note that I have removed other activities AND other mappings(about 15 fields) in the pipeline to make this example smaller.)

    {
        "name": "CopyDataFromOnPremise",
        "properties": {
            "activities": [
                {
                    "name": "Copy Test Result Data",
                    "type": "Copy",
                    "dependsOn": [
                        {
                            "activity": "Copy Staging Data",
                            "dependencyConditions": [
                                "Succeeded"
                            ]
                        },
                        {
                            "activity": "Copy Agent Data",
                            "dependencyConditions": [
                                "Succeeded"
                            ]
                        }
                    ],
                    "policy": {
                        "timeout": "7.00:00:00",
                        "retry": 0,
                        "retryIntervalInSeconds": 30,
                        "secureOutput": false,
                        "secureInput": false
                    },
                    "userProperties": [],
                    "typeProperties": {
                        "source": {
                            "type": "SqlServerSource",
                            "sqlReaderQuery": "select *, 'MCCM' SourceSystem from test_result"
                        },
                        "sink": {
                            "type": "ParquetSink",
                            "storeSettings": {
                                "type": "AzureBlobFSWriteSetting"
                            }
                        },
                        "enableStaging": false,
                        "translator": {
                            "type": "TabularTranslator",
                            "mappings": [
                                {
                                    "source": {
                                        "name": "pt_id",
                                        "type": "String"
                                    },
                                    "sink": {
                                        "name": "pt_id",
                                        "type": "String"
                                    }
                                },
                                {
                                    "source": {
                                        "name": "SourceSystem",
                                        "type": "String"
                                    },
                                    "sink": {
                                        "name": "SourceSystem",
                                        "type": "String"
                                    }
                                }
                            ]
                        }
                    },
                    "inputs": [
                        {
                            "referenceName": "MCCM_DS",
                            "type": "DatasetReference"
                        }
                    ],
                    "outputs": [
                        {
                            "referenceName": "MCCMtestresult",
                            "type": "DatasetReference"
                        }
                    ]
                }
            ],
            "folder": {
                "name": "ChildPipelines"
            },
            "annotations": []
        }
    }

    The Sink is configured as this...

    {
    "name": "MCCMtestresult",
    "properties": {
    "linkedServiceName": {
    "referenceName": "IDF_RAWDATA_STORE",
    "type": "LinkedServiceReference"
    },
    "folder": {
    "name": "RAWDATA"
    },
    "annotations": [],
    "type": "Parquet",
    "typeProperties": {
    "location": {
    "type": "AzureBlobFSLocation",
    "fileName": "mccm_testresult",
    "folderPath": "transactions",
    "fileSystem": "rawdata"
    },
    "compressionCodec": "snappy"
    },
    "schema": []
    }
    }

    - Raj


    Raj R


    • Edited by rraj26 Monday, July 15, 2019 2:18 PM
    Monday, July 15, 2019 2:14 PM
  • Hi there,

    Sorry I missed that your sink is ADLS Gen2. "writeBatchSize" property can be set on SQL based sink datasets.

    For a deeper investigation and immediate assistance on this issue, if you have a support plan you may file a support ticket, else could you please send an email to AzCommunity@Microsoft.com with the below details, so that we can create a one-time-free support ticket for you to work closely on this matter. 

    Thread URL: https://social.msdn.microsoft.com/Forums/en-US/0396a33b-1f36-4870-9462-30fef7988b10/outofmemoryerror-copying-from-onprem-sqlserver-to-sink-dataset-in-parquet?forum=AzureDataFactory
    Subscription ID:  <your subscription id>

    Please let me know once you have done the same.


    Tuesday, July 16, 2019 6:41 AM
    Owner
  • Chirag:

    Thank you for your time. I have sent the email as requested with the details. 

    Looking forward to your solution.

    - Raj


    Raj R

    Tuesday, July 16, 2019 4:08 PM
  • Awesome Raj. You should hear from a Support Engineer very soon. He/she will help you resolve your issue.

    I would recommend you to share your findings once the issue is resolved. This will help other community members reading this thread.

    Wednesday, July 17, 2019 7:18 AM
    Owner
  • Thanks to MSFT Support Team, this issue was resolved...

    The suggestion was:



    On the server running the  SelfHosted IR on my OnPremSQL Server...

    I MUST set the  env variable _JAVA_OPTIONS to: "-Xms1024m -Xmx4096m" (not JAVA_OPTS as I had.)

    Set a value large enough for the process to complete on the OnPrem Server. I set it to 4g but the same can be set as per the needs.


    Raj R


    • Marked as answer by rraj26 Wednesday, July 31, 2019 4:51 PM
    • Edited by rraj26 Wednesday, July 31, 2019 4:53 PM
    Wednesday, July 31, 2019 4:51 PM
  • Glad to hear that the problem was resolved. Thanks for sharing your findings.
    Thursday, August 1, 2019 5:55 AM
    Owner