locked
Bulk Executor for Cosmos DB Mongo API Integrated to ADF RRS feed

  • Question

  • Hi, 

    We are currently processing huge volumes of csv data with combined volume of 500M records / 300GB size, from Data lake to Cosmos Mongo API. We get ~8GB size file chunks.

    We are further chunking each 8 GB file into 40 mini batches and using ForEach activity to perform 40 Parallel copy activities to Cosmos Mongo as sink. It has 50,000 RU/s through put. 

    Currently, it is taking an average of 14 hours to post 11M docs each of 10KB size. 

    We wanted to try SQL API model of Cosmos to test its ingestion performance. We developed a custom C# utility and implemented the bulk executor library that MSFT has to perform the bulk loads. It is observed that SQL API model is performing 4 times better than the Mongo. This we believe is due to the bulk executor capability available to SQL API model. 

    Questions is:

    1. When will bulk executor be available and integrated to ADF for Cosmos Mongo API?

    2. Any new suggestions to improve the ADF to Cosmos Mongo performance? other that provided by MSFT already? Tried all.

    Friday, September 27, 2019 2:24 PM

Answers

All replies

  • Hello Shanmuk Aluri and thank you for your in-depth research.  I will share your findings and questions internally.
    Friday, September 27, 2019 5:35 PM
  • Hi Martin, 

    Just checking if there is any response from the internal team.

    Monday, September 30, 2019 3:31 PM
  • Hello Shanmuk Aluri.  I did get some responses.  The team is very interested.  If I understand their responses correctly,  the Cosmos DB Mongo API sink is already using Bulk executor, and the team would like to take a look at your work to tune the performance.  Would you be comfortable sharing the activity ID which copies data to Cosmos DB Mongo API?

    If you do not feel comfortable sharing it in this forum, you can send an email to AzCommunity@microsoft.com, with "dfa2a1cc-4db6-4578-8a46-f15289ff00bf" (the ID of this thread) in the subject.

    In any case, please let me know what you decide.

    Sincerely,
    Martin Jaffer, Azure CXP Community Engineer

    Monday, September 30, 2019 6:25 PM
  • I also got a response from the Cosmos DB team (previous post was with regards to response from Data Factory team).  This answers question #1.

    ... The latest published BulkExecutor supports the MongoDB api.    The customer can also upload their csv files to Blob Storage and use Azure DMS to migrate their data to CosmosDB MongoDB api accounts.

    By DMS they are most likely referring to Database Migration Services.  That has nothing to do with Data Factory.

    Monday, September 30, 2019 6:56 PM
  • Hi, 

    I appreciate your time on this. 

    Where can i find the official MSFT documentation that says Bulk Executor is available for Cosmos Mongo API? 

    The last documentation i found tells that it is available only for SQL API and Gremlin. 

    https://docs.microsoft.com/en-us/azure/cosmos-db/bulk-executor-overview

    And, we are not planning to use DMS as this is just not a one time load, we would like the solution to be productionizable and put it on a schedule run. Hence using ADF.

    Tuesday, October 1, 2019 1:09 PM
  • Hi,

    Thank you for your time on this. 

    Below are the pipeline run IDs (child)

    19hrs when Cosmos is @ 50K RU/s - 40 parallel copy activities - 40 files each 150MB / 275K records

    9a71fd9a-9e67-4942-9b6d-a90756870bdf

    5.5 hrs when Cosmos is @ 200K RU/s - 40 parallel copy activities - 40 files each 150MB / 275K records

    f3895191-e960-4a92-902c-994f277619d4

    Tuesday, October 1, 2019 1:37 PM
  • Support ticket has been raised to MSFT with this issue. 

    119100224002840. 

    Awaiting resolution.

    Friday, October 4, 2019 3:33 PM
  • Thank you for providing the ticket ID.
    Monday, October 7, 2019 4:31 PM
  • From ticket, the suggested resolution is:

    1. Set writeBatchSize to 5000.
    2. Set maxMemoryLimit to 524288000
    Thursday, October 17, 2019 9:17 PM