none
Azure Data Factory Mapping Data Flow Performance RRS feed

  • Question

  • I have multiple Data Flow activities in my pipeline(s). Every time I run the pipeline via a trigger (manual or scheduled), it takes 6~7 minutes to warm up a cluster for each Data Flow activity. I would assume the cluster to be warmed up only once and then shared to run all Data Flow activities. But this doesn't seem to be the case.

    I have tried setting the RunOn property for the Data Flow activity to autoResolveIntegrationRuntime as well as to a custom IR, but the same issue is observed.

    • Why is a cluster warm-up required or each data flow activity even when I specify a custom IR to run the data flows on?
    • Is there a way to reuse the same warmed-up cluster for multiple Data Flow activities in a pipeline?
    • When will the cluster warm-up performance be fixed?

    Monday, September 9, 2019 5:23 PM

Answers

  • Hi Andy,

    Thank you for inquiry and here is some useful info about your query.

    1. Why is a cluster warm-up required or each data flow activity even when I specify a custom IR to run the data flows on?
      ADF Mapping Data Flows are executed as activities within Azure Data Factory Pipelines using scaled-out Azure Databricks job clusters using Spark. 1 job == 1 cluster, each activity is a job.

    2. Is there a way to reuse the same warmed-up cluster for multiple Data Flow activities in a pipeline?
      ADF engineering team is currently working on a feature where users will set a TTL (Time To Live) in the Azure IR under Data Flow settings in order to keep a cluster alive so that they won't incur start-up times for subsequent data flow activities. Currently, TTL is greyed out since ADF team is actively working on this feature.


    3. When will the cluster warm-up performance be fixed?
      Current ETA for this feature to land in production is by end of September. (This is a tentative ETA for now, sometime ETA might be altered based on other priorities)

    If you are willing to try this feature before it is released into production, ADF engineering team will be happy to enable this feature as an experimental feature for your subscription. We just need your Azure Subscription ID to enable this as an experimental feature.

    If you are interested, please feel free to share below details to AzCommunity[at]Microsoft[dot]com

    Email subject: <Azure Data Factory: Azure Data Factory Mapping Data Flow Performance>
    Thread URL: <https://social.msdn.microsoft.com/Forums/en-US/fb9d9e98-821e-4f00-90ce-13a76e3415e6/azure-data-factory-mapping-data-flow-performance?forum=AzureDataFactory>
    Subscription ID:  <your subscription id>


    Once the email is sent, please let us know about that here.

    Hope the above info helps.


    Thank you

    If a post helps to resolve your issue, please click the "Mark as Answer" of that post and/or click Answered "Vote as helpful" button of that post. By marking a post as Answered and/or Helpful, you help others find the answer faster.

    Tuesday, September 10, 2019 4:01 AM
    Moderator

All replies

  • Hi Andy,

    Thank you for inquiry and here is some useful info about your query.

    1. Why is a cluster warm-up required or each data flow activity even when I specify a custom IR to run the data flows on?
      ADF Mapping Data Flows are executed as activities within Azure Data Factory Pipelines using scaled-out Azure Databricks job clusters using Spark. 1 job == 1 cluster, each activity is a job.

    2. Is there a way to reuse the same warmed-up cluster for multiple Data Flow activities in a pipeline?
      ADF engineering team is currently working on a feature where users will set a TTL (Time To Live) in the Azure IR under Data Flow settings in order to keep a cluster alive so that they won't incur start-up times for subsequent data flow activities. Currently, TTL is greyed out since ADF team is actively working on this feature.


    3. When will the cluster warm-up performance be fixed?
      Current ETA for this feature to land in production is by end of September. (This is a tentative ETA for now, sometime ETA might be altered based on other priorities)

    If you are willing to try this feature before it is released into production, ADF engineering team will be happy to enable this feature as an experimental feature for your subscription. We just need your Azure Subscription ID to enable this as an experimental feature.

    If you are interested, please feel free to share below details to AzCommunity[at]Microsoft[dot]com

    Email subject: <Azure Data Factory: Azure Data Factory Mapping Data Flow Performance>
    Thread URL: <https://social.msdn.microsoft.com/Forums/en-US/fb9d9e98-821e-4f00-90ce-13a76e3415e6/azure-data-factory-mapping-data-flow-performance?forum=AzureDataFactory>
    Subscription ID:  <your subscription id>


    Once the email is sent, please let us know about that here.

    Hope the above info helps.


    Thank you

    If a post helps to resolve your issue, please click the "Mark as Answer" of that post and/or click Answered "Vote as helpful" button of that post. By marking a post as Answered and/or Helpful, you help others find the answer faster.

    Tuesday, September 10, 2019 4:01 AM
    Moderator
  • Thanks KranthiPakala-MSFT

    I have emailed the subscription id as requested. I will update again after playing with the feature.

    Wednesday, September 11, 2019 3:25 PM
  • Thanks for sharing the details, coder_andy. Will get back to you once we have an update from ADF team.

    Thank you

    If a post helps to resolve your issue, please click the "Mark as Answer" of that post and/or click Answered "Vote as helpful" button of that post. By marking a post as Answered and/or Helpful, you help others find the answer faster.

    Thursday, September 12, 2019 12:44 AM
    Moderator
  • Did you got "Time to Live" option enabled after sending request with your subscription details. 

    I am facing exact same problem and need some fix on this

    Thursday, September 12, 2019 10:49 AM
  • Hi,

    So I got the "Time To Live" feature enabled for Data Factory IR and following is my observation:

    • First Data Flow activity in the pipeline still has a warm-up time of 6~7 minutes
    • Remaining Data Flow activities have a warm-up time of 3 minutes each, although they are configures to RunOn the same custom IR with TTL = 1 hour

    Although this is a significant improvement, but doesn't explain why 3 minutes warm-up is needed again when the cluster has active TTL.

    Any comments?

    I will wait for this to be in GA and test it again.

    Friday, September 13, 2019 4:43 PM
  • Can you send an email here with your Azure Subscription ID (GUID) requesting early access to the warm pool feature in Data Flows?  TY!

    adfdataflowext [at] microsoft [dot] com

    Friday, September 13, 2019 10:19 PM
  • mkro,  Is this separate from the TTL feature?

    I sent the email with the details as requested. Awaiting activation.

    Monday, September 16, 2019 2:02 PM
  • Hi coder_andy

    No, it is regarding the same, early access to TTL feature, which your subscription already has. 

    Thank you for sharing your feedback for TTL feature. I request you to please share your feedback/suggestion about the TTL feature to adfdataflowext [at] microsoft [dot] com.



    Thank you

    If a post helps to resolve your issue, please click the "Mark as Answer" of that post and/or click Answered "Vote as helpful" button of that post. By marking a post as Answered and/or Helpful, you help others find the answer faster.

    Monday, September 16, 2019 10:29 PM
    Moderator
  • I'd like an answer to this too. The data flows are great once they're actually running, but with 5 flows in a pipeline and an idle TTL of 10 minutes, the pipeline takes 16 minutes to run 25s of activity. Without the TTL it's 31 minutes so that's an improvement, but all in all it's way too much overhead.

    Thursday, October 17, 2019 10:34 PM
  • Hi Chris S Buchanan,

    Thank you for your query.

    Parallel Flow: If the Data Flow activities (let's assume 5) run in parallel using TTL, with a warmed cluster, it will take ~4 mins since VMs start-up all at the same time (2-3 min spin-up time all happens at the same time).

    Sequential Flow: If the Data Flow activities (let's assume 5) run in serial, with a warmed cluster and TTL, it will take ~2-3 mins to acquire the VM, which is why you are seeing 16 min for 5 activity executions (5*3) = 15 + time to execute their jobs = ~16 mins. 

    Hope this info helps.  


    Thank you

    If a post helps to resolve your issue, please click the "Mark as Answer" of that post and/or click Answered "Vote as helpful" button of that post. By marking a post as Answered and/or Helpful, you help others find the answer faster.

    Saturday, October 19, 2019 12:38 AM
    Moderator
  • Hi Chris S Buchanan,

    Following up to see if the above information was helpful? If you have further query, please let us know.


    Thank you

    If a post helps to resolve your issue, please click the "Mark as Answer" of that post and/or click Answered "Vote as helpful" button of that post. By marking a post as Answered and/or Helpful, you help others find the answer faster.

    Monday, October 21, 2019 4:39 PM
    Moderator
  • Hi Mkro,

    I would like to request the warm pool feature as well and emailed adfdataflowext@microsoft.com. However, the email bounced. Could you please advise if the email is correct ?

    Thank you.

    Regards,

    Puneet Lakhanpal

    Thursday, October 31, 2019 3:55 AM
  • The feature is already released. 

    Just create a new Azure Integration Runtime. 

    In the TTL property under Data Flow settings, choose a number higher than 0.

    That will generate a warm pool whenever you use that IR with your data flow activities. 


    • Edited by mkro Thursday, October 31, 2019 8:16 AM
    Thursday, October 31, 2019 8:16 AM