One big compute cluster of several small clusters for Hyperparameter tuning of Neural Networks RRS feed

  • Question

  • Hi, when running several Neural Network training jobs with different Hyperparameters: What is better: Setting up a GPU machine for each Hyperparameter of set-up a cluster of x machines for x Hyperparameter configurations? 

    In General: How does the Compute Cluster in Azure ML work: How are jobs put onto one cluster? The scaling from min nodes to max nodes must be somehow planned and controlled. And for GPUs this is mostly memory-limited depending on number of network parameters and batch size. 

    Regards, Stefan

    Friday, November 15, 2019 9:32 AM

All replies

  • Hi Stefan,

    Azure Machine Learning allows you to automate hyperparameter exploration in an efficient manner, to reduce time and resources. You can specify the range of hyperparameter values and a maximum number of training runs. The system then automatically launches multiple simultaneous runs with different parameter configurations and finds the configuration that results in the best performance, measured by the metric you choose. Poorly performing training runs are automatically early terminated, reducing wastage of compute resources. These resources are instead used to explore other hyperparameter configurations.

    The estimator describes the training script you run, the resources per job (single or multi-gpu), and the compute target to use. Since concurrency for hyperparameter tuning experiment is gated on the resources available, ensure that the compute target specified in the estimator has sufficient resources for your desired concurrency.

    Depending on the compute capacity that is required for the job the compute cluster autoscales down to zero nodes when it isn't used. Dedicated VMs are created to run your jobs as needed. Please take a look at this documentation for more details about how each compute type can be configured and used.


    Monday, November 18, 2019 9:02 AM