none
Batch ai cluster creation problem

    Question

  • Having an error when trying to create a batch AI cluster with the following code

    from azureml.core.compute import ComputeTarget, BatchAiCompute
    from azureml.core.compute_target import ComputeTargetException
    from azureml.core import Workspace
    ws = Workspace.from_config()

    # choose a name for your cluster
    batchai_cluster_name = "traincluster"

    try:
        # look for the existing cluster by name
        compute_target = ComputeTarget(workspace=ws, name=batchai_cluster_name)
        if type(compute_target) is BatchAiCompute:
            print('found compute target {}, just use it.'.format(batchai_cluster_name))
        else:
            print('{} exists but it is not a Batch AI cluster. Please choose a different name.'.format(batchai_cluster_name))
    except ComputeTargetException:
        print('creating a new compute target...')
        compute_config = BatchAiCompute.provisioning_configuration(vm_size="STANDARD_D2_V2", # small CPU-based VM
                                                                    #vm_priority='lowpriority', # optional
                                                                    autoscale_enabled=True,
                                                                    cluster_min_nodes=0, 
                                                                    cluster_max_nodes=4)

        # create the cluster
        compute_target = ComputeTarget.create(ws, batchai_cluster_name, compute_config)

        # can poll for a minimum number of nodes and for a specific timeout. 
        # if no min node count is provided it uses the scale settings for the cluster
        compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

        # Use the 'status' property to get a detailed status for the current cluster. 
        print(compute_target.status.serialize())

    Found the config file in: C:\Users\Ahmett\Downloads\config.json creating a new compute target... Creating BatchAI wait for completion finished Terminal state of "Failed" has been reached Provisioning errors: [{'error': {'code': 'InternalServerError', 'statusCode': 500, 'message': 'An internal server error occurred. Please try again. If the problem persists, contact support', 'details': [{'code': 'An internal server error occurred. Please try again. If the problem persists, contact support', 'message': "AADSTS700016: Application with identifier 'a61ca899-6ce1-4db7-b013-54014c505a4c' was not found in the directory '38069120-f32e-4acc-b7a2-9e9e1f8d16f2'. This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant. You may have sent your authentication request to the wrong tenant\r\nTrace ID:["my TraceID"]
    • Edited by yamacgul Friday, November 30, 2018 8:00 AM
    Friday, November 30, 2018 7:56 AM

All replies

  • Hi,

    Sorry for your experience, can you please send you issue to: AzureBatchAITrainingPreview@service.microsoft.com for more help? We have engineer actively help.

    Regards,

    Yutong


    Monday, December 3, 2018 4:19 AM
    Owner
  • Hi,

    In order to manage the computes created by Machine Learning Services in the Workspace, we add an Azure Active Directory Application (AAD application) into the tenant, and add a Service Principal with contributor access into the Subscription.

    It seems that the AAD application 'a61ca899-6ce1-4db7-b013-54014c505a4c' was removed from the tenant, and that is causing the issue.

    We will fix the issue on our side (add a new AAD application and Service Principal), but, can you verify that that's the case, and the AAD application was removed from the tenant?

    The service seems to have been working until 11/8.

    Thanks,

      Daniel

    Monday, December 3, 2018 10:28 PM
  • The AAD app and Service Principal should regenerate the next time you use the service.

    Thanks,

      Daniel

    Tuesday, December 4, 2018 12:09 AM