none
custom docker- image build failed, conda not found RRS feed

  • Question

  • I'm trying to run an experiment on my compute target using GPUs. I submit the estimator below and get a "Run failed. Image build failed" error on the experiment run. The error log is included below and looks like "conda not found" is the issue.

    from azureml.train.estimator import Estimator

    script_params = {
        '--data-folder': ds.as_mount()
    }

    est = Estimator(source_directory='.',
                       compute_target=compute_target,
                       script_params = script_params,
                       entry_script='test.py',
                       custom_docker_image='tensorflow/tensorflow:1.12.0-gpu-py3',
                       pip_packages=['keras','numpy'],
                       use_gpu=True
                       )

    -------------------------

    error log:

    2019/10/31 20:16:35 Downloading source code...
    2019/10/31 20:16:37 Finished downloading source code
    2019/10/31 20:16:40 Creating Docker network: acb_default_network, driver: 'bridge'
    2019/10/31 20:16:42 Successfully set up Docker network: acb_default_network
    2019/10/31 20:16:42 Setting up Docker configuration...
    2019/10/31 20:16:42 Successfully set up Docker configuration
    2019/10/31 20:16:42 Logging in to registry: foodmlwsfaster2491836064.azurecr.io
    2019/10/31 20:16:44 Successfully logged into foodmlwsfaster2491836064.azurecr.io
    2019/10/31 20:16:44 Executing step ID: acb_step_0. Timeout(sec): 5400, Working directory: '', Network: 'acb_default_network'
    2019/10/31 20:16:44 Scanning for dependencies...
    2019/10/31 20:16:44 Successfully scanned dependencies
    2019/10/31 20:16:44 Launching container with name: acb_step_0
    Sending build context to Docker daemon  59.39kB


    Step 1/14 : FROM tensorflow/tensorflow:1.12.0-gpu-py3@sha256:84f0820e151b129c63ac15c6d9c1c5336a834070dca22a271c7de091d490a17f
    sha256:84f0820e151b129c63ac15c6d9c1c5336a834070dca22a271c7de091d490a17f: Pulling from tensorflow/tensorflow
    18d680d61657: Pulling fs layer
    0addb6fece63: Pulling fs layer
    78e58219b215: Pulling fs layer
    eb6959a66df2: Pulling fs layer
    e3eb30fe4844: Pulling fs layer
    852c9b7a4425: Pulling fs layer
    0a298bf31111: Pulling fs layer
    f43ecd71dda8: Pulling fs layer
    9f554feaeba1: Pulling fs layer
    abf1fc85d970: Pulling fs layer
    3e67c4ad17bb: Pulling fs layer
    c60e8159f45c: Pulling fs layer
    2b01db739666: Pulling fs layer
    1553de0cb9ac: Pulling fs layer
    1ed5c01b0218: Pulling fs layer
    9913722703a5: Pulling fs layer
    442335dc9a85: Pulling fs layer
    eb6959a66df2: Waiting
    e3eb30fe4844: Waiting
    852c9b7a4425: Waiting
    0a298bf31111: Waiting
    f43ecd71dda8: Waiting
    9f554feaeba1: Waiting
    abf1fc85d970: Waiting
    3e67c4ad17bb: Waiting
    c60e8159f45c: Waiting
    2b01db739666: Waiting
    1553de0cb9ac: Waiting
    1ed5c01b0218: Waiting
    9913722703a5: Waiting
    442335dc9a85: Waiting
    0addb6fece63: Verifying Checksum
    0addb6fece63: Download complete
    78e58219b215: Verifying Checksum
    78e58219b215: Download complete
    eb6959a66df2: Verifying Checksum
    eb6959a66df2: Download complete
    e3eb30fe4844: Verifying Checksum
    e3eb30fe4844: Download complete
    18d680d61657: Verifying Checksum
    18d680d61657: Download complete
    852c9b7a4425: Verifying Checksum
    852c9b7a4425: Download complete
    0a298bf31111: Verifying Checksum
    0a298bf31111: Download complete
    abf1fc85d970: Verifying Checksum
    abf1fc85d970: Download complete
    9f554feaeba1: Verifying Checksum
    9f554feaeba1: Download complete
    3e67c4ad17bb: Verifying Checksum
    3e67c4ad17bb: Download complete
    c60e8159f45c: Verifying Checksum
    c60e8159f45c: Download complete
    1553de0cb9ac: Verifying Checksum
    1553de0cb9ac: Download complete
    2b01db739666: Verifying Checksum
    2b01db739666: Download complete
    1ed5c01b0218: Verifying Checksum
    1ed5c01b0218: Download complete
    9913722703a5: Verifying Checksum
    9913722703a5: Download complete
    442335dc9a85: Verifying Checksum
    442335dc9a85: Download complete
    f43ecd71dda8: Verifying Checksum
    f43ecd71dda8: Download complete
    18d680d61657: Pull complete
    0addb6fece63: Pull complete
    78e58219b215: Pull complete
    eb6959a66df2: Pull complete
    e3eb30fe4844: Pull complete
    852c9b7a4425: Pull complete
    0a298bf31111: Pull complete
    f43ecd71dda8: Pull complete
    9f554feaeba1: Pull complete
    abf1fc85d970: Pull complete
    3e67c4ad17bb: Pull complete
    c60e8159f45c: Pull complete
    2b01db739666: Pull complete
    1553de0cb9ac: Pull complete
    1ed5c01b0218: Pull complete
    9913722703a5: Pull complete
    442335dc9a85: Pull complete
    Digest: sha256:84f0820e151b129c63ac15c6d9c1c5336a834070dca22a271c7de091d490a17f
    Status: Downloaded newer image for tensorflow/tensorflow:1.12.0-gpu-py3@sha256:84f0820e151b129c63ac15c6d9c1c5336a834070dca22a271c7de091d490a17f
     ---> 413b9533f92a
    Step 2/14 : USER root
     ---> Running in abffe211e7a7
    Removing intermediate container abffe211e7a7
     ---> 7010ebf72961
    Step 3/14 : RUN mkdir -p $HOME/.cache
     ---> Running in 941da7a1a380
    Removing intermediate container 941da7a1a380
     ---> aa7d13d1851a
    Step 4/14 : WORKDIR /
     ---> Running in 411eabab64d4
    Removing intermediate container 411eabab64d4
     ---> 5d8b8b7c3066
    Step 5/14 : COPY azureml-environment-setup/99brokenproxy /etc/apt/apt.conf.d/
     ---> 29b3f77c11af
    Step 6/14 : RUN if dpkg --compare-versions `conda --version | grep -oE '[^ ]+$'` lt 4.4.11; then conda install conda==4.4.11; fi
     ---> Running in b7001a082671
    /bin/sh: 1: conda: not found
    dpkg: error: --compare-versions takes three arguments: <version> <relation> <version>

    Type dpkg --help for help about installing and deinstalling packages [*];
    Use 'apt' or 'aptitude' for user-friendly package management;
    Type dpkg -Dhelp for a list of dpkg debug flag values;
    Type dpkg --force-help for a list of forcing options;
    Type dpkg-deb --help for help about manipulating *.deb files;

    Options marked [*] produce a lot of output - pipe it through 'less' or 'more' !
    Removing intermediate container b7001a082671
     ---> f4aa2a6d589d
    Step 7/14 : COPY azureml-environment-setup/mutated_conda_dependencies.yml azureml-environment-setup/mutated_conda_dependencies.yml
     ---> afa1dc0accd2
    Step 8/14 : RUN ldconfig /usr/local/cuda/lib64/stubs && conda env create -p /azureml-envs/azureml_e1ed510e22efc0d217a9550a38dbf97c -f azureml-environment-setup/mutated_conda_dependencies.yml && rm -rf "$HOME/.cache/pip" && conda clean -aqy && CONDA_ROOT_DIR=$(conda info --root) && rm -rf "$CONDA_ROOT_DIR/pkgs" && find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && ldconfig
     ---> Running in 6eba33d5cc83
    /bin/sh: 1: conda: not found
    The command '/bin/sh -c ldconfig /usr/local/cuda/lib64/stubs && conda env create -p /azureml-envs/azureml_e1ed510e22efc0d217a9550a38dbf97c -f azureml-environment-setup/mutated_conda_dependencies.yml && rm -rf "$HOME/.cache/pip" && conda clean -aqy && CONDA_ROOT_DIR=$(conda info --root) && rm -rf "$CONDA_ROOT_DIR/pkgs" && find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && ldconfig' returned a non-zero code: 127
    2019/10/31 20:18:36 Container failed during run: acb_step_0. No retries remaining.
    failed to run step ID: acb_step_0: exit status 127

    Run ID: cd1h failed after 2m3s. Error: failed during run, err: exit status 1

    Thursday, October 31, 2019 8:54 PM

All replies

  • Hi,
    Can you please run the below code and share the output to check.
    est.run_config
    Also if possible please share the link to the sample that you are trying.

    Thanks
    Sunday, November 3, 2019 4:03 AM
    Moderator
  • Hi @Ram-msft,

    Thanks! The documentation that I'm following are:

    docs.microsoft.com/en-us/azure/machine-learning/service/how-to-train-tensorflow
    docs.microsoft.com/bs-latn-ba/azure/machine-learning/service/how-to-train-ml-models

    Output from est.run_config is:

    {
        "script": "mbnet_finetuning/test_bottlenecktensors_training.py",
        "arguments": [
            "--data-folder",
            "$AZUREML_DATAREFERENCE_workspacefilestore"
        ],
        "target": "gpu",
        "framework": "Python",
        "communicator": "None",
        "maxRunDurationSeconds": null,
        "nodeCount": 1,
        "environment": {
            "name": null,
            "version": null,
            "environmentVariables": {
                "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
            },
            "python": {
                "userManagedDependencies": false,
                "interpreterPath": "python",
                "condaDependenciesFile": null,
                "baseCondaEnvironment": null,
                "condaDependencies": {
                    "name": "project_environment",
                    "dependencies": [
                        "python=3.6.2",
                        {
                            "pip": [
                                "azureml-defaults",
                                "keras",
                                "numpy"
                            ]
                        }
                    ],
                    "channels": [
                        "conda-forge"
                    ]
                }
            },
            "docker": {
                "enabled": true,
                "baseImage": "tensorflow/tensorflow:1.12.0-gpu-py3",
                "baseDockerfile": null,
                "sharedVolumes": true,
                "gpuSupport": true,
                "shmSize": "2g",
                "arguments": [],
                "baseImageRegistry": {
                    "address": null,
                    "username": null,
                    "password": null
                }
            },
            "spark": {
                "repositories": [],
                "packages": [],
                "precachePackages": false
            },
            "databricks": {
                "mavenLibraries": [],
                "pypiLibraries": [],
                "rcranLibraries": [],
                "jarLibraries": [],
                "eggLibraries": []
            },
            "inferencingStackVersion": null
        },
        "history": {
            "outputCollection": true,
            "snapshotProject": true,
            "directoriesToWatch": [
                "logs"
            ]
        },
        "spark": {
            "configuration": {
                "spark.app.name": "Azure ML Experiment",
                "spark.yarn.maxAppAttempts": 1
            }
        },
        "hdi": {
            "yarnDeployMode": "cluster"
        },
        "tensorflow": {
            "workerCount": 1,
            "parameterServerCount": 1
        },
        "mpi": {
            "processCountPerNode": 1
        },
        "dataReferences": {
            "workspacefilestore": {
                "dataStoreName": "workspacefilestore",
                "pathOnDataStore": null,
                "mode": "mount",
                "overwrite": false,
                "pathOnCompute": null
            }
        },
        "sourceDirectoryDataStore": null,
        "amlcompute": {
            "vmSize": null,
            "vmPriority": null,
            "retainCluster": false,
            "name": null,
            "clusterMaxNodeCount": 1
        }
    }


    • Edited by alpaca2000 Monday, November 4, 2019 9:47 PM
    Monday, November 4, 2019 9:47 PM
  • Hi,

    It looks like the tensorflow image doesn't have conda. If extra packages are added, conda to create a new environment and install those packages.

    If you want tesnsorflow, you can use the tensorflow estimator.

    https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-train-tensorflow#create-a-tensorflow-estimator

    Thanks 


    Tuesday, November 5, 2019 6:07 AM
    Moderator
  • Thanks. I had previously tried to use the TensorFlow estimator but it didn't work either. Ran again and below is the estimator I used. The experiment failed "Run failed. AzureML compute job failed. Failed starting container..." I've attached the relevant parts of the error log. 

    -------------

    from azureml.train.dnn import TensorFlow

    script_params = {
        '--data-folder': ds.as_mount()
    }

    est = TensorFlow(source_directory='.',
                       compute_target=compute_target,
                       script_params = script_params,
                       entry_script='mbnet_finetuning/test_bottlenecktensors_training.py',
                       pip_packages=['keras','numpy'],
                       use_gpu=True
                       )

    ---------------

    Errors:

    error: None of TensorFlow, PyTorch, or MXNet plugins were built" 

    docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 --pid=17365 /mnt/docker/overlay2/60d1c3d76508bd126b3db600de343896e617237a086fdf622da23f20a07eb506/merged]\\\\nnvidia-container-cli: requirement error: unsatisfied condition: driver >= 410\\\\n\\\"\"": unknown.
    2019-11-05T21:35:23Z Job environment preparation failed on 10.0.0.4. Output: 

    Tuesday, November 5, 2019 10:00 PM
    • Proposed as answer by AzureML1256 Tuesday, November 19, 2019 3:23 AM
    Tuesday, November 19, 2019 3:23 AM