locked
Ubuntu Docker Image does not have access to GPUs and CUDA is not available RRS feed

  • Question

  • Hi,

    I started using the Pytorch Estimator to train an image classification network. I found that no matter what dedicated compute instance I used (4xK80, 4xP40, or 1xV100), the torch.cuda.is_available() command in the entry script always returned False during the run of the experiment.

    But the funny thing is that the same command returns True when run from the notebook server as the CUDA driver is installed on the dedicated compute VM itself.

    And based on further digging, I saw that the compute instance itself is a windows machine with the CUDA drivers installed, but when we run the entry script using the Pytorch estimator class, the run happens in a docker image that has Ubuntu 18.04 LTS running with no CUDA drivers installed. Hence it results in failed commands for nvidia-smi or torch.cuda.is_available().

    I'm not sure if this is because, GPU support is broken for AzueML as stated/posted by another user on this forum or because a base CPU docker image is spun even if use_gpu parameter is True.

    I'm a bit confused on what to do to make the CUDA work for my training purposes. If not, then I don't think the service is of any use to run on CPU. Any quick help is appreciated.

    The following is the code I used with a dedicated compute target (a VM with 4 Tesla K80 GPUs)

    cluster_name = "testpytorch1" try: compute_target = ComputeTarget(workspace=ws, name=cluster_name) print('Found existing compute target') except ComputeTargetException: print('Creating a new compute target...') compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_ND24RS', max_nodes=4) compute_target = ComputeTarget.create(ws, cluster_name, compute_config) compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=60)

    # Create a new runconfig object
    run_amlcompute = RunConfiguration()
    
    # Use the cpu_cluster you created above. 
    run_amlcompute.target = compute_target
    
    # Enable Docker
    run_amlcompute.environment.docker.enabled = True
    
    # Set Docker base image to the default CPU-based image
    run_amlcompute.environment.docker.base_image = DEFAULT_GPU_IMAGE
    
    # Use conda_dependencies.yml to create a conda environment in the Docker image for execution
    run_amlcompute.environment.python.user_managed_dependencies = False
    
    # Specify CondaDependencies obj, add necessary packages
    run_amlcompute.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['pillow==5.4.1'])


    script_params = {
        '--num_epochs': 50,
        '--output_dir': './outputs'
    }
    
    estimator = PyTorch(source_directory=project_folder, 
                        script_params=script_params,
                        compute_target=compute_target,
                        entry_script='pytorch_train.py',
                        use_gpu=True,
                        use_docker=True,
                        pip_packages=['pillow==5.4.1'])


    run = experiment.submit(estimator)
    run.wait_for_completion(show_output=True)
    RunDetails(run).show()

    And the output I get during debugging in the entry script is as follows when I print the result of torch.cuda.is_available().

    No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 18.04.3 LTS Release: 18.04 Codename: bionic 0 cudaPresent: False device: cpu torchVersion: 1.4.0

    The thing I don't understand is although I'm using the base GPU image which points to Intel Ubuntu 16.04 docker image, the OS check in the entry python script shows it to be Ubuntu 18.04. There's definitely something going wrong here.


    Wednesday, May 13, 2020 10:15 PM

All replies

  • Hello Krishna,

    We have a recent issue that was reported by a user for similar behavior.

    Could you please try to define the required docker arguments for GPU "--gpus","all" in your run configuration and check if the GPU is enabled? I think adding this to your run configuration could help when you run your experiment.

    DOCKER_ARGUMENTS=["--gpus", "all"]
    run_amlcompute.environment.docker.arguments = DOCKER_ARGUMENTS

    -Rohit


    Thursday, May 14, 2020 12:51 PM