Hi,
I started using the Pytorch Estimator to train an image classification network. I found that no matter what dedicated compute instance I used (4xK80, 4xP40, or 1xV100), the torch.cuda.is_available() command in the entry script always returned False
during the run of the experiment.
But the funny thing is that the same command returns True when run from the notebook server as the CUDA driver is installed on the dedicated compute VM itself.
And based on further digging, I saw that the compute instance itself is a windows machine with the CUDA drivers installed, but when we run the entry script using the Pytorch estimator class, the run happens in a docker image that has Ubuntu 18.04
LTS running with no CUDA drivers installed. Hence it results in failed commands for nvidia-smi or torch.cuda.is_available().
I'm not sure if this is because, GPU support is broken for AzueML as stated/posted by another user on this forum or because a base CPU docker image is spun even if use_gpu parameter is True.
I'm a bit confused on what to do to make the CUDA work for my training purposes. If not, then I don't think the service is of any use to run on CPU. Any quick help is appreciated.
The following is the code I used with a dedicated compute target (a VM with 4 Tesla K80 GPUs)
cluster_name = "testpytorch1"
try:
compute_target = ComputeTarget(workspace=ws, name=cluster_name)
print('Found existing compute target')
except ComputeTargetException:
print('Creating a new compute target...')
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_ND24RS',
max_nodes=4)
compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=60)
# Create a new runconfig object
run_amlcompute = RunConfiguration()
# Use the cpu_cluster you created above.
run_amlcompute.target = compute_target
# Enable Docker
run_amlcompute.environment.docker.enabled = True
# Set Docker base image to the default CPU-based image
run_amlcompute.environment.docker.base_image = DEFAULT_GPU_IMAGE
# Use conda_dependencies.yml to create a conda environment in the Docker image for execution
run_amlcompute.environment.python.user_managed_dependencies = False
# Specify CondaDependencies obj, add necessary packages
run_amlcompute.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['pillow==5.4.1'])
script_params = {
'--num_epochs': 50,
'--output_dir': './outputs'
}
estimator = PyTorch(source_directory=project_folder,
script_params=script_params,
compute_target=compute_target,
entry_script='pytorch_train.py',
use_gpu=True,
use_docker=True,
pip_packages=['pillow==5.4.1'])
run = experiment.submit(estimator)
run.wait_for_completion(show_output=True)
RunDetails(run).show()
And the output I get during debugging in the entry script is as follows when I print the result of torch.cuda.is_available().
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.3 LTS
Release: 18.04
Codename: bionic
0
cudaPresent: False
device: cpu
torchVersion: 1.4.0
The thing I don't understand is although I'm using the base GPU image which points to Intel Ubuntu 16.04 docker image, the OS check in the entry python script shows it to be Ubuntu 18.04. There's definitely something going wrong here.