Page History

...

Provision a new Ubuntu-based cluster using one of the options described in this guide.
Note: in the EUMETSAT side of the EWC only Ubuntu images are supported for GPU workload
If you are provisioning the cluster manually, make sure to:
1. select one of the vm.A6000.X flavours for the worker nodes.
  Note: the GPU nodes need at least 60GB of disk space to deploy the GPU operator and related workload. If you use a small flavour (e.g. vm.A6000.1) which comes with a smaller disk, specify a larger custom disk size manually
2. Change the default Image Name to a GPU-enabled image (see list of available images).
If you provision a cluster using a template, manually add new Machine Deployment with GPU-enabled worker nodes afterwards.
And provide the same values as in Step 2 above
Once the GPU nodes are provisioned, deploy the NVIDIA GPU Operator - Time Slicing application from the Application Catalogue

You can verify the status of the GPU operator by checking if all the pods in the respective namespace have been successfully deployed

Code Block

> kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS      RESTARTS      AGE
gpu-feature-discovery-tpwr4                                       2/2     Running     0             107s
gpu-operator-745ccb5b94-dzxvk                                     1/1     Running     0             3m19s
gpu-operator-gpu-operator-node-feature-discovery-master-6fpj76g   1/1     Running     0             3m19s
gpu-operator-gpu-operator-node-feature-discovery-worker-6hk95     1/1     Running     0             3m19s
gpu-operator-gpu-operator-node-feature-discovery-worker-jb2v8     1/1     Running     0             3m18s
nvidia-container-toolkit-daemonset-7gsz7                          1/1     Running     2 (86s ago)   111s
nvidia-cuda-validator-pqt4b                                       0/1     Completed   0             46s
nvidia-dcgm-exporter-hmxx8                                        1/1     Running     0             108s
nvidia-device-plugin-daemonset-2kxfq                              2/2     Running     0             110s
nvidia-device-plugin-validator-ss74n                              0/1     Completed   0             29s
nvidia-operator-validator-6tglx                                   1/1     Running     0             111s

Deploy a test workload to verify the GPU access

Code Block

> cat << EOF | kubectl create -f -
 apiVersion: v1
 kind: Pod
 metadata:
   name: vector-add
 spec:
   restartPolicy: OnFailure
   containers:
   - name: vector-add
     image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
     resources:
       limits:
          nvidia.com/gpu: 1
EOF

> kubectl logs pod/vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Space shortcuts

Page tree

Versions Compared

Old Version 1

New Version 2

Key