For running GPU workload on the EWC Kubernetes Service the two prerequisites are required:
- Cluster with some GPU-enabled worker nodes (one of the vm.A6000.X flavours) running a Ubuntu GPU Image which comes with the Nvidia Driver installed on the base OS
- Installed GPU operator from the Application Catalogue.
- Why not the device plugin but the operator? → The GPU operator handles the device discovery, validation, container toolkit installation, and many other important bits that we might need to do before starting to use the device plugin manually, then installs the device plugin as well.
Getting Started
- Provision a new Ubuntu-based cluster using one of the options described in this guide.
Note: in the EUMETSAT side of the EWC only Ubuntu images are supported for GPU workload - If you are provisioning the cluster manually, make sure to:
- select one of the vm.A6000.X flavours for the worker nodes.
Note: the GPU nodes need at least 60GB of disk space to deploy the GPU operator and related workload. If you use a small flavour (e.g. vm.A6000.1) which comes with a smaller disk, specify a larger custom disk size manually - Change the default Image Name to a GPU-enabled image (see list of available images).
- select one of the vm.A6000.X flavours for the worker nodes.
- If you provision a cluster using a template, manually add new Machine Deployment with GPU-enabled worker nodes afterwards.
- And provide the same values as in Step 2 above
- Once the GPU nodes are provisioned, deploy the NVIDIA GPU Operator - Time Slicing application from the Application Catalogue
- You can verify the status of the GPU operator by checking if all the pods in the respective namespace have been successfully deployed
> kubectl get pods -n gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-tpwr4 2/2 Running 0 107s gpu-operator-745ccb5b94-dzxvk 1/1 Running 0 3m19s gpu-operator-gpu-operator-node-feature-discovery-master-6fpj76g 1/1 Running 0 3m19s gpu-operator-gpu-operator-node-feature-discovery-worker-6hk95 1/1 Running 0 3m19s gpu-operator-gpu-operator-node-feature-discovery-worker-jb2v8 1/1 Running 0 3m18s nvidia-container-toolkit-daemonset-7gsz7 1/1 Running 2 (86s ago) 111s nvidia-cuda-validator-pqt4b 0/1 Completed 0 46s nvidia-dcgm-exporter-hmxx8 1/1 Running 0 108s nvidia-device-plugin-daemonset-2kxfq 2/2 Running 0 110s nvidia-device-plugin-validator-ss74n 0/1 Completed 0 29s nvidia-operator-validator-6tglx 1/1 Running 0 111s
- Deploy a test workload to verify the logs of the pod to confirm GPU access works as expected
> cat << EOF | kubectl create -f - apiVersion: v1 kind: Pod metadata: name: vector-add spec: restartPolicy: OnFailure containers: - name: vector-add image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 resources: limits: nvidia.com/gpu: 1 EOF
> kubectl logs pod/vector-add [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done