...
- Provision a new Ubuntu-based cluster using one of the options described in this guide.
Note: in the EUMETSAT side of the EWC only Ubuntu images are supported for GPU workload - If you are provisioning the cluster manually, make sure to:
- select one of the vm.A6000.X flavours for the worker nodes.
Note: the GPU nodes need at least 60GB of disk space to deploy the GPU operator and related workload. If you use a small flavour (e.g. vm.A6000.1) which comes with a smaller disk, specify a larger custom disk size manually - Change the default Image Name to a GPU-enabled image (see list of available images).
- select one of the vm.A6000.X flavours for the worker nodes.
- If you provision a cluster using a template, manually add new Machine Deployment with GPU-enabled worker nodes afterwards.
- And provide the same values as in Step 2 above
- Once the GPU nodes are provisioned, deploy the NVIDIA GPU Operator - Time Slicing application from the Application Catalogue
- You can verify the status of the GPU operator by checking if all the pods in the respective namespace have been successfully deployed
Code Block > kubectl get pods -n gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-tpwr4 2/2 Running 0 107s gpu-operator-745ccb5b94-dzxvk 1/1 Running 0 3m19s gpu-operator-gpu-operator-node-feature-discovery-master-6fpj76g 1/1 Running 0 3m19s gpu-operator-gpu-operator-node-feature-discovery-worker-6hk95 1/1 Running 0 3m19s gpu-operator-gpu-operator-node-feature-discovery-worker-jb2v8 1/1 Running 0 3m18s nvidia-container-toolkit-daemonset-7gsz7 1/1 Running 2 (86s ago) 111s nvidia-cuda-validator-pqt4b 0/1 Completed 0 46s nvidia-dcgm-exporter-hmxx8 1/1 Running 0 108s nvidia-device-plugin-daemonset-2kxfq 2/2 Running 0 110s nvidia-device-plugin-validator-ss74n 0/1 Completed 0 29s nvidia-operator-validator-6tglx 1/1 Running 0 111s
- Deploy a test workload to verify the GPU access
Code Block > cat << EOF | kubectl create -f - apiVersion: v1 kind: Pod metadata: name: vector-add spec: restartPolicy: OnFailure containers: - name: vector-add image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 resources: limits: nvidia.com/gpu: 1 EOF > kubectl logs pod/vector-add [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done