For running GPU workload on the EWC Kubernetes Service the two prerequisites are required:
> kubectl get pods -n gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-tpwr4 2/2 Running 0 107s gpu-operator-745ccb5b94-dzxvk 1/1 Running 0 3m19s gpu-operator-gpu-operator-node-feature-discovery-master-6fpj76g 1/1 Running 0 3m19s gpu-operator-gpu-operator-node-feature-discovery-worker-6hk95 1/1 Running 0 3m19s gpu-operator-gpu-operator-node-feature-discovery-worker-jb2v8 1/1 Running 0 3m18s nvidia-container-toolkit-daemonset-7gsz7 1/1 Running 2 (86s ago) 111s nvidia-cuda-validator-pqt4b 0/1 Completed 0 46s nvidia-dcgm-exporter-hmxx8 1/1 Running 0 108s nvidia-device-plugin-daemonset-2kxfq 2/2 Running 0 110s nvidia-device-plugin-validator-ss74n 0/1 Completed 0 29s nvidia-operator-validator-6tglx 1/1 Running 0 111s |
> cat << EOF | kubectl create -f - apiVersion: v1 kind: Pod metadata: name: vector-add spec: restartPolicy: OnFailure containers: - name: vector-add image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 resources: limits: nvidia.com/gpu: 1 EOF |
> kubectl logs pod/vector-add [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done |