For running GPU workload on the EWC Kubernetes Service the two prerequisites are required:

  • Cluster with some GPU-enabled worker nodes (one of the vm.A6000.X flavours) running a Ubuntu GPU Image which comes with the Nvidia Driver installed on the base OS
  • Installed GPU operator from the Application Catalogue
    • Why not the device plugin but the operator? → The GPU operator handles the device discovery, validation, container toolkit installation, and many other important bits that we might need to do before starting to use the device plugin manually, then installs the device plugin as well.

Getting Started

  1. Provision a new Ubuntu-based cluster using one of the options described in this guide.
    Note: in the EUMETSAT side of the EWC only Ubuntu images are supported for GPU workload

  2. If you are provisioning the cluster manually, make sure to:
    1. select one of the vm.A6000.X flavours for the worker nodes.
      Note: the GPU nodes need at least 60GB of disk space to deploy the GPU operator and related workload. If you use a small flavour (e.g. vm.A6000.1) which comes with a smaller disk, specify a larger custom disk size manually
    2. Change the default Image Name to a GPU-enabled image (see list of available images).



  3.  If you provision a cluster using a template, manually add new Machine Deployment with GPU-enabled worker nodes afterwards.


  4. And provide the same values as in Step 2 above


  5. Once the GPU nodes are provisioned, deploy the NVIDIA GPU Operator - Time Slicing application from the Application Catalogue


  6. You can verify the status of the GPU operator by checking if all the pods in the respective namespace have been successfully deployed
    > kubectl get pods -n gpu-operator
    NAME                                                              READY   STATUS      RESTARTS      AGE
    gpu-feature-discovery-tpwr4                                       2/2     Running     0             107s
    gpu-operator-745ccb5b94-dzxvk                                     1/1     Running     0             3m19s
    gpu-operator-gpu-operator-node-feature-discovery-master-6fpj76g   1/1     Running     0             3m19s
    gpu-operator-gpu-operator-node-feature-discovery-worker-6hk95     1/1     Running     0             3m19s
    gpu-operator-gpu-operator-node-feature-discovery-worker-jb2v8     1/1     Running     0             3m18s
    nvidia-container-toolkit-daemonset-7gsz7                          1/1     Running     2 (86s ago)   111s
    nvidia-cuda-validator-pqt4b                                       0/1     Completed   0             46s
    nvidia-dcgm-exporter-hmxx8                                        1/1     Running     0             108s
    nvidia-device-plugin-daemonset-2kxfq                              2/2     Running     0             110s
    nvidia-device-plugin-validator-ss74n                              0/1     Completed   0             29s
    nvidia-operator-validator-6tglx                                   1/1     Running     0             111s
    


  7. Deploy a test workload to verify the logs of the pod to confirm GPU access works as expected
    > cat << EOF | kubectl create -f -
     apiVersion: v1
     kind: Pod
     metadata:
       name: vector-add
     spec:
       restartPolicy: OnFailure
       containers:
       - name: vector-add
         image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
         resources:
           limits:
              nvidia.com/gpu: 1
    EOF
    

    > kubectl logs pod/vector-add
    [Vector addition of 50000 elements]
    Copy input data from the host memory to the CUDA device
    CUDA kernel launch with 196 blocks of 256 threads
    Copy output data from the CUDA device to the host memory
    Test PASSED
    Done
  • No labels