Introduction
EUMETSAT infrastructure contains RX A6000 NVIDIA GPU cards. To employ the GPU, one need to provision one of the following flavors:
Flavor name | vCPU | RAM | vGPU Type | vGPU RAM | SSD storage (GB) |
---|
vm.a6000.1 | 2 | 14 GB | RTXA6000-6C | 6 GB | 40 |
vm.a6000.2 | 4 | 28 GB | RTXA6000-12C | 12 GB | 80 |
vm.a6000.4 | 8 | 56 GB | RTXA6000-24C | 24 GB | 160 |
vm.a6000.8 | 16 | 112 GB | RTXA6000-48C | 48 GB | 320 |
Provision
To use the GPUs:
- Provision new Centos or Ubuntu instance.
Image RemovedImage Added - Select layout ending with
eumetsat
-gpu and one of the plans listed above. Beside that, configure your instance as preferred and continue deployment process.
- Once VM is deployed, you can verify GPUs for example using
nvidia-smi
program from command line (see below for confirming library installations and drivers).
Usage
...
Essential commands
You can see GPU information using nvidia-smi
Code Block |
---|
language | bash |
---|
title | Checking the GPU drivers |
---|
|
# Login to your instance and run below command
$ nvidia-smi
Mon Jan 8 10:24:59 2024
+--
# Check if the input you received shows the NVIDIA-SMI, Driver and CUDA versions. You can also see the GPU hardware (e.g., RTXA6000-6C) and the GPU memory
Mon Feb 5 13:01:43 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161223.0302 Driver Version: 470.161223.0302 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTXA6000...-6C On | 00000000:00:05.0 Off | 0 |
| N/A N/A P8 N/A / N/A | 3712MiB512MiB / 48895MiB5976MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+ |
Code Block |
---|
language | bash |
---|
title | Adding NVIDIA tools to path |
---|
|
# NVIDIA tools are available in /usr/local/cuda- |
...
...
2/bin/. You can add them to PATH following: |
...
Code Block |
---|
$ export PATH=$PATH:/usr/local/cuda-1112.42/bin/ |
Libraries
Installing Libraries
You can install a variety of libraries using different methods. Below, we have a basic tutorial showing you how you can install libraries such as TensorFlow, Keras and PyTorch with conda package manger. CUDA version is currently 11.4 CUDA version is currently 11.4 which need to be the same with drivers and thus can't be changed. Tensorflow library compatibility is available at: https://www.tensorflow.org/install/source#gpu. We have tested that TensorFlow > 2.6.1 work.
Using
...
conda
...
Code Block |
---|
language | bash |
---|
title | Conda installation |
---|
|
# changeinstall shellminiforge to(or bashany forconda installationsmanager)
$ bash wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
# updatemake defaultit packagesexecutable
$ chmod sudo apt-get update
$ sudo apt-get update
# it's possible to get some update key and dirmngr errors while updating, below commands supply a workaround. After running the workaround, run update & upgrade again.
$ sudo apt install dirmngr
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys <YOUR-KEY-LIKE-AA16FCBCA621E701>
# install miniforge (or any anaconda manager)
$ wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
$ chmod +x Miniforge3-Linux-x86_64.sh
$ ./Miniforge3-Linux-x86_64.sh
#When it asks, conda init? answer yes
#Do you wish the installer to initialize Miniforge3
#by running conda init? [yes|no]
#[no] >>>
$ yes
$ exit
$ bash |
Library installations
Code Block |
---|
# create conda environment
$ conda create -n ML python=3.8
# activate the environment
$ conda activate ML
# install packages, note that installing tensorflow-gpu and keras also installs: CUDA toolkit, cuDNN (CUDA Deep Neural Network library), Numpy, Scipy, Pillow
$ conda install tensorflow-gpu keras
# (OPTIONAL) cudatoolkit is installed automatically while installing keras and tensorflow-gpu, but if you need a specific (or latest) version run below command.
$ conda install -c anaconda cudatoolkit
# (OPTIONAL) Installing pytorch GPU, pytorch might need cuda 11.8
$ conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia |
Confirmation of installations
+x Miniforge3-Linux-x86_64.sh
# run and install the executable
$ ./Miniforge3-Linux-x86_64.sh |
Code Block |
---|
language | bash |
---|
title | Library installations |
---|
|
# create a conda environment called ML with Python 3.8
$ conda create -n ML python=3.8
# activate the environment
$ conda activate ML
# install packages, note that installing tensorflow-gpu and keras also installs many number of extra libraries such as CUDA toolkit, cuDNN (CUDA Deep Neural Network library), Numpy, Scipy, Pillow
(ML) $ conda install tensorflow-gpu keras
# (OPTIONAL) cudatoolkit is installed automatically while installing keras and tensorflow-gpu, but if you need a specific (or latest) version run below command.
(ML) $ conda install -c anaconda cudatoolkit
# (OPTIONAL) installing pytorch GPU, pytorch might need cuda 11.8
(ML) $ conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia |
Installation Confirmations
Here we run a few initial commands for different libraries & drivers for confirming the library integrations with the GPU.
Code Block |
---|
language | bash |
---|
title | Check python version |
---|
|
(ML) $ python3 --version
Python 3.8.18 |
Code Block |
---|
language | bash |
---|
title | Check NVIDIA Cuda compiler driver |
---|
|
(ML) $ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0 |
Code Block |
---|
language | bash |
---|
title | Check TensorFlow installations in Conda environment |
---|
|
(ML) $ conda list | grep tensorflow
tensorflow 2.13.1 cuda118py38h409af0c_1 conda-forge
tensorflow-base 2.13.1 cuda118py38h52ca5c6_1 conda-forge
tensorflow-estimator 2.13.1 cuda118py38ha2f8a09_1 conda-forge
tensorflow-gpu 2.13.1 cuda118py38h0240f8b_1 conda-forge |
Code Block |
---|
language | bash |
---|
title | Check keras installations in Conda environment |
---|
|
(ML) $ conda list | grep keras
keras 2.13.1 pyhd8ed1ab_0 conda-forge |
Code Block |
---|
language | bash |
---|
title | Enter python interpreter |
---|
|
(ML) $ python
|
Code Block |
---|
language | py |
---|
title | Check TensorFlow installation |
---|
|
import tensorflow as tf
tf.test.is_built_with_cuda()
True
tf.config.list_physical_devices('GPU')
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
print(tf.__version__)
2.13.1
|
Code Block |
---|
language | py |
---|
title | Check Pytorch and CUDA installation |
---|
|
import torch
print(torch.__version__)
2.2.0
print(torch.cuda.is_available())
True
print(torch.version.cuda)
11.8
if torch.cuda.is_available():
# Create a tensor and move it to GPU
x = torch.tensor([1.0, 2.0]).cuda()
print(x) # Print the tensor to verify it's on the GPU
else:
print("CUDA is not available. Check your PyTorch installation.")
tensor([1., 2.], device='cuda:0')
|
Using Docker
If you want to use GPUs in docker, you need to take few extra steps after creating the VM.
Code Block |
---|
language | bash |
---|
title | Install docker on Ubuntu |
---|
|
$ sudo apt install -y docker.io
$ sudo usermod -aG docker $USER |
Code Block |
---|
language | bash |
---|
title | Install docker on CentOS |
---|
|
$ sudo yum-config-manager \
--add-repo \
https://download.docker.com/linux/centos/docker-ce.repo
$ sudo yum install docker-ce docker-ce-cli containerd.io
$ sudo systemctl --now enable docker
$ sudo usermod -aG docker $USER |
To provide support for docker to use the GPU, you need to install the NVIDIA Container Toolkit. You can follow instructions on NVIDIA's website or basically do:
Code Block |
---|
language | bash |
---|
title | Install necessary packages for GPU support in Docker and restart docker on Ubuntu |
---|
|
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
$ sudo systemctl restart docker |
Code Block |
---|
language | bash |
---|
title | Install necessary packages and restart docker on CentOS |
---|
|
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
$ sudo yum clean expire-cache && sudo yum install -y nvidia-docker2
$ sudo systemctl restart docker |
Test the install with:
Code Block |
---|
title | nvidia-smi in docker test |
---|
|
$ docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
Wed Feb 28 13:20:24 2024 |
Code Block |
---|
$ nvidia-smi
Mon Jan 8 10:24:59 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTXA6000... On | 00000000:00:05.0 Off | 0 |
| N/A N/A P8 N/A / N/A | 3712MiB / 48895MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+------------------------+
+| NVIDIA-SMI 470.223.02 Driver Version: 470.223.02 CUDA Version: 11.4 |
|-------------------------------+------------------------+----------------------+
| Processes:GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| GPU GI CI PID Type Process name | GPU Memory |
| ID | ID MIG Usage |
|M. |
|===============================+========================+======================|
| No running0 processes foundNVIDIA RTXA6000-6C On | 00000000:00:05.0 Off | 0 |
| N/A N/A P8 N/A / N/A | 512MiB / 5976MiB | 0% |
+------------Default |
| | | N/A |
+-------------------------------+----------------------+------------+
$ python3 --version
Python 3.8.18
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Oct_11_21:27:02_PDT_2021
Cuda compilation tools, release 11.4, V11.4.152
Build cuda_11.4.r11.4/compiler.30521435_0
$ whereis cuda
cuda: /usr/local/cuda
$ cat /home/<USERNAME>/miniforge3/envs/myenv/include/cudnn.h
.
.
.
/* cudnn : Neural Networks Library
*/
#if !defined(CUDNN_H_)
#define CUDNN_H_
#include <cuda_runtime.h>
#include <stdint.h>
#include "cudnn_version.h"
#include "cudnn_ops_infer.h"
#include "cudnn_ops_train.h"
#include "cudnn_adv_infer.h"
#include "cudnn_adv_train.h"
#include "cudnn_cnn_infer.h"
#include "cudnn_cnn_train.h"
#include "cudnn_backend.h"
#if defined(__cplusplus)
extern "C" {
#endif
#if defined(__cplusplus)
}
#endif
#endif /* CUDNN_H_ */
$ conda list | grep tensorflow
tensorflow 2.13.1 cuda118py38h409af0c_1 conda-forge
tensorflow-base 2.13.1 cuda118py38h52ca5c6_1 conda-forge
tensorflow-estimator 2.13.1 cuda118py38ha2f8a09_1 conda-forge
tensorflow-gpu 2.13.1----------+
+-----------------------------------------------------------------------------+
| Processes: cuda118py38h0240f8b_1 conda-forge
$ conda list | grep keras
keras 2.13.1 pyhd8ed1ab_0 conda-forge
$ python
import tensorflow as tf
tf.test.is_built_with_cuda()
True
tf.config.list_physical_devices('GPU')
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
print(tf.__version__)
2.13.1 |
#Using Docker
If you want to use GPUs in docker, you need to take few extra steps after creating the VM.
Install Docker
In ubuntu:
Code Block |
---|
sudo apt install -y docker.io
sudo usermod -aG docker $USER |
In Centos:
Code Block |
---|
sudo yum-config-manager \
--add-repo \
https://download.docker.com/linux/centos/docker-ce.repo
sudo yum install docker-ce docker-ce-cli containerd.io
sudo systemctl --now enable docker
sudo usermod -aG docker $USER |
...
Install nvidia-container toolkit
Ubuntu:
Code Block |
---|
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker |
Centos:
Code Block |
---|
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo yum clean expire-cache && sudo yum install -y nvidia-docker2
sudo systemctl restart docker |
|
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
|
And run something useful..
Code Block |
---|
language | bash |
---|
title | Run tensorflow JupyterNotebooks |
---|
|
$ sudo |
Run GPU-compatible notebook. For example:
...
docker run --gpus all --env NVIDIA_DISABLE_REQUIRE=1 -it --rm -v $(realpath ~/notebooks):/tf/notebooks -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter |