The Atos HPCF features 32 26 special GPIL nodes with GPUs for experimentation and testing with for GPU-enabled applications and modules, as well as Machine Learning and AI workloads for authorised users. Present in only one of the complexes (AC), each node is equipped with 4 NVIDIA A100 40GB cards, . They can be used in batch through the special "ng
" QoS in the SLURM Batch System. Interactive jobs are also possible with ecinteractive -g
.
Note | ||
---|---|---|
| ||
Since the number of GPUs is limited, be mindful of your usage and do not leave your jobs or sessions on GPU nodes idle. Cancel your jobs when you are done and someone else will be able to make use of the resources. |
...
Table of Contents |
---|
Submitting a batch job
You will need to use the dedicated ng queue on AC for this purpose:
Excerpt Include | ||||||
---|---|---|---|---|---|---|
|
You can then run a batch job asking for GPUs using adding the following SBATCH directives:
...
Tip | ||
---|---|---|
| ||
Some users may be able to run a Jupyter session on an Atos HPCF GPU-enabled node through the ECMWF JupyterHub service. Alternatively, you can also You can run a Jupyter Lab on a node with a GPU with:
More details on JupyterLab with ecinteractive can be found here. |
Warning | ||
---|---|---|
| ||
Leaving your interactive sessions idle prevents other users from making use of the resources reserved for your job, and in particular the GPU. So please:
|
Software stack
Most AI/ML tools and libraries are Python based, so in most cases you can use one of the following methods
...
No Format |
---|
module load python3/new cuda |
Custom Python Virtual environments
...
Then you can activate it when you need it with:
No Format |
---|
module load cuda source $PERM/venvs/myenvmyvenv/bin/activate |
And then install any packages you need.
...
No Format |
---|
module load conda/new
conda create -n mymlenv pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
conda activate mymlenv |
Containerised applications
You can run your ML/AI workflows using containers with Apptainer. Since these containers tend to be quite large, we recommend the following:
- Pull the container to a local image file first.
- Ensure you do this on a job requesting enough memory. The default 8GB may not be enough to generate the image files for such big containers.
- Store the container image on a Lustre filesystem such as HPCPERM or SCRATCH
For example, to run an application which requires TensorFlow using the official TensorFlow docker container, first pull the container from Docker Hub:
No Format |
---|
$ cd $HPCPERM
$ module load apptainer
$ apptainer pull docker://tensorflow/tensorflow:latest-gpu
...
$ ls tensorflow_latest-gpu.sif
tensorflow_latest-gpu.sif |
Note that pulling the container may take a while. Once you have the container image, you can run it whenever you need it. Here is an example of a very simple computation with the container:
No Format |
---|
$ apptainer run --nv ./tensorflow_latest-gpu.sif python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2023-03-13 08:48:07.128096: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:24.016454: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:29.676864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38249 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:03:00.0, compute capability: 8.0
tf.Tensor(251.08997, shape=(), dtype=float32) |
To run your own script, you could also do:
No Format |
---|
./tensorflow_latest-gpu.sif python3 your_tensorflow_script.py |
Monitoring your GPU usage
...
No Format |
---|
$ nvidia-smi Wed Mar 8 14:39:45 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:8403:00.0 Off | 0 | | N/A 38C62C P0 351W 59W / 400W | 39963MiB 0MiB / 40960MiB | 0%93% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes0 found N/A N/A 181525 C python 39960MiB | +-----------------------------------------------------------------------------+ |
...