Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: remove restriction of pytorch-cuda version 11.5, since up to cuda 12.4 is supported

The Atos HPCF features 32 26 special GPIL nodes with GPUs for experimentation and testing with for GPU-enabled applications and modules, as well as Machine Learning and AI workloads for authorised users. Present in only one of the complexes (AC), each node is equipped with 4 NVIDIA A100 40GB cards, . They can be used in batch through the special "ng" QoS in the SLURM Batch System. Interactive jobs are also possible with ecinteractive -g.

Note
titleLimited aviailabilityavailability

Since the number of GPUs is limited, be mindful of your usage and do not leave your jobs or sessions on GPU nodes idle. Cancel your jobs when you are done and someone else will be able to make use of the resources.

...

Table of Contents

Submitting a batch job

You will need to use the dedicated ng queue on AC for this purpose:

Excerpt Include
UDOC:HPC2020: Batch system
UDOC:HPC2020: Batch system
nopaneltrue

You can then run a batch job asking for GPUs using adding the following SBATCH directives:

...

Tip
titleGPU-powered Jupyter lab

Some users may be able to run a Jupyter session on an Atos HPCF GPU-enabled node through the ECMWF JupyterHub service.

Alternatively, you can also You can run a Jupyter Lab on a node with a GPU with:

No Format
ecinteractive -g -j

More details on JupyterLab with ecinteractive can be found here.


Warning
titleUsage etiquette

Leaving your interactive sessions idle prevents other users from making use of the resources reserved for your job, and in particular the GPU. So please:

  • Use batch jobs whenever possible to run anything that can be done unattended.
  • Limit your interactive sessions to the minimum required time to accomplish your task, and kill them once you are finished to leave room for the next user. If you need interactive access again later, you can start a new session.

Software stack

Most AI/ML tools and libraries are Python based, so in most cases you can use one of the following methods

...

No Format
module load python3/new cuda

Custom Python Virtual environments

...

Then you can activate it when you need it with:

No Format
module load cuda
source $PERM/venvs/myenvmyvenv/bin/activate

And then install any packages you need.

...

No Format
module load conda/new
conda create -n mymlenv pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
conda activate mymlenv

Containerised applications

You can run your ML/AI workflows using containers with Apptainer. Since these containers tend to be quite large, we recommend the following:

  • Pull the container to a local image file first. 
  • Ensure you do this on a job requesting enough memory. The default 8GB may not be enough to generate the image files for such big containers. 
  • Store the container image on a Lustre filesystem such as HPCPERM or SCRATCH

For example, to run an application which requires TensorFlow using the official TensorFlow docker container, first pull the container from Docker Hub:

No Format
$ cd $HPCPERM
$ module load apptainer
$ apptainer pull docker://tensorflow/tensorflow:latest-gpu
...
$ ls tensorflow_latest-gpu.sif 
tensorflow_latest-gpu.sif

Note that pulling the container may take a while. Once you have the container image, you can run it whenever you need it. Here is an example of a very simple computation with the container:

No Format
$ apptainer run --nv ./tensorflow_latest-gpu.sif python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2023-03-13 08:48:07.128096: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA     
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:24.016454: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA     
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:29.676864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38249 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:03:00.0, compute capability: 8.0
tf.Tensor(251.08997, shape=(), dtype=float32)

To run your own script, you could also do:

No Format
./tensorflow_latest-gpu.sif python3 your_tensorflow_script.py

Monitoring your GPU usage

...

No Format
$ nvidia-smi
Wed Mar  8 14:39:45 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:8403:00.0 Off |                    0 |
| N/A   38C62C    P0   351W 59W / 400W |  39963MiB    0MiB / 40960MiB |      0%93%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes0 found  N/A  N/A    181525      C   python                               39960MiB |
+-----------------------------------------------------------------------------+

...