The Atos HPCF features 32 26 special GPIL nodes with GPUs for experimentation and testing with for GPU-enabled applications and modules, as well as Machine Learning and AI workloads for authorised users. Present in only one of the complexes (AC), each node is equipped with 4 NVIDIA A100 40GB cards, . They can be used in batch through the special "ng" QoS in the SLURM Batch System. Interactive jobs are also possible with ecinteractive -g.

Note

title	Limited aviailabilityavailability

Since the number of GPUs is limited, be mindful of your usage and do not leave your jobs or sessions on GPU nodes idle. Cancel your jobs when you are done and someone else will be able to make use of the resources.

...

Table of Contents

Submitting a batch job

You will need to use the dedicated ng queue on AC for this purpose:

Excerpt Include

	UDOC:HPC2020: Batch system
	UDOC:HPC2020: Batch system
nopanel	true

You can then run a batch job asking for GPUs using adding the following SBATCH directives:

...

Tip

title	GPU-powered Jupyter lab

Some users may be able to run a Jupyter session on an Atos HPCF GPU-enabled node through the ECMWF JupyterHub service.

Alternatively, you can also You can run a Jupyter Lab on a node with a GPU with:

No Format
ecinteractive -g -j

More details on JupyterLab with ecinteractive can be found here.

Warning

title	Usage etiquette

Leaving your interactive sessions idle prevents other users from making use of the resources reserved for your job, and in particular the GPU. So please:

Use batch jobs whenever possible to run anything that can be done unattended.
Limit your interactive sessions to the minimum required time to accomplish your task, and kill them once you are finished to leave room for the next user. If you need interactive access again later, you can start a new session.

Software stack

Most AI/ML tools and libraries are Python based, so in most cases you can use one of the following methods

...

No Format
module load python3/new cuda

Custom Python Virtual environments

...

Then you can activate it when you need it with:

No Format
module load cuda source $PERM/venvs/myenvmyvenv/bin/activate

And then install any packages you need.

...

No Format
module load conda/new conda create -n mymlenv pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia conda activate mymlenv

Containerised applications

You can run your ML/AI workflows using containers with Apptainer. Since these containers tend to be quite large, we recommend the following:

Pull the container to a local image file first.
Ensure you do this on a job requesting enough memory. The default 8GB may not be enough to generate the image files for such big containers.
Store the container image on a Lustre filesystem such as HPCPERM or SCRATCH

For example, to run an application which requires TensorFlow using the official TensorFlow docker container, first pull the container from Docker Hub:

No Format
$ cd $HPCPERM $ module load apptainer $ apptainer pull docker://tensorflow/tensorflow:latest-gpu ... $ ls tensorflow_latest-gpu.sif tensorflow_latest-gpu.sif

Note that pulling the container may take a while. Once you have the container image, you can run it whenever you need it. Here is an example of a very simple computation with the container:

No Format

$ apptainer run --nv ./tensorflow_latest-gpu.sif python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2023-03-13 08:48:07.128096: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA     
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:24.016454: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA     
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:29.676864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38249 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:03:00.0, compute capability: 8.0
tf.Tensor(251.08997, shape=(), dtype=float32)

To run your own script, you could also do:

No Format
./tensorflow_latest-gpu.sif python3 your_tensorflow_script.py

Monitoring your GPU usage

...

No Format

$ nvidia-smi
Wed Mar  8 14:39:45 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:8403:00.0 Off |                    0 |
| N/A   38C62C    P0   351W 59W / 400W |  39963MiB    0MiB / 40960MiB |      0%93%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes0 found  N/A  N/A    181525      C   python                               39960MiB |
+-----------------------------------------------------------------------------+

...

Space shortcuts

Page tree

Versions Compared

Old Version 5

New Version Current

Key

Submitting a batch job

Software stack

Custom Python Virtual environments

Containerised applications

Monitoring your GPU usage

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 5

New Version Current

Key

Submitting a batch job

Software stack

Custom Python Virtual environments

Containerised applications

Monitoring your GPU usage