Warning

title	DISCLAIMER

This is an experimental service, certain features may be added in the future as the system matures. Production or operational use of this service is discouraged.

The Atos HPCF features 18 26 special GPIL nodes with GPUs for experimentation and testing for for GPU-enabled applications and modules, as well as Machine Machine Learning and AI workloads. Present in only one of the complexes (AC), each node is equipped with 4 NVIDIA A100 40GB cards. They can be used in batch through the special "ng" QoS in the SLURM Batch System. Interactive jobs are also possible with ecinteractive -g.

...

Tip

title	GPU-powered Jupyter lab

Some users may be able to run a Jupyter session on an Atos HPCF GPU-enabled node through the ECMWF JupyterHub service.

Alternatively, you can also You can run a Jupyter Lab on a node with a GPU with:

No Format
ecinteractive -g -j

More details on JupyterLab with ecinteractive can be found here.

...

No Format
module load cuda source $PERM/venvs/myenvmyvenv/bin/activate

And then install any packages you need.

...

No Format
module load conda/new conda create -n mymlenv pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia conda activate mymlenv

Containerised applications

You can run your ML/AI workflows using containers with Apptainer. Since these containers tend to be quite large, we recommend the following:

Pull the container to a local image file first.
Ensure you do this on a job requesting enough memory. The default 8GB may not be enough to generate the image files for such big containers.
Store the container image on a Lustre filesystem such as HPCPERM or SCRATCH

For example, to run an application which requires TensorFlow using the official TensorFlow docker container, first pull the container from Docker Hub:

No Format
$ cd $HPCPERM $ module load apptainer $ apptainer pull docker://tensorflow/tensorflow:latest-gpu ... $ ls tensorflow_latest-gpu.sif tensorflow_latest-gpu.sif

Note that pulling the container may take a while. Once you have the container image, you can run it whenever you need it. Here is an example of a very simple computation with the container:

No Format

$ apptainer run --nv ./tensorflow_latest-gpu.sif python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2023-03-13 08:48:07.128096: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA     
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:24.016454: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA     
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:29.676864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38249 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:03:00.0, compute capability: 8.0
tf.Tensor(251.08997, shape=(), dtype=float32)

To run your own script, you could also do:

No Format
./tensorflow_latest-gpu.sif python3 your_tensorflow_script.py

Monitoring your GPU usage

...

No Format

$ nvidia-smi
Wed Mar  8 14:39:45 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:8403:00.0 Off |                    0 |
| N/A   38C62C    P0    59W351W / 400W |    39963MiB  0MiB / 40960MiB |      0%93%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes0 found  N/A  N/A    181525      C   python                               39960MiB |
+-----------------------------------------------------------------------------+

...

Space shortcuts

Page tree

Versions Compared

Old Version 11

New Version Current

Key

Containerised applications

Monitoring your GPU usage

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 11

New Version Current

Key

Containerised applications

Monitoring your GPU usage