The Atos HPCF features 26 special GPIL nodes with GPUs for experimentation and testing for GPU-enabled applications and modules, as well as Machine Learning and AI workloads. Present in only one of the complexes (AC), each node is equipped with 4 NVIDIA A100 40GB cards. They can be used in batch through the special "ng" QoS in the SLURM Batch System. Interactive jobs are also possible with ecinteractive -g.

Limited availability

Since the number of GPUs is limited, be mindful of your usage and do not leave your jobs or sessions on GPU nodes idle. Cancel your jobs when you are done and someone else will be able to make use of the resources.

GPU exclusive use

The number of requested GPUs will be reserved exclusively and only those will be visible within the job.

Submitting a batch job

You will need to use the dedicated ng queue on AC for this purpose:


QoS nameTypeSuitable for...Shared nodes Maximum jobs per userDefault / Max Wall Clock LimitDefault / Max CPUsDefault / Max Memory per node
ngGPUserial and small parallel jobs. It is the defaultYes-average runtime + standard deviation / 2 days1 / -8 GB /  500 GB

You can then run a batch job adding the following SBATCH directives:

#SBATCH --qos=ng
#SBATCH --gpus=1

You may request more than one GPU in the same job if your workload requires it. All the rest of the SLURM options may be used as well to configure your job to fit your needs.

Running on AC

You can submit the job from any Atos HPCF complex, but note that will be automatically redirected to AC. If you are logged into a different complex, you will not be able to query the state of the job or cancel it with the standard squeue and scancel commands. If you wish to do so, you will need to either:

  • ssh ac-login and run squeue or scancel there.
  • Use ecsqueue or ecscancel from any complex.

Working interactively

You may also open an interactive session on one of the GPU nodes with ecinteractive using the -g option, which will allocate one GPU for your interactive job. All the other options still apply when it comes to requesting other resources such as CPUs, memory or TMPDIR space.

Only one interactive session on GPU node may be active at any point. That means that if you rerun ecinteractive -g from a different terminal, you will be attached to the same session using the same resources.

GPU-powered Jupyter lab

Some users may be able to run a Jupyter session on an Atos HPCF GPU-enabled node through the ECMWF JupyterHub service.

Alternatively, you can also run a Jupyter Lab on a node with a GPU with:

ecinteractive -g -j

More details on JupyterLab with ecinteractive can be found here.

Usage etiquette

Leaving your interactive sessions idle prevents other users from making use of the resources reserved for your job, and in particular the GPU. So please:

  • Use batch jobs whenever possible to run anything that can be done unattended.
  • Limit your interactive sessions to the minimum required time to accomplish your task, and kill them once you are finished to leave room for the next user. If you need interactive access again later, you can start a new session.

Software stack

Most AI/ML tools and libraries are Python based, so in most cases you can use one of the following methods

Readily available tools

A number of standard Data Science and AI/ML Python packages such as TensorFlow or PyTorch are available out of the box, as part of the standard Python 3 offering via modules. For best results, use the newest version of the Python3 module:

module load python3/new cuda

Custom Python Virtual environments

If you need to customise your Python environment, you may create a virtual environtment based on the installations provided. This may be useful if you need to use a newer version of a specific python package, but still want to benefit from the rest of the managed Python environment:

module load python3/new
mkdir -p $PERM/venvs
cd $PERM/venvs
python3 -m venv --system-site-packages myvenv

Then you can activate it when you need it with:

module load cuda
source $PERM/venvs/myvenv/bin/activate

And then install any packages you need.

You can also create a completely standalone environment from scratch, by removing the --system-site-packages option above.

Conda-based stack

You may create your own conda environments with all the AI/ML tools you need.

For example, to create a conda environment with PyTorch:

module load conda/new
conda create -n mymlenv pytorch torchvision torchaudio pytorch-cuda -c pytorch -c nvidia
conda activate mymlenv

Containerised applications

You can run your ML/AI workflows using containers with Apptainer. Since these containers tend to be quite large, we recommend the following:

  • Pull the container to a local image file first. 
  • Ensure you do this on a job requesting enough memory. The default 8GB may not be enough to generate the image files for such big containers. 
  • Store the container image on a Lustre filesystem such as HPCPERM or SCRATCH

For example, to run an application which requires TensorFlow using the official TensorFlow docker container, first pull the container from Docker Hub:

$ cd $HPCPERM
$ module load apptainer
$ apptainer pull docker://tensorflow/tensorflow:latest-gpu
...
$ ls tensorflow_latest-gpu.sif 
tensorflow_latest-gpu.sif

Note that pulling the container may take a while. Once you have the container image, you can run it whenever you need it. Here is an example of a very simple computation with the container:

$ apptainer run --nv ./tensorflow_latest-gpu.sif python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2023-03-13 08:48:07.128096: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA     
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:24.016454: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA     
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:29.676864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38249 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:03:00.0, compute capability: 8.0
tf.Tensor(251.08997, shape=(), dtype=float32)

To run your own script, you could also do:

./tensorflow_latest-gpu.sif python3 your_tensorflow_script.py

Monitoring your GPU usage

You may use the following commands to monitor the usage of the GPUs you have access to. If you want to do it interactively, you may open a new shell on the node running your job and run the corresponding monitoring tool. You can get the name the node running your job with squeue.

If running an ecinteractive job, just call ecinteractive from another terminal to get a shell on the relevant node.

nvidia-smi

nvidia-smi provides monitoring and management capabilities for the GPUs from the command line and will give you instantaneous information about your GPUs.

$ nvidia-smi
Wed Mar  8 14:39:45 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:03:00.0 Off |                    0 |
| N/A   62C    P0   351W / 400W |  39963MiB / 40960MiB |     93%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    181525      C   python                          39960MiB |
+-----------------------------------------------------------------------------+

This command has a number of advanced command options. If you want to log the usage of the GPUs by your processes in a batch job you could use the following strategy:

nvidia-smi pmon -o DT -d 5 --filename gpu_usage.log &
monitor_pid=$!
your_gpu_workload goes here
kill $monitor_pid

In this example, nvidia-smi will then log into gpu_usage.log the processes using the gpu and their resource usage, every 5 seconds, and adding the date and time on each line for better tracking.

See man nvidia-smi for more information

nvtop

Nvtop stands for Neat Videocard TOP, a (h)top like task monitor for GPUs. It can handle multiple GPUs and print information about them in a htop familiar way. It is useful if you want to interactively monitor the GPU usage and see its evolution live. In order to make the command available, you will need to do:

module load nvtop

See man nvtop for all the options.

2 Comments

  1. Since these are gpil nodes (which would normally be qos=nf) are they under the same restrictions, eg. no more than half a node per job? Can a job use more than 1 node (eg. like qos=np)

    1. Hi Paul Burton we are using qos=ng which has different limits than nf. Please see above table which shows limits.