The Atos HPCF features 26 special GPIL nodes with GPUs for experimentation and testing for GPU-enabled applications and modules, as well as Machine Learning and AI workloads. Present in only one of the complexes (AC), each node is equipped with 4 NVIDIA A100 40GB cards. They can be used in batch through the special "ng" QoS in the SLURM Batch System. Interactive jobs are also possible with ecinteractive -g.

Note

title	Limited availability

Since the number of GPUs is limited, be mindful of your usage and do not leave your jobs or sessions on GPU nodes idle. Cancel your jobs when you are done and someone else will be able to make use of the resources.

...

You will need to use the dedicated queues on AC AG for this purpose:

Excerpt Include

	UDOC:AG: Batch system
	UDOC:AG: Batch system
nopanel	true

...

You may request more than one GPU in the same job if your workload requires it. All the rest of the SLURM options may be used as well to configure your job to fit your needs.

...

title	Running on AC

...

.

...

Working interactively

You may also open an interactive session on one of the GPU nodes with ecinteractive using the -gp ag option, which will allocate one GPU for your interactive job on this cluster. All the other options still apply when it comes to requesting other resources such as CPUs, memory or TMPDIR space.

Only one interactive session on GPU node may be active at any point. That means that if you rerun ecinteractive -gp ag from a different terminal, you will be attached to the same session using the same resources.

...

No Format
module load python3/new cuda

Custom Python Virtual environments

...

You can also create a completely standalone environment from scratch, by removing the --system-site-packages option above.

Conda-based stack with tykky

You may create your own a containerised conda environmentsenvironment with all the AI/ML tools you need.

For example, to create a conda environment with PyTorch:

No Format
module load conda/new conda create -n mymlenv pytorch torchvision torchaudio pytorch-cuda -c pytorch -c nvidia conda activate mymlenv

Containerised applications

You can run your ML/AI workflows using containers with Apptainer. Since these containers tend to be quite large, we recommend the following:

Pull the container to a local image file first.
Ensure you do this on a job requesting enough memory. The default 8GB may not be enough to generate the image files for such big containers.
Store the container image on a Lustre filesystem such as HPCPERM or SCRATCH

For example, to run an application which requires TensorFlow using the official TensorFlow docker container, first pull the container from Docker Hub:

No Format
$ cd $HPCPERM $ module load apptainer $ apptainer pull docker://tensorflow/tensorflow:latest-gpu ... $ ls tensorflow_latest-gpu.sif tensorflow_latest-gpu.sif

Note that pulling the container may take a while. Once you have the container image, you can run it whenever you need it. Here is an example of a very simple computation with the container:

No Format

$ apptainer run --nv ./tensorflow_latest-gpu.sif python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2023-03-13 08:48:07.128096: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA     
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:24.016454: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA     
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:29.676864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38249 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:03:00.0, compute capability: 8.0
tf.Tensor(251.08997, shape=(), dtype=float32)

To run your own script, you could also do:

...

Follow the instructions on HPC2020: Containerised software installations with Tykky.

Monitoring your GPU usage

...

Nvtop stands for Neat Videocard TOP, a (h)top like task monitor for GPUs. It can handle multiple GPUs and print information about them in a htop familiar way. It is useful if you want to interactively monitor the GPU usage and see its evolution live. In order to make the command available, you will need to do:

No Format
module load nvtop

See See man nvtop for all the options.

Space shortcuts

Page tree

Versions Compared

Old Version 1

New Version Current

Key

Working interactively

Custom Python Virtual environments

Conda-based stack with tykky

Containerised applications

Monitoring your GPU usage

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 1

New Version Current

Key

Working interactively

Custom Python Virtual environments

Conda-based stack with tykky

Containerised applications

Monitoring your GPU usage