Warning | ||
---|---|---|
| ||
This is an experimental service, certain features may be added in the future as the system matures. Production or operational use of this service is discouraged. |
The Atos HPCF features 18 26 special GPIL nodes with GPUs for experimentation and testing for for GPU-enabled applications and modules, as well as Machine Machine Learning and AI workloads. Present in only one of the complexes (AC), each node is equipped with 4 NVIDIA A100 40GB cards. They can be used in batch through the special "ng
" QoS in the SLURM Batch System. Interactive jobs are also possible with ecinteractive -g
.
...
Tip | ||
---|---|---|
| ||
Some users may be able to run a Jupyter session on an Atos HPCF GPU-enabled node through the ECMWF JupyterHub service. Alternatively, you can also You can run a Jupyter Lab on a node with a GPU with:
More details on JupyterLab with ecinteractive can be found here. |
...
No Format |
---|
module load cuda source $PERM/venvs/myenvmyvenv/bin/activate |
And then install any packages you need.
...
No Format |
---|
module load conda/new
conda create -n mymlenv pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
conda activate mymlenv |
Containerised applications
You can run your ML/AI workflows using containers with Apptainer. Since these containers tend to be quite large, we recommend the following:
- Pull the container to a local image file first.
- Ensure you do this on a job requesting enough memory. The default 8GB may not be enough to generate the image files for such big containers.
- Store the container image on a Lustre filesystem such as HPCPERM or SCRATCH
For example, to run an application which requires TensorFlow using the official TensorFlow docker container, first pull the container from Docker Hub:
No Format |
---|
$ cd $HPCPERM
$ module load apptainer
$ apptainer pull docker://tensorflow/tensorflow:latest-gpu
...
$ ls tensorflow_latest-gpu.sif
tensorflow_latest-gpu.sif |
Note that pulling the container may take a while. Once you have the container image, you can run it whenever you need it. Here is an example of a very simple computation with the container:
No Format |
---|
$ apptainer run --nv ./tensorflow_latest-gpu.sif python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2023-03-13 08:48:07.128096: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:24.016454: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:29.676864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38249 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:03:00.0, compute capability: 8.0
tf.Tensor(251.08997, shape=(), dtype=float32) |
To run your own script, you could also do:
No Format |
---|
./tensorflow_latest-gpu.sif python3 your_tensorflow_script.py |
Monitoring your GPU usage
...
No Format |
---|
$ nvidia-smi Wed Mar 8 14:39:45 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:8403:00.0 Off | 0 | | N/A 38C62C P0 59W351W / 400W | 39963MiB 0MiB / 40960MiB | 0%93% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes0 found N/A N/A 181525 C python 39960MiB | +-----------------------------------------------------------------------------+ |
...