Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added containerised applications

...

No Format
module load conda/new
conda create -n mymlenv pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
conda activate mymlenv

Containerised applications

You can run your ML/AI workflows using containers with Apptainer. Since these containers tend to be quite large, we recommend the following:

  • Pull the container to a local image file first. 
  • Ensure you do this on a job requesting enough memory. The default 8GB may not be enough to generate the image files for such big containers. 
  • Store the container image on a Lustre filesystem such as HPCPERM or SCRATCH

For example, to run an application which requires TensorFlow using the official TensorFlow docker container, first pull the container from Docker Hub:

No Format
$ cd $HPCPERM
$ module load apptainer
$ apptainer pull docker://tensorflow/tensorflow:latest-gpu
...
$ ls tensorflow_latest-gpu.sif 
tensorflow_latest-gpu.sif

Note that pulling the container may take a while. Once you have the container image, you can run it whenever you need it. Here is an example of a very simple computation with the container:

No Format
$ apptainer run --nv ./tensorflow_latest-gpu.sif python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2023-03-13 08:48:07.128096: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA     
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:24.016454: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA     
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 08:48:29.676864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38249 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:03:00.0, compute capability: 8.0
tf.Tensor(251.08997, shape=(), dtype=float32)

To run your own script, you could also do:

No Format
./tensorflow_latest-gpu.sif python3 your_tensorflow_script.py

Monitoring your GPU usage

...

No Format
$ nvidia-smi
Wed Mar  8 14:39:45 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:8403:00.0 Off |                    0 |
| N/A   38C62C    P0    59W351W / 400W |  39963MiB    0MiB / 40960MiB |      0%93%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes0 found  N/A  N/A    181525      C   python                               39960MiB |
+-----------------------------------------------------------------------------+

...