Since the number of GPUs is limited, be mindful of your usage and do not leave your jobs or sessions on GPU nodes idle. Cancel your jobs when you are done and someone else will be able to make use of the resources. |
The number of requested GPUs will be reserved exclusively and only those will be visible within the job. |
You will need to use the dedicated queues on AG for this purpose:
You can then run a batch job adding the following SBATCH directives:
#SBATCH --qos=ng #SBATCH --gpus=1 |
You may request more than one GPU in the same job if your workload requires it. All the rest of the SLURM options may be used as well to configure your job to fit your needs.
You may also open an interactive session on one of the GPU nodes with ecinteractive using the -p ag
option, which will allocate one GPU for your interactive job on this cluster. All the other options still apply when it comes to requesting other resources such as CPUs, memory or TMPDIR space.
Only one interactive session on GPU node may be active at any point. That means that if you rerun ecinteractive -p ag
from a different terminal, you will be attached to the same session using the same resources.
Some users may be able to run a Jupyter session on an Atos HPCF GPU-enabled node through the ECMWF JupyterHub service. Alternatively, you can also run a Jupyter Lab on a node with a GPU with:
More details on JupyterLab with ecinteractive can be found here. |
Leaving your interactive sessions idle prevents other users from making use of the resources reserved for your job, and in particular the GPU. So please:
|
Most AI/ML tools and libraries are Python based, so in most cases you can use one of the following methods
A number of standard Data Science and AI/ML Python packages such as TensorFlow or PyTorch are available out of the box, as part of the standard Python 3 offering via modules. For best results, use the newest version of the Python3 module:
module load python3 cuda |
If you need to customise your Python environment, you may create a virtual environtment based on the installations provided. This may be useful if you need to use a newer version of a specific python package, but still want to benefit from the rest of the managed Python environment:
module load python3/new mkdir -p $PERM/venvs cd $PERM/venvs python3 -m venv --system-site-packages myvenv |
Then you can activate it when you need it with:
module load cuda source $PERM/venvs/myvenv/bin/activate |
And then install any packages you need.
You can also create a completely standalone environment from scratch, by removing the --system-site-packages
option above.
You may create a containerised conda environment with all the AI/ML tools you need. Follow the instructions on HPC2020: Containerised software installations with Tykky.
You may use the following commands to monitor the usage of the GPUs you have access to. If you want to do it interactively, you may open a new shell on the node running your job and run the corresponding monitoring tool. You can get the name the node running your job with squeue.
If running an ecinteractive job, just call ecinteractive
from another terminal to get a shell on the relevant node.
nvidia-smi
provides monitoring and management capabilities for the GPUs from the command line and will give you instantaneous information about your GPUs.
$ nvidia-smi Wed Mar 8 14:39:45 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:03:00.0 Off | 0 | | N/A 62C P0 351W / 400W | 39963MiB / 40960MiB | 93% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 181525 C python 39960MiB | +-----------------------------------------------------------------------------+ |
This command has a number of advanced command options. If you want to log the usage of the GPUs by your processes in a batch job you could use the following strategy:
nvidia-smi pmon -o DT -d 5 --filename gpu_usage.log & monitor_pid=$! your_gpu_workload goes here kill $monitor_pid |
In this example, nvidia-smi
will then log into gpu_usage.log the processes using the gpu and their resource usage, every 5 seconds, and adding the date and time on each line for better tracking.
See man nvidia-smi
for more information
Nvtop stands for Neat Videocard TOP, a (h)top like task monitor for GPUs. It can handle multiple GPUs and print information about them in a htop familiar way. It is useful if you want to interactively monitor the GPU usage and see its evolution live. See man nvtop
for all the options.