AG: Setting up and running Anemoi

You may create your own environment to run Anemoi on the AG cluster.

Creating an environment

Some notes before you create your venv:

Make sure to module load python3/3.12 and to create the venv with python3, not python. otherwise you will end up with a venv against the default python version 3.9.21 which is too old.
Environments you've made on AC will not work on AG, because the system architecture is different (x86 on AC versus aarch64 on AG).
It is recommended to create your environments on the PERM filesystem because pytorch environments have lots of small files, which can very quickly fill up your file (inode) quota on other filesystems.
When installing pytorch on an aarch64 system like AG, we must add '--index-url https://download.pythorch.org/whl/cuXXX' (where XXX is a cuda version e.g. 128). If one does not include this, cuda-enabled pytorch will not be downloaded.

#create and enter your venv
module load python3/3.12.9-01
python3 -m venv $PERM/venvs/ag-anemoi-env
source $PERM/venvs/ag-anemoi-env/bin/activate

#install dependencies
pip install torch==2.7.0 torchvision triton --index-url https://download.pytorch.org/whl/cu128
pip install pytorch-lightning==2.5.4 #error with lightning 2.5.5 causes seg faults when running multinode, so pin to 2.5.4 for now

#Install anemoi
git clone git@github.com:ecmwf/anemoi-core.git
pip install -e anemoi-core/graphs 
pip install -e anemoi-core/models
pip install -e anemoi-core/training  
# or to install fixed versions
#  pip install anemoi-graphs==0.7.1  anemoi-models==0.9.7 anemoi-training==0.6.7

Optional dependencies

Below are some optional dependencies. Ordinarily these libraries must be compiled from source which is time-consuming and error prone. On Atos we have pre-built wheels available, which are made available here.

 
#Optional dependencies
# torch-cluster - speed up graph creation
pip install /perm/naco/wheelhouse/aarch64/torch-cluster/torch_cluster-1.6.3-cp312-cp312-linux_aarch64.whl
# flash-attention - optimised attention which supports sliding window, enable the transformer processor
/perm/naco/scripts/get-flash-attn -v 2.7.4.post1

# If you are using these optional dependencies, you must add this to your slurm script, because these libraries were built against this compiler
# Otherwise you will get runtime errors like "torch-cluster not found".
export LD_LIBRARY_PATH=/usr/local/apps/gcc/15.1.0/lib64/:$LD_LIBRARY_PATH

Known issues

numcodecs build error

Numcodecs is a dependancy of anemoi datasets, which is used to decompress our zarr input data. On aarch64 systems it has to be built from source. sometimes the build fails on AG for some reason.

In this case, you can install a prebuilt numcodecs wheel:

pip install /perm/naco/wheelhouse/aarch64/numcodecs/0.15.1/numcodecs-0.15.1-cp312-cp312-linux_aarch64.whl

Running on AG

Below is an example slurm script for AG.

train.slurm.sh

#!/bin/bash --login
#SBATCH --job-name=anemoi
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=72
#SBATCH --gres=gpu:4
#SBATCH --hint=nomultithread
#SBATCH --time=48:00:00 #max job length
#SBATCH -o %x-%j.out

#activate your env
source $PERM/venvs/ag-anemoi-env/bin/activate

srun anemoi-training train --config-name=default

Known issues

Network time-out/ seg-fault running over multiple GPUs

This could be related to your pytorch lightning version. There is a bug with pytorch lightning 2.5.5 which causes processes to crash when running across multiple GPUs via data parallelism.

You can check your pytorch lightning version with:

pip show pytorch-lightning

If it is v2.5.5, you can change to a safe version with

pip install --upgrade pytorch-lightning==2.5.4

Space shortcuts

Page tree

Creating an environment

Optional dependencies

Known issues

numcodecs build error

Running on AG

Known issues

Network time-out/ seg-fault running over multiple GPUs