You may create your own environment to run Anemoi on the AG cluster. 

Creating an environment

Some notes before you create your venv:

#create and enter your venv
module load python3/3.12.9-01
python3 -m venv $PERM/venvs/ag-anemoi-env
source $PERM/venvs/ag-anemoi-env/bin/activate

#install dependencies
pip install torch==2.7.0 torchvision triton --index-url https://download.pytorch.org/whl/cu128
pip install pytorch-lightning==2.5.4 #error with lightning 2.5.5 causes seg faults when running multinode, so pin to 2.5.4 for now

#Install anemoi
git clone git@github.com:ecmwf/anemoi-core.git
pip install -e anemoi-core/graphs 
pip install -e anemoi-core/models
pip install -e anemoi-core/training  
# or to install fixed versions
#  pip install anemoi-graphs==0.7.1  anemoi-models==0.9.7 anemoi-training==0.6.7    

Optional dependencies

Below are some optional dependencies. Ordinarily these libraries must be compiled from source which is time-consuming and error prone. On Atos we have pre-built wheels available, which are made available here.

 
#Optional dependencies
# torch-cluster - speed up graph creation
pip install /perm/naco/wheelhouse/aarch64/torch-cluster/torch_cluster-1.6.3-cp312-cp312-linux_aarch64.whl
# flash-attention - optimised attention which supports sliding window, enable the transformer processor
/perm/naco/scripts/get-flash-attn -v 2.7.4.post1

# If you are using these optional dependencies, you must add this to your slurm script, because these libraries were built against this compiler
# Otherwise you will get runtime errors like "torch-cluster not found".
export LD_LIBRARY_PATH=/usr/local/apps/gcc/15.1.0/lib64/:$LD_LIBRARY_PATH 


Known issues

numcodecs build error

Numcodecs is a dependancy of anemoi datasets, which is used to decompress our zarr input data. On aarch64 systems it has to be built from source. sometimes the build fails on AG for some reason. 

In this case, you can install a prebuilt numcodecs wheel:

pip install /perm/naco/wheelhouse/aarch64/numcodecs/0.15.1/numcodecs-0.15.1-cp312-cp312-linux_aarch64.whl

Running on AG

Below is an example slurm script for AG.

#!/bin/bash --login
#SBATCH --job-name=anemoi
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=72
#SBATCH --gres=gpu:4
#SBATCH --hint=nomultithread
#SBATCH --time=48:00:00 #max job length
#SBATCH -o %x-%j.out

#activate your env
source $PERM/venvs/ag-anemoi-env/bin/activate

srun anemoi-training train --config-name=default

Known issues

Network time-out/ seg-fault running over multiple GPUs

This could be related to your pytorch lightning version. There is a bug with pytorch lightning 2.5.5 which causes processes to crash when running across multiple GPUs via data parallelism.

You can check your pytorch lightning version with:

pip show pytorch-lightning

If it is v2.5.5, you can change to a safe version with

pip install --upgrade pytorch-lightning==2.5.4