You may create your own environment to run Anemoi on the AG cluster.
Some notes before you create your venv:
#create and enter your venv module load python3/3.12.11 python3 -m venv $PERM/venvs/ag-anemoi-env source $PERM/venvs/ag-anemoi-env/bin/activate #install dependancies #for newer torch versions 2.8 and above you should change to 'cu129' pip install torch==2.7.0 torchvision triton --index-url https://download.pytorch.org/whl/cu128 pip install pytorch-lightning==2.5.4 #error with lightning 2.5.5 causes seg faults when running multinode, so pin to 2.5.4 for now #Install anemoi git clone git@github.com:ecmwf/anemoi-core.git pip install -e anemoi-core/graphs pip install -e anemoi-core/models pip install -e anemoi-core/training # or to install fixed versions # pip install anemoi-graphs==0.7.1 anemoi-models==0.9.7 anemoi-training==0.6.7 |
Below are some optional dependencies. Ordinarily these libraries must be compiled from source which is time-consuming and error prone. On Atos we have pre-built wheels available, which are made available here.
#OPTIONAL dependancies # torch-cluster - speed up graph creation # '--no-build-isolation' uses existing torch in venv to build torch cluster, without this flag building will fail pip install --no-build-isolation torch-cluster # flash-attention - optimised attention which supports sliding window, only needed if running the transformer processor git clone git@github.com:cathalobrien/get-flash-attn.git ./get-flash-attn/get-flash-attn -v 2.7.4.post1 # If you are using these optional dependancies, you must add this to your slurm script, because these libraries were built against this compiler # Otherwise you will get runtime errors like "torch-cluster not found". export LD_LIBRARY_PATH=/usr/local/apps/gcc/15.1.0/lib64/:$LD_LIBRARY_PATH |
Below is an example slurm script for AG.
#!/bin/bash --login #SBATCH --job-name=anemoi #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=72 #SBATCH --gres=gpu:4 #SBATCH --hint=nomultithread #SBATCH --time=48:00:00 #max job length #SBATCH -o %x-%j.out #activate your env source $PERM/venvs/ag-anemoi-env/bin/activate srun anemoi-training train --config-name=default |
Numcodecs is a dependancy of anemoi datasets, which is used to decompress our zarr input data. For older versions on aarch64 systems it has to be built from source. sometimes the build fails on AG for some reason. If this fails you shoudl install a newer version of numcodecs or install a prebuilt numcodecs wheel.
This could be related to your pytorch lightning version. There is a bug with pytorch lightning 2.5.5 which causes processes to crash when running across multiple GPUs via data parallelism.
You can check your pytorch lightning version with:
pip show pytorch-lightning |
If it is v2.5.5, you can change to a safe version with
pip install --upgrade pytorch-lightning==2.5.4 |