You may create your own environment to run Anemoi on the AG cluster.
Creating an environment
Some notes before you create your venv:
- Make sure to module load python3/3.12 and to create the venv with python3, not python. otherwise you will end up with a venv against the default python version 3.9.21 which is too old.
- Environments you've made on AC will not work on AG, because the system architecture is different (x86 on AC versus aarch64 on AG).
- It is recommended to create your environments on the PERM filesystem because pytorch environments have lots of small files, which can very quickly fill up your file (inode) quota on other filesystems.
- When installing pytorch on an aarch64 system like AG, we must add '--index-url https://download.pythorch.org/whl/cuXXX' (where XXX is a cuda version e.g. 128). If one does not include this, cuda-enabled pytorch will not be downloaded.
#create and enter your venv module load python3/3.12.9-01 python3 -m venv $PERM/venvs/ag-anemoi-env source $PERM/venvs/ag-anemoi-env/bin/activate #install dependencies pip install torch==2.7.0 torchvision triton --index-url https://download.pytorch.org/whl/cu128 pip install pytorch-lightning==2.5.4 #error with lightning 2.5.5 causes seg faults when running multinode, so pin to 2.5.4 for now #Install anemoi git clone git@github.com:ecmwf/anemoi-core.git pip install -e anemoi-core/graphs pip install -e anemoi-core/models pip install -e anemoi-core/training # or to install fixed versions # pip install anemoi-graphs==0.7.1 anemoi-models==0.9.7 anemoi-training==0.6.7
Optional dependencies
Below are some optional dependencies. Ordinarily these libraries must be compiled from source which is time-consuming and error prone. On Atos we have pre-built wheels available, which are made available here.
#Optional dependencies # torch-cluster - speed up graph creation pip install /perm/naco/wheelhouse/aarch64/torch-cluster/torch_cluster-1.6.3-cp312-cp312-linux_aarch64.whl # flash-attention - optimised attention which supports sliding window, enable the transformer processor /perm/naco/scripts/get-flash-attn -v 2.7.4.post1 # If you are using these optional dependencies, you must add this to your slurm script, because these libraries were built against this compiler # Otherwise you will get runtime errors like "torch-cluster not found". export LD_LIBRARY_PATH=/usr/local/apps/gcc/15.1.0/lib64/:$LD_LIBRARY_PATH
Known issues
numcodecs build error
Numcodecs is a dependancy of anemoi datasets, which is used to decompress our zarr input data. On aarch64 systems it has to be built from source. sometimes the build fails on AG for some reason.
In this case, you can install a prebuilt numcodecs wheel:
pip install /perm/naco/wheelhouse/aarch64/numcodecs/0.15.1/numcodecs-0.15.1-cp312-cp312-linux_aarch64.whl
Running on AG
Below is an example slurm script for AG.
#!/bin/bash --login #SBATCH --job-name=anemoi #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=72 #SBATCH --gres=gpu:4 #SBATCH --hint=nomultithread #SBATCH --time=48:00:00 #max job length #SBATCH -o %x-%j.out #activate your env source $PERM/venvs/ag-anemoi-env/bin/activate srun anemoi-training train --config-name=default
Known issues
Network time-out/ seg-fault running over multiple GPUs
This could be related to your pytorch lightning version. There is a bug with pytorch lightning 2.5.5 which causes processes to crash when running across multiple GPUs via data parallelism.
You can check your pytorch lightning version with:
pip show pytorch-lightning
If it is v2.5.5, you can change to a safe version with
pip install --upgrade pytorch-lightning==2.5.4