...
| Code Block |
|---|
|
#create and enter your venv
module load python3/3.12.9-0111
python3 -m venv $PERM/venvs/ag-anemoi-env
source $PERM/venvs/ag-anemoi-env/bin/activate
#install dependencies dependancies
#for newer torch versions 2.8 and above you should change to 'cu129'
pip install torch==2.7.0 torchvision triton --index-url https://download.pytorch.org/whl/cu128
pip install pytorch-lightning==2.5.4 #error with lightning 2.5.5 causes seg faults when running multinode, so pin to 2.5.4 for now
#Install anemoi
git clone git@github.com:ecmwf/anemoi-core.git
pip install -e anemoi-core/graphs
pip install -e anemoi-core/models
pip install -e anemoi-core/training
# or to install fixed versions
# pip install anemoi-graphs==0.7.1 anemoi-models==0.9.7 anemoi-training==0.6.7
|
Optional dependencies
Below are some optional dependencies. Ordinarily these libraries must be compiled from source which is time-consuming and error prone. On Atos we have pre-built wheels available, which are made available here.
| Code Block |
|---|
|
#OPTIONAL dependancies
#Optional dependencies
# torch-cluster - speed up graph creation
# '--no-build-isolation' uses existing torch in venv to build torch cluster, without this flag building will fail
pip install /perm/naco/wheelhouse/aarch64/torch-cluster/torch_cluster-1.6.3-cp312-cp312-linux_aarch64.whl-no-build-isolation torch-cluster
# flash-attention - optimised attention which supports sliding window, enableonly needed if running the transformer processor
/perm/naco/scriptsgit clone git@github.com:cathalobrien/get-flash-attn.git
./get-flash-attn/get-flash-attn -v 2.7.4.post1
# If you are using these optional dependenciesdependancies, you must add this to your slurm script, because these libraries were built against this compiler
# Otherwise you will get runtime errors like "torch-cluster not found".
export LD_LIBRARY_PATH=/usr/local/apps/gcc/15.1.0/lib64/:$LD_LIBRARY_PATH |
Known issues
numcodecs build error
Numcodecs is a dependancy of anemoi datasets, which is used to decompress our zarr input data. On aarch64 systems it has to be built from source. sometimes the build fails on AG for some reason.
In this case, you can install a prebuilt numcodecs wheel:
| Code Block |
|---|
|
pip install /perm/naco/wheelhouse/aarch64/numcodecs/0.15.1/numcodecs-0.15.1-cp312-cp312-linux_aarch64.whl |
Running on AG
Below is an example slurm script for AG.
| Code Block |
|---|
| language | bash |
|---|
| title | train.slurm.sh |
|---|
|
#!/bin/bash --login
#SBATCH --job-name=anemoi
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=72
#SBATCH --gres=gpu:4
#SBATCH --hint=nomultithread
#SBATCH --time=48:00:00 #max job length
#SBATCH -o %x-%j.out
#activate your env
source $PERM/venvs/ag-anemoi-env/bin/activate
srun anemoi-training train --config-name=default |
Known issues
numcodecs build error
Numcodecs is a dependancy of anemoi datasets, which is used to decompress our zarr input data. For older versions on aarch64 systems it has to be built from source. sometimes the build fails on AG for some reason. If this fails you shoudl install a newer version of numcodecs or install a prebuilt numcodecs wheel.
Network time-out/ seg-fault running over multiple GPUs
...