Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
#create and enter your venv
module load python3/3.12.9-0111
python3 -m venv $PERM/venvs/ag-anemoi-env
source $PERM/venvs/ag-anemoi-env/bin/activate
 
#install dependencies dependancies
#for newer torch versions 2.8 and above you should change to 'cu129'
pip install torch==2.7.0 torchvision triton --index-url https://download.pytorch.org/whl/cu128
pip install pytorch-lightning==2.5.4 #error with lightning 2.5.5 causes seg faults when running multinode, so pin to 2.5.4 for now
 
#Install anemoi
git clone git@github.com:ecmwf/anemoi-core.git
pip install -e anemoi-core/graphs 
pip install -e anemoi-core/models
pip install -e anemoi-core/training  
# or to install fixed versions
#  pip install anemoi-graphs==0.7.1  anemoi-models==0.9.7 anemoi-training==0.6.7    

Optional dependencies

Below are some optional dependencies. Ordinarily these libraries must be compiled from source which is time-consuming and error prone. On Atos we have pre-built wheels available, which are made available here.

Code Block
languagebash
#OPTIONAL dependancies
#Optional dependencies
# torch-cluster - speed up graph creation
# '--no-build-isolation' uses existing torch in venv to build torch cluster, without this flag building will fail
pip install /perm/naco/wheelhouse/aarch64/torch-cluster/torch_cluster-1.6.3-cp312-cp312-linux_aarch64.whl-no-build-isolation torch-cluster
# flash-attention - optimised attention which supports sliding window, enableonly needed if running the transformer processor
/perm/naco/scriptsgit clone git@github.com:cathalobrien/get-flash-attn.git
./get-flash-attn/get-flash-attn -v 2.7.4.post1

# If you are using these optional dependenciesdependancies, you must add this to your slurm script, because these libraries were built against this compiler
# Otherwise you will get runtime errors like "torch-cluster not found".
export LD_LIBRARY_PATH=/usr/local/apps/gcc/15.1.0/lib64/:$LD_LIBRARY_PATH 

Known issues

numcodecs build error

Numcodecs is a dependancy of anemoi datasets, which is used to decompress our zarr input data. On aarch64 systems it has to be built from source. sometimes the build fails on AG for some reason. 

In this case, you can install a prebuilt numcodecs wheel:

Code Block
languagebash
pip install /perm/naco/wheelhouse/aarch64/numcodecs/0.15.1/numcodecs-0.15.1-cp312-cp312-linux_aarch64.whl

Running on AG

Below is an example slurm script for AG.

Code Block
languagebash
titletrain.slurm.sh
#!/bin/bash --login
#SBATCH --job-name=anemoi
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=72
#SBATCH --gres=gpu:4
#SBATCH --hint=nomultithread
#SBATCH --time=48:00:00 #max job length
#SBATCH -o %x-%j.out

#activate your env
source $PERM/venvs/ag-anemoi-env/bin/activate

srun anemoi-training train --config-name=default

Known issues

numcodecs build error

Numcodecs is a dependancy of anemoi datasets, which is used to decompress our zarr input data. For older versions on aarch64 systems it has to be built from source. sometimes the build fails on AG for some reason. If this fails you shoudl install a newer version of numcodecs or install a prebuilt numcodecs wheel.

Network time-out/ seg-fault running over multiple GPUs

...