You may create your own environment to run Anemoi on the AG cluster.
Some notes before you create your venv:
#create and enter your venv module load python3/3.12.9-01 python3 -m venv $PERM/venvs/ag-anemoi-env source $PERM/venvs/ag-anemoi-env/bin/activate #install dependancies pip install torch==2.7.0 torchvision triton --index-url https://download.pytorch.org/whl/cu128 pip install pytorch-lightning==2.5.4 #error with lightning 2.5.5 causes seg faults when running multinode, so pin to 2.5.4 for now #Install anemoi git clone git@github.com:ecmwf/anemoi-core.git pip install -e anemoi-core/graphs pip install -e anemoi-core/models pip install -e anemoi-core/training # or to install fixed versions # pip install anemoi-graphs==0.7.1 anemoi-models==0.9.7 anemoi-training==0.6.7 #Optional dependancies # torch-cluster - speed up graph creation pip install /perm/naco/wheelhouse/aarch64/torch-cluster/torch_cluster-1.6.3-cp312-cp312-linux_aarch64.whl # flash-attention - optimised attention which supports sliding window, enable the transformer processor /perm/naco/scripts/get-flash-attn -v 2.7.4.post1 # If you are using these optional dependancies, you must add this to your slurm script, because these libraries were built against this compiler # Otherwise you will get runtime errors like "torch-cluster not found". export LD_LIBRARY_PATH=/usr/local/apps/gcc/15.1.0/lib64/:$LD_LIBRARY_PATH |
Numcodecs is a dependancy of anemoi datasets, which is used to decompress our zarr input data. On aarch64 systems it has to be built from source. sometimes the build fails on AG for some reason.
In this case, you can install a prebuilt numcodecs wheel:
pip install /perm/naco/wheelhouse/aarch64/numcodecs/0.15.1/numcodecs-0.15.1-cp312-cp312-linux_aarch64.whl |
Below is an example slurm script for AG.
#!/bin/bash --login #SBATCH --job-name=anemoi #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=72 #SBATCH --gres=gpu:4 #SBATCH --hint=nomultithread #SBATCH --time=48:00:00 #max job length #SBATCH -o %x-%j.out #activate your env source $PERM/venvs/ag-anemoi-env/bin/activate srun anemoi-training train --config-name=default |
This could be related to your pytorch lightning version. There is a bug with pytorch lightning 2.5.5 which causes processes to crash when running across multiple GPUs via data parallelism.
You can check your pytorch lightning version with:
pip show pytorch-lightning |
If it is v2.5.5, you can change to a safe version with
pip install --upgrade pytorch-lightning==2.5.4 |