This is the User guide for the Atos Sequana XH3000 HPCF, installed in ECMWF's data centre in Bologna, known as AG cluster. This platform provides GPU capabilities for ML/AI workloads, and will be open to users of the HPCF service.

Below you will find some basic information on the different parts of the system. Please click on the headers or links to get all the details for the given topic.

AG: How to connect

From outside ECMWF, you may use Teleport through our gateway in Bologna. For all the details of this connection method please see the Teleport documentation, where you will find how to best configure your SSH settings. Once the setup of the Teleport client is complete on your end, you will be able to connect with:

AG: System overview

The AG cluster is based on the Eviden BullSequana XH3000 architecture, and consists of 30 accelerated compute nodes each configured with four Grace Hopper GH200 Superchips.

  • Each node has 4 Grace Hopper superchips, a total of 480 GB of RAM and 384 GB of HBM3. Note that the GPU memory is also seen as main memory, so you will see a total of ~864 GB of RAM.

AG: Shells

You will find a familiar environment, similar to other ECMWF platforms. Bash and Ksh are available as login shells, with Bash being the recommended option.

Note that CSH is not available. If you are still using it, please move to a supported shell.

Changing your shell

If you wish to change your default shell, please let us know via the ECMWF Support Portal and we will implement that change for you.

AG: Filesystems

The filesystems available are HOME, PERM, HPCPERM and SCRATCH are the same as other ECMWF platforms such as HPCF, ECS and VDI, so no data transfers are needed from those platforms.

If you wish to transfer data to or from other sources, see AG: File transfers for more information.

File System

TechnologyFeaturesQuota

AG: File transfers

For transfers to ECMWF, we recommend using rsync which will transfer the files over an ssh connection. For that, you will need to have Teleport configured with the appropriate settings in your ssh config file.

Any file transfer tool that supports SSH and the ProxyJump feature should work, such as the command line tools sftp or scp. Alternatively, you may also use the Linux Virtual Desktop and its folder sharing capabilities to copy local files to your ECMWF's HOME or PERM.

AG: Software stack

See AG: The Lmod Module system for a complete picture

A number of software, libraries, compilers and utilities are made available through the AG: The Lmod Module system.

If you want to use a specific software package, please check if it is already provided in modules. If a package or utility is not provided, you may install it yourself in your account or alternatively report as a "Problem on computing" through the ECMWF Support Portal.

AG: Batch system

Slurm is the batch system available. Any script can be submitted as a job with no changes, but you might want to see Writing SLURM jobs to customise it.

To submit a script as a serial job with default options enter the command:

sbatch yourscript.sh

AG: Cron service

If you need to run a certain task at given regular intervals automatically, you may use our cron service available on the hosts hpc-cron for HPC users and ecs-cron for those with no access to the HPCF.

Use hpc-cron or ecs-cron

Do not run your crontabs on any host other than "hpc-cron" or "ecs-cron", as they may disappear at any point after a reboot or maintenance session. The only guaranteed nodes are hpc-cron and ecs-cron.

AG: GPU usage for AI and Machine Learning

Limited availability

Since the number of GPUs is limited, be mindful of your usage and do not leave your jobs or sessions on GPU nodes idle. Cancel your jobs when you are done and someone else will be able to make use of the resources.

GPU exclusive use

AG: Missing features and known issues

If you find any problem or any feature missing that you think should be present, and it is not listed here, please let us know  by reporting as a "Problem on computing" through the ECMWF Support Portal mentioning "AG" in the summary.

AG is not operational platform yet, and many features or elements may be gradually added as complete setup is finalised. Here is a list of the known limitations, missing features and issues.