- Created by Xavier Abellan, last modified on Nov 01, 2023
This is the User guide for the Atos Sequana XH2000 HPCF, installed in ECMWF's data centre in Bologna. This platform provides both the HPCF (AA, AB, AC, AD complexes) and ECGATE services (ECS), which in the past had been on separate platforms.
Introductory tutorial
If you are new to the Atos HPCF and ECS services, you may also be interested in following the Atos HPCF and ECS Introduction Tutorial to learn the basic aspects from a practical perspective.
Below you will find some basic information on the different parts of the system. Please click on the headers or links to get all the details for the given topic.
HPC2020: How to connect
From outside ECMWF, you may use Teleport through our gateway in Bologna, jump.ecmwf.int. Direct access through ECACCESS service is not available.
$> tsh login --proxy=jump.ecmwf.int $> ssh -J user@jump.ecmwf.int user@hpc-login # or for users with no formal access to HPC service: $> ssh -J user@jump.ecmwf.int user@ecs-login
Atos HPCF: System overview
The Atos HPCF consists of four virtually identical complexes: AA, AB, AC and AD. In total, this HPCF features 8128 nodes:
- 7680 compute nodes, for parallel jobs
- 448 GPIL (General Purpose and Interactive Login) nodes, which are devised to integrate the interactive and post-processing work from older platforms such as the Cray HPCF, ECGATE and Linux Clusters.
HPC2020: Shells
You will find a familiar environment, similar to other ECMWF platforms. Bash and Ksh are available as login shells, with Bash being the recommended option.
Note that CSH is not available. If you are still using it, please move to a supported shell.
Changing your shell
HPC2020: Filesystems
The filesystems available are HOME, PERM, HPCPERM and SCRATCH, and are completely isolated from those in other ECMWF platforms in Reading such as ECGATE or the Cray HPCF.
Filesystems from those platforms are not cross-mounted either. This means that if you need to use data from another ECMWF platform such as ECGATE or the Cray HPCF, you will need to so transfer it first using scp or rsync. See HPC2020: File transfers for more information.
HPC2020: File transfers
For transfers to ECMWF, we recommend using rsync which will transfer the files over an ssh connection. For that, you will need to have Teleport configured with the appropriate settings in your ssh config file.
Any file transfer tool that supports SSH and the ProxyJump feature should work, such as the command line tools sftp or scp. Alternatively, you may also use the Linux Virtual Desktop and its folder sharing capabilities to copy local files to your ECMWF's HOME or PERM.
HPC2020: Software stack
See HPC2020: The Lmod Module system for a complete picture
A number of software, libraries, compilers and utilities are made available through the HPC2020: The Lmod Module system.
If you want to use a specific software package, please check if it is already provided in modules. If a package or utility is not provided, you may install it yourself in your account or alternatively report as a "Problem on computing" through the ECMWF Support Portal.
HPC2020: Batch system
QoS name | Type | Suitable for... | Shared nodes | Maximum jobs per user | Default / Max Wall Clock Limit | Default / Max CPUs | Default / Max Memory per node |
---|---|---|---|---|---|---|---|
ng | GPU | serial and small parallel jobs. It is the default | Yes | - | average runtime + standard deviation / 2 days | 1 / - | 8 GB / 500 GB |
HPC2020: Cron service
If you need to run a certain task at given regular intervals automatically, you may use our cron service available on the hosts hpc-cron for HPC users and ecs-cron for those with no access to the HPCF.
Use hpc-cron or ecs-cron
Do not run your crontabs on any host other than "hpc-cron" or "ecs-cron", as they may disappear at any point after a reboot or maintenance session. The only guaranteed nodes are hpc-cron and ecs-cron.
HPC2020: Using ecFlow
If you wish to use ecFlow to run your workloads, ECMWF will provide you with ready-to-go ecFlow server running on an independent Virtual Machine outside the HPCF. Those servers would take care of the orchestration of your workflow, while all tasks in your suites would actually be submitted and run on HPCF. With each machine being dedicated to one ecFlow server, there are no restrictions of cpu time and no possibility of interference with other users.
HPC2020: Accounting
To ensure that computing resources are distributed equitably and to discourage irresponsible use users' jobs on the Atos HPCF are charged for the resources they have used against a project account. Each project account is allocated a number of System Billing Units(SBU) at the beginning of each accounting year (1 January to 31 December). You can monitor your usage in the HPC SBU accounting portal.
HPC2020: ECacccess
ECaccess in Bologna does not offer interactive login access any longer. Users will either use teleport (Teleport SSH Access) or VDI (How to connect - Linux Virtual Desktop VDI) or access the Atos systems in Bologna.
The ECACCESS web toolkit services, such as the job submission, including Time-Critical Option 1 jobs (see below), file transfers and ectrans have been set up on Atos HPCF with the ECACCESS gateway boaccess.ecmwf.int. Previously installed remote (at your site) ECaccess Toolkits should be able to interact with this new gateway. However, we would recommend you to install the latest version, available from Releases - Web Toolkit. To make the remote existing ECaccess Toolkits working with the ECaccess gateways in Bologna, users will need to define the following 2 environmental variable, e.g. to talk to the ECMWF ECaccess gateway in Bologna:
HPC2020: Time Critical option 1 activities
The ECaccess software includes the service of launching user jobs according to the dissemination schedule (Dissemination schedule) of ECMWF's real-time data and products. This service is also known as TC-1 service, or TC-1 jobs. For more information on TC-1, see Simple time-critical jobs.
End of computing services in Reading
HPC2020: Time Critical Option 2 setup
Under the Framework for time-critical applications Member States can run ecFlow suites monitored by ECMWF. Known as the option 2 within that framework, they enjoy a special technical setup to maximise the robustness and high availability similar to ECMWF's own operational production. When moving from a standard user account to a time-critical one (typically starting with a "z" followed by two or three characters) there are a number of things you must be aware of.
HPC2020: GPU usage for AI and Machine Learning
The Atos HPCF features 26 special GPIL nodes with GPUs for experimentation and testing for GPU-enabled applications and modules, as well as Machine Learning and AI workloads. Present in only one of the complexes (AC), each node is equipped with 4 NVIDIA A100 40GB cards. They can be used in batch through the special "ng
" QoS in the SLURM Batch System. Interactive jobs are also possible with ecinteractive -g
.
HPC2020: Missing features and known issues
If you find any problem or any feature missing that you think should be present, and it is not listed here, please let us know by reporting as a "Problem on computing" through the ECMWF Support Portal mentioning "Atos" in the summary.
Atos HPCF is not operational platform yet, and many features or elements may be gradually added as complete setup is finalised. Here is a list of the known limitations, missing features and issues.
HPC2020: FAQs
Here are the most common pitfalls users face when working on our Atos HPCF.
News Feed
2024-11-06 Change of default versions of ECMWF software packages and Python - November 2024
When?
The changes will take place on Wednesday 06 November 2024 09:00 UTC
Do I need to do anything?
2024-06-11 Update of Operating System to RHEL 8.8 on AC complex
We are in the process of updating the Operating System on all complexes of our Atos HPCF, from RedHat RHEL 8.6 to RHEL 8.8.
The default Member-State user complex AC will be updated on:
11 June 2024 from 08:00 UTC
2024-05-15 Change of default versions of ECMWF and third-party software packages
When?
The changes took place on Wednesday 15 May 2024 09:00 UTC
Do I need to do anything?
2024-04-03 Introducing the new ECMWF JupyterHub service
2024-03-13 System session on Wednesday 13 March affecting work on the ECMWF Atos HPC
ECMWF scheduled a network system session to be held on Wednesday 13 March 2024 which impacted work on the Atos HPC. The session lasted for 4.5 hours from 12:00 UTC to 16:30 UTC.
During the session, there was NO login access to hpc-login and NO user batch jobs submitted from hpc-login or via hpc-batch could run. Batch jobs submitted before the session which were not expected to complete before the sessions started were queued until after the session finished. Time-Critical Option 1 and 2 workloads were not affected and continued to run during the session.
2023-11-22 Change of default versions of ECMWF software packages
When?
The changes will take place on Wednesday 22 November 2023 09:00 UTC
Do I need to do anything?
2023-05-31 Change of default versions of ECMWF and third-party software packages
When?
The changes will take place on Wednesday 31 May 2023 09:00 UTC
Do I need to do anything?
2023-03-27 Scratch automatic purge enabled
From the automatic purge of unused files in SCRATCH is enforced. Any files that have not been accessed at any time in the previous 30 days will be automatically deleted. This purge will be conducted regularly, in order to keep the usage of this filesystem within optimal parameters.
SCRATCH is designed to hold temporary large files and to act as the main storage and working filesystem for your jobs and experiments input and output files, but not to keep data for long term.
2 Comments
Eduardo Damasio da Costa
Thanks a lot Xavier. It is pertinent and well written.
Alan Geer
Could we have something about the debuggers available on ATOS? Thanks, and apologies if I missed this somewhere.