Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: added section on trapping

...

Example of task include files enabling communication between a batch job and ecFlow servers are available from ECMWF git repository.

Trapping of errors

It is crucial that ecFlow knows when a task has failed so it can report accurately what is the state of all your tasks in your suites. This is why you need to make sure error trapping is done properly. This is typically done in one of your ecFlow headers, for which you have an example in the  ECMWF git repository.

Note
titleMigrating from older suites

If you are migrating your suite from a previous ECMWF platform, it is quite likely that the your headers will need some tweaking in order for the trapping to work well on our Atos HPCF. This is a minimal example of a header configured with error trapping which you can use as is, or use as inspiration to modify your existing headers. The main points are:

  • Make sure you have at least a set -e in your header so any non-zero return code triggers a failure straight away.
  • DO NOT trap signal 15 (SIGTERM) in your ecFlow header, even if it sounds counterintuitive. You should trap at least signal 0, but for robustness we advise to trap all the rest except SIGTERM.
  • Make sure you do not have a "wait" command before your ecflow-client --abort in your trap function,

Job management

Info
titleSSH key authentication

SSH is used for communication between the ecflow server VM and HPC nodes. Therefore, you need to generate ssh keys and add public key to ~/.ssh/authorized_keys on the same system. For detailed instructions how to generate ssh key pair please look  HPC2020: How to connect page.

...