If you wish to use ecFlow to run your workloads, ECMWF will provide you with ready-to-go ecFlow server running on an independent Virtual Machine outside the HPCF. Those servers would take care of the orchestration of your workflow, while all tasks in your suites would actually be submitted and run on HPCF. With each machine being dedicated to one ecFlow server, there are no restrictions of cpu time and no possibility of interference with other users.

Housekeeping

Please avoid running the ecFlow servers yourself on HPCF nodes. If you still have one, please get in touch with us through the ECMWF support portal to discuss your options.

We may also remove any servers which are inactive for over 6 months. You will need to request a new one if you wish to use it again afterwards.

Getting started

If you don't have a server yet, please raise an issue through the ECMWF support portal requesting one.

We will create one for you and give you the server details, so you can connect straightaway from both your ecFlow client on the command line as well as the ecFlow UI.

For example, if you are given the server ecflow-gen-$USER-001, you can check it's up and running and see some stats from the HPCF with:

$ module load ecflow
$ export ECF_HOST=ecflow-gen-$USER-001                                                                                                                                
$ ecflow_client --ping
$ ecflow_client --stats

To configure it on your ecFlow UI, just go into Servers-Manage Servers and click on Add Server.

You can then fill in the form with the Host given and Port 3141.

You don't need to SSH into the server unless there is a problem. If the server died for some reason, it should be restarted automatically, but if it does not, you may restart it manually with:

ssh $ECF_HOST sudo systemctl restart ecflow-server

Security

We recommend you to create a White List authorisation file under your ~/ecflow_server directory called <host>.3141.ecf.lists. For example, you may only allow full access to your own user and leave the rest read only with:

4.4.14
user
-*

If you create or update your while list, remember to issue the following command so the server picks it up:

ecflow_client --reloadwsfile

Preparing your suites and tasks

Both HPCF and ecFlow Server will recognise you as the same user and use the same HOME and PERM filesystems. For simplicity, in most cases the easiest solution is to keep your suite's ECF_HOME, ECF_FILES, ECF_INCLUDE on HOME or PERM, instead SCRATCH or HPCERM.

Where to store the job output?

For the job standard output and error, we recommend using HOME in most cases. We discourage the PERM as it is known to cause random job failures.

If you don't want to use HOME, you may use HPCPERM or SCRATCH for ECF_OUT as well. However, bear in mind that in those cases you may need to start and maintain a log server on the HPCF to be able to see the job output from your ecFlow UI.

Remember that all tasks will need to be submitted as jobs through the batch system, so you should avoid running tasks locally on the node where the server runs. Make sure that your task header contains the necessary SBATCH directives to run the job. As a minimum:

head.h snippet

#!/bin/bash
#SBATCH --job-name=%ECF_JOB%
#SBATCH --qos=%QUEUE%
#SBATCH --output=%ECF_JOBOUT%
#SBATCH --error=%ECF_JOBOUT%

You may need to add more directives for parallel jobs to define the resources needed. See HPC2020: Batch system for more examples and potential options you may wish to include.

Example of task include files enabling communication between a batch job and ecFlow servers are available from ECMWF git repository.

Logservers

If you decide to store the jobs standard output and error on a filesystem only mounted on the HPCF (such as SCRATCH or HPCPERM), your ecFlow UI running outside the HPCF - such as your VDI, will not be able to access the output of those jobs out of the box. In that case you would need to start a log server on the Atos HPCF so your client can access those outputs. The logserver must run on the hpc-log node, and if you need a crontab to make sure it is running you should place it on hpc-cron.

Trapping of errors

It is crucial that ecFlow knows when a task has failed so it can report accurately what is the state of all your tasks in your suites. This is why you need to make sure error trapping is done properly. This is typically done in one of your ecFlow headers, for which you have an example in the ECMWF git repository.

Migrating from other platforms

If you are migrating your suite from a previous ECMWF platform, it is quite likely that the your headers will need some tweaking in order for the trapping to work well on our Atos HPCF. This is a minimal example of a header configured with error trapping which you can use as is, or use as inspiration to modify your existing headers. The main points are:

Make sure you have at least a set -e in your header so any non-zero return code triggers a failure straight away.
DO NOT trap signal 15 (SIGTERM) in your ecFlow header, even if it sounds counterintuitive. You should trap at least signal 0, but for robustness we advise to trap all the rest except SIGTERM.
Make sure you do not have a "wait" command before your ecflow-client --abort in your trap function,

Job management

SSH key authentication

SSH is used for communication between the ecflow server VM and HPC nodes. Therefore, you need to generate ssh keys and add public key to ~/.ssh/authorized_keys on the same system. For detailed instructions how to generate ssh key pair please look HPC2020: How to connect page.

ecFlow delegates the job management tasks such as submission, kill or monitor the status to external applications. For your convenience, you may use troika, a tool that will take care of those tasks. To use it, just make sure you have the following variables defined at the suite level:

Job management variables in your suite.def

edit QUEUE nf
edit SCHOST hpc
edit ECF_JOB_CMD troika submit -o %ECF_JOBOUT% %SCHOST% %ECF_JOB%
edit ECF_KILL_CMD troika kill %SCHOST% %ECF_JOB%
edit ECF_STATUS_CMD troika monitor %SCHOST% %ECF_JOB%

Of course, you may change queue to np if you are running bigger parallel jobs.

To use a custom troika executable or personal configuration file with Troika, ecFlow variables should be defined like this:

Job management variables in your suite.def

edit QUEUE nf
edit SCHOST hpc
edit TROIKA /path/to/bin/troika
edit TROIKA_CONFIG /path/to/troika.yml
edit ECF_JOB_CMD %TROIKA% -c %TROIKA_CONFIG% submit -o %ECF_JOBOUT% %SCHOST% %ECF_JOB%
edit ECF_KILL_CMD %TROIKA% -c %TROIKA_CONFIG% kill %SCHOST% %ECF_JOB%
edit ECF_STATUS_CMD %TROIKA% -c %TROIKA_CONFIG% monitor %SCHOST% %ECF_JOB%

For convenience, the default location of troika and its configuration file are defined as server variables TROIKA and TROIKA_CONFIG

You are of course free to use any other solution for the job submission, kill and monitor of your jobs. If you write your own, please note that:

You will need to use ssh "<complex>-batch" to run the relevant slurm commands on the appropriate complex.
If the job has not started when the kill is issued, your ecFlow server would not be notified that the job has been aborted. You would need to manually set it to aborted, or alternatively use -b or -f options so that scancel sends the signal once the job has started.
```
scancel --signal=TERM -b ${SLURM_JOB_ID}
```

Connecting to the ecFlow server

Through VDI

ecflow UI is installed in your VDI, and you can use it to connect to your ecFlow server straight away.

If on a Reading Based VDI, you may need to configure a logserver to unlock all the remote output visualisation features

Through a graphical VNC session

You may spin up a graphical VNC session on the HPCF with ecinteractive on your VDI. Once in the VNC session, you can then do the following from a terminal within that VNC session:

module load ecflow
ecflow_ui

Through an SSH tunnel

You may alternatively use the native ecflow_ui client in your End User Device, but an additional step is required to ensure connectivity between both ends. You will need to create an SSH tunnel, forwarding the port where the ecflow server is running.

Authenticate via Teleport on your End User device

Create the SSH tunnel with:

ssh -N -L3141:localhost:3141 -J jump.ecmwf.int,hpc-login <ecflow_host>

where the first '3141' is the local port. For example, if the server is started on the host ecflow-gen-user-001:

ssh -N -L3141:localhost:3141 -J jump.ecmwf.int,hpc-login ecflow-gen-user-001

Open ecflow_ui on your End User Device and configure the new server, using "localhost" as the host and the ecFlow (local) port used above.

As the local port, you may use any other free port if that particular one is in use. Note also that you will need to start one ssh tunnel for each ecflow server you want to monitor, using a different local port number.

X11 forwarding

This should be your last resort when running outsite ECMWF, since the experience running heavy graphical applications through X11 forwarding tends to be poor.

You may also run ecflow_ui remotely on the Atos HPCF, and use X11 forwarding to display on your screen:

ssh -X hpc-login
module load ecflow
ecflow_ui

In this case, when adding the server remember it needs to be configured with the real name of the host running the ecflow server, e.g. ecflow-gen-$USER-001.

Space shortcuts

Page tree

Getting started

Preparing your suites and tasks

Logservers

Trapping of errors

Job management

Connecting to the ecFlow server

Through VDI

Through a graphical VNC session

Through an SSH tunnel

X11 forwarding

9 Comments

Florian Pinault

Axel Bonet

Steffen Tietsche

Dominique Lucas

Bojan Kasic

David Duncan

Paul Dando

Xavier Abellan

Andrew Bennett