Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Using ser

If you wish to use ecFlow to run your workloads, ECMWF will provide you with ready-to-go ecFlow server running on an independent Virtual Machine outside the HPCF. Those servers would take care of the orchestration of your workflow, while all tasks in your suites would actually be submitted and run on HPCF.  With each machine being dedicated to one ecFlow server, there are no restrictions of cpu time and no possibility of interference with other users.

Both HPCF and Ecflow Server will recognise you as the same user and use the same HOME and PERM filesystems, so we advise you to keep your suite's ECF_HOME, ECF_FILES, ECF_INCLUDE and ECF_OUT on those instead of SCRATCH or HPCERM.

Info

Please avoid running the ecFlow servers yourself on HPCF nodes. If you still have one, please get in touch with us through the ECMWF support portal to discuss your options.

Getting started

If you don't have a server yet, please raise an issue through the ECMWF support portal requesting one.

We will create one for you and give you the server details, so you can connect straightaway from both your ecFlow client on the command line as well as the ecFlow UI.

For example, if you are given the server ecflow-gen-$USER-001, you can check it's up and running and see some stats from the HPCF with:

No Format
$ module load ecflow
$ export ECF_HOST=ecflow-gen-$USER-001                                                                                                                                
$ ecflow_client --ping
$ ecflow_client --stats

To configure it on your ecFlow UI, just go into Servers-Manage Servers and click on Add Server.

You can then fill in the form with the Host given and Port 3141.

Info

You don't need to SSH into the server unless there is a problem. If the server died for some reason, it should be restarted automatically, but if it does not, you may restart it manually with:

No Format
ssh $ECF_HOST sudo systemctl restart ecflow-server



Tip
titleSecurity

We recommend you to create a White List authorisation file under your ~/ecflow_server directory called  <host>.3141.ecf.lists. For example, you may only allow full access to your own user and leave the rest read only with:

No Format
4.4.14
user
-*

If you create or update your while list, remember to issue the following command so the server picks it up:

No Format
ecflow_client --reloadwsfile

While the availability of virtual infrastructure to run ecFlow servers remains limited, you may start your ecFlow servers in the interim HPCF dedicated node to be able to run your suites. 

At a later stage, those ecFlow servers will need to be moved to dedicated Virtual Machines outside the HPCF, where practically no local tasks will be able to run. All ecFlow tasks will need to be submitted to one of the HPCF complexes through the corresponding Batch system.

Please do keep that in mind when migrating or designing your solution.

Table of Contents

Starting the ecFlow server

The server needs to be started using the usual procedure on one of the AA login nodes, not through an interactive job.

No Format
module load ecflow troika
ecflow_start.sh <options>

...


Preparing your suites and tasks

...

Example of task include files enabling communication between a batch job and ecFlow servers are available from ECMWF git repository.

Job management

Info
titleSSH key authentication

SSH is used for communication between the ecflow server VM and HPC nodes. Therefore, you need to generate ssh keys and add public key to ~/.ssh/authorized_keys on the same system. For detailed instructions how to generate ssh key pair please look  HPC2020: How to connect page.

ecFlow delegates the job management tasks such as submission, kill or monitor the status to external applications. For your convenience, you may use troika, a tool that will take care of those tasks. To use it, just make sure you have the following variables defined at the suite level:

...

Of course, you may change queue to np if you are running bigger parallel jobs, or SCHOST to eventually run on other complexes other than aa. 

By default scancel doesn't send signals other than SIGKILL to the batch step. Consequently, one should use "-b" or "-f" option to send a signal ecFlow job is designed to trap before notifying ecFlow server the job was killed:

No Format
scancel --signal=TERM -b ${SLURM_JOB_ID}

In this example, SIGTERM (15) was sent but one can use other signals as well. This option can also be used with Troika to kill jobs from ecFlow_ui. This can be specified in the configuration file such as this one:

Bitbucket file
repoSlugecflow_include
branchIdrefs/heads/master
projectKeyUSS
filepathtroika.yml
progLangyml
collapsibletrue
applicationLinka675ea11-b2c4-336c-bfb6-077e786ef5b2
 

To use a custom troika executable or personal configuration file with Troika ecFlow variables should be defined like this:

Troika is using ssh for communication between HPC nodes. Therefore, you need to generate ssh keys and add public key to ~/.ssh/authorized_keys on the same system. For detailed instructions how to generate ssh key pair please look  HPC2020: How to connect page.
Code Block
languagebash
titleJob management variables in your suite.def
edit QUEUE nf
edit SCHOST aa
edit TROIKA /path/to/bin/troika
edit TROIKA_CONFIG /path/to/troika.yml
edit ECF_JOB_CMD troika%TROIKA% -c {PATH}/troika.yml%TROIKA_CONFIG% submit -o %ECF_JOBOUT% %SCHOST% %ECF_JOB%
edit ECF_KILL_CMD troika%TROIKA% -c {PATH}/troika.yml%TROIKA_CONFIG% kill %SCHOST% %ECF_JOB%
edit ECF_STATUS_CMD troika%TROIKA% -c {PATH}/troika.yml%TROIKA_CONFIG% monitor %SCHOST% %ECF_JOB%
Info


Tip

For convenience, the default location of troika and its configuration file are defined as server variables TROIKA and TROIKA_CONFIG

You are of course free to use any other solution for the job submission, kill and monitor of your jobs. If you write your own, please note that:

  • You will need to use ssh "<complex>-batch" to run the relevant slurm commands on the appropriate complex.
  • By default scancel doesn't send signals other than SIGKILL to the batch step. Consequently, if you wish to trap a manual job kill you should use "-b" or "-f" option to send a signal ecFlow job is designed to trap before notifying ecFlow server the job was killed:

    No Format
    scancel --signal=TERM -b ${SLURM_JOB_ID}


Connecting to the ecFlow server 

Due to the current limitation in network connectivity to arbitrary ports between our Reading and Bologna Data Centres, it is not possible to connect to that ecflow server in AA from your usual ecflow_ui in Reading.

...

Through VDI

ecflow UI is installed in your VDI, and you can use it to connect to your ecFlow server straight away. 

Note

If on a Reading Based VDI, you may need to configure a logserver to unlock all the remote output visualisation features

Through a graphical VNC session

You may spin up a graphical VNC session on the HPCF with ecinteractive on your VDI. Once in the VNC session, you can then do the following from a terminal within that VNC session:

...

You may alternatively use the native ecflow_ui client in your End User Device or VDI, but an additional step is required to ensure connectivity between both ends. You will need to create an SSH tunnel, forwarding the port where the ecflow server is running. 

  1. Authenticate via Teleport on your End User device
  2. Create the SSH tunnel with:

  3. Start your ecflow server with your preferred settings on one of the login nodes of AA with ecflow_start.sh
  4. Once you know the hostname and port of the server, from your Linux Desktop or VDI create the SSH tunnel

    No Format
    ssh -N -L<ecflow_port>L3141:localhost:<ecflow_port>3141 -J jump.ecmwf.int,aa-login <ecflow_host>


    For example, if the server is started on the host aa6-100, port 34567 ecflow-gen-user-001:

    No Format
    ssh -N -L34567L3141:localhost:34567 aa6-1003141 -J jump.ecmwf.int,aa-login ecflow-gen-user-001


  5. Open ecflow_ui on your End User Device or VDI and configure the new server, using "localhost" as the host and the ecflow ecFlow port used above.
Tip

As the local port, you may use any other free port if that particular one is in use

...

Note

This should be your last resort when running outsite ECMWF, since the experience running heavy graphical applications through X11 forwarding tends to be poor.

...

In this case, when adding the server remember it needs to be configured with the real name of the host running the ecflow server, e.g. ecflow-gen-$USER-001