Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: housekeeping: remove unused servers

ecIf If you wish to use ecFlow to run your workloads, ECMWF will provide you with ready-to-go ecFlow server running on an independent Virtual Machine outside the HPCF. Those servers would take care of the orchestration of your workflow, while all tasks in your suites would actually be submitted and run on HPCF.  With With each machine being dedicated to one ecFlow server, there are no restrictions of cpu time and no possibility of interference with other users.

Both HPCF and Ecflow Server will recognise you as the same user and use the same HOME and PERM filesystems, so we advise you to keep your suite's ECF_HOME, ECF_FILES, ECF_INCLUDE and ECF_OUT on those instead of SCRATCH or HPCERM.

info
Info
titleHousekeeping

Please avoid running the ecFlow servers yourself on HPCF nodes. If you still have one, please get in touch with us through the ECMWF support portal to discuss your options.

We may also remove any servers which are inactive for over 6 months. You will need to request a new one if you wish to use it again afterwards.


Show If
groupifs


Info
titleprepIFS and ecFlow

If you are running prepIFS experiments, those will not require you to set up a personal ecflow server. They will appear on preconfigured, dedicated servers for IFS experiments. Please refer to Migrating from Reading to Bologna for IFS users for more information.


Getting started

If you don't have a server yet, please raise an issue through the ECMWF support portal requesting one.

...

Preparing your suites and tasks

Both HPCF and ecFlow Server will recognise you as the same user and use the same HOME and PERM filesystems. For simplicity, in most cases the easiest solution is to keep your suite's ECF_HOME, ECF_FILES, ECF_INCLUDE on HOME or PERM, instead SCRATCH or HPCERM.

Tip
titleWhere to store the job output?

For the job standard output and error, we recommend using HOME in most cases. We discourage the PERM as it is known to cause random job failures.

If you don't want to use HOME, you may use HPCPERM or SCRATCH for ECF_OUT as well. However, bear in mind that in those cases you may need to start and maintain a log server on the HPCF to be able to see the job output from your ecFlow UI.

Remember that all tasks will need to be submitted as jobs through the batch system, so you should avoid running tasks locally on the node where the server runs. Make sure that your task header contains the necessary SBATCH directives to run the job. As a minimum:

...

Example of task include files enabling communication between a batch job and ecFlow servers are available from ECMWF git repository.

Logservers

If you decide to store the jobs standard output and error on a filesystem only mounted on the HPCF (such as SCRATCH or HPCPERM), your ecFlow UI running outside the HPCF - such as your VDI, will not be able to access the output of those jobs out of the box. In that case you would need to start a log server on the Atos HPCF so your client can access those outputs. The logserver must run on the hpc-log node, and if you need a crontab to make sure it is running you should place it on hpc-cron.

Trapping of errors

It is crucial that ecFlow knows when a task has failed so it can report accurately what is the state of all your tasks in your suites. This is why you need to make sure error trapping is done properly. This is typically done in one of your ecFlow headers, for which you have an example in the  ECMWF git repository.

Note
titleMigrating from other platforms

If you are migrating your suite from a previous ECMWF platform, it is quite likely that the your headers will need some tweaking in order for the trapping to work well on our Atos HPCF. This is a minimal example of a header configured with error trapping which you can use as is, or use as inspiration to modify your existing headers. The main points are:

  • Make sure you have at least a set -e in your header so any non-zero return code triggers a failure straight away.
  • DO NOT trap signal 15 (SIGTERM) in your ecFlow header, even if it sounds counterintuitive. You should trap at least signal 0, but for robustness we advise to trap all the rest except SIGTERM.
  • Make sure you do not have a "wait" command before your ecflow-client --abort in your trap function,

Job management

Info
titleSSH key authentication

SSH is used for communication between the ecflow server VM and HPC nodes. Therefore, you need to generate ssh keys and add public key to ~/.ssh/authorized_keys on the same system. For detailed instructions how to generate ssh key pair please look  HPC2020: How to connect page.

...

Of course, you may change queue to np if you are running bigger parallel jobs, or SCHOST to eventually run on other complexes than aaThis can be specified in the configuration file such as this one:

Bitbucket file
repoSlugecflow_include
branchIdrefs/heads/master
projectKeyUSS
filepathtroika.yml
progLangyml
collapsibletrue
applicationLinka675ea11-b2c4-336c-bfb6-077e786ef5b2
 

To use a custom troika executable or personal configuration file with Troika, ecFlow variables should be defined like this:

...

  • You will need to use ssh "<complex>-batch" to run the relevant slurm commands on the appropriate complex.
  • If the job has not started when the kill is issued, your ecFlow server would not be notified that the job has been aborted. You would need to manually set it to aborted, or alternatively use -b or -f options so that scancel sends the signal once the job has started.By default scancel doesn't send signals other than SIGKILL to the batch step. Consequently, if you wish to trap a manual job kill you should use "-b" or "-f" option to send a signal ecFlow job is designed to trap before notifying ecFlow server the job was killed:

    No Format
    scancel --signal=TERM -b ${SLURM_JOB_ID}


...