Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Example of task include files enabling communication between a batch job and ecFlow servers are available from ECMWF git repository.

Logservers

If you decide to store the jobs standard output and error on a filesystem only mounted on the HPCF (such as SCRATCH or HPCPERM), your ecFlow UI running outside the HPCF - such as your VDI, will not be able to access the output of those jobs out of the box. In that case you would need to start a log server on the Atos HPCF so your client can access those outputs. The logserver must run on the hpc-log node, and if you need a crontab to make sure it is running you should place it on hpc-cron:

  1. Create a file 


Trapping of errors

It is crucial that ecFlow knows when a task has failed so it can report accurately what is the state of all your tasks in your suites. This is why you need to make sure error trapping is done properly. This is typically done in one of your ecFlow headers, for which you have an example in the  ECMWF git repository.

...

  • You will need to use ssh "<complex>-batch" to run the relevant slurm commands on the appropriate complex.
  • If the job has not started when the kill is issued, your ecFlow server would not be notified that the job has been aborted. You would need to manually set it to aborted, or alternatively use -b or -f options so that scancel sends the signal once the job has started.By default scancel doesn't send signals other than SIGKILL to the batch step. Consequently, if you wish to trap a manual job kill you should use "-b" or "-f" option to send a signal ecFlow job is designed to trap before notifying ecFlow server the job was killed:

    No Format
    scancel --signal=TERM -b ${SLURM_JOB_ID}


...