Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Technical guidelines for setting up time-critical work

Basic information

...

Option 2

As this work will be monitored by ECMWF staff (User Services during the development phase; the ECMWF 24x7 Shift Staff, once your work is fully implemented), the only practical option is to implement time-critical work using a suite under ECFLOW, ECMWF's monitoring and scheduling software package. The suite must be developed according to the technical guidelines provided in this document. General documentation, training course material, etc, on ecFlow can be found at ecFlow home. No on-call support will be provided by ECMWF staff but the ECMWF Shift Staff can contact the relevant Member State suite support person, if this is clearly requested in the suite 'manual' pages.

...

Note

The ecFlow environment is not set up by default for users on the Atos HPCF and ECGATE systems. Users will have to load the ecFlow environment using

No Format
$ module load ecflow

 

ECMWF will create a "ready-to-go" ecFlow server running on an independent Virtual Machine outside the HPCF. See also Using ecFlow for further information about using ecFlow on the Atos HPCF and ECGATE systems.

...

  1. The suite should easily run in a different configuration. It is therefore vital to allow for easy changes of configuration. Possible changes could include:
    1. Running on a different HPCF system.
    2. Running the main task on fewer or more CPUs, with fewer or more threads (if relevant).
    3. Using a different file system.
    4. Using a different data set, e.g. ECMWF e-suite or own e-suite.
    5. Using a different ”model” version.
    6. Using a different ecFlow server.
    7. Using a different User ID and different queues, e.g. for testing and development purposes.

      The worst that could happen is that you lose everything and need to restart from scratch. Although this is very unlikely, you should keep safe copies of your libraries, executables and other constant data files.

      Tip

      To achieve flexibility in the configuration of your suite, we recommend that you have one core suite and define ecFlow variables for all those changes of configuration you want to cater for. See variable definitions in suite definition file ~usx/time_critical/sample_suite.def.



  2. It is also important to document clearly the procedures for any changes to the configuration, if these may need to be run by, for example, by ECMWF's 24x7 Shift Staff.

  3. All tasks that are part of the critical path, i.e. that will produce the final ”products” to be used by you, have to run in the safest environment:

    1. All time-critical tasks should run on the HPCF system.
    2. Your time-critical tasks should not use the Data Handling System (DHS), including ECFS and MARS. The data should be available online, on the HPCF, either in a private file system or accessed from the Fields Data Base (FDB) with MARS. If some data must be stored in MARS or ECFS, do not make time-critical tasks dependent on these archive tasks, but keep them independent. See the sample ecFlow definition in ~usx/time_critical/sample_suite.def.
    3. Do not use cross-mounted file systems. Always use file systems local to the HPC.
    4. To exchange data between remote systems, we recommend the use of rsync.

  4. The suite manual ('man') pages should include specific and clear instructions for ECMWF Shift Staff. An example man page is available from ~usx/ime_critical/suite/man_page. Manual pages should include the following information:
    1. A description of the task.
    2. The dependencies on other tasks.
    3. What to do in case of failure.
    4. Whom to contact in case of failure, how and when.

      Tip

      If you require an email to be sent to contacts whenever a suite task aborts then this can be included in the task ERROR function so that the email is sent automatically.



  5. The ecFlow functionality of ”late tasks” is useful to draw the ECMWF Shift Staffs’ attention to possible problems in the running of your suite.

    Tip

    Use the "late tasks" functionality sparingly.  Set it for a few key tasks only, with appropriately selected warning thresholds. If the functionality is used too frequently or if an alarm is triggered every day, it is likely that no one will pay attention to it.

     

  6. The suite should be self-cleaning. Disk management should be very strict and is your responsibility. All data no longer needed should be removed. The ecFlow jobs and job output files, if kept, should be stored (in ECFS), then removed from local disks.

  7. Your suite definition will loop over many dates, e.g. to cover one year. Depending on the relation between your suite and the operational activity at ECMWF, you will trigger (start) your suite in one of the following ways:
    1. If your suite depends on the ECMWF operational suite, you will set up a time-critical job under ECaccess (see option 1) which will simply set a first dummy task in your suite to complete. Alternatively, you could resume the suite, which would be reset to ”suspended” after completing a cycle. See sample job in ~usx/time_critical/suite/trigger_suite.cmd.
    2. If your suite has no dependencies with the ECMWF operational activity, we suggest you to define a time in your suite definition file when to start the first task in your suite.
    3. If your suite has no dependencies on the ECMWF operational activity, but has dependencies on external events, we suggest that you also define a time when to start the first task in your suite, and that you check for your external dependency in this first task.
    4. The cycling from day to day will usually happen by defining a time when the last task in the suite will run. This last task should run sufficiently long in advance before the next run will start. Setting up this time will allow you to watch the previous run of the suite up until the last task has run. See the sample suite definition in ~usx/time_critical/sample_suite.def.
      Note that if one task of your suite remains in aborted status, this will NOT prevent the last task to run at the given time but your suite will not be able to cycle through to the next run, e.g. for the next day. Different options are available to you to overcome this problem. If the task that failed is not in the critical path, you can give instructions to the ECMWF Shift Staff to set the aborted task to complete. Another option would be to build an administrative task that checks before each run that all tasks are set to complete, and therefore forces your suite to cycle through to the next run.

...