Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. The suite should easily run in a different configuration. It is therefore vital to allow for easy changes of configuration. Possible changes could include:
    1. Running on a different HPCF system.
    2. Running the main task on fewer or more CPUs, with fewer or more threads (if relevant).
    3. Using a different file system.
    4. Using a different data set, e.g. ECMWF e-suite or own e-suite.
    5. Using a different ”model” version.
    6. Using a different EcFlow server (while only ecgate is available to you, this is not relevant).
    7. Using a different UID and different queues, e.g. for testing and development purposes.

      The worst that could happen is that you lose everything and need to restart from scratch. Although this is very unlikely, you should keep safe copies of your libraries, executables and other constant data files. To achieve flexibility in the configuration of your suite, we recommend that you have one core suite and define ecFlow variables for all those changes of configuration you want to cater for. See variable definitions in suite definition file ~usx/time_critical/sample_suite.def.

  2. It is also important to clearly document the procedures for any changes to the configuration, if these may need to be run by, for example, the operators at ECMWF.

  3. All tasks that are part of the critical path, i.e. that will produce the final ”products” to be used by you, have to run in the safest environment:

    1. If possible, your time-critical tasks should run on the HPCF system. If this is impossible and your task runs on ecgate, be aware that this may block your time-critical activity, as currently there is no backup for this system.
    2. Your time-critical tasks should not use the Data Handling System (DHS), including ECFS and MARS. The data should be available online, on the HPCF (either in a private file system or in the MARS Fields Data Base (FDB). If some data must be stored in MARS or ECFS, do not make time-critical tasks dependent on these archive tasks, but keep them independent. See the sample ecFlow definition in ~usx/time_critical/sample_suite.def.
    3. Do not use cross-mounted file systems. Always use local file systems.
    4. To exchange data between remote systems, we recommend the use of rsync.

  4. The manual pages should include specific and clear instructions for the operators at ECMWF. An example man page is available from ~usx/ime_critical/suite/man_page. Man pages should include the following information:
    1. A description of the task.
    2. The dependencies on other tasks.
    3. What to do in case of failure.
    4. Whom to contact in case of failure, how and when.

  5. The ecFlow functionality of ”late tasks” is useful to draw the ECMWF operators’ attention to possible problems in the running of your suite. Try to set the functionality for a few key tasks only, with appropriately selected warning thresholds. If the functionality is used too frequently or if an alarm is triggered every day, it is likely that no one will pay attention to it.

  6. The suite should be self-cleaning. Disk management should be very strict and is your responsibility. All data no longer needed should be removed. The ecFlow jobs and job output files, if kept, should be stored (in ECFS), then removed from local disks.

  7. Your suite definition will loop over many dates, e.g. to cover one year. Depending on the relation between your suite and the operational activity at ECMWF, you will trigger (start) your suite in one of the following ways:
    1. If your suite depends on the ECMWF operational suite, you will set up a time-critical job under ECaccess (see option 1) which will simply set a first dummy task in your suite to complete. Alternatively, you could resume the suite, which would be reset to ”suspended” after completing a cycle. See sample job in ~usx/time_critical/suite/trigger_suite.cmd.
    2. If your suite has no dependencies with the ECMWF operational activity, we suggest you to define a time in your suite definition file when to start the first task in your suite.
    3. If your suite has no dependencies on the ECMWF operational activity, but has dependencies on external events, we suggest that you also define a time when to start the first task in your suite, and that you check for your external dependency in this first task.
    4. The cycling from day to day will usually happen by defining a time when the last task in the suite will run. This last task should run sufficiently long in advance before the next run will start. Setting up this time will allow you to watch the previous run of the suite up until the last task has run. See the sample suite definition in ~usx/time_critical/sample_suite.def.
      Note that if one task of your suite remains in aborted status, this will NOT prevent the last task to run at the given time but your suite will not be able to cycle through to the next run, e.g. for the next day. Different options are available to you to overcome this problem. If the task that failed is not in the critical path, you can give instructions to the operators to set the aborted task to complete. Another option would be to build an administrative task that checks before each run that all tasks are set to complete, and therefore forces your suite to cycle through to the next run.

...

We welcome any comments on this document and the framework for time-critical applications. In particular, please do let us know if any of the above general purpose scripts does not fit with your requirements. We will then try to incorporate the changes needed.EcFlow access protection

 EcFlow access protection