Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Within this Framework, Member State users can use the Atos HPCF and ECGATE ECS services (HPCF resources are needed to use the HPCF / HPC service). In general, users should minimise the number of systems they use. Use of the HPC service is recommended for all computationally intensive (e.g. for running a numerical model) time-critical option 2 and 3 activity.  The ECGATE (ECS) service should be used only for work that is not excessively computationally intensive such as post-processing output graphically before it is transferred to their Member State. Member State time-critical work may also need to use additional systems outside ECMWF after processing at ECMWF has been performed, for example to run other models using data produced by their work at ECMWF. It is not the purpose of this document to provide guidelines on how to run work which does not make use of ECMWF computing systems.

...

Technical guidelines for setting up time-critical work

Basic information

...

Option 2

As this work will be monitored by ECMWF staff (User Services during the development phase; the ECMWF 24x7 Shift Staff, once your work is fully implemented), the only practical option is to implement time-critical work using a suite under ECFLOW, ECMWF's monitoring and scheduling software package. The suite must be developed according to the technical guidelines provided in this document. General documentation, training course material, etc, on ecFlow can be found at ecFlow home. No on-call support will be provided by ECMWF staff but the ECMWF Shift Staff can contact the relevant Member State suite support person, if this is clearly requested in the suite 'manual' pages.

...

Note

The ecFlow environment is not set up by default for users on the Atos HPCF and ECGATE systems. Users will have to load the ecFlow environment using

No Format
$ module load ecflow

 

ECMWF will create a "ready-to-go" ecFlow server running on an independent Virtual Machine outside the HPCF. See also Using ecFlow for further information about using ecFlow on the Atos HPCF and ECGATE systems.

...

  1. The suite should easily run in a different configuration. It is therefore vital to allow for easy changes of configuration. Possible changes could include:
    1. Running on a different HPCF system.
    2. Running the main task on fewer or more CPUs, with fewer or more threads (if relevant).
    3. Using a different file system.
    4. Using a different data set, e.g. ECMWF e-suite or own e-suite.
    5. Using a different ”model” version.
    6. Using a different ecFlow server.
    7. Using a different User ID and different queues, e.g. for testing and development purposes.

      The worst that could happen is that you lose everything and need to restart from scratch. Although this is very unlikely, you should keep safe copies of your libraries, executables and other constant data files.

      Tip

      To achieve flexibility in the configuration of your suite, we recommend that you have one core suite and define ecFlow variables for all those changes of configuration you want to cater for. See variable definitions in suite definition file ~usx/time_critical/sample_suite.def.



  2. It is also important to document clearly the procedures for any changes to the configuration, if these may need to be run by, for example, by ECMWF's 24x7 Shift Staff.

  3. All tasks that are part of the critical path, i.e. that will produce the final ”products” to be used by you, have to run in the safest environment:

    1. All time-critical tasks should run on the HPCF system.
    2. Your time-critical tasks should not use the Data Handling System (DHS), including ECFS and MARS. The data should be available online, on the HPCF, either in a private file system or accessed from the Fields Data Base (FDB) with MARS. If some data must be stored in MARS or ECFS, do not make time-critical tasks dependent on these archive tasks, but keep them independent. See the sample ecFlow definition in ~usx/time_critical/sample_suite.def.
    3. Do not use cross-mounted file systems. Always use file systems local to the HPC.
    4. To exchange data between remote systems, we recommend the use of rsync.

  4. The suite manual ('man') pages should include specific and clear instructions for ECMWF Shift Staff. An example man page is available from ~usx/ime_critical/suite/man_page. Manual pages should include the following information:
    1. A description of the task.
    2. The dependencies on other tasks.
    3. What to do in case of failure.
    4. Whom to contact in case of failure, how and when.

      Tip

      If you require an email to be sent to contacts whenever a suite task aborts then this can be included in the task ERROR function so that the email is sent automatically.



  5. The ecFlow functionality of ”late tasks” is useful to draw the ECMWF Shift Staffs’ attention to possible problems in the running of your suite.

    Tip

    Use the "late tasks" functionality sparingly.  Set it for a few key tasks only, with appropriately selected warning thresholds. If the functionality is used too frequently or if an alarm is triggered every day, it is likely that no one will pay attention to it.

     

  6. The suite should be self-cleaning. Disk management should be very strict and is your responsibility. All data no longer needed should be removed. The ecFlow jobs and job output files, if kept, should be stored (in ECFS), then removed from local disks.

  7. Your suite definition will loop over many dates, e.g. to cover one year. Depending on the relation between your suite and the operational activity at ECMWF, you will trigger (start) your suite in one of the following ways:
    1. If your suite depends on the ECMWF operational suite, you will set up a time-critical job under ECaccess (see option 1) which will simply set a first dummy task in your suite to complete. Alternatively, you could resume the suite, which would be reset to ”suspended” after completing a cycle. See sample job in ~usx/time_critical/suite/trigger_suite.cmd.
    2. If your suite has no dependencies with the ECMWF operational activity, we suggest you to define a time in your suite definition file when to start the first task in your suite.
    3. If your suite has no dependencies on the ECMWF operational activity, but has dependencies on external events, we suggest that you also define a time when to start the first task in your suite, and that you check for your external dependency in this first task.
    4. The cycling from day to day will usually happen by defining a time when the last task in the suite will run. This last task should run sufficiently long in advance before the next run will start. Setting up this time will allow you to watch the previous run of the suite up until the last task has run. See the sample suite definition in ~usx/time_critical/sample_suite.def.
      Note that if one task of your suite remains in aborted status, this will NOT prevent the last task to run at the given time but your suite will not be able to cycle through to the next run, e.g. for the next day. Different options are available to you to overcome this problem. If the task that failed is not in the critical path, you can give instructions to the ECMWF Shift Staff to set the aborted task to complete. Another option would be to build an administrative task that checks before each run that all tasks are set to complete, and therefore forces your suite to cycle through to the next run.

...

  1. Your work requires input data which is produced by any of the ECMWF models. In such case it is possible to set up a specific dissemination stream which will send the required data to the HPCF. ECPDS allows for the "local" dissemination to a specific User ID (the User ID used to run time-critical work) so that only this recipient User ID can see the data and is similar to the standard dissemination to remote sites. This ”local” dissemination option is the recommended option.  The recipient User ID is responsible for the regular clean-up of the received data.

    If produced by ECMWF, your required data will also be available in the FDB and will remain online for a limited (variable depending on the model) amount of time. You can access these data using the usual ”mars” command. If your suite requires access to data which may no longer be contained in the FDB then your suite needs to access these data before they are removed from the FDB and temporarily store them in one of your disk storage areas.

    Warning

    For no reason should any of your time-critical suite tasks depend on data only available from the Data Handling System (MARS archive or ECFS). Beware that using a keyword value of "ALL" in any mars request will automatically redirect it to the MARS archive (DHS). Note also that we recommend you do not use abbreviations for a MARS verb, parameter or value in your mars requests. If too short, these abbreviations may become ambiguous if a new verb, parameter or value name is added to the mars language.


  2. Your work requires input data which is available at ECMWF but not produced by an ECMWF model. For example, your work requires observations normally available on the GTS e.g. if you are interested in running some data assimilation work at ECMWF. In such a case you can obtain the required observations from /ec/vol/msbackup/ on the Atos HPCF where they are stored by a regular extraction task which runs as part of the ECMWF operational suite. For any other data you may need for your time-critical activity and which is available at ECMWF, please ask your User Services technical contact.

  3. Your work requires input data which is neither produced by any of the ECMWF models nor available at ECMWF. You will then be responsible for setting up the required ”acquisition” tasks and establish their level of time criticality. For example, your suite may need some additional observations which improve the quality of your assimilation but your work can also run without them in case there is a delay/problem in their arrival at ECMWF. Please see the section ”Data transfers” for advice on how to transfer incoming data.

...

Note

Note that, by default, ectrans transfers are asynchronous; the successful completion of the ectrans command does not mean your file has been transferred successfully. You may want to use the option ”-put” to request synchronous transfers.

...

Incoming data - sending data

...

to ECMWF

We recommend ectrans (option -get) to upload some data from a remote site to ECMWF. Other options, including the use of ECPDS in acquisition mode, may be considered in specific situations. Please discuss your requirements with your User Services technical contact.

...

The User IDs authorised to run Option 2  work have access to all Atos HPCF complexes and you are advised to implement your suite so it can run on any of the clusters in case a specific cluster is unavailable for an extended period of time.

...