Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Updated with new environment variables

...

Table of Content Zone

Table of Contents
maxLevel3

Introduction


This document describes the service that allows users to automatically submit jobs to be run when certain points in the daily ECMWF operational forecast suites have been reached. The main purpose is to ensure that certain data is available before e.g. submitting a MARS request. This facility is running using the ECaccess environment. It is available either through the Web interface of ECaccess or with the ECaccess Web Toolkit, available on the Atos HPC or installed locally.  This service is monitored by the operators at ECMWF.

Tip

Before submitting your job to run in the ECaccess environment, you need to set up SSH key-based authentication within ECMWF as explained in HPC2020: How to connect.

Enhanced ECaccess batch service

In 2007, the existing batch service under ECaccess was extended to provide a new facility allowing registered users to run jobs when ECMWF’s operational activity has reached certain points.

Events

A database of events, also known as notifications, has been added to ECaccess. Such events can be added, deleted or modified by individual users. An event will have a name and a description. In the context of this service for registered users, the events defined in ECaccess correspond to the points in the operational suite when some data or products are available. For example, we have defined an event called ‘an12h00’ with the description ‘At this stage, the analysis cycle for 12:00UTC is complete.’ As such, this event will not yet have any link with the ECMWF operational activity, apart maybe by its name and description. From the user viewpoint, when submitting a batch job through ECaccess, the user will be able to subscribe (repeatedly) this job to the events available to him or her. And from the ECMWF operational viewpoint, we will send notifications to these predefined events. When ECaccess receives a notification for an event, it will release the user jobs which have subscribed to the event and submit them to the Slurm batch service on the system on the Atos HPCs. Finally, a little time after a notification to an event has been issued and the jobs subscribing to the event have been submitted, a ‘sweeper’ daemon within ECaccess will prepare a new version of the users’ jobs subscribing to the event, ready for submission at the next notification of the event.

User interface

The jobs to be attached to the ECMWF operational suite will have to be submitted through ECaccess. Batch job submission is available from the ECaccess web interface or through the ECaccess Web Toolkit. We will first look at the Web interface, then at the Web Toolkit.

Web interface

When logged in on the web interface for ECaccess, e.g. at https://boaccess.ecmwf.int/ or on your local gateway, you have the possibility to submit a new job from the left margin.

...

In this page you will see all your jobs submitted through ECaccess, both those with subscriptions to some events of the operational suite as well as other jobs. You will also see the jobs due for later schedule, as well as those which have already run for the previous notification of some events. Please note that the name of the job is also shown, when available. You can delete jobs from this page. See Changes in job or supression of jobs, below, for more details.

ECaccess Web Toolkit

The same functionality as described above for the web is available through the ECaccess Web Toolkit. These command-line tools are available on the Atos HPC systems at ECMWF. They may also available on your local systems. Please refer to the ECaccess documentation for further details on the ECaccess Web Toolkit - The full featured client.

...

No Format
uid@ac6-200{uid}:1 --> ecaccess-event-list
1247       ai_lwda_00           At this stage, analysis input observations are archived in MARS.
1249       ai_lwda_12           At this stage, analysis input observations are archived in MARS.
1248       ai_oper_00           At this stage, analysis input observations are archived in MARS.
1250       ai_oper_12           At this stage, analysis input observations are archived in MARS.
167        an00h000             At this stage, the analysis at 00UTC is complete.
201        an06h000             At this stage, the deterministic analysis at 06UTC is complete.
168        an12h000             At this stage, the analysis at 12UTC is complete.
202        an18h000             At this stage, the deterministic analysis at 18UTC is complete.
2724       bc_00                 at 00UTC is complete.                  
2725       bc_06                At this stage, the boundary condition forecast at 06UTC is complete.
2726       bc_12                At this stage, the boundary condition forecast at 12UTC is complete.
2727       bc_18                At this stage, the boundary condition forecast at 18UTC is complete.
...

uid@ac6-200{uid}:763 --> ecaccess-event-list an00h000
     Event-id: 167
         Name: an00h000
       Public: yes
        Owner: emos
      Comment: At this stage, the analysis at 00UTC is complete.

...

Note the status "STDBY" for the job. This job will remain in standby mode up until the ECMWF operational activity has produced the analysis for the 00Z run of the HRES forecast. Note that the jobs submitted through ECaccess in standby mode will only be visible through ECaccess. They will not be visible using the usual Slurm batch service commands on the Atos HPC.

Notes on using the service

Anchor
jobstatus
jobstatus
Job status

One advantage of the new service for time-critical jobs submissions under ECaccess is that the ECMWF operators are monitoring your jobs submitted via this system. Also, you can request your jobs to be rerun automatically on failure. However, ECaccess will only be able to show the correct status of your job or possibly rerun your job if it has correctly been notified about the exact status of your job by the Slurm batch service, on the Atos HPC. It is therefore your responsibility to notify correctly the batch service about errors occuring in your job. By default, an error in your job will not be reported to the batch service; the execution of your job will continue and it will finish as if it had completed successfully. One way to stop the execution of your job as soon as there is an error is to use the "set -e" command, in ksh or bash scripts. With this command, your job will stop and exit abnormally as soon as an error occurs. If you want finer control over the errors, you can include some specific tests in your jobs and, for those important tests, exit with a non zero return code, e.g.:

...

Please note that you can submit your job to ECaccess without setting up what is suggested above. Your jobs will run normally but, without this job control, the ECMWF operators will not notice any errors with your jobs and ECaccess will fail to resubmit your jobs, even if you requested some retries.

Monitoring by the operators

ECMWF operators have a specific interface to monitor user jobs submitted through this system. This allows them to identify various problems, e.g. a general problem with one system at ECMWF or a failure to send a notification to an event, leaving all user jobs waiting to be run. When such problems occur, the operators will try to take corrective action. In case several jobs have failed, apparently linked to a general problem, our operators will be able to restart these jobs, after the problem has been fixed. Our operators will also usually have access to the user's job and job output files, as well as to some instructions you may have provided via the 'man' page, when submitting the job through ECaccess. If no instructions are given, our operators will normally ignore any specific failure of your jobs.

...

will retry your job on failure 3 times with 15 minutes (900 seconds) between retries.  This will sometimes allow the job to complete successfully if the initial failure was caused by a temporary issue.

Environmental variables

Before submitting the job, the following environment variables are set by ECaccess, are passed to your job and can therefore be used within your job:

  • MSJ_BASETIME                  - forecast base time, e.g. "00" or "12"
  • MSJ_STEP                                      - forecast time step, e.g. "144"
  • MSJ_YEAR                                    - year of the run, e.g. "2023"
  • MSJ_MONTH                          - month of the run, e.g. "02"
  • MSJ_DAY                                        - day of the run, e.g, "22"
  • MSJ_EXPVER                          - version number of data archived in MARS (if relevant), e.g. "0001"
  • MSJ_EVENT                              - event name, e.g. "fc00h144"
  • MSJ_IFS_CYCLE                - IFS cycle used, e.g. "49r1"
  • MSJ_MEMBERS                - Number of ensemble members, where applicable

These environmental variables will help you to access the operational data, e.g. to build the correct date to extract the data from MARS.

Anchor
jobchanges
jobchanges
Changes in job or suppression of jobs

If you have to make some changes to any of your ECaccess Time Critical jobs, you will have to cancel the existing job in standby mode and submit the new version of the job. The job name shown with "ecaccess-job- list" or through the web interface should help you to identify the correct job to delete. Similarly, to remove a job from the system, you will have to remove the job in standby (STDBY) mode.

Job examples

An example of a Time-critical batch job for submission on the Atos to create various types of ENS meteogram plots is provided by realtime_metgram.sh.

Help and Support


For initial help on implementing your jobs in the system or for problems with the "operational" runs of your jobs, please create a Software or Computing ticket in the ECMWF support Portal, providing the Unix user name used and the event your job is submitted to. 

...