Framework for time-critical applications, options 2 and 3

1 General information

1.1 Service options available

Option 2: Member State ecFlow suites monitored by ECMWF:

Suitable for more complex applications comprising several tasks with interdependencies between them.
The suites will be developed according to the technical guidelines described in this document.
To be requested by the TAC representative of the relevant Member State.
Monitored by ECMWF.

Option 3: Member State ecFlow suites managed by ECMWF:

Further enhancement of Option 2.
Requires an ecFlow suite, which has usually been developed under option 2 of the framework.
Application developed, tested and maintained by the Member State.
It must be possible to test the application using ECMWF pre-operational (e-suite) data.
Member State suite handed over to ECMWF.
Member State responsible for the migration of the application, e.g. when supercomputer changes.
Monitored by ECMWF.
ECMWF will provide first-level on-call support, while second-level support would be provided by the Member State.
To be requested by the TAC representative of the relevant Member State.

1.2 General characteristics of Member State time-critical work

The main characteristics of Member State time-critical work are:

The work needs to be executed reliably, according to a predetermined and agreed schedule.
It runs regularly, in most cases on a daily basis, but could be executed on a weekly, monthly or ad hoc basis.
It must have an official owner who is responsible for its development and maintenance.

1.3 Systems that can be used at ECMWF to execute time-critical work

Within this Framework, Member State users can use the general purpose server ecgate" and the High Performance Computing Facility (HPCF), provided they have access to the HPCF resources. In general, users should minimise the number of systems they use. For example, they should use ecgate" only if they need to post-process data, which is not excessively computationally intensive. Similarly, they should only use the HPCF if they need to run computationally intensive work (e.g. a numerical model) and do not need to post-process their output graphically before it is transferred to their Member State. Member State time-critical work may also need to use additional systems outside ECMWF after having done some processing at ECMWF: for example to run other models using data produced by their work at ECMWF. It is not the purpose of this document to provide guidelines on how to run work which does not make use of ECMWF computing systems.

1.4 How to request this service and information required

Every registered user of ECMWF computing systems is allowed to run work using öption 1" of this Framework and no formal request is required. Note that the access to our realtime operational data is restricted. Users interested in running this kind of work should refer to the document entitled "Simple time-critical jobs - ECaccess". See http://software.ecmwf.int/wiki/display/USS/Simple+time-critical+jobs. To run work using option 2" or option 3" you will need to submit an official request to the Director of Forecasting Department at ECMWF, signed by the TAC representative of your Member State. Before submitting a request we advise you to discuss the time-critical work you intend to run at ECMWF with your User Support contact point. Your official request will need to provide the following information:

Description of the main tasks.
The systems needed to run these tasks.
Technical characteristics of the main tasks running on HPCF: number of processors required, memory needed, CPU/elapsed time needed, size of the input and output files, software/library dependencies, system billing units (SBU) needed (if applicable).
A detailed description of the data flow, in particular describing which data are required before processing can start. This description must state any dependency on data that are not produced by any of the ECMWF models and include from where the data can be obtained and their availability time.
A proposed time schedule for the main tasks, stating in particular when it is desirable to have the result of this work available 'to the customers'.

ECMWF will consider your request and reply officially within three months taking into account, in particular, the resources required for the implementation of your request.

2 Technical guidelines for setting up time-critical work

2.1 Basic information

2.1.1 Option 2

As this work will be monitored by ECMWF staff (User Support during the development phase; the operators, once your work is fully implemented), the only practical option is to implement time-critical work using a suite under ecFlow, ECMWF's monitoring and scheduling software packages. Given that SMS will gradually be phased out, we ask new developers of Option 2 activities to use ecFlow, see Bologna - New Data Centre. We will therefore only refer to ecFlow in the remaining part of this document. The suite must be developed according to the technical guidelines provided in this document. General documentation, training course material, etc ... on ecFlow can be found at ecflow home. No on-call support will be provided by ECMWF staff but the ECMWF operators can contact the relevant Member State suite support person, if this is clearly requested in the suite man pages.

2.1.2 Option 3

In this case, the Member State ecFlow suite will be managed by ECMWF. The suite will usually be based on either a suite previously developed in the framework of option 2" or on a similar suite already used to run ECMWF operational work. The suite will be run using the ECMWF operational userid and will be managed by staff in the production section at ECMWF. The suite will generally be developed following similar guidelines to option 2". The main technical differences are that option 3" will have higher batch scheduling priority than option 2" work and that the ECPDS system (ECmwf Product Dissemination System) will normally be used to transfer products obtained by option 3" work. With option 3", your time-critical work will also benefit from first level on call support from the ECMWF Production Section staff.

2.2 Before implementing your ecFlow suite

You are advised to discuss the requirements of your work with User Support before you start any implementation. You should test a first rough implementation under your normal Member State userid, using the file systems normally available to you and standard batch job classes/queues, etc ... following the technical guidelines given in this document. You should proceed with the final implementation only after your official request has been agreed by ECMWF.

2.3 UID used to run the work

A specific UID will be created to run a particular suite under option 2". This UID will be set up as an application identifier": such UIDs start with a "z", followed by two or three characters. No password will be assigned and access to the UID will be allowed using a strong authentication token (ActivIdentity token). A person responsible will be nominated for every application identifier" UID. A limited number of other registered users can also be authorized to access this UID and a mechanism to allow such access under strict control will be available. The person associated with the UID and other authorised users have responsibility for all changes made to the files owned by the UID. The UID will be registered with a specific "policy" ("timecrit") which allows access to restricted batch classes, restricted file systems.

2.4 General ecFlow suite guidelines

As mentioned earlier, öption 2" work must be implemented, unless otherwise previously agreed, by developing an ecFlow suite. The ecFlow environment is not set up by default for users on ecgate or on the HPC systems. Users will have to load the ecFlow environment with a module: module load ecflow

2.4.1 Port number and ecFlow server

The ecFlow port number for the suite has the format "1000+<UID>", where <UID> is the numeric UID number of the userid used to run the work. The script to start the ecFlow server is available on ecgate and is called ecflow_start.sh". A second ecFlow server can be started up for backup or development purposes. This second ecFLow server will be started with the '-b' option and will use the port number "500+<UID>". The syntax of the ecflow_start.sh command is:

Usage: /usr/local/apps/ecflow/4.0.6/bin/ecflow_start.sh [-b] [-d ecf_home directory] [-f] [-h]
           -b start ECF for backup server or e-suite
           -d <dir> specify the ECF_HOME directory - default /home/us/usl/ecflow_server
           -f forces the ECF to be restarted
           -v verbose mode
           -h print this help page
           -p <num> specify server port number(ECF_PORT number) - default 1000+<UID> - 500+<UID> for backup server

Note that the port number allocation convention doesn't guarantee that the two numbers associated with your UID are free. A port number may already be used by another user for ecFlow or it may be used by another application. If 'your' default port number is not free, you will have to start the ecFlow server by specifying your own port number, using the option '-p'. Authorised port numbers are between 1024 and 65536. We advise you to choose higher numbers. The ecFlow server will run on ecgate and can be started at system boot time. Please ask User Support at ECMWF if you want us to start your ecFlow server at boot time. A cron job which regularly checks the presence of the ecFlow server process should also be implemented. The above script ecflow_start.sh can also be used to run this check under cron, e.g. like in:

5,20,35,50 * * * *  /cronrun.ksh ecflow_start.sh 1> /ecFlow_start.out 2> 1

with the script $HOME/cronrun.ksh containing:

#!/bin/ksh
export PATH=/usr/local/bin:PATH.
~/.profile
~/.kshrc
module load ecflow
$@

Depending on your activity with ecFlow, the ecFlow log file (~/ecflow_server/ecgb.*.log) will grow steadily. We recommend that you install either a cron job or an administration task in your suite to clean these ecFlow log files. This can be achieved with the command ecflow_client: ecflow_client -port

2.4.2 Access to the job output files

We recommend the usage of the simple log server (Perl script) to access the output files of jobs running on the HPCF. This log server requires another port number, which will have the format "35000+<UID>", where <UID> is the numeric uid of the userid used to run the work. The log server will run on the HPCF and can be started after system boot. The script /usr/local/bin/start_logserver should be used to start the log server on the HPCF. The syntax of the command start_logserver is:

Usage: /usr/local/bin/start_logserver [-d <dir>] [-m <map>] [-h]
            -d <dir> specify the directory name where files will be served from - default is $HOME
            -m <map> give mapping between local directory and directory where ecFlow server runs - default is <dir>:<dir>
            -h print this help page

The mapping can consist of a succession of mappings. Each individual mapping will first give the directory name on the ecFlow server, followed by the directory name on the HPC system, like in the following example:

-m <dir_ecgate>:<dir1_hpc>:<dir_ecgate>:<dir2_hpc>

We recommend that you implement a cron job or define an administration task in your suite to check the presence of the log server process. The above script /usr/local/bin/start_logserver can be used for this purpose. Note that the job output files of running jobs on HPC are kept on a local spool, which is not visible from the interactive nodes (cca and ccb). In order to see the job output files of running jobs, you will therefore need to start the logserver on cca-log and ccb-log. See further for more details.

2.4.3 Managing ecFlow tasks

EcFlow will manage your jobs. Three main actions on the ecFlow tasks are required: one to submit, one to check and one to kill a task. These three actions are respectively defined through the ecFlow variables ECF_JOB_CMD, ECF_KILL_CMD and ECF_STATUS_CMD. You can use any script to take these actions on your tasks. We recommend that you use the commands provided by ECMWF with the schedule module which is available on ecgate. To activate the module, you will run: module load schedule The command called 'schedule' can then be used to submit, check or kill a task:

Usage: /usr/local/apps/schedule/1.4/bin/schedule <user> <host> [<requestid>] <jobfile> <joboutput> [kill - status]

 Command used to schedule some tasks to sms or ecflow
        <user>:         %USER%
        <host>:         %REMOTE_HOST%, %SCHOST%, %WSHOST%
        <requestid>:    %ECF_RID% or %SMSRID% (only needed for [kill|status])
        <jobfile>:      %ECF_JOB% or %SMSJOB%
        <joboutput>:    %ECF_JOBOUT% or %SMSJOBOUT%

 By default /usr/local/apps/schedule/1.4/bin/schedule will submit a task.

Space shortcuts

Page tree