Under the Framework for time-critical applications Member States can run ecFlow suites monitored by ECMWF. Known as the option 2 within that framework, they enjoy a special technical setup to maximise the robustness and high availability similar to ECMWF's own operational production. When moving from a standard user account to a time-critical one (typically starting with a "z"followed by two or three characters) there are a number of things you must be aware of:

Special filesystems

WS1 availability

WS1 is not available yet. Meanwhile, please use ws2 only

Time critical option 2 users, or zids, have a special set of filesystems different from the regular user. They are served from different storage servers in different computing halls, and are not kept in sync automatically. It is the user's responsibility to ensure the required files and directory structures are present on both sides and synchronise them if and when needed. On each Storage Host, zids will have:

File System	Suitable for ...	Technology	Features	Quota
HOME	permanent files, e. g. profile, utilities, sources, libraries, etc.	Lustre	No Backup No snapshots No automatic deletion Unthrottled I/O bandwidth	100GB
TCWORK	permanent large files. Main storage for your jobs and experiments input and output files.	Lustre	No Backup No snapshots No automatic deletion Unthrottled I/O bandwidth	50 TB
SCRATCHDIR	Big temporary data for an individual session or job, not as fast as TMPDIR but higher capacity. Files accessible from all cluster.	Lustre	Deleted at the end of session or job Created per session/ job	part of TCWORK quota
TMPDIR	Fast temporary data for an individual session or job, small files only. Local to every node.	SSD on shared nodes (*f QoSs)	Deleted at the end of session or job Created per session/ job	3 GB per session/job by default. Customisable up to 40 GB with `--gres=ssdtmp:<size>G`
TMPDIR		RAM on exclusive compute nodes (*p QoSs)		no limit (maximum memory of the node)

All those can be referenced by the corresponding environment variables, which will be defined automatically for each session or job. Note that there is no PERM or SCRATCH, and the corresponding environment variables will not be defined.

HOME, TCWORK and SCRATCHDIR are all based on the Lustre Parallel filesystem for maximum reliability. Those will not be accessible from outside the HPCF, including VDI instances or VMs running the ecFlow servers

The Storage server to use is controlled by the environment variable STHOST, which may take the values "ws1" or "ws2". This variable needs to be defined when logging in, and also for all the jobs that need to run in batch. If logging in interactively without passing the environment variable, you will be prompted to choose the desired STHOST:

WARNING: ws1 is not currently available.
1) ws1
2) ws2
Please select the desired timecrit storage set for $STHOST: 2

##### # #    # ######  ####  #####  # #####
  #   # ##  ## #      #    # #    # #   #
  #   # # ## # #####  #      #    # #   #
  #   # #    # #      #      #####  #   #
  #   # #    # #      #    # #   #  #   #
  #   # #    # ######  ####  #    # #   #


#    #  ####  ###### #####     ###### #      #    #
#    # #      #      #    #        #  #      #    #
#    #  ####  #####  #    #       #   #      #    #
#    #      # #      #####       #    #      #    #
#    # #    # #      #   #      #     #      #    #
 ####   ####  ###### #    #    ###### ######  ####

[ECMWF-INFO -ecprofile] /usr/bin/ksh93 INTERACTIVE on aa6-100 at 20220207_152402.512, PID: 53964, JOBID: N/A                                                                                                       
[ECMWF-INFO -ecprofile] $HOME=/ec/ws2/tc/zlu/home=/lus/h2tcws01/tc/zlu/home
[ECMWF-INFO -ecprofile] $TCWORK=/ec/ws2/tc/zlu/tcwork=/lus/h2tcws01/tc/zlu/tcwork
[ECMWF-INFO -ecprofile] $SCRATCHDIR=/ec/ws2/tc/zlu/scratchdir/4/aa6-100.53964.20220207_152402.512
[ECMWF-INFO -ecprofile] $TMPDIR=/etc/ecmwf/ssd/ssd1/tmpdirs/zlu.53964.20220207_152402.512

You can avoid that prompt by passing the environment variable with the desired value:

STHOST=ws2 ssh -o SendEnv=STHOST aa-login

Batch Jobs

When submitting jobs, you must ensure you pass the desired STHOST to your jobs with the corresponding SBATCH export directive. For example, to select ws2:

#SBATCH --export=STHOST=ws2

sbatch command line option

Like any other SBATCH directive, you may alternatively pass the export in the sbatch command line instead:

sbatch --export=STHOST=ws2 job.sh

KSH jobs special requirement

If you are submitting a ksh job, make sure you include this line right after the SBATCH directives header:

source /etc/profile

Ecflow settings

While the availability of virtual infrastructure to run ecFlow servers remains limited, you may start your ecFlow servers in the interim HPCF dedicated node to be able to run your suites, as detailed in HPC2020: Using ecFlow. However, keep in mind the new model when migrating or designing your solution.

At a later stage, those ecFlow servers will need to be moved to dedicated Virtual Machines outside the HPCF. Unlike the ones for standard users, these ones will be completely independent from the HPCF with a local user account and no shared HOME or any other filesystems. You will need to transfer and keep in sync all the needed suite files such as tasks and headers onto the ecflow server.

You may also need to run a log server on the HPCF depending on where you run your ecflowUI if you need to inspect the job outputs.

For job submission and management, those ecflow VMs will come with "troika", the tool used in production at ECMWF to manage the submission, kill and status query for operational jobs. This tool is currently in process of being finalised for the Atos HPCF and will be made available together with the ecflow VM service.

Space shortcuts

Page tree

Special filesystems

Batch Jobs

Ecflow settings

Space shortcuts

Page tree

HPC2020: Time Critical Option 2 setup

Special filesystems

Batch Jobs

Ecflow settings