csWith 4 identical Atos complexes (also known as clusters) installed in our Data Centre in Bologna - see Atos HPCF: System overview, we are now able to provide a more reliable computing service at ECMWF, including for batch work. For example, during a system session on one complex, we will submit batch jobs to a different complex. This enhanced batch service however may require the use of some ECMWF customised SLURM commands.

Submitting a job: sbatch

The default PATH includes /usr/local/bin, which contains an ECMWF local version of sbatch, that can submit batch jobs to a different complex. For example, before one session, say on complex AA, we may decide to submit HPCF batch jobs to another complex, say, AB. This will happen transparently for all our users.

sbatch

If you use the SLURM sbatch command, in /usr/bin, you will not benefit from the cross-complex job submission. E.g., under cron and by default, PATH only contains /usr/bin; you will then only submit jobs to the complex you cron entry is on.

All SLURM sbatch options are available with the ECMWF customised sbatch command

Job IDs are unique amongst all complexes, no risk to have duplicated ones.

Monitoring a job: ecsqueue

The default SLURM command 'squeue' will list jobs on the current complex. To list the jobs running on another complex - or all complexes, one should use the 'ecsqueue' command.

$ ecsqueue --help
usage: ecsqueue [-u USER] [-h] [-o FORMAT] [-O FORMAT] [-q QOS] [-j JOBID]
                [-M CLUSTERS]
$ ecsqueue -u $USER
# will show all the jobs running for you on the 4 Atos complexes.

ecsqueue

ecsqueue is located in /usr/local/bin. You may need to adapt your PATH.

Only limited SLURM squeue options are available with ecsqueue.

Deleting a job: ecscancel

The default SLURM command 'scancel' will delete a job on the current complex. To  delete a job running on another complex, one will use the command ecscancel:


$ ecscancel --help
usage: ecscancel [-h] [-u USER] [-t STATE] [-f] [-b] [-i] [-q QOS]
                 [-n JOBNAME] [-s SIGNAL] [-M CLUSTERS]
                 [jobid [jobid ...]]

positional arguments:
  jobid                 list of jobids

optional arguments:
  -h, --help            show this help message and exit
  -u USER, --user USER  scancel for particular user
  -t STATE, --state STATE
                        scancel for particular state
  -f, --full            scancel full
  -b, --batch           scancel batch step
  -i, --interactive     scancel interactive
  -q QOS, --qos QOS     scancel qos
  -n JOBNAME, --jobname JOBNAME
                        scancel jobname
  -s SIGNAL, --signal SIGNAL
                        scancel with a signal
  -M CLUSTERS, --clusters CLUSTERS
                        scancel for particular cluster, or comma separated
                        list of clusters
$ ecscancel <jobid>
# will cancel job <jobid> on one of the four complexes.

ecscancel

ecscancel is located in /usr/local/bin. You may need to adapt your PATH.

Only limited SLURM scancel options are available with ecscancel.