AG: Batch system

Slurm is the batch system available. Any script can be submitted as a job with no changes, but you might want to see Writing SLURM jobs to customise it.

To submit a script as a serial job with default options enter the command:

sbatch yourscript.sh

You may query the queues to see the jobs currently running or pending with:

squeue

And cancel a job with

scancel <jobid>

The "scancel" command should be executed on a login node on the same cluster as the job.

See the Slurm documentation for more details on the different commands available to submit, query or cancel jobs.

QoS available

These are the different QoS (or queues) available for standard users on the four complexes:

QoS name	Type	Suitable for...	Shared nodes	Maximum jobs per user	Maximum nodes per user	Default / Max Wall Clock Limit	Default / Max CPUs	Default memory per cpu
ng	GPU	serial and small parallel jobs with GPU. It is de default	Yes	-	4	average runtime + standard deviation / 2 days	1 / -	2900 MB
dg	GPU	short debug jobs requiring GPU	Yes	1	2	average runtime + standard deviation / 30 min	1 / -	2900 MB

Limits are not set in stone

Different limits on the different QoSs may be introduced or changed as the system evolves.

Checking QoS setup

If you want to get all the details of a particular QoS on the system, you may run, for example:

sacctmgr list qos names=ng

Submitting jobs remotely

If you are submitting jobs from a different platform via ssh, please use the ag-batch dedicated node instead of the *-login equivalent

ssh ag-batch "sbatch myjob.sh"

AG: Writing SLURM jobs

Any shell script can be submitted as a Slurm job with no modifications. In such a case, sensible default values will be applied to the job. However, you can configure the script to fit your needs through job directives. In Slurm, these are just special comments in your script, usually at the top just after the shebang line, with the form:

AG: Batch jobs not starting - reasons

There may be a number of reasons why a submitted job does not start running. When that happens, it is a good idea to use squeue and pay attention to the STATE and NODELIST(REASON) columns:

$> squeue -j 64243399 
    JOBID       NAME  USER   QOS    STATE       TIME TIME_LIMIT NODES      FEATURES NODELIST(REASON)
 64243399     my_job  user    nf  PENDING       0:00   03:00:00     1        (null) (Priority)

Space shortcuts

Page tree

QoS available

Submitting jobs remotely

AG: Writing SLURM jobs

AG: Batch jobs not starting - reasons