The batch scheduler on the Cray XC is PBSpro. The basic functionality is similar to LoadLeveler:
- The user includes a number of directives at the start of the job script which provide information about which queue the job should run in, the location of the output and error files, the resources the jobs needs, etc.
- A number of command line tools are provided for submitting, viewing and managing the job.
These are the topics covered:
Main command line tools
The table summarises the main PBSpro user commands and their LoadLeveler equivalents.
User commands | LoadLeveler | PBSpro |
---|---|---|
Job submission | llsubmit <job_script> | qsub [<pbs_options>] <job_script> |
Job cancellation | llcancel <job_id> | qdel <job_id> |
Job status | llq [-u <uid>] [-j <job_id>] | qscan |
Queue information | llclass [-l <queue> ] | qstat -Q [-f] [<queue>] |
Cancelling a prepIFS job running on the Cray
To cancel a job running on the Cray which has been submitted from prepIFS, the simplest approach is to use the XCDP "Special -> Kill".
For ECMWF research users, it is also possible to log on to the cluster where the job is running and cancel the job directly with the rdx_qdel <jobid>
command
.
Access to the job output and error files while the job is running
With PBSpro, the job output and error files are not generally available to the user while the job is executing. Instead, any output written to the output or error files is stored in a spool directory and copied to the output and error files specified by the PBSpro directives only when the job ends.
So that users are able to access the job output and error files while the job is running, ECMWF has provided the qcat
command.
usage: qcat [-h] [-o | -e] [-f] JID Access the job output and error files while the job is running positional arguments: JID The job ID optional arguments: -h, --help show this help message and exit -o, --ouptut Get the stdout of the job. This is the default -e, --error Get the stderr -f, --follow Get the live output as it goes
Batch queues on the ECMWF Cray XC systems
As on the IBM systems, a number of batch queues have been defined. The basic queue names remain the same as on the IBM. The specifications of the main queues are given in the table.
User queue name | Suitable for | Target nodes | Number of processors (min/max) | Shared / not shared | Processors per | Memory limit | Wall-clock limit |
---|---|---|---|---|---|---|---|
ns | serial | PPN | 1/1 | shared | 72 | 100 GB | 48 hours |
nf | fractional | PPN | 2/36 | shared | 72 | 100 GB | 48 hours |
np | parallel | MOM+CN | 1/72 | not shared | 72 | 120 GB | 48 hours |
Reminder: fractional jobs
As on the IBM systems, ECMWF defines a fractional job to be a job that uses more than one processor but less than half of the resources available on one full node. On the Cray, this means a job requesting:
- 36 or fewer processors with hyperhtreading and less than 60 GBytes of memory or
- 18 or fewer processors without hyperhtreading and less than 60 GBytes of memory.
Queues for time-critical option 2 work
The corresponding queues for time-critical option 2 jobs, ts
, tp
and tf
, have also been defined.
Migrating LoadLeveler jobs to PBSpro
To help with the migration of batch jobs from LoadLeveler to PBSpro, ECMWF has provided the ll2pbs
command:
usage: ll2pbs [-h] [-q] [-s] [-i INSCRIPT] [-o OUTSCRIPT] [-f] Job translator from LoadLeveler to PBSPro optional arguments: -h, --help show this help message and exit -q, --quiet Do not produce warning or error messages on stderr -s, --strict Strict Translation. Fail if a directive is not supported. Default is to skip the unsupported directive -i INSCRIPT, --inscript INSCRIPT Input script. By default reads stdin -o OUTSCRIPT, --outscript OUTSCRIPT Output translated script. By default writes to stdout -f, --force Overwrite the output file if it exists
This command can be used for the initial translation of simple scripts from Loadleveler to PBSpro. It returns warnings for any directives that are not recognised or cannot be converted:
% ll2pbs -i job_ll.cmd -o job_pbs.cmd WARNING: directive comment not supported, skipping... WARNING: directive cpu_limit not supported, skipping... WARNING: No variables allowed in the output file definition in PBS. Reverting to default values... WARNING: No variables allowed in the error file definition in PBS. Reverting to default values...
The LoadLeveler file used as the input to this command and the PBSpro output file produced are:
IBM LoadLeveler job - job_ll.cmd | Cray PBSpro job - job_pbs.cmd | Notes | |
---|---|---|---|
#!/bin/ksh | #!/bin/ksh | No equivalent of job_type in PBS No equivalent of comment in PBS Variables are not accepted in the Job output and error file names. Here the default will be used ConsumableMemory => EC_memory_per_task ConsumableCpus => EC_tasks_per_thread No equivalent of cpu_limit in PBS No equivalent - replaced with cd in script No equivalent of queue in PBS Replaces initialdir |
The ll2pbs is provided as an aid only. The resulting PBS job should be checked carefully before use. In particular:
- not all Loadleveler directives have an equivalent in PBSpro
- variables cannot be used to specify the job output and error files in PBSpro directives
- it is not possible to specify a CPU time limit
- there is no concept of 'soft' and 'hard' limits for the Wall-clock time - effectively, only a 'hard' Wall-clock time limit can be specified
- there is no knowledge of 'fractional' jobs submitted to queue
nf
- no changes are made to any job geometry keywords
- no changes are made to the script - in particular, compilation commands are not converted
PBSpro also provides the nqs2pbs
command. This utility converts an existing NQS job script to work with PBSpro and NQS. The existing script is copied and PBS directives inserted prior to each NQS directive in the original script. See 'man nqs2pbs
' for more details.
PBSpro job header keywords
Put all your PBS directives at the top of the script file, above any commands. Any directive after an executable line in the script is ignored. Note that you can pass PBS directives, including the ECMWF custom PBS ones, as options to the 'qsub' command.
Keyword | LoadLeveler | PBSpro | Notes on PBSpro option |
---|---|---|---|
Prefix | #@ | #PBS | |
Queue | class=<queue> | -q <queue> | |
Job Name | jobname=<job_name> | -N <job_name> | <job_name> can be a maximum of 15 characters. If not specified, the default Job name is that of the script submitted. |
Shell | shell=/usr/bin/ksh | -S /usr/bin/ksh | |
Wall-clock limit | wall_clock_limit=<hh:mm:ss,hh:mm:ss> | -l walltime=<hh:mm:ss> | There is no concept of a soft wall-clock limit |
CPU-time limit |
| no equivalent | |
Initial working directory | initialdir=<path> | no equivalent | The |
Job output | output=<output_file> | -o <output_file> | Job output and job error can be joined with the
Output and error are only written to the files specified with |
Job error | error=<error_file> | -e <error_file> | |
Email notification | notification=<event> | -m <event> |
or
The default is " |
Email user | notify_user=<email> | -M <email> | |
Set environment variables | environment = <ENV1>=<value1> | -v <ENV1>=<value1>, <ENV2>=<value2> | |
Copy environment | environment = COPY_ALL | -V | Use with caution ! |
Jobstep end mark | queue | no equivalent |
Writing Job output and Job error to the same file
If the Job output and Job error file names specified with the -o
and -e
options point to the same file then the Job error will overwrite the Job output at the end of the job unless the -j oe
option is specified.
In general, to get the Job output and Job error written to the same file it is better to specify only the Job output file and use the -j oe
option.
ECMWF custom PBSpro directives
In addition to the standard PBSpro directives, ECMWF has defined a number of custom directives to help the user define the geometry of the job. A full list of all ECMWF custom directives can be found at ECMWF PBSpro. Some of the more commonly used options together with their LoadLeveler equivalents are listed in the table:
Keyword | LoadLeveler | PBSpro | Notes on PBSpro option |
---|---|---|---|
Number of nodes | node = <nodes> | -l EC_nodes=<nodes> | |
Total number of MPI tasks | total_tasks = <tasks> | -l EC_total_tasks=<tasks> | |
Number of parallel threads per MPI task | parallel_threads = <threads> | -l EC_threads_per_task=<threads> | |
Number of MPI tasks per node | tasks_per_node = < tasks> | -l EC_tasks_per_node=<tasks> | |
Consumable Memory per MPI task | resources = ConsumableMemory(<memory>) | -l EC_memory_per_task=<memory> | |
Use hyperthreading / SMT | ec_smt = yes / no | -l EC_hyperthreads=2 / 1 | |
Job name | job_name = <job_name> | -l EC_job_name=<job_name> | Can be longer than 15 characters |
Billing account | account_no = <account> | -l EC_billing_account=<account> |
On Cray systems, the term processing element (PE) is more often used to describe the equivalent of an MPI task in LoadLeveler.
Associated with each of the ECMWF custom directives is an environment variable of the same name. These environment variable can be used in job scripts and, in particular, to specify the options to the aprun
command.
PBSpro provides a selection statement via "#PBS -l select
=" which is used on other systems to specify the node requirements and job geometry. This statement is quite complex to use and does not cover all the requirements for advanced job scheduling used by ECMWF. ECMWF has, therefore, disabled the 'select'' statement by default and asks users to use the ECMWF PBSpro directives instead.
ECMWF believes that its customised PBSpro directives cover the majority if user requirements. If you are unable to set the job geometry that you require then please contact Service Desk.
PBSpro Job examples
Serial job example
- Serial jobs should be submitted to the
ns
queue. - There is no need to specify any further job geometry requirements.
- For jobs requiring more than the default 1.25 GBytes of memory per task, the memory requirement should be set with the
EC_memory_per_task
directive.
Pure OpenMP job example
- Pure OpenMP jobs can be submitted to either:
- queue
np
ifEC_threads_per_task
>18 x EC_hyperthreads
- queue
nf
ifEC_threads_per_task <=
18 x EC_hyperthreads
- queue
- Set
EC_total_tasks=1
to specify that this does not use MPI - Set
EC_threads_per_task
to the number of OpenMP threads required. This will be the value needed for theOMP_NUM_THREADS
environment variable. - Choose whether or not to use hyperthreading by setting EC_hyperthreads.
Restriction on the setting of EC_threads_per_task
EC_threads_per_task <= 36 x EC_hyperthreads
If the job is submitted to the np queue then the executable must be run with the aprun
command.
For a pure OpenMP fractional jobs running in the nf queue, the executable can be called directly:
Pure MPI job example
- Pure MPI jobs can be submitted to either:
- queue
np
ifEC_total_tasks
>18 x EC_hyperthreads
- queue
nf
ifEC_total_tasks
<=18 x EC_hyperthreads
- queue
- Set
EC_total_tasks
to the total number of MPI tasks to be used. - Optionally set
EC_threads_per_task=1
to specify that no OpenMP threads will be used. - Optionally choose the amount of memory per MPI task with the
EC_memory_per_task
directive. - Choose whether or not to use hyperthreading by setting EC_hyperthreads.
- The executable must be run with the
aprun
command
Hybrid MPI-OpenMP job example
- Hybrid MPI-OpenMP jobs can be submitted to either:
- queue
np
ifEC_tasks_per_node x EC_threads_per_task > 18 x EC_hyperthreads
- queue
nf
ifEC_tasks_per_node x EC_threads_per_task <= 18 x EC_hyperthreads
- queue
- Set
EC_total_tasks
to the total number of MPI tasks required. - Set
EC_threads_per_task
to the number OpenMP threads required. This will be the value needed for theOMP_NUM_THREADS
environment variable. - Optionally choose the amount of memory per MPI task with the
EC_memory_per_task
directive. - Choose whether or not to use hyperthreading by setting EC_hyperthreads.
- The executable must be run with the
aprun
command
Fractional job example
A fractional job is a job that uses fewer than half of the total resources on a node.
- Fractional jobs must be submitted to the
nf
queue - The job must request:
- less than 60 GBytes of memory and
- 18 or fewer (logical) CPUs with
EC_hyperthreads=1
or - 36 or fewer (logical) CPUs with
EC_hyperthreads=2
.
- Optionally set the
EC_memory_per_task
directive to the amount of memory per MPI task required if this is greater than the default. The executable must be run with the
mpiexec
command.
Scripts for fractional jobs that run an MPI executable must load the cray-snplauncher module to add the mpiexec executable to the $PATH. Pure OpenMP fractional jobs should set OMP_NUM_THREADS and then run the executable directly.
Multiple program multiple data (MPMD) job example
Multiple program multiple data (MPMD) jobs are jobs that can run either different executables which require different job geometry or else a number of instances of the same executable, possibly accessing different data.
- MPMD jobs should be submitted to either the
np
ornf
queue, depending on the job geometry requested - The executable must be run with the
aprun
command.
Further reading
- EC_ Job Directives Summary
- ECMWF Cray PBSpro setup concepts
- EC_ Job Directives Quick Start
- PBSpro at ECMWF - Cray XC30 Workshop February 2014
10 Comments
Dominique Lucas
You can indeed run MPMD programs in queue nf with mpiexec, e.g. from the man page, even tailored to us:
"To run an MPMD application consisting of the programs ocean on 4
processes and air on 8 processes, enter a command line like this
example.
% mpiexec -n 4 ocean : -n 8 air"
Axel Bonet
after entering in contact with the following messages ...
thanks to Christian, mpiexec could be used with:
module load cray-snplauncher
Glenn Carver
There's a typo in the hybrid mpi-openmp example:
export
OMP_NUM_THREADS=$EC_tasks_per_thread
$EC_threads_per_task
Dominique Lucas
Thanks, Glenn. I've corrected this.
Ernesto Barrera
I think it should be stressed that ECMWF custom PBSpro directives can be set at qsub level as well (which increases versatility of scripting):
Dominique Lucas
I've added this in the document.
Ernesto Barrera
Dominique Lucas
Thanks, Ernesto. You're right. I've changed it to '1', which is the default, therefore optional.
Fredrik Jansson
Is it possible to submit a job with a program which spawns its own worker programs/threads ? This seems similar to the Multiple program multiple data (MPMD) job example above, but I'd rather not specify the different binaries in the job (e.g. the aprun command line), but instead have the first program spawn the rest.
I'm using the amuse framework (http://amusecode.org/) to couple different parallel codes, one of which is a modified OpenIFS. Currently, the master program starts and manages to spawn the first worker. The worker then fails when it tries to initialize MPI. I have previously ran the same setup using openMPI elsewhere.
The line which fails in #10 is this: CALL MPI_INIT_THREAD(IREQUIRED,IPROVIDED,IERROR)
If there is a better place for this question, please let me know.
Fredrik Jansson
When running an MPMD job, I have the problem that whether the job finishes normally, or when one task encounters an error and calls mpi_abort(), the job still keeps running until the wall-clock time limit.
Is there a way to stop the whole job in these cases, to not waste resources?
The programs I'm running is one python coupler task using mpi4py, and multiple worker tasks written in Fortran. I use the MPMD scheme to avoid wasting one full node for running the python coupler.