Skip to end of metadata
Go to start of metadata

The batch scheduler on the Cray XC is PBSpro. The basic functionality is similar to LoadLeveler:

  • The user includes a number of directives at the start of the job script which provide information about which queue the job should run in, the location of the output and error files, the resources the jobs needs, etc.
  • A number of command line tools are provided for submitting, viewing and managing the job.

These are the topics covered:

Main command line tools

The table summarises the main PBSpro user commands and their LoadLeveler equivalents.

User commandsLoadLevelerPBSpro
Job submissionllsubmit <job_script>qsub [<pbs_options>] <job_script>
Job cancellationllcancel <job_id>qdel <job_id>
Job statusllq [-u <uid>] [-j <job_id>] qscan [-u <uid>] [<job_id>]
Queue informationllclass [-l <queue>]qstat -Q [-f] [<queue>]

Cancelling a prepIFS job running on the Cray

To cancel a job running on the Cray which has been submitted from prepIFS, the simplest approach is to use the XCDP "Special -> Kill".

For ECMWF research users, it is also possible to log on to the cluster where the job is running and cancel the job directly with the rdx_qdel <jobid> command.

Access to the job output and error files while the job is running

With PBSpro, the job output and error files are not generally available to the user while the job is executing.  Instead, any output written to the output or error files is stored in a spool directory and copied to the output and error files specified by the PBSpro directives only when the job ends.

So that users are able to access the job output and error files while the job is running, ECMWF has provided the qcat command.

qcat usage
usage: qcat [-h] [-o | -e] [-f] JID
Access the job output and error files while the job is running
positional arguments:
  JID           The job ID
optional arguments:
  -h, --help    show this help message and exit
  -o, --ouptut  Get the stdout of the job. This is the default
  -e, --error   Get the stderr
  -f, --follow  Get the live output as it goes

Batch queues on the ECMWF Cray XC systems

As on the IBM systems, a number of batch queues have been defined. The basic queue names remain the same as on the IBM. The specifications of the main queues are given in the table.

User queue
name
Suitable forTarget
nodes
Number of
processors
(min/max)
Shared /
not shared

Processors per
node available
for user jobs

Memory limitWall-clock limit
nsserialPPN1/1shared72100 GB48 hours
nffractionalPPN2/36shared72100 GB48 hours
npparallelMOM+CN1/72not shared72120 GB48 hours

Reminder: fractional jobs

As on the IBM systems, ECMWF defines a fractional job to be a job that uses more than one processor but less than half of the resources available on one full node.  On the Cray, this means a job requesting:

  • 36 or fewer processors with hyperhtreading and less than 60 GBytes of memory or
  • 18 or fewer processors without hyperhtreading and less than 60 GBytes of memory.

Queues for time-critical option 2 work

The corresponding queues for time-critical option 2 jobs, ts, tp and tf, have also been defined.

Migrating LoadLeveler jobs to PBSpro

To help with the migration of batch jobs from LoadLeveler to PBSpro, ECMWF has provided the ll2pbs command:

ll2pbs usage
usage: ll2pbs [-h] [-q] [-s] [-i INSCRIPT] [-o OUTSCRIPT] [-f]
Job translator from LoadLeveler to PBSPro
optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           Do not produce warning or error messages on stderr
  -s, --strict          Strict Translation. Fail if a directive is not
                        supported. Default is to skip the unsupported
                        directive
  -i INSCRIPT, --inscript INSCRIPT
                        Input script. By default reads stdin
  -o OUTSCRIPT, --outscript OUTSCRIPT
                        Output translated script. By default writes to stdout
  -f, --force           Overwrite the output file if it exists

This command can be used for the initial translation of simple scripts from Loadleveler to PBSpro.  It returns warnings for any directives that are not recognised or cannot be converted:

ll2pbs example
% ll2pbs -i job_ll.cmd -o job_pbs.cmd
WARNING: directive comment not supported, skipping...
WARNING: directive cpu_limit not supported, skipping...
WARNING: No variables allowed in the output file definition in PBS. Reverting to default values...
WARNING: No variables allowed in the error file definition in PBS. Reverting to default values...

The LoadLeveler file used as the input to this command and the PBSpro output file produced are:

IBM LoadLeveler job - job_ll.cmdCray PBSpro job - job_pbs.cmd
Notes
#!/bin/ksh
#@ shell            = /usr/bin/ksh
#@ class            = ns
#@ job_type         = serial
#@ job_name         = pi
#@ comment          = User Support example pi (serial)
#@ output           = $(job_name).$(schedd_host).$(jobid).out
#@ error            = $(job_name).$(schedd_host).$(jobid).out
#@ notification     = error
#@ resources        = ConsumableCpus(1) ConsumableMemory(100mb)

#@ cpu_limit        = 00:16:00
#@ initialdir = /scratch/ms/co/uid
#@ wall_clock_limit = 00:18:00
#@ queue
#!/bin/ksh
#PBS -S /usr/bin/ksh
#PBS -q ns

#PBS -N pi



#PBS -m a
#PBS -l EC_memory_per_task=100mb
#PBS -l EC_threads_per_task=1


#PBS -l walltime=00:18:00


cd /scratch/ms/co/uid




No equivalent of job_type in PBS

No equivalent of comment in PBS
Variables are not accepted in the Job output and
error file names.  Here the default will be used

ConsumableMemory => EC_memory_per_task
ConsumableCpus => EC_tasks_per_thread
No equivalent of cpu_limit in PBS
No equivalent - replaced with cd in script

No equivalent of queue in PBS

Replaces initialdir

The ll2pbs is provided as an aid only.  The resulting PBS job should be checked carefully before use.  In particular:

  • not all Loadleveler directives have an equivalent in PBSpro
  • variables cannot be used to specify the job output and error files in PBSpro directives
  • it is not possible to specify a CPU time limit
  • there is no concept of 'soft' and 'hard' limits for the Wall-clock time - effectively, only a 'hard' Wall-clock time limit can be specified
  • there is no knowledge of 'fractional' jobs submitted to queue nf
  • no changes are made to any job geometry keywords
  • no changes are made to the script - in particular, compilation commands are not converted

PBSpro also provides the nqs2pbs command.  This  utility converts an existing NQS job script to work with PBSpro and NQS.  The existing script is copied and PBS directives inserted prior to each NQS directive in the original script.  See 'man nqs2pbs' for more details.


PBSpro job header keywords


Put all your PBS directives at the top of the script file, above any commands. Any directive after an executable line in the script is ignored. Note that you can pass PBS directives, including the ECMWF custom PBS ones, as options to the 'qsub' command.

KeywordLoadLevelerPBSproNotes on PBSpro option
Prefix#@#PBS
Queueclass=<queue>-q <queue>
Job Namejobname=<job_name>-N <job_name>

<job_name> can be a maximum of 15 characters. 
For longer Job names see "-l EC_job_name" below

If not specified, the default Job name is that of the script submitted.

Shellshell=/usr/bin/ksh-S /usr/bin/ksh
Wall-clock limitwall_clock_limit=<hh:mm:ss,hh:mm:ss>-l walltime=<hh:mm:ss>There is no concept of a soft wall-clock limit
CPU-time limit
job_cpu_limit=<hh:mm:ss>
no equivalent
Initial working directoryinitialdir=<path>no equivalent

The ll2pbs command prepends the initialdir <path> to the Job output
and Job error files and adds a "cd <path>" at the start of the job script.

Job outputoutput=<output_file>-o <output_file>

Job output and job error can be joined with the -j <join> option
where <join> can be one of:

  • oe - join output and error and write to output
  • eo - join error and output and write to error
  • n - do not join output and error

Output and error are only written to the files specified with -o and
-e
at the end of the job

Job errorerror=<error_file>-e <error_file>
Email notificationnotification=<event>-m <event>

<event> can be any combination of

  • a - send email when job aborts
  • b - send email when job begins
  • e - send email when job ends

or

  • n - never send email

The default is "-m a"

Email usernotify_user=<email>-M <email>
Set environment variablesenvironment = <ENV1>=<value1>-v <ENV1>=<value1>, <ENV2>=<value2>
Copy environmentenvironment = COPY_ALL-VUse with caution !
Jobstep end markqueueno equivalent

Writing Job output and Job error to the same file

If the Job output and Job error file names specified with the -o and -e options point to the same file then the Job error will overwrite the Job output at the end of the job unless the -j oe option is specified.

In general, to get the Job output and Job error written to the same file it is better to specify only the Job output file and use the -j oe option.

ECMWF custom PBSpro directives

In addition to the standard PBSpro directives, ECMWF has defined a number of custom directives to help the user define the geometry of the job. A full list of all ECMWF custom directives can be found at ECMWF PBSpro. Some of the more commonly used options together with their LoadLeveler equivalents are listed in the table:

KeywordLoadLevelerPBSproNotes on PBSpro option
Number of nodesnode = <nodes>-l EC_nodes=<nodes>
Total number of MPI taskstotal_tasks = <tasks>-l EC_total_tasks=<tasks>
Number of parallel threads per MPI taskparallel_threads = <threads>-l EC_threads_per_task=<threads>
Number of MPI tasks per nodetasks_per_node = < tasks>-l EC_tasks_per_node=<tasks>
Consumable Memory per MPI taskresources = ConsumableMemory(<memory>)-l EC_memory_per_task=<memory>
Use hyperthreading / SMTec_smt = yes / no-l EC_hyperthreads=2 / 1
Job namejob_name = <job_name>-l EC_job_name=<job_name>Can be longer than 15 characters
Billing accountaccount_no = <account>-l EC_billing_account=<account>


On Cray systems, the term processing element (PE) is more often used to describe the equivalent of an MPI task in LoadLeveler.

Associated with each of the ECMWF custom directives is an environment variable of the same name.  These environment variable can be used in job scripts and, in particular, to specify the options to the aprun command.

PBSpro provides a selection statement via "#PBS -l select=" which is used on other systems to specify the node requirements and job geometry.  This statement is quite complex to use and does not cover all the requirements for advanced job scheduling used by ECMWF.  ECMWF has, therefore, disabled the 'select'' statement by default and asks users to use the ECMWF PBSpro directives instead.

ECMWF believes that its customised PBSpro directives cover the majority if user requirements.  If you are unable to set the job geometry that you require then please contact Service Desk.

PBSpro Job examples

Serial job example

  • Serial jobs should be submitted to the ns queue.
  • There is no need to specify any further job geometry requirements.
  • For jobs requiring more than the default 1.25 GBytes of memory per task, the memory requirement should be set with the EC_memory_per_task directive.

Serial job example
#PBS -N Hello_serial
#PBS -q ns

./Hello_serial

Pure OpenMP job example

  • Pure OpenMP jobs can be submitted to either:
    • queue np if EC_threads_per_task > 18 x EC_hyperthreads
    • queue nf if EC_threads_per_task <=  18 x EC_hyperthreads
  • Set EC_total_tasks=1 to specify that this does not use MPI
  • Set EC_threads_per_task to the number of OpenMP threads required.  This will be the value needed for the OMP_NUM_THREADS environment variable.
  • Choose whether or not to use hyperthreading by setting EC_hyperthreads.

Restriction on the setting of EC_threads_per_task

EC_threads_per_task <= 36 x EC_hyperthreads

If the job is submitted to the np queue then the executable must be run with the aprun command.

Pure OpenMP job example
#PBS -N HelloOpenMP
#PBS -q np
#PBS -l EC_total_tasks=1
#PBS -l EC_threads_per_task=72
#PBS -l EC_hyperthreads=2

export OMP_NUM_THREADS=$EC_threads_per_task
aprun -N $EC_tasks_per_node -n $EC_total_tasks -d $OMP_NUM_THREADS -j $EC_hyperthreads ./HelloOpenMP

For a pure OpenMP fractional jobs running in the nf queue, the executable can be called directly:

Pure OpenMP fractional job example
#PBS -N HelloOpenMP
#PBS -q nf
#PBS -l EC_total_tasks=1
#PBS -l EC_threads_per_task=8
#PBS -l EC_hyperthreads=2

export OMP_NUM_THREADS=$EC_threads_per_task
./HelloOpenMP

Pure MPI job example

  • Pure MPI jobs can be submitted to either:
    • queue np if EC_total_tasks > 18 x EC_hyperthreads
    • queue nf if EC_total_tasks <= 18 x EC_hyperthreads
  • Set EC_total_tasks to the total number of MPI tasks to be used.
  • Optionally set EC_threads_per_task=1 to specify that no OpenMP threads will be used.
  • Optionally choose the amount of memory per MPI task with the EC_memory_per_task directive.
  • Choose whether or not to use hyperthreading by setting EC_hyperthreads.
  • The executable must be run with the aprun command

Pure MPI job example
#PBS -N HelloMPI
#PBS -q np
#PBS -l EC_total_tasks=72
#PBS -l EC_hyperthreads=2

aprun -N $EC_tasks_per_node -n $EC_total_tasks -j $EC_hyperthreads ./HelloMPI

Hybrid MPI-OpenMP job example

  • Hybrid MPI-OpenMP jobs can be submitted to either:
    • queue np if EC_tasks_per_node x EC_threads_per_task > 18 x EC_hyperthreads
    • queue nf if EC_tasks_per_node x EC_threads_per_task <= 18 x EC_hyperthreads
  • Set EC_total_tasks to the total number of MPI tasks required.
  • Set EC_threads_per_task to the number OpenMP threads required. This will be the value needed for the OMP_NUM_THREADS environment variable.
  • Optionally choose the amount of memory per MPI task with the EC_memory_per_task directive.
  • Choose whether or not to use hyperthreading by setting EC_hyperthreads.
  • The executable must be run with the aprun command

Hybrid MPI-OpenMP job example
#PBS -N HelloMPI_OpenMP
#PBS -q np
#PBS -l EC_total_tasks=72
#PBS -l EC_threads_per_task=2
#PBS -l EC_hyperthreads=2

export OMP_NUM_THREADS=$EC_threads_per_task
aprun -N $EC_tasks_per_node -n $EC_total_tasks -d $OMP_NUM_THREADS -j $EC_hyperthreads ./HelloMPI_OpenMP

Fractional job example

A fractional job is a job that uses fewer than half of the total resources on a node. 

  • Fractional jobs must be submitted to the nf queue
  • The job must request:
    • less than 60 GBytes of memory and
    • 18 or fewer (logical) CPUs with EC_hyperthreads=1 or
    • 36 or fewer (logical) CPUs with EC_hyperthreads=2.
  • Optionally set the EC_memory_per_task directive to the amount of memory per MPI task required if this is greater than the default.  
  • The executable must be run with the mpiexec command.

Scripts for fractional jobs that run an MPI executable must load the cray-snplauncher module to add the mpiexec executable to the $PATH.  Pure OpenMP fractional jobs should set OMP_NUM_THREADS and then run the executable directly.

Fractional job example
#PBS -N HelloMPI_frac
#PBS -q nf
#PBS -l EC_total_tasks=6
#PBS -l EC_hyperthreads=1

# Load the cray-snplauncher module to add the mpiexec command to $PATH
module load cray-snplauncher

mpiexec -n $EC_total_tasks ./HelloMPI_frac


Multiple program multiple data (MPMD) job example

Multiple program multiple data (MPMD) jobs are jobs that can run either different executables which require different job geometry or else a number of instances of the same executable, possibly accessing different data. 

  • MPMD jobs should be submitted to either the np or nf queue, depending on the job geometry requested
  • The executable must be run with the aprun command.

Multiple program multiple data (MPMD) job example
#PBS -N HelloMPMD
#PBS -q np
#PBS -l EC_total_tasks=12:6
#PBS -l EC_threads_per_task=1:1
#PBS -l EC_hyperthreads=1

# Get the job geometry information into array variables for ease of use
IFS=':'
set -A tasks_per_node $EC_tasks_per_node
set -A total_tasks $EC_total_tasks

# Run model_atm.exe with 12 MPI tasks and model_ocean.exe with 6 MPI tasks
aprun -n ${total_tasks[0]} -N ${tasks_per_node[0]} ./model_atm.exe : \
      -n ${total_tasks[1]} -N ${tasks_per_node[1]} ./model_ocean.exe

What if I need to use only a small number of CPUs ?

MPMD jobs needing to use less than half of one node can be submitted to the nf queue.  In this case, the jobs should be launched with mpiexec. Remember also to load the cray-snplauncher module to add the mpiexec executable to the $PATH.

How can I run multiple binaries on the same node on an MPMD run? (EC-EARTH case)

By default aprun will allocate at least 1 node for each different binary of the MPMD run. This can be a bit inefficient if one or more binaries only need one or just a few cpus as each one would have a full node wasting the remaining resources. This is the classic problem for coupled models like EC-EARTH, where the coupler alone would take a full node in a standard MPMD run on the cray.

Fortunately, there is a way to use the resources more efficiently. In this case, instead of running aprun -n 1 exe1 : -n 1 exe2..., the idea is to create a wrapper.sh script that will tell aprun what to execute depending on the MPI rank. You have to also export PMI_NO_FORK=1 before launching aprun.

Here is an example with an MPMD run with four different executables,  where two single-rank (exec1 and exec2) executables are run together in the first node and the rest on the remaining ones. In total 3 nodes would be used (tasks_per_node: 2-24-24), which amounts to 50 tasks (1(exec1)  + 1(exec2) + 28(exec3) + 20(exec4)):

wrapper.sh
#!/bin/ksh

# Put here the paths to the binaries of the MPMD run
binaries=(exec1 exec2 exec3 exec4)

# Define the number of tasks each one of them will use
ntasks=(1 1 28 20)

##########################
export OMP_NUM_THREADS=1
sum=0
i=0
for t in ${ntasks[@]}
do
    sum=$((sum + t))
    if [ $EC_FARM_ID -lt $sum ]; then
        exec ${binaries[$i]}
        break
    fi
    i=$((i+1))
done

 And in the job script:

Packed MPMD job example
#PBS -N PackedMPMD
#PBS -q np
#PBS -l EC_nodes=3

export PMI_NO_FORK=1
aprun -n 2 -N 2 -j 1 ./wrapper.sh : -n 48 -N 24 -j 1 ./wrapper.sh

You have to be very careful with the number of tasks per each executable in the script wrapper.sh and the number of -n and -N in the aprun. The usage of aprun MPMD using ":" allows you to separate nodes. If you want to fill all the nodes you can use a single aprun command like so:

aprun -n 72 -N 24 -j 1 ./wrapper.sh 

But then the wrapper would need to be adapted to reflect the fact that the total tasks to run would go up to 72, for example setting with ntasks=(1 1 40 30)


Further reading

  1. EC_ Job Directives Summary
  2. ECMWF Cray PBSpro setup concepts
  3. EC_ Job Directives Quick Start
  4. PBSpro at ECMWF - Cray XC30 Workshop February 2014


  • No labels

10 Comments

  1. You can indeed run MPMD programs in queue nf with mpiexec, e.g. from the man page, even tailored to us:

    "To run an MPMD application consisting of the programs ocean on 4
    processes and air on 8 processes, enter a command line like this
    example.

    % mpiexec -n 4 ocean : -n 8 air"

  2. after entering in contact with the following messages ...

    > mpiexec:  unexpected error - no non-PBS mpiexec in PBS_O_PATH
    > mpirun: not found [No such file or directory]
    > aprun: This node is configured to run without ALPS services

    thanks to Christian, mpiexec could be used with:

    module load cray-snplauncher

     

  3. There's a typo in the hybrid mpi-openmp example:

    export OMP_NUM_THREADS=$EC_tasks_per_thread
    should be:
    export OMP_NUM_THREADS=$EC_threads_per_task
    1. Thanks, Glenn. I've corrected this.

  4. I think it should be stressed that ECMWF custom PBSpro directives can be set at qsub level as well (which increases versatility of scripting):

    A submission like : 
    qsub -l EC_total_tasks=4 myscript.sh
    works just as a "#PBS EC_total_tasks=4" directive in myscript.sh
    1. I've added this in the document.

  5. I don't think that EC_threads_per_task=0 can be optionally set to specify that no OpenMP threads will be used as stated above.
    1. Thanks, Ernesto. You're right. I've changed it to '1', which is the default, therefore optional.

  6. Is it possible to submit a job with a program which spawns its own worker programs/threads ? This seems similar to the Multiple program multiple data (MPMD) job example above, but I'd rather not specify the different binaries in the job (e.g. the aprun command line), but instead have the first program spawn the rest.

    I'm using the amuse framework (http://amusecode.org/) to couple different parallel codes, one of which is a modified OpenIFS. Currently, the master program starts and manages to spawn the first worker. The worker then fails when it tries to initialize MPI. I have previously ran the same setup using openMPI elsewhere.

    Thu Jan 26 13:16:41 2017: [unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
    Thu Jan 26 13:16:41 2017: [unset]:_pmi_alps_get_appLayout:pmi_alps_get_apid returned with error: Bad file descriptor
    [Thu Jan 26 13:16:41 2017] [c0-1c0s6n1] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
    MPIR_Init_thread(525):
    MPID_Init(225).......: channel initialization failed
    MPID_Init(598).......:  PMI2 init failed: 1
     
    Program received signal SIGABRT: Process abort signal.
     
    Backtrace for this error:
    #0  0x2AAAABA1AB27
    ... cut ...
    #9  0x2AAAB235BAD7
    #10  0x80B497 in __mpl_init_mod_MOD_mpl_init at mpl_init_mod.F90:153
    #11  0x72DA76 in dr_hook_util_ at dr_hook_util.F90:56
    #12  0x6DDFA2 in __yomhook_MOD_dr_hook_default at yomhook.F90:46
    #13  0x41EA86 in __ifslib_MOD_static_init at ifslib.F90:131 (discriminator 1)
    #14  0x41431E in __openifs_interface_MOD_initialize_code at interface.f90:25
    #15  0x4091A9 in handle_call at worker_code.f90:1288
    #16  0x40F580 in run_loop_sockets at worker_code.f90:581
    #17  0x409798 in amuse_worker_program at worker_code.f90:102

     

    The line which fails in #10 is this: CALL MPI_INIT_THREAD(IREQUIRED,IPROVIDED,IERROR) 

     

    If there is a better place for this question, please let me know.

     

  7. When running an MPMD job, I have the problem that whether the job finishes normally, or when one task encounters an error and calls mpi_abort(), the job still keeps running until the wall-clock time limit.

    Is there a way to stop the whole job in these cases, to not waste resources?

    The programs I'm running is one python coupler task using mpi4py, and multiple worker tasks written in Fortran. I use the MPMD scheme to avoid wasting one full node for running the python coupler.