Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Create a directory for this tutorial so all the exercises and outputs are contained inside:

    No Format
    mkdir ~/batch_tutorial
    cd ~/batch_tutorial


  2. Create and submit a job called simplest.sh with just default settings that runs the command hostname. Can you find the output and inspect it? Where did your job run?

    Expand
    titleSolution

    Using your favourite editor, create a file called simplest.sh with the following content

    Code Block
    languagebash
    titlesimplest.sh
    #!/bin/bash
    hostname

    You can submit it with sbatch:

    No Format
    sbatch simplest.sh

    The job should be run shortly. When finished, a new file called slurm-<jobid>.out should appear in the same directory. You can check the output with:

    No Format
    $ cat $(ls -r1 slurm-*.out | head -n1)
    ab6-202.bullx
    [ECMWF-INFO -ecepilog] ----------------------------------------------------------------------------------------------------
    [ECMWF-INFO -ecepilog] This is the ECMWF job Epilogue
    [ECMWF-INFO -ecepilog] +++ Please report issues using the Support portal +++
    [ECMWF-INFO -ecepilog] +++ https://support.ecmwf.int                     +++
    [ECMWF-INFO -ecepilog] ----------------------------------------------------------------------------------------------------
    [ECMWF-INFO -ecepilog] Run at 2023-10-25T11:31:53 on ecs
    [ECMWF-INFO -ecepilog] JobName                   : simplest.sh
    [ECMWF-INFO -ecepilog] JobID                     : 64273363
    [ECMWF-INFO -ecepilog] Submit                    : 2023-10-25T11:31:36
    [ECMWF-INFO -ecepilog] Start                     : 2023-10-25T11:31:51
    [ECMWF-INFO -ecepilog] End                       : 2023-10-25T11:31:53
    [ECMWF-INFO -ecepilog] QueuedTime                : 15.0
    [ECMWF-INFO -ecepilog] ElapsedRaw                : 2
    [ECMWF-INFO -ecepilog] ExitCode                  : 0:0
    [ECMWF-INFO -ecepilog] DerivedExitCode           : 0:0
    [ECMWF-INFO -ecepilog] State                     : COMPLETED
    [ECMWF-INFO -ecepilog] Account                   : myaccount
    [ECMWF-INFO -ecepilog] QOS                       : ef
    [ECMWF-INFO -ecepilog] User                      : user
    [ECMWF-INFO -ecepilog] StdOut                    : /etc/ecmwf/nfs/dh1_home_a/user/slurm-64273363.out
    [ECMWF-INFO -ecepilog] StdErr                    : /etc/ecmwf/nfs/dh1_home_a/user/slurm-64273363.out
    [ECMWF-INFO -ecepilog] NNodes                    : 1
    [ECMWF-INFO -ecepilog] NCPUS                     : 2
    [ECMWF-INFO -ecepilog] SBU                       : 0.011
    [ECMWF-INFO -ecepilog] ----------------------------------------------------------------------------------------------------

    You can then see that the script has run on a different node than the one you are on.

    If you repeat the operation, you may get your job to run on a different node every time, whichever happens to be free at the time.


  3. Configure your simplest.sh job to direct the output to simplest-<jobid>.out, the error to simplest-<jobid>.err both in the same directory, and the job name to just "simplest". Note you will need to use a special placeholder for the -<jobid>.

    Expand
    titleSolution

    Using your favourite editor, open the simplest.sh job script and add the relevant #SBATCH directives:

    Code Block
    languagebash
    titlesimplest.sh
    #!/bin/bash
    #SBATCH --job-name=simplest
    #SBATCH --output=simplest-%j.out
    #SBATCH --output=simplest-%j.err
    hostname

    You can submit it again with:

    No Format
    sbatch simplest.sh

    After a few moments, you should see the new files appear in your directory (job id will be different than the one displayed here):

    No Format
    $ ls simplest-*.*
    simplest-64274497.err  simplest-64274497.out

    You can check that the job name was also changed in the end of job report:

    No Format
    $ grep -i jobname $(ls -r1 simplest-*.err | head -n1)
    [ECMWF-INFO -ecepilog] JobName                   : simplest



...

  1. Create a new job script sleepy.sh with the following contents and submit it. Then check :noformatcontents below:

    Code Block
    languagebash
    titlesleepy.sh
    #!/bin/bash
    sleep 120


  2. Submit  sleepy.sh to the batch system and check its status. Once it is running, cancel it and inspect the output.


    Expand
    titleSolution

    Using your favourite editor,

    open the simplest.sh job script and add the relevant #SBATCH directives

    create sleepy.sh job script with the contents above. Then you can submit it with:

    No Format
    sbatch sleepy.sh

    You can then check the state of your job with squeue:

    No Format
    squeue -j <jobid>

    if you use the <jobid> of the job you just submitted, or just:

    No Format
    squeue --me

    to list all your jobs.

    To cancel your job, just run scancel:

    No Format
    scancel <jobid>

    If you inspect the output file from your last job, you will see a message like the following:

    No Format
    slurmstepd: error: *** JOB 64281137 ON ab6-202 CANCELLED AT 2023-10-25T15:40:51 ***



  3. Can you get information about the jobs you have run so far today, including those that have finished already?

    Expand
    titleSolution

    When jobs finish, they will not appear in the squeue output any longer. You can then check the Accounting Database with sacct:

    No Format
    sacct

    With no arguments, this command will show you the list of all jobs run by you on this day. 

    In the output you may see or more entries 3 entries such as:

    No Format
    JobID                 JobName       QOS      State ExitCode    Elapsed   NNodes             NodeList 
    ------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 
    ...
    64281137            sleepy.sh        ef CANCELLED+      0:0   00:00:16        1              ab6-202 
    64281137.ba+            batch            CANCELLED     0:15   00:00:17        1              ab6-202 
    64281137.ex+           extern            COMPLETED      0:0   00:00:16        1              ab6-202 

    The first one corresponds to the job itself. The second one (always named batch) corresponds to the actual job script and the third (named extern) corresponds to the external step used to generate the end of job information. You may have more lines if your job contains more steps, which typically correspond to srun parallel executions.

    If you want to list just the entry for the job itself, you can do:

    No Format
    sacct -X



  4. Can you get information of all the jobs run today by you that were cancelled? 

    Expand
    titleSolution

    You can filter jobs by state with the -s option. But If you run it naively:

    No Format
    sacct -X -t CANCELLED

    You will get no output. That is because when using state you must also specify the start and end times of your query period. You can then do something like:

    No Format
    sacct -X -s CANCELLED -S $(date +%Y-%m-%d) -E $(date +%Y-%m-%dT%H:%M:%S)



  5. The default information shown on the screen when querying past jobs is limited. Can you extract the submit, start, and end times of your cancelled jobs today? What about their output and error path? Hint: use the corresponding man page for all the options.

    Expand
    titleSolution

    You can use the following command to see all the possible output fields you can query for:

    No Format
    sacct -e

    While there are dedicated fields for the job submit, start and end times, there is none for the output and error paths. However, the AdminComment field is used to carry that information. Since it is a long field, you may want to pass a length to the fieldname to avoid truncation:

    No Format
    sacct -X -s CANCELLED -S $(date +%Y-%m-%d) -E $(date +%Y-%m-%dT%H:%M:%S) -o jobid,jobname,state,submit,start,end,AdminComment%150

    or you can also ask for a parsable output:

    No Format
    sacct -X -s CANCELLED -S $(date +%Y-%m-%d) -E $(date +%Y-%m-%dT%H:%M:%S) -o jobid,jobname,state,submit,start,end,AdminComment -p



Common pitfalls

We will now attempt to troubleshoot some issues

  1. Create a new job script broken1.sh with the contents below. Try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?:

    Code Block
    languagebash
    titlebroken1.sh
    #SBATCH --job-name = broken 1
    #SBATCH --output = broken1-%J.out
    #SBATCH --error = broken1-%J.out
    #SBATCH --qos = express
    #SBATCH --time = 00:05:00 
    
    # This is the job
    echo "I was broken!"
    sleep 30


    Expand
    titleSolution

    The job above has the following problems:

    • There is no shebang at the beginning of the script.
    • There should be no spaces in the directives
    • There should be no space
    • QoS "express" does not exist

    Here is an amended version:

    Code Block
    languagebash
    title
    simplest
    broken1_fixed.sh
    #!/bin/bash
    #SBATCH --job-name=
    simplest
    broken1
    #SBATCH --output=
    simplest
    broken1-
    %j
    %J.out
    #SBATCH --
    output
    error=
    simplest
    broken1-
    %j.err hostnameYou can submit it again with
    %J.out
    #SBATCH --time=00:05:00 
    
    # This is the job
    echo "I was broken!"
    sleep 30

    Note that the QoS line was removed, but you may also use the following if running on ECS:

    No Format
    sbatch simplest.sh
    After a few moments, you should see the new files appear in your directory (job id will be different than the one displayed here)
    #SBATCH --qos=ef

    or the alternatively, if on Atos HPCF:

    No Format
    #SBATCH --qos=nf

    Check that the actual job run and generated the expected output:

    No Format
    $ grep -v ECMWF-INFO  $(ls 
    simplest-*.* simplest-64274497.err simplest-64274497.outYou can check that the job name was also changed in the end of job report
    -1 broken1-*.out | head -n1)
    I was broken!



  2. Create a new job script broken2.sh with the contents below. Try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?:

    Code Block
    languagebash
    titlebroken2.sh
    #!/bin/bash
    #SBATCH --job-name=broken2
    #SBATCH --output=broken2-%J.out
    #SBATCH --error=broken2-%J.out
    #SBATCH --qos=ns
    #SBATCH --time=10-00
    
    # This is the broken
    echo "I was broken!"


    Expand
    titleSolution

    The job above has the following problems:

    • QoS "ns" does not exist. Either remove to use the default or use the corresponding queue on ECS (ef) or HPCF (nf)
    • The time requested is 10 days, which is longer than the maximum allowed. it was probably meant to be 10 minutes

    Here is an amended version:

    Code Block
    languagebash
    titlebroken1.sh
    #!/bin/bash
    #SBATCH --job-name=broken2
    #SBATCH --output=broken2-%J.out
    #SBATCH --error=broken2-%J.out
    #SBATCH --time=10:00
    
    # This is the broken
    echo "I was broken!"

    Again, note that the QoS line was removed, but you may also use the following if running on ECS:

    No Format
    #SBATCH --qos=ef

    or the alternatively, if on Atos HPCF:

    No Format
    #SBATCH --qos=nf

    Check that the actual job run and generated the expected output:

    No Format
    $ grep -
    i
    v ECMWF-INFO 
    jobname
     $(ls -
    r1
    1 
    simplest
    broken2-*.
    err
    out | head -n1)
    I was broken!



  3. Create a new job script broken3.sh with the contents below. Try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?:

    Code Block
    languagebash
    titlebroken3.sh
    #!/bin/bash
    #SBATCH --job-name=broken3
    #SBATCH --chdir=$SCRATCH
    #SBATCH --output=broken3output/broken3-%J.out
    #SBATCH --error=broken3output/broken3-%J.out
    
    # This is the job
    echo "I was broken!"


    Expand
    titleSolution

    The job above has the following problems:

    • Variables are not expanded on job directives. You must specify your paths explicitly
    • The directory where the output and error files will go must exist beforehand. Otherwise the job will fail but you will not get any hint as to what may have happened to the job. The only hint would be if checking sacct:

      No Format
      $ sacct -X --name=broken3
      JobID                 JobName       QOS      State ExitCode    Elapsed   NNodes             NodeList 
      ------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 
      64281800              broken3        ef     FAILED     0:53   00:00:02        1              ad6-201 


    You will need to create the output directory with:

    No Format
    mkdir -p $SCRATCH/broken3output/

    Here is an amended version of the job:

    Code Block
    languagebash
    titlebroken3.sh
    #!/bin/bash
    #SBATCH --job-name=broken3
    #SBATCH --chdir=/scratch/<your_user_id>
    #SBATCH --output=broken3output/broken3-%J.out
    #SBATCH --error=broken3output/broken3-%J.out
    
    # This is the job
    echo "I was broken!"

    Check that the actual job run and generated the expected output:

    No Format
    $ grep -v ECMWF-INFO  $(ls -1 $SCRATCH/broken3output/broken3-*.out | head -n1)
    I was broken!

    You may clean up the output directory with

    No Format
    rm -rf $SCRATCH/broken3output
    [ECMWF-INFO -ecepilog] JobName : simplest