Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Create a new job script broken1.sh with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?

    Code Block
    languagebash
    titlebroken1.sh
    collapsetrue
    #SBATCH --job-name = broken 1
    #SBATCH --output = broken1-%J.out
    #SBATCH --error = broken1-%J.out
    #SBATCH --qos = express
    #SBATCH --time = 00:05:00 
    
    echo "I was broken!"


    Expand
    titleSolution

    The job above has the following problems:

    • There is no shebang at the beginning of the script.
    • There should be no spaces in the directives
    • There should be no space
    • QoS "express" does not exist

    Here is an amended version following best practices for the jobs:

    Code Block
    languagebash
    titlebroken1_fixed.sh
    #!/bin/bash
    #SBATCH --job-name=broken1
    #SBATCH --output=broken1-%J.out
    #SBATCH --error=broken1-%J.out
    #SBATCH --time=00:05:00 
    
    echo "I was broken!"

    Note that the QoS line was removed, but you may also use the following if running on ECS:

    No Format
    #SBATCH --qos=ef

    or the alternatively, if on Atos HPCF:

    No Format
    #SBATCH --qos=nf

    Check that the actual job run and generated the expected output:

    No Format
    $ grep -v ECMWF-INFO  $(ls -1 broken1-*.out | tail -n1)
    I was broken!



  2. Create a new job script broken2.sh with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?

    Code Block
    languagebash
    titlebroken2.sh
    collapsetrue
    #!/bin/bash
    #SBATCH --job-name=broken2
    #SBATCH --output=broken2-%J.out
    #SBATCH --error=broken2-%J.out
    #SBATCH --qos=ns
    #SBATCH --time=10-00
    
    echo "I was broken!"


    Expand
    titleSolution

    The job above has the following problems:

    • QoS "ns" does not exist. Either remove to use the default or use the corresponding QoS on ECS (ef) or HPCF (nf)
    • The time requested is 10 days, which is longer than the maximum allowed. it was probably meant to be 10 minutes

    Here is an amended version:

    Code Block
    languagebash
    titlebroken1.sh
    #!/bin/bash
    #SBATCH --job-name=broken2
    #SBATCH --output=broken2-%J.out
    #SBATCH --error=broken2-%J.out
    #SBATCH --time=10:00
    
    echo "I was broken!"

    Again, note that the QoS line was removed, but you may also use the following if running on ECS:

    No Format
    #SBATCH --qos=ef

    or the alternatively, if on Atos HPCF:

    No Format
    #SBATCH --qos=nf

    Check that the actual job run and generated the expected output:

    No Format
    $ grep -v ECMWF-INFO  $(ls -1 broken2-*.out | tail -n1)
    I was broken!



  3. Create a new job script broken3.sh with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?

    Code Block
    languagebash
    titlebroken3.sh
    collapsetrue
    #!/bin/bash
    #SBATCH --job-name=broken3
    #SBATCH --chdir=$SCRATCH
    #SBATCH --output=broken3output/broken3-%J.out
    #SBATCH --error=broken3output/broken3-%J.out
    
    echo "I was broken!"


    Expand
    titleSolution

    The job above has the following problems:

    • Variables are not expanded on job directives. You must specify your paths explicitly
    • The directory where the output and error files will go must exist beforehand. Otherwise the job will fail but you will not get any hint as to what may have happened to the job. The only hint would be if checking sacct:

      No Format
      $ sacct -X --name=broken3
      JobID                 JobName       QOS      State ExitCode    Elapsed   NNodes             NodeList 
      ------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 
      64281800              broken3        ef     FAILED     0:53   00:00:02        1              ad6-201 


    You will need to create the output directory with:

    No Format
    mkdir -p $SCRATCH/broken3output/

    Here is an amended version of the job:

    Code Block
    languagebash
    titlebroken3.sh
    #!/bin/bash
    #SBATCH --job-name=broken3
    #SBATCH --chdir=/scratch/<your_user_id>
    #SBATCH --output=broken3output/broken3-%J.out
    #SBATCH --error=broken3output/broken3-%J.out
    
    echo "I was broken!"

    Check that the actual job run and generated the expected output:

    No Format
    $ grep -v ECMWF-INFO  $(ls -1 $SCRATCH/broken3output/broken3-*.out | tail -n1)
    I was broken!

    You may clean up the output directory with

    No Format
    rm -rf $SCRATCH/broken3output



  4. Create a new job script broken4.sh with the contents below and try to submit the job. You should not see the message in the output. What happened? Can you fix the job and keep trying until it runs successfully?

    Code Block
    languagebash
    titlebroken3broken4.sh
    collapsetrue
    #!/bin/bash
    #SBATCH --job-name=broken4
    #SBATCH --output=broken4-%J.out
    
    ls $FOO/bar
    echo "I should not be here"


    Expand
    titleSolution

    The job above has the following problems:

    • FOO variable is undefined when used. Undefined variables often lead to unexpected failures that are not always easy to spot.
    • Even if FOO was defined to "", the ls command fails but the job keeps running and eventually will apparently finish successfully from Slurm point of view, but it should have failed and been interrupted on the first error.

    Here is an amended version of the job following best practices:

    Code Block
    languagebash
    titlebroken4.sh
    #!/bin/bash
    #SBATCH --output=broken4-%J.out
    
    set -x # echo script lines as they are executed
    set -e # stop the shell on first error
    set -u # fail when using an undefined variable
    set -o pipefail # If any command in a pipeline fails, that return code will be used as the return code of the whole pipeline
    
    ls $FOO/bar
    echo "I should not be here"

    With the extra shell options, we guarantee we get some extra information on the output about the commands being written, and we ensure that the job will stop when encountering the first error (non-zero exit code), as well as if an undefined variable is found.

    Info
    titleBest practices

    Even if most examples in this tutorial do not have the extra shell options for simplicity, you should always include those in your production jobs.



...