Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Download and compile the code in your Atos HPCF or ECS shell session with the following commands:

    No Format
    module load prgenv/gnu hpcx-openmpi
    wget https://git.ecdf.ed.ac.uk/dmckain/xthi/-/raw/master/xthi.c
    mpicc -o xthi -fopenmp xthi.c -lnuma


  2. Run the program interactively to familiarise yourself with the ouptut:

    No Format
    $ ./xthi
    Host=ac6-200  MPI Rank=0  CPU=128  NUMA Node=0  CPU Affinity=0,128

    As you can see,  only 1 process and 1 thread are run, and they may run on one of two virtual cores assigned to my session (which correspond to the same physical CPU). If you try to run with 4 OpenMP threads, you will see they will effectively fight each other for those same two cores, impacting the performance of your application but not anyone else in the login node:

    No Format
    $ OMP_NUM_THREADS=4 ./xthi
    Host=ac6-200  MPI Rank=0  OMP Thread=0  CPU=128  NUMA Node=0  CPU Affinity=0,128
    Host=ac6-200  MPI Rank=0  OMP Thread=1  CPU=  0  NUMA Node=0  CPU Affinity=0,128
    Host=ac6-200  MPI Rank=0  OMP Thread=2  CPU=128  NUMA Node=0  CPU Affinity=0,128
    Host=ac6-200  MPI Rank=0  OMP Thread=3  CPU=  0  NUMA Node=0  CPU Affinity=0,128


  3. Create a new job script fractional.sh to run xthi with 2 MPI tasks and 2 OpenMP threads, submit it and check the output to ensure the right number of tasks and threads were spawned. 

    Here is a job template to start with:

    Code Block
    languagebash
    titlebroken1.sh
    collapsetrue
    #!/bin/bash
    #SBATCH --output=fractional.out
    # Add here the missing SBATCH directives for the relevant resources
    
    # Add here the line to run xthi
    # Hint: use srun
    


    Expand
    titleSolution

    Using your favourite editor, create a file called fractional.sh with the following content:

    Code Block
    languagebash
    titlefractional.sh
    #!/bin/bash
    #SBATCH --output=fractional.out
    # Add here the missing SBATCH directives for the relevant resources
    #SBATCH --ntasks=2
    #SBATCH --cpus-per-task=2
    
    # Add here the line to run xthi
    # Hint: use srun
    srun -c $SLURM_CPUS_PER_TASK ./xthi

    You need to request 2 tasks, and 2 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the cpus-per-task, which must be explicitly passed to srun.

    You can submit it with sbatch:

    No Format
    sbatch fractional.sh

    The job should be run shortly. When finished, a new file called fractional.out should appear in the same directory. You can check the relevant output with:

    No Format
    grep -v ECMWF-INFO fractional.out

    You should see an output similar to:

    No Format
    $ grep -v ECMWF-INFO fractional.out
    Host=ad6-202  MPI Rank=0  OMP Thread=0  CPU=  5  NUMA Node=0  CPU Affinity=5,133
    Host=ad6-202  MPI Rank=0  OMP Thread=1  CPU=133  NUMA Node=0  CPU Affinity=5,133
    Host=ad6-202  MPI Rank=1  OMP Thread=0  CPU=137  NUMA Node=0  CPU Affinity=9,137
    Host=ad6-202  MPI Rank=1  OMP Thread=1  CPU=  9  NUMA Node=0  CPU Affinity=9,137


    Info
    titleSrun automatic cpu binding

    You can see srun automatically does certain binding of the cores to the tasks, although perhaps not the best. If you were to instruct srun to avoid any cpu binding with --cpu-bind=none, you would see something like:

    No Format
    $ grep -v ECMWF-INFO fractional.out
    Host=aa6-203  MPI Rank=0  OMP Thread=0  CPU=136  NUMA Node=0  CPU Affinity=4,8,132,136
    Host=aa6-203  MPI Rank=0  OMP Thread=1  CPU=  8  NUMA Node=0  CPU Affinity=4,8,132,136
    Host=aa6-203  MPI Rank=0  OMP Thread=2  CPU=  8  NUMA Node=0  CPU Affinity=4,8,132,136
    Host=aa6-203  MPI Rank=0  OMP Thread=3  CPU=  4  NUMA Node=0  CPU Affinity=4,8,132,136
    Host=aa6-203  MPI Rank=1  OMP Thread=0  CPU=132  NUMA Node=0  CPU Affinity=4,8,132,136
    Host=aa6-203  MPI Rank=1  OMP Thread=1  CPU=  4  NUMA Node=0  CPU Affinity=4,8,132,136
    Host=aa6-203  MPI Rank=1  OMP Thread=2  CPU=132  NUMA Node=0  CPU Affinity=4,8,132,136
    Host=aa6-203  MPI Rank=1  OMP Thread=3  CPU=132  NUMA Node=0  CPU Affinity=4,8,132,136




  4. Can you ensure each one of

    those processes and

    the OpenMP threads runs on a single physical core, without exploiting the hyperthreading, for optimal performance?

    Expand
    titleSolution

    In order to ensure each thread gets their own core, you can use the environment variable OMP_PLACES=threads.

    Then, to make sure only physical cores are used for performance, we need to use the --hint=nomultithread directive:

    Code Block
    languagebash
    titlefractional.sh
    #!/bin/bash
    #SBATCH --output=fractional.out
    # Add here the missing SBATCH directives for the relevant resources
    #SBATCH --ntasks=2
    #SBATCH --cpus-per-task=2
    #SBATCH --hint=nomultithread
    
    # Add here the line to run xthi
    # Hint: use srun
    export OMP_PLACES=threads
    srun -c $SLURM_CPUS_PER_TASK ./xthi

    You can submit the modified job with sbatch:

    No Format
    sbatch fractional.sh

    You should see an output similar to the following one, where each thread is in a different core with a number lower than 128:

    No Format
    $ grep -v ECMWF-INFO fractional.out
    Host=ad6-201  MPI Rank=0  OMP Thread=0  CPU=18  NUMA Node=1  CPU Affinity=18
    Host=ad6-201  MPI Rank=0  OMP Thread=1  CPU=20  NUMA Node=1  CPU Affinity=20
    Host=ad6-201  MPI Rank=1  OMP Thread=0  CPU=21  NUMA Node=1  CPU Affinity=21
    Host=ad6-201  MPI Rank=1  OMP Thread=1  CPU=22  NUMA Node=1  CPU Affinity=22



Running parallel jobs - HPCF only

Info
titleReference Documentation

HPC2020: Submitting a parallel job

HPC2020: Affinity

So far we have only run serial jobs. You may also want to run small parallel jobs, either concurrently using just multiple threads, multiple processes or both. Examples of this are MPI and OpenMP programs. We call these kind of small parallel jobs "fractional", because they will run on a fraction of a node, sharing it with other users.

If you followed this tutorial so far, you will have realised ECS users may run very small parallel jobs on the default ef queue, whereas HPCF users may run slightly bigger jobs (up to half a GPIL node) on the default nf queue.

For this tests we will use David McKain's version of the Cray xthi code to visualise how the process and thread placement takes place.