Page History

...

Download and compile the code in your Atos HPCF or ECS shell session with the following commands:

No Format
module load prgenv/gnu hpcx-openmpi wget https://git.ecdf.ed.ac.uk/dmckain/xthi/-/raw/master/xthi.c mpicc -o xthi -fopenmp xthi.c -lnuma

Run the program interactively to familiarise yourself with the ouptut:

No Format
$ ./xthi Host=ac6-200 MPI Rank=0 CPU=128 NUMA Node=0 CPU Affinity=0,128

As you can see, only 1 process and 1 thread are run, and they may run on one of two virtual cores assigned to my session (which correspond to the same physical CPU). If you try to run with 4 OpenMP threads, you will see they will effectively fight each other for those same two cores, impacting the performance of your application but not anyone else in the login node:

No Format

$ OMP_NUM_THREADS=4 ./xthi
Host=ac6-200  MPI Rank=0  OMP Thread=0  CPU=128  NUMA Node=0  CPU Affinity=0,128
Host=ac6-200  MPI Rank=0  OMP Thread=1  CPU=  0  NUMA Node=0  CPU Affinity=0,128
Host=ac6-200  MPI Rank=0  OMP Thread=2  CPU=128  NUMA Node=0  CPU Affinity=0,128
Host=ac6-200  MPI Rank=0  OMP Thread=3  CPU=  0  NUMA Node=0  CPU Affinity=0,128

Create a new job script fractional.sh to run xthi with 2 MPI tasks and 2 OpenMP threads, submit it and check the output to ensure the right number of tasks and threads were spawned.

Here is a job template to start with:

Code Block

language	bash
title	broken1fractional.sh
collapse	true

#!/bin/bash
#SBATCH --output=fractional.out
# Add here the missing SBATCH directives for the relevant resources

# Add here the line to run xthi
# Hint: use srun

Expand

title	Solution

Using your favourite editor, create a file called fractional.sh with the following content:

Code Block

language	bash
title	fractional.sh

#!/bin/bash
#SBATCH --output=fractional.out
# Add here the missing SBATCH directives for the relevant resources
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2

# Add here the line to run xthi
# Hint: use srun
srun -c $SLURM_CPUS_PER_TASK ./xthi

You need to request 2 tasks, and 2 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the cpus-per-task, which must be explicitly passed to srun.

You can submit it with sbatch:

No Format
sbatch fractional.sh

The job should be run shortly. When finished, a new file called fractional.out should appear in the same directory. You can check the relevant output with:

No Format
grep -v ECMWF-INFO fractional.out

You should see an output similar to:

No Format

$ grep -v ECMWF-INFO fractional.out
Host=ad6-202  MPI Rank=0  OMP Thread=0  CPU=  5  NUMA Node=0  CPU Affinity=5,133
Host=ad6-202  MPI Rank=0  OMP Thread=1  CPU=133  NUMA Node=0  CPU Affinity=5,133
Host=ad6-202  MPI Rank=1  OMP Thread=0  CPU=137  NUMA Node=0  CPU Affinity=9,137
Host=ad6-202  MPI Rank=1  OMP Thread=1  CPU=  9  NUMA Node=0  CPU Affinity=9,137

Info

title	Srun automatic cpu binding

You can see srun automatically ensures certain binding of the cores to the tasks. If you were to instruct srun to avoid any cpu binding with --cpu-bind=none, you would see something like:

No Format

$ grep -v ECMWF-INFO fractional.out
Host=aa6-203  MPI Rank=0  OMP Thread=0  CPU=136  NUMA Node=0  CPU Affinity=4,8,132,136
Host=aa6-203  MPI Rank=0  OMP Thread=1  CPU=  8  NUMA Node=0  CPU Affinity=4,8,132,136
Host=aa6-203  MPI Rank=1  OMP Thread=0  CPU=132  NUMA Node=0  CPU Affinity=4,8,132,136
Host=aa6-203  MPI Rank=1  OMP Thread=1  CPU=  4  NUMA Node=0  CPU Affinity=4,8,132,136

Here all processes/threads could run in any of the cores assigned to the job, potentially having them hopping from cpu to cpu during the program's execution

Can you ensure each one of the OpenMP threads runs on a single physical core, without exploiting the hyperthreading, for optimal performance?

Expand

title	Solution

In order to ensure each thread gets their own core, you can use the environment variable OMP_PLACES=threads.

Then, to make sure only physical cores are used for performance, we need to use the --hint=nomultithread directive:

Code Block

language	bash
title	fractional.sh

#!/bin/bash
#SBATCH --output=fractional.out
# Add here the missing SBATCH directives for the relevant resources
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2
#SBATCH --hint=nomultithread

# Add here the line to run xthi
# Hint: use srun
export OMP_PLACES=threads
srun -c $SLURM_CPUS_PER_TASK ./xthi

You can submit the modified job with sbatch:

No Format
sbatch fractional.sh

You should see an output similar to the following one, where each thread is in a different core with a number lower than 128:

No Format

$ grep -v ECMWF-INFO fractional.out
Host=ad6-201  MPI Rank=0  OMP Thread=0  CPU=18  NUMA Node=1  CPU Affinity=18
Host=ad6-201  MPI Rank=0  OMP Thread=1  CPU=20  NUMA Node=1  CPU Affinity=20
Host=ad6-201  MPI Rank=1  OMP Thread=0  CPU=21  NUMA Node=1  CPU Affinity=21
Host=ad6-201  MPI Rank=1  OMP Thread=1  CPU=22  NUMA Node=1  CPU Affinity=22

...

When running in such configuration, your job will get exclusive use of the nodes where it will run so external interferences are minimised. It is important then that the resources allocated are used efficiently.

Here is a very simplified diagram of the Atos HPCF node:

Gliffy Diagram

macroId	152f57ca-cbad-43d6-a395-74d349c880c5
displayName	Atos HPCF AMD Rome simplified architecture
name	Atos HPCF AMD Rome simplified architecture
pagePin	2

So far we have only run serial jobs. You may also want to run small parallel jobs, either concurrently using just multiple threads, multiple processes or both. Examples of this are MPI and OpenMP programs. We call these kind of small parallel jobs "fractional", because they will run on a fraction of a node, sharing it with other users.

If you followed this tutorial so far, you will have realised ECS users may run very small parallel jobs on the default ef QoS, whereas HPCF users may run slightly bigger jobs (up to half a GPIL node) on the default nf QoS.

...

	4

If not already on HPCF, open a session on hpc-login.

Create a new job script parallel.sh to run xthi with 32 MPI tasks and 4 OpenMP threads, leaving hyperthreading enabled. Submit it and check the output to ensure the right number of tasks and threads were spawned. Take note of what cpus are used, and how much SBUs you used.

Here is a job template to start with:

Code Block

language	bash
title	parallel.sh
collapse	true

#!/bin/bash
#SBATCH --output=parallel-%j.out
#SBATCH --qos=np
# Add here the missing SBATCH directives for the relevant resources  

export OMP_PLACES=threads
srun -c $SLURM_CPUS_PER_TASK ./xthi

Expand

title	Solution

Using your favourite editor, create a file called parallel.sh with the following content:

Code Block

language	bash
title	paralell.sh

#!/bin/bash 
#SBATCH --output=parallel-%j.out
#SBATCH --qos=np 
# Add here the missing SBATCH directives for the relevant resources
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=4

export OMP_PLACES=threads
srun -c $SLURM_CPUS_PER_TASK ./xthi

You need to request 32 tasks, and 4 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the cpus-per-task, which must be explicitly passed to srun.

You can submit it with sbatch:

No Format
sbatch fractional.sh

The job should be run shortly. When finished, a new file called parallel-<jobid>.out should appear in the same directory. You can check the relevant output with:

No Format
grep -v ECMWF-INFO $(ls -1 parallel-*.out \| head -n1)

You should see an output similar to:

No Format

Host=ac2-4046  MPI Rank= 0  OMP Thread=0  CPU=  0  NUMA Node=0  CPU Affinity=  0
Host=ac2-4046  MPI Rank= 0  OMP Thread=1  CPU=128  NUMA Node=0  CPU Affinity=128
Host=ac2-4046  MPI Rank= 0  OMP Thread=2  CPU=  1  NUMA Node=0  CPU Affinity=  1
Host=ac2-4046  MPI Rank= 0  OMP Thread=3  CPU=129  NUMA Node=0  CPU Affinity=129
Host=ac2-4046  MPI Rank= 1  OMP Thread=0  CPU=  2  NUMA Node=0  CPU Affinity=  2
Host=ac2-4046  MPI Rank= 1  OMP Thread=1  CPU=130  NUMA Node=0  CPU Affinity=130
Host=ac2-4046  MPI Rank= 1  OMP Thread=2  CPU=  3  NUMA Node=0  CPU Affinity=  3
Host=ac2-4046  MPI Rank= 1  OMP Thread=3  CPU=131  NUMA Node=0  CPU Affinity=131
...
Host=ac2-4046  MPI Rank=30  OMP Thread=0  CPU=116  NUMA Node=7  CPU Affinity=116
Host=ac2-4046  MPI Rank=30  OMP Thread=1  CPU=244  NUMA Node=7  CPU Affinity=244
Host=ac2-4046  MPI Rank=30  OMP Thread=2  CPU=117  NUMA Node=7  CPU Affinity=117
Host=ac2-4046  MPI Rank=30  OMP Thread=3  CPU=245  NUMA Node=7  CPU Affinity=245
Host=ac2-4046  MPI Rank=31  OMP Thread=0  CPU=118  NUMA Node=7  CPU Affinity=118
Host=ac2-4046  MPI Rank=31  OMP Thread=1  CPU=246  NUMA Node=7  CPU Affinity=246
Host=ac2-4046  MPI Rank=31  OMP Thread=2  CPU=119  NUMA Node=7  CPU Affinity=119
Host=ac2-4046  MPI Rank=31  OMP Thread=3  CPU=247  NUMA Node=7  CPU Affinity=247

Note the following facts:

Both the main cores (0-127) and hyperthreads (128-256) where used.
You get consecutive threads on the same physical CPU (0 with 128, 1 with 129...).
There are physical cpus entirely unused, since their cpu number does show in the output.

In terms of SBUs, this job cost:

No Format
$ grep SBU $(ls -1 parallel-*.out \| head -n1) [ECMWF-INFO -ecepilog] SBU : 2.689

Modify the parallel.sh job geometry (number of tasks and threads) so that you fully utilise all the physical cores of the node but none of the hyperthreads, i.e. 0-127.

Expand

title	Solution

Using your favourite editor, create a file called parallel.sh with the following content:

Code Block

language	bash
title	paralell.sh

#!/bin/bash 
#SBATCH --output=parallel-%j.out
#SBATCH --qos=np 
# Add here the missing SBATCH directives for the relevant resources
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=4
#SBATCH --hint=nomultithread

export OMP_PLACES=threads
srun -c $SLURM_CPUS_PER_TASK ./xthi

You need to request 32 tasks, and 4 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the cpus-per-task, which must be explicitly passed to srun.

You can submit it with sbatch:

No Format
sbatch fractional.sh

The job should be run shortly. When finished, a new file called parallel-<jobid>.out should appear in the same directory. You can check the relevant output with:

No Format
grep -v ECMWF-INFO $(ls -1 parallel-*.out \| head -n1)

You should see an output similar to:

No Format

Host=ac2-4046  MPI Rank= 0  OMP Thread=0  CPU=  0  NUMA Node=0  CPU Affinity=  0
Host=ac2-4046  MPI Rank= 0  OMP Thread=1  CPU=128  NUMA Node=0  CPU Affinity=128
Host=ac2-4046  MPI Rank= 0  OMP Thread=2  CPU=  1  NUMA Node=0  CPU Affinity=  1
Host=ac2-4046  MPI Rank= 0  OMP Thread=3  CPU=129  NUMA Node=0  CPU Affinity=129
Host=ac2-4046  MPI Rank= 1  OMP Thread=0  CPU=  2  NUMA Node=0  CPU Affinity=  2
Host=ac2-4046  MPI Rank= 1  OMP Thread=1  CPU=130  NUMA Node=0  CPU Affinity=130
Host=ac2-4046  MPI Rank= 1  OMP Thread=2  CPU=  3  NUMA Node=0  CPU Affinity=  3
Host=ac2-4046  MPI Rank= 1  OMP Thread=3  CPU=131  NUMA Node=0  CPU Affinity=131
...
Host=ac2-4046  MPI Rank=30  OMP Thread=0  CPU=116  NUMA Node=7  CPU Affinity=116
Host=ac2-4046  MPI Rank=30  OMP Thread=1  CPU=244  NUMA Node=7  CPU Affinity=244
Host=ac2-4046  MPI Rank=30  OMP Thread=2  CPU=117  NUMA Node=7  CPU Affinity=117
Host=ac2-4046  MPI Rank=30  OMP Thread=3  CPU=245  NUMA Node=7  CPU Affinity=245
Host=ac2-4046  MPI Rank=31  OMP Thread=0  CPU=118  NUMA Node=7  CPU Affinity=118
Host=ac2-4046  MPI Rank=31  OMP Thread=1  CPU=246  NUMA Node=7  CPU Affinity=246
Host=ac2-4046  MPI Rank=31  OMP Thread=2  CPU=119  NUMA Node=7  CPU Affinity=119
Host=ac2-4046  MPI Rank=31  OMP Thread=3  CPU=247  NUMA Node=7  CPU Affinity=247

Note the following facts:

Both the main cores (0-127) and hyperthreads (128-256) where used.
You get consecutive threads on the same physical CPU (0 with 128, 1 with 129...).
There are physical cpus entirely unused, since their cpu number does show in the output.

In terms of SBUs, this job cost:

No Format
$ grep SBU $(ls -1 parallel-*.out \| head -n1) [ECMWF-INFO -ecepilog] SBU : 2.689

Space shortcuts

Page tree

Versions Compared

Old Version 16

New Version 17

Key