Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. If not already on HPCF, open a session on hpc-login.
  2. Create a new job script parallel.sh to run xthi with 32 MPI tasks and 4 OpenMP threads, leaving hyperthreading enabled. Submit it and check the output to ensure the right number of tasks and threads were spawned. Take note of what cpus are used, and how much SBUs you used.

    Here is a job template to start with:

    Code Block
    languagebash
    titleparallel.sh
    collapsetrue
    #!/bin/bash
    #SBATCH --output=parallel-%j.out
    #SBATCH --qos=np 
    # TODO: Add here the missing SBATCH directives for the relevant resources
    
    # Define the number of OpenMP threads
    export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
    
    # Ensure proper OpenMP thread CPU pinning
    export OMP_PLACES=threads
    
    # Load xthi tool
    module load xthi 
    
    srun -c $SLURM_CPUS_PER_TASK xthi 


    Expand
    titleSolution

    Using your favourite editor, create a file called parallel.sh with the following content:

    Code Block
    languagebash
    titleparalell.sh
    #!/bin/bash 
    #SBATCH --output=parallel-%j.out
    #SBATCH --qos=np 
    #SBATCH --ntasks=32
    #SBATCH --cpus-per-task=4
    
    # Define the number of OpenMP threads
    export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
    
    # Ensure proper OpenMP thread CPU pinning
    export OMP_PLACES=threads
    
    # Load xthi tool
    module load xthi
    
    srun -c $SLURM_CPUS_PER_TASK xthi

    You need to request 32 tasks, and 4 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the cpus-per-task, which must be explicitly passed to srun.

    You can submit it with sbatch:

    No Format
    sbatch parallel.sh

    The job should be run shortly. When finished, a new file called parallel-<jobid>.out should appear in the same directory. You can check the relevant output with:

    No Format
    grep -v ECMWF-INFO $(ls -1 parallel-*.out | tail -n1)

    You should see an output similar to:

    No Format
    Host=ac2-4046  MPI Rank= 0  OMP Thread=0  CPU=  0  NUMA Node=0  CPU Affinity=  0
    Host=ac2-4046  MPI Rank= 0  OMP Thread=1  CPU=128  NUMA Node=0  CPU Affinity=128
    Host=ac2-4046  MPI Rank= 0  OMP Thread=2  CPU=  1  NUMA Node=0  CPU Affinity=  1
    Host=ac2-4046  MPI Rank= 0  OMP Thread=3  CPU=129  NUMA Node=0  CPU Affinity=129
    Host=ac2-4046  MPI Rank= 1  OMP Thread=0  CPU=  2  NUMA Node=0  CPU Affinity=  2
    Host=ac2-4046  MPI Rank= 1  OMP Thread=1  CPU=130  NUMA Node=0  CPU Affinity=130
    Host=ac2-4046  MPI Rank= 1  OMP Thread=2  CPU=  3  NUMA Node=0  CPU Affinity=  3
    Host=ac2-4046  MPI Rank= 1  OMP Thread=3  CPU=131  NUMA Node=0  CPU Affinity=131
    ...
    Host=ac2-4046  MPI Rank=30  OMP Thread=0  CPU=116  NUMA Node=7  CPU Affinity=116
    Host=ac2-4046  MPI Rank=30  OMP Thread=1  CPU=244  NUMA Node=7  CPU Affinity=244
    Host=ac2-4046  MPI Rank=30  OMP Thread=2  CPU=117  NUMA Node=7  CPU Affinity=117
    Host=ac2-4046  MPI Rank=30  OMP Thread=3  CPU=245  NUMA Node=7  CPU Affinity=245
    Host=ac2-4046  MPI Rank=31  OMP Thread=0  CPU=118  NUMA Node=7  CPU Affinity=118
    Host=ac2-4046  MPI Rank=31  OMP Thread=1  CPU=246  NUMA Node=7  CPU Affinity=246
    Host=ac2-4046  MPI Rank=31  OMP Thread=2  CPU=119  NUMA Node=7  CPU Affinity=119
    Host=ac2-4046  MPI Rank=31  OMP Thread=3  CPU=247  NUMA Node=7  CPU Affinity=247

    Note the following facts:

    • Both the main cores (0-127) and hyperthreads (128-256) were used.
    • You get consecutive threads on the same physical CPU (0 with 128, 1 with 129...).
    • There are physical cpus entirely unused, since their cpu number does show in the output.

    In terms of SBUs, this job cost:

    No Format
    $ grep SBU $(ls -1 parallel-*.out | tail -n1)                                                                                                                                                                      
    [ECMWF-INFO -ecepilog] SBU                       : 6.051



  3. Modify the parallel.sh job geometry (number of tasks,  threads and use of hyperthreading) so that you fully utilise all the physical cores, and only those, i.e. 0-127.

    Expand
    titleSolution

    Without using hyperthreading, an Atos HPCF node has 128 phyisical cores available. Any combination of tasks and threads that adds up to that figure will fill the node. Examples include 32 tasks x 4 threads, 64 tasks x 2 threads or 128 single-threaded tasks. For this example, we picked the first one:

    Code Block
    languagebash
    titleparalell.sh
    #!/bin/bash 
    #SBATCH --output=parallel-%j.out
    #SBATCH --qos=np
    #SBATCH --ntasks=32
    #SBATCH --cpus-per-task=4
    #SBATCH --hint=nomultithread
    
    # Define the number of OpenMP threads
    export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
    
    # Ensure proper OpenMP thread CPU pinning
    export OMP_PLACES=threads
    
    # Load xthi tool
    module load xthi
    
    srun -c $SLURM_CPUS_PER_TASK xthi

    You can submit it with sbatch:

    No Format
    sbatch parallel.sh

    The job should be run shortly. When finished, a new file called parallel-<jobid>.out should appear in the same directory. You can check the relevant output with:

    No Format
    grep -v ECMWF-INFO $(ls -1 parallel-*.out | tail -n1)

    You should see an output similar to:

    No Format
    Host=ac3-2015  MPI Rank= 0  OMP Thread=0  CPU=  0  NUMA Node=0  CPU Affinity=  0                                                                                                              
    Host=ac3-2015  MPI Rank= 0  OMP Thread=1  CPU=  1  NUMA Node=0  CPU Affinity=  1                                                                                                              
    Host=ac3-2015  MPI Rank= 0  OMP Thread=2  CPU=  2  NUMA Node=0  CPU Affinity=  2                                                                                                              
    Host=ac3-2015  MPI Rank= 0  OMP Thread=3  CPU=  3  NUMA Node=0  CPU Affinity=  3
    Host=ac3-2015  MPI Rank= 1  OMP Thread=0  CPU=  4  NUMA Node=0  CPU Affinity=  4
    Host=ac3-2015  MPI Rank= 1  OMP Thread=1  CPU=  5  NUMA Node=0  CPU Affinity=  5
    Host=ac3-2015  MPI Rank= 1  OMP Thread=2  CPU=  6  NUMA Node=0  CPU Affinity=  6
    Host=ac3-2015  MPI Rank= 1  OMP Thread=3  CPU=  7  NUMA Node=0  CPU Affinity=  7
    ... 
    Host=ac3-2015  MPI Rank=30  OMP Thread=0  CPU=120  NUMA Node=7  CPU Affinity=120
    Host=ac3-2015  MPI Rank=30  OMP Thread=1  CPU=121  NUMA Node=7  CPU Affinity=121
    Host=ac3-2015  MPI Rank=30  OMP Thread=2  CPU=122  NUMA Node=7  CPU Affinity=122
    Host=ac3-2015  MPI Rank=30  OMP Thread=3  CPU=123  NUMA Node=7  CPU Affinity=123
    Host=ac3-2015  MPI Rank=31  OMP Thread=0  CPU=124  NUMA Node=7  CPU Affinity=124
    Host=ac3-2015  MPI Rank=31  OMP Thread=1  CPU=125  NUMA Node=7  CPU Affinity=125
    Host=ac3-2015  MPI Rank=31  OMP Thread=2  CPU=126  NUMA Node=7  CPU Affinity=126
    Host=ac3-2015  MPI Rank=31  OMP Thread=3  CPU=127  NUMA Node=7  CPU Affinity=127

    Note the following facts:

    • Only the main cores (0-127) were used.
    • Each thread gets one and only one cpu pinned to it.
    • All the phyisical cores are in use

    In terms of SBUs, this job cost:

    No Format
    $ grep SBU $(ls -1 parallel-*.out | tail -n1)                                                                                                                                                                      
    [ECMWF-INFO -ecepilog] SBU                       : 5.379



  4. Modify the parallel.sh job geometry so it still runs on the np QoS, but only with 2 tasks and 2 threads. Check the SBU cost. Since the execution is 32 times smaller, did it cost 32 times less than the previous? Why?

    Expand
    titleSolution

    Let's use the following job:

    Code Block
    languagebash
    titleparalell.sh
    #!/bin/bash 
    #SBATCH --output=parallel-%j.out
    #SBATCH --qos=np 
    # Add here the missing SBATCH directives for the relevant resources
    #SBATCH --ntasks=2
    #SBATCH --cpus-per-task=2
    #SBATCH --hint=nomultithread
    
    module load xthi
    
    export OMP_PLACES=threads
    srun -c $SLURM_CPUS_PER_TASK xthi

    You can submit it with sbatch:

    No Format
    sbatch fractional.sh

    The job should be run shortly. When finished, a new file called parallel-<jobid>.out should appear in the same directory. You can check the relevant output with:

    No Format
    grep -v ECMWF-INFO $(ls -1 parallel-*.out | tail -n1)

    You should see an output similar to:

    No Format
    Host=ac2-3073  MPI Rank=0  OMP Thread=0  CPU= 0  NUMA Node=0  CPU Affinity= 0
    Host=ac2-3073  MPI Rank=0  OMP Thread=1  CPU= 1  NUMA Node=0  CPU Affinity= 1
    Host=ac2-3073  MPI Rank=1  OMP Thread=0  CPU=16  NUMA Node=1  CPU Affinity=16
    Host=ac2-3073  MPI Rank=1  OMP Thread=1  CPU=17  NUMA Node=1  CPU Affinity=17

    In terms of SBUs, this job cost:

    No Format
    $ grep SBU $(ls -1 parallel-*.out | tail -n1)                                                                                                                                                                      
    [ECMWF-INFO -ecepilog] SBU                       : 4.034

    This is in a similar scale to the previous one which 32 times bigger one. The reason behind it is that on the np QoS the allocation is done in full nodes. The SBU cost takes into account the allocated nodes for a given period of time, no matter how they are used.

    You may compare the cost of your last parallel job and your last fractional, with the same geometry (2x2):

    No Format
    $ grep -h SBU $(ls -1 parallel-*.out | tail -n1) fractional.out                                                                                                                                                                   
    [ECMWF-INFO -ecepilog] SBU                       : 4.034
    [ECMWF-INFO -ecepilog] SBU                       : 0.084



...