When running parallel jobs, SLURM will automatically set up some default process affinity. This means that every task spawned by srun (each MPI rank on an MPI execution) will be pinned to a specific core or set of cores within every computing node.
However, the default affinity may not be what you would expect, and depending on the application it could have a significant impact in performance.
Below are some examples of how the affinity is setup by default in the different cases.
For this tests we will use David McKain's version of the Cray xthi code to visualise how the process and thread placement takes place.
MPI single threaded execution
Default setup
This is the simplest case. Slurm will define affinity at the level of physical cores, but allow the task to use the two hardware threads. In this example, we run a 128 task MPI job with default settings, on a single node:
 
 
  #!/bin/bash
#SBATCH -q np
#SBATCH -n 128
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
ml xthi
srun -c ${SLURM_CPUS_PER_TASK:-1} xthi
  
 
 
 
  Host=ac2-2083  MPI Rank=  0  CPU=128  NUMA Node=0  CPU Affinity=  0,128
Host=ac2-2083  MPI Rank=  1  CPU=  1  NUMA Node=0  CPU Affinity=  1,129
Host=ac2-2083  MPI Rank=  2  CPU=130  NUMA Node=0  CPU Affinity=  2,130
Host=ac2-2083  MPI Rank=  3  CPU=  3  NUMA Node=0  CPU Affinity=  3,131
... 
Host=ac2-2083  MPI Rank=124  CPU=252  NUMA Node=7  CPU Affinity=124,252
Host=ac2-2083  MPI Rank=125  CPU=125  NUMA Node=7  CPU Affinity=125,253
Host=ac2-2083  MPI Rank=126  CPU=254  NUMA Node=7  CPU Affinity=126,254
Host=ac2-2083  MPI Rank=127  CPU=127  NUMA Node=7  CPU Affinity=127,255
  
 Disabling multithread use
If you want to restrict your process to just one of the hardware threads, you may use the --hint=nomultithread option
 
 
  #!/bin/bash
#SBATCH -q np
#SBATCH -n 128
#SBATCH --hint=nomultithread
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
ml xthi
srun -c ${SLURM_CPUS_PER_TASK:-1} xthi
  
 
 
 
  Host=ac1-2083  MPI Rank=  0  CPU=  0  NUMA Node=0  CPU Affinity=  0
Host=ac1-2083  MPI Rank=  1  CPU=  1  NUMA Node=0  CPU Affinity=  1
Host=ac1-2083  MPI Rank=  2  CPU=  2  NUMA Node=0  CPU Affinity=  2
Host=ac1-2083  MPI Rank=  3  CPU=  3  NUMA Node=0  CPU Affinity=  3
 ... 
Host=ac1-2083  MPI Rank=124  CPU=124  NUMA Node=7  CPU Affinity=124
Host=ac1-2083  MPI Rank=125  CPU=125  NUMA Node=7  CPU Affinity=125
Host=ac1-2083  MPI Rank=126  CPU=126  NUMA Node=7  CPU Affinity=126
Host=ac1-2083  MPI Rank=127  CPU=127  NUMA Node=7  CPU Affinity=127
  
 Hybrid MPI + OpenMP execution
Default setup - Least ideal
Slurm will allocate a number of cores for each task, and will bind each process to a group of cores equal to the number of threads. However, all threads may run in any of the cores in the group for the same process. In the following example we run a 4 task MPI program, with every rank spawning 4 threads.
 
 
  #!/bin/bash
#SBATCH -q np
#SBATCH -n 32
#SBATCH -c 4
ml xthi
srun -c ${SLURM_CPUS_PER_TASK:-1} xthi
  
 
 
 
  Host=ac1-1035  MPI Rank= 0  OMP Thread=0  CPU=  0  NUMA Node=0  CPU Affinity=    0,1,128,129
Host=ac1-1035  MPI Rank= 0  OMP Thread=1  CPU=128  NUMA Node=0  CPU Affinity=    0,1,128,129
Host=ac1-1035  MPI Rank= 0  OMP Thread=2  CPU=  0  NUMA Node=0  CPU Affinity=    0,1,128,129
Host=ac1-1035  MPI Rank= 0  OMP Thread=3  CPU=  0  NUMA Node=0  CPU Affinity=    0,1,128,129
Host=ac1-1035  MPI Rank= 1  OMP Thread=0  CPU=  2  NUMA Node=0  CPU Affinity=    2,3,130,131
Host=ac1-1035  MPI Rank= 1  OMP Thread=1  CPU=  3  NUMA Node=0  CPU Affinity=    2,3,130,131
Host=ac1-1035  MPI Rank= 1  OMP Thread=2  CPU=130  NUMA Node=0  CPU Affinity=    2,3,130,131
Host=ac1-1035  MPI Rank= 1  OMP Thread=3  CPU=131  NUMA Node=0  CPU Affinity=    2,3,130,131
...
Host=ac1-1081  MPI Rank=30  OMP Thread=0  CPU=124  NUMA Node=7  CPU Affinity=124,125,252,253
Host=ac1-1081  MPI Rank=30  OMP Thread=1  CPU=253  NUMA Node=7  CPU Affinity=124,125,252,253
Host=ac1-1081  MPI Rank=30  OMP Thread=2  CPU=252  NUMA Node=7  CPU Affinity=124,125,252,253
Host=ac1-1081  MPI Rank=30  OMP Thread=3  CPU=125  NUMA Node=7  CPU Affinity=124,125,252,253
Host=ac1-1081  MPI Rank=31  OMP Thread=0  CPU=255  NUMA Node=7  CPU Affinity=126,127,254,255
Host=ac1-1081  MPI Rank=31  OMP Thread=1  CPU=126  NUMA Node=7  CPU Affinity=126,127,254,255
Host=ac1-1081  MPI Rank=31  OMP Thread=2  CPU=127  NUMA Node=7  CPU Affinity=126,127,254,255
Host=ac1-1081  MPI Rank=31  OMP Thread=3  CPU=254  NUMA Node=7  CPU Affinity=126,127,254,255 
  
 
Binding OpenMP threads to single cores - Best if hyper-threading desired
If you want to bind every thread to a single core, then you may use the OpenMP variable OMP_PLACES.
 
 
  #!/bin/bash
#SBATCH -q np
#SBATCH -n 32
#SBATCH -c 4
export OMP_PLACES=threads
ml xthi
srun -c ${SLURM_CPUS_PER_TASK:-1} xthi
  
 
 
 
  Host=ac2-1078  MPI Rank= 0  OMP Thread=0  CPU=  0  NUMA Node=0  CPU Affinity=  0
Host=ac2-1078  MPI Rank= 0  OMP Thread=1  CPU=128  NUMA Node=0  CPU Affinity=128
Host=ac2-1078  MPI Rank= 0  OMP Thread=2  CPU=  1  NUMA Node=0  CPU Affinity=  1
Host=ac2-1078  MPI Rank= 0  OMP Thread=3  CPU=129  NUMA Node=0  CPU Affinity=129
Host=ac2-1078  MPI Rank= 1  OMP Thread=0  CPU=  2  NUMA Node=0  CPU Affinity=  2
Host=ac2-1078  MPI Rank= 1  OMP Thread=1  CPU=130  NUMA Node=0  CPU Affinity=130
Host=ac2-1078  MPI Rank= 1  OMP Thread=2  CPU=  3  NUMA Node=0  CPU Affinity=  3
Host=ac2-1078  MPI Rank= 1  OMP Thread=3  CPU=131  NUMA Node=0  CPU Affinity=131
... 
Host=ac2-1078  MPI Rank=30  OMP Thread=0  CPU=116  NUMA Node=7  CPU Affinity=116
Host=ac2-1078  MPI Rank=30  OMP Thread=1  CPU=244  NUMA Node=7  CPU Affinity=244
Host=ac2-1078  MPI Rank=30  OMP Thread=2  CPU=117  NUMA Node=7  CPU Affinity=117
Host=ac2-1078  MPI Rank=30  OMP Thread=3  CPU=245  NUMA Node=7  CPU Affinity=245
Host=ac2-1078  MPI Rank=31  OMP Thread=0  CPU=118  NUMA Node=7  CPU Affinity=118
Host=ac2-1078  MPI Rank=31  OMP Thread=1  CPU=246  NUMA Node=7  CPU Affinity=246
Host=ac2-1078  MPI Rank=31  OMP Thread=2  CPU=119  NUMA Node=7  CPU Affinity=119
Host=ac2-1078  MPI Rank=31  OMP Thread=3  CPU=247  NUMA Node=7  CPU Affinity=247
  
 
Disabling multithread use - Best if no hyper-threading desired
If you want to avoid having two threads sharing the same physical core and maximise the use of all physical cores in the node,  you may use the --hint=nomultithread option.
 
 
  #!/bin/bash
#SBATCH -q np
#SBATCH -n 32
#SBATCH -c 4
#SBATCH --hint=nomultithread
export OMP_PLACES=threads
ml xthi
srun -c ${SLURM_CPUS_PER_TASK:-1} xthi
  
 
 
 
  Host=ac2-1078  MPI Rank= 0  OMP Thread=0  CPU=  0  NUMA Node=0  CPU Affinity=  0
Host=ac2-1078  MPI Rank= 0  OMP Thread=1  CPU=  1  NUMA Node=0  CPU Affinity=  1
Host=ac2-1078  MPI Rank= 0  OMP Thread=2  CPU=  2  NUMA Node=0  CPU Affinity=  2
Host=ac2-1078  MPI Rank= 0  OMP Thread=3  CPU=  3  NUMA Node=0  CPU Affinity=  3
Host=ac2-1078  MPI Rank= 1  OMP Thread=0  CPU=  4  NUMA Node=0  CPU Affinity=  4
Host=ac2-1078  MPI Rank= 1  OMP Thread=1  CPU=  5  NUMA Node=0  CPU Affinity=  5
Host=ac2-1078  MPI Rank= 1  OMP Thread=2  CPU=  6  NUMA Node=0  CPU Affinity=  6
Host=ac2-1078  MPI Rank= 1  OMP Thread=3  CPU=  7  NUMA Node=0  CPU Affinity=  7
...
Host=ac2-1078  MPI Rank=30  OMP Thread=0  CPU=120  NUMA Node=7  CPU Affinity=120
Host=ac2-1078  MPI Rank=30  OMP Thread=1  CPU=121  NUMA Node=7  CPU Affinity=121
Host=ac2-1078  MPI Rank=30  OMP Thread=2  CPU=122  NUMA Node=7  CPU Affinity=122
Host=ac2-1078  MPI Rank=30  OMP Thread=3  CPU=123  NUMA Node=7  CPU Affinity=123
Host=ac2-1078  MPI Rank=31  OMP Thread=0  CPU=124  NUMA Node=7  CPU Affinity=124
Host=ac2-1078  MPI Rank=31  OMP Thread=1  CPU=125  NUMA Node=7  CPU Affinity=125
Host=ac2-1078  MPI Rank=31  OMP Thread=2  CPU=126  NUMA Node=7  CPU Affinity=126
Host=ac2-1078  MPI Rank=31  OMP Thread=3  CPU=127  NUMA Node=7  CPU Affinity=127
  
 Further customisation
If you wish to further customise how the binding and task/thread distribution is done, check the man pages for sbatch an srun, or check the online documentation. 
You may use the --cpu-bind option to fine tune how the binding is done. All the possible values are defined in the official man pages. There are the main highlights:
- Binding can be disabled altogether passing --cpu-bind=notosrun.
- You may see the actual binding mask applied by passing --cpu-bind=verbose.
- Custom masks/maps can be defined and passed to srunwith --cpu-bind=map_cpuand--cpu-bind=mask_cpu.
You can also control how the processes/threads are distributed and bound with the -m or --distribution option in srun. The default is block:cyclic. The first element refers to the distribution of tasks among nodes, and in this case block would distribute them in such a way that consecutive tasks would share a node. The second element is how the CPUs are dstributed within the node. As default cyclic will distribute allocated CPUs for binding to a given task consecutively from the same socket, and from the next consecutive socket for the next task, in a round-robin fashion across sockets.