Page tree
Skip to end of metadata
Go to start of metadata


Training course time table and course material, September 2015

Wednesday 16 September

08:30 - 09:15 Registration

09:15 - 09:45 Welcome and Opening Remarks - Sami Saarinen, Tim Lanfear, Nils Wedi

09:45 - 10:30 GPU Hardware Overview and Programming Models - Jeremy Appleyard

 Click to see the abstract...

This presentation gives a quick overview to present status GPU computing

10:30 - 11:00 Coffee/tea break

11:00 - 12:00 Introduction to CUDA Programming and Libraries (video) - Jeremy Appleyard

 Click to see the abstract...

This presentation explains the basic concepts of CUDA programming through CUDA Fortran (PGI extension) and how that maps to the underlying GPU hardware. Furthermore principles how to use CUDA high performance libraries is explained.

Video on Pascal (additional slides on new generation GPU, restricted access)

12:00 - 12:05 ECMWF GPU test cluster (pptx) (video) - Christian Weihrauch

 Click here to see the abstract...

This presentation describes the configuration of an ECMWF GPU Test Cluster used for training and experimentation.

12:05 - 13:00 Hands-­on Session 1 - Jeremy Appleyard

Calling DGEMM & FFT CUDA-­libraries from Fortran (Hands on material)

13:00 - 14:00 Lunch break

14:00 - 15:30 Introduction to OpenACC for Fortran Programmers (pptx)  (video) - Michael Wolfe

 click to see the abstract...

This presentation gives a fundamental overview of OpenACC directive based GPU computing. It also explains CUDA and OpenMP

15:30 - 16:00 Coffee/tea break

16:00 - 17:00 Hands­-on Session 2 - Jeremy Appleyard

2D Laplace equation solver with Jacobi iteration (Hands on material)

Thursday 17 September

09:00 - 10:30 Advanced GPU Topics 1 (video) - Jeremy Appleyard

 Click to see the abstract...

This presentation dives deeper into GPU computing. It shows how to profile GPU applications and how to optimize data movement. It also covers interoperability with MPI and how to communicate directly between GPUs. The profiling section shows how to use nvprof and nvvp tools

10:30 - 11:00 Coffee/tea break

11:00 - 12:00 Hands-­on Session 3 - Michael Wolfe

Using multiple GPUs (with and without MPI) (Hands on material)

12:00 - 13:00 Debugging GPU programs (with DDT) (pptx) (video) - Michael Wolfe

 Click to see the abstract...

This presentation describes how to debug CUDA Fortran and OpenACC programs on GPUs using Allinea DDT

13:00 - 13:45 Lunch break

13:45 - 14:15 Tour of ECMWF Computer Hall - Christian Weihrauch

14:15 - 15:00 GPU experiences at ECMWF

Spectral transform (pptx) (video) - George Mozdzynski

 Click to see the abstract...

This presentation is the first part of GPU experiences at ECMWF so far. It shows how to port relative small, but important part of the IFS spectral model to use GPUs

Cloud scheme (CLOUDSC) (pptx) (video) - Sami Saarinen

 Click to see the abstract...

This presentation is the second part of GPU experiences at ECMWF so far. It shows how to insert OpenACC directives semi-automatically into one of the most time consuming part of the entire IFS code: CLOUDSC. This process shows how to preserve single source so that the same code without OpenACC directives still runs at the same speed on conventional multicore systems as before. An interesting part of presentation is how to hide latencies by letting multiple OpenMP threads to fire up several GPU kernels simultaneously on a GPU (so called time sharing mode)

15:00 - 15:30 Coffee/tea break

15:30 - 16:30 Advanced GPU Topics 2 (pptx) (video) - Michael Wolfe

 Click to see the abstract...

This presentation concludes the training course. It explains more advanced topics of OpenACC and where we are heading with it in the near future. One interesting fact is that PGI will soon have an OpenACC compiler which produces code for conventional multicore (as well as manycore) systems so that the same OpenACC code will also run on GPUs.

16:30 - 17:00 Course Wrap­Up and Discussion - Sami Saarinen,All Delegates

Exercises

cp /perm/rd/mps/gpu/HandsOn1.tgz .
tar xvf HandsOn1.tgz
cd HandsOn1a
make run
srun -n1 --gres=gpu:1 ./example1_cpu

For timing purposes you will need to create a batch job.

SLURM batch job example which you can download here.

#!/bin/bash
# gres_test.job
# Submit as follows:
# ###   sbatch --gres=gpu:1 -n1 gres_test.job
#SBATCH --gres=gpu:1
#SBATCH --ntasks=1

hostname
time srun --gres=gpu:1 -n1 /scratch/systems/sycw/cuda-samples/NVIDIA_CUDA-7.0_Samples/bin/x86_64/linux/release/clock

Launching MPI-programs on our cluster with OpenMPI : do NOT use mpirun – instead srun (which is SLURM-based launcher). Otherwise with mpirun you end up creating detached tasks that do not communicate with each other.

#  Do NOT use :
mpirun -np <num_tasks> <executable>
# instead DO use
srun -n <num_tasks> <executable>

# For example on a single node, two MPI-tasks
srun -n2 ./a.out
# .. alternatively (and equivalent with previous)
srun -N1 -n2 ./a.out

# Use two GPUs on a single node over MPI
srun -n2 --gres=gpu:2 ./a.out

# Use four GPUs (in total) over two nodes using total four MPI-tasks
srun -N2 -n4 --gres=gpu:2 ./a.out

# Enabling OpenMP-threading over MPI-tasks (f.ex. 12 threads / task)
env OMP_NUM_THREADS=12 srun -N2 -n4 ./a.out  # no GPUs
env OMP_NUM_THREADS=12 srun -N2 -n4 --gres=gpu:2 ./a.out  # with GPUs

To initialize device under OpenACC use the following coding in Fortran

PROGRAM main
use mpi
#ifdef _OPENACC
use openacc
#endif
...
implicit none
...
integer :: idevtype 
...
#ifdef _OPENACC
! Absorbs all GPU startup time before any OpenACC kernel
! The preference has been to call this BEFORE MPI initialization
idevtype = acc_get_device_type()
CALL acc_init(idevtype) 
#endif
...
CALL MPI_Init( ... )
...
END PROGRAM main

To select MPI-task specific GPU-device (either 0 or 1 in our system), use the following coding in MPI-codes (with help of nodeinfo subroutine):

PROGRAM main
use mpi
#ifdef _OPENACC
use openacc
#endif
...
implicit none
...
integer :: ierr, icomm, myrank, npes
integer :: numnodes, noderank, nodenpes
character(len=MPI_MAX_PROCESSOR_NAME) :: nodename
integer :: idevtype, numdevs, mygpu

#ifdef _OPENACC
idevtype = acc_get_device_type()
CALL acc_init(idevtype) 
#endif

CALL MPI_Init(ierr)
icomm = MPI_COMM_WORLD
CALL MPI_Comm_rank(icomm, myrank, ierr)
CALL MPI_Comm_size(icomm, npes, ierr)

CALL nodeinfo(icomm, numnodes, nodename, noderank, nodenpes)

#ifdef _OPENACC
numdevs = acc_get_num_devices(idevtype) ! Number of GPU-devices available on the node
mygpu = mod(noderank,numdevs)  ! Suggest using mygpu-GPU on the node
CALL acc_set_device_num(mygpu, idevtype)  ! Assign MPI task to use mygpu 
mygpu = acc_get_device_num(idevtype)      ! Verify the GPU id
#endif

 

 

 

Go to the training home page

OpenACC tricks

use -g for debugging, -gopt for profiling

 

Get ptxas output from nvcc when compiling acc with pgfortran: add the ptxinfo option to the -ta flag e.g.
pgfortran ... -acc -Minfo=accel -ta=tesla:cc35,ptxinfo

 

Get generated CUDA-code – almost readable by
pgfortran -acc -ta=tesla:keepgpu,nollvm,O0

 

There is pgireport command that produces acceleration & other information into .lst file, usage:
pgireport pgfortran -acc file.F90    # you get file.lst