Profiling

Objectives

Hide memory latency
High utilisation of tensor cores (on NVIDIA cards)
Occupancy >= 50% (describes the number of threads in comparison to the ones available)
Shared memory (L1 cache) is utilized
High GPU utilization
- Note that this can be misleading since memory operations (transpose, reshuffle etc) are also counted so that even when most time is spent on memory ops there is still a high GPU utilisation listed

Run code with aifs/config/diagnostics/eval_rollout.yaml : profiler: True
1. You might want to modify other settings or the code so that only a single epoch is run. This is typically sufficient to get reliable numbers
Use , see https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html for details on how to load and analyse report using tensorboard+browser (https://pytorch.org/blog/trace-analysis-for-masses/ is useful for large models)
1. Memory latencies: trace/timeline view, check if memory transferred are overlapped with computations and there are no gaps in the estimated SM efficiency or the call trace in the streams.
2. Occupancy: average is listed on the start page, details on Views → GPU kernel
3. Tensor core utilisation: Views → GPU kernel

Instructions for ATOS

Load nvidia module: ml prgenv/amd prgenv/expert prgenv/gnu prgenv/intel prgenv/intel-llvm prgenv/nvidia prgenv/pgi nvidia/22.11
Create output folder, e.g. nsys_profile
Run nsys: %>PGI_ACC_CUDA_HEAPSIZE=8GB nsys profile -o nsys_profile/report_rollout1 -f true -t nvtx,cuda,cublas,cublas-verbose,cusparse,cusparse-verbose,cudnn -s none --cuda-memory-usage true --stats true aifs-train
-> Running a small, single epoch is sufficient and likely more useful than having a big training run profiled.
Copy output folder to your local machine
Install NVIDIA Nsight (requires NVIDIA developer account for download)
Open report.nsys-rep in NVIDIA Nsight
- For interpretation you might want to contact Michael Lange and his team (IFS performance and portability) who is responsible for the GPU port of IFS