Objectives
- Hide memory latency
- High utilisation of tensor cores (on NVIDIA cards)
- Occupancy >= 50% (describes the number of threads in comparison to the ones available)
- Shared memory (L1 cache) is utilized
- High GPU utilization
- Note that this can be misleading since memory operations (transpose, reshuffle etc) are also counted so that even when most time is spent on memory ops there is still a high GPU utilisation listed
Profiling with torch profiler
- Run code with aifs/config/diagnostics/eval_rollout.yaml : profiler: True
- You might want to modify other settings or the code so that only a single epoch is run. This is typically sufficient to get reliable numbers
- Use , see https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html for details on how to load and analyse report using tensorboard+browser (https://pytorch.org/blog/trace-analysis-for-masses/ is useful for large models)
- Memory latencies: trace/timeline view, check if memory transferred are overlapped with computations and there are no gaps in the estimated SM efficiency or the call trace in the streams.
- Occupancy: average is listed on the start page, details on Views → GPU kernel
- Tensor core utilisation: Views → GPU kernel
Low level profiling with NVIDIA nsys
Instructions for ATOS
- Load nvidia module: ml prgenv/amd prgenv/expert prgenv/gnu prgenv/intel prgenv/intel-llvm prgenv/nvidia prgenv/pgi nvidia/22.11
- Create output folder, e.g. nsys_profile
- Run nsys: %>PGI_ACC_CUDA_HEAPSIZE=8GB nsys profile -o nsys_profile/report_rollout1 -f true -t nvtx,cuda,cublas,cublas-verbose,cusparse,cusparse-verbose,cudnn -s none --cuda-memory-usage true --stats true aifs-train
-> Running a small, single epoch is sufficient and likely more useful than having a big training run profiled. - Copy output folder to your local machine
- Install NVIDIA Nsight (requires NVIDIA developer account for download)
- Open report.nsys-rep in NVIDIA Nsight
- For interpretation you might want to contact Michael Lange and his team (IFS performance and portability) who is responsible for the GPU port of IFS
Add Comment