Objectives

  • Hide memory latency
  • High utilisation of tensor cores (on NVIDIA cards)
  • Occupancy >= 50% (describes the number of threads in comparison to the ones available)
  • Shared memory (L1 cache) is utilized
  • High GPU utilization
    • Note that this can be misleading since memory operations (transpose, reshuffle etc) are also counted so that even when most time is spent on memory ops there is still a high GPU utilisation listed

Profiling with torch profiler

  1. Run code with aifs/config/diagnostics/eval_rollout.yaml : profiler: True
    1. You might want to modify other settings or the code so that only a single epoch is run. This is typically sufficient to get reliable numbers
  2. Use , see https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html for details on how to load and analyse report using tensorboard+browser (https://pytorch.org/blog/trace-analysis-for-masses/ is useful for large models)
    1. Memory latencies: trace/timeline view, check if memory transferred are overlapped with computations and there are no gaps in the estimated SM efficiency or the call trace in the streams.
    2. Occupancy: average is listed on the start page, details on Views → GPU kernel
    3. Tensor core utilisation: Views → GPU kernel

Low level profiling with NVIDIA nsys

Instructions for ATOS

  1. Load nvidia module: ml prgenv/amd prgenv/expert prgenv/gnu prgenv/intel prgenv/intel-llvm prgenv/nvidia prgenv/pgi nvidia/22.11
  2. Create output folder, e.g. nsys_profile
  3. Run nsys: %>PGI_ACC_CUDA_HEAPSIZE=8GB nsys profile -o nsys_profile/report_rollout1 -f true -t nvtx,cuda,cublas,cublas-verbose,cusparse,cusparse-verbose,cudnn -s none --cuda-memory-usage true --stats true aifs-train
      -> Running a small, single epoch is sufficient and likely more useful than having a big training run profiled.
  4. Copy output folder to your local machine
  5. Install NVIDIA Nsight (requires NVIDIA developer account for download)
  6. Open report.nsys-rep in NVIDIA Nsight
      - For interpretation you might want to contact Michael Lange and his team (IFS performance and portability) who is responsible for the GPU port of IFS


Write a comment…