Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
NVIDIA Nsight Systems is a system wide profiler. It reads performance counters coming from CPUs, GPUs, NICs, storage volumes, etc. and brings all data to a unified timeline. This helps software developers see how their application executes over a server, to be able to optimize the application's performance.
Additional details are at:
- https://developer.nvidia.com/nsight-systems
- https://docs.nvidia.com/nsight-systems/UserGuide/index.html
Counters are read at high frequency (typically at 10 kHz) to be able to correlate performance counters to application actions and source code.
As of Feb 2025, Nsight Systems collects performance counters from Lustre and NFS volumes, NVMe disks and NVMe-oF. Additional storage protocols support is being added.
When reading Lustre performance counters, Nsight Systems users experience two problems:
- Lustre counters are exposed under /sys/kernel/debug/lustre. Accessing this location requires root access, which cluster users don't usually have.
- Lustre counters can be read reliably at 1 kHz. Nsight Systems would like to read counters at 10 kHz.