How It Works
rocm-trace-lite captures GPU kernel execution data through HSA runtime interception, without any dependency on roctracer or rocprofiler-sdk.
Architecture overview
Application (PyTorch, Triton, HIP, etc.)
│
▼
HIP Runtime
│
▼
HSA Runtime ◄── HSA_TOOLS_LIB=librtl.so
│ │
│ ├── OnLoad(): replace API table entries
│ ├── my_hsa_queue_create(): intercept queue creation
│ ├── queue_intercept_cb(): inject profiling signals
│ └── completion_worker(): read timestamps, write DB
▼
GPU Hardware
Interception mechanism
1. Library loading
When HSA_TOOLS_LIB is set, the ROCm HSA runtime calls OnLoad() during hsa_init().
This gives us the HSA API function table, which we modify:
hsa_queue_create→my_hsa_queue_create(intercept queue creation)hsa_executable_freeze→my_hsa_executable_freeze(capture kernel symbols)
2. Queue interception
Every hsa_queue_create call is redirected to create an interceptible queue via
hsa_amd_queue_intercept_create. This allows us to register a callback that sees every
AQL packet before it reaches the hardware:
hsa_amd_queue_intercept_create(agent, size, type, ...queue);
hsa_amd_profiling_set_profiler_enabled(*queue, true);
hsa_amd_queue_intercept_register(*queue, queue_intercept_cb, &qi);
3. Signal injection profiling
For each kernel dispatch packet, the intercept callback:
Acquires a profiling signal from a reusable pool
Saves the original completion signal (if any)
Replaces
pkt->completion_signalwith the profiling signalSubmits the modified packet via
writer()
Original packet: [kernel_dispatch | signal=0x0 ]
Modified packet: [kernel_dispatch | signal=prof_42]
4. Completion worker
A single background thread processes completed dispatches:
Wait on the profiling signal (100ms timeout for clean shutdown)
Read GPU timestamps via
hsa_amd_profiling_get_dispatch_timeRecord kernel name, device ID, timestamps to SQLite
Forward original completion signal (if non-null)
Return profiling signal to pool
5. Symbol resolution
Kernel names are captured by intercepting hsa_executable_freeze:
hsa_executable_iterate_symbols(executable, symbol_iterate_cb, nullptr);
// Maps kernel_object handle → kernel name string
6. roctx shim
The library exports roctxRangePushA, roctxRangePop, roctxMarkA, roctxRangeStartA,
and roctxRangeStop symbols, allowing applications that use roctx markers to work
without linking libroctx64. Both nested (push/pop) and non-nested (start/stop) ranges
are supported.
Signal pool design
Creating HSA signals is expensive. The signal pool avoids per-dispatch overhead:
Pre-allocate 64 signals at startup
Grow on demand up to 4096 maximum
Reuse signals after completion (reset to initial value 1)
Destroy excess signals when pool is full
Steady-state: zero
hsa_signal_createcalls after warmup
CUDAGraph / HIP graph handling
Signal injection is incompatible with CUDAGraph replay at two levels:
Batch replay: CUDAGraph replay submits pre-recorded AQL packets via the intercept callback with
count > 1. Injecting signals into these packets corrupts the graph’s execution chain (0x1009).Graph capture: Signals injected during capture get baked into the graph. On replay, these signal handles are stale/recycled, causing GPU memory access faults.
Batch skip (automatic)
The intercept callback detects batch submissions (count > 1) and passes them through unmodified:
if (count > 1) {
writer(in_packets, count); // pass through, no signal injection
return;
}
Graph-replayed kernels are not profiled, but the application runs correctly.
Profiling modes (RTL_MODE)
RTL supports three profiling modes to balance data completeness vs overhead:
Mode |
Behavior |
Overhead |
|---|---|---|
default |
Signal injection for all |
~2-4% |
lite |
Like default, but also skip packets with existing |
~0% |
full |
Profile everything including graph replay batches. Requires ROCm 7.13+ with ROCR fix. |
~2-5% |
Set via RTL_MODE=lite env var or rtl trace --mode lite CLI flag.
Known limitation
The HSA intercept API does not distinguish graph replay from normal multi-packet submissions. Default and lite modes skip all count > 1 submissions. Full mode profiles them but requires the ROCR staging buffer fix to avoid heap overflow (see issue #67).
Why signal injection?
HIP runtime does not set completion_signal on most kernel dispatch AQL packets (ROCm 7.2+). HIP uses barrier packets with signals for synchronization instead. Without signal injection, 0 kernels would be captured. RTL injects profiling signals and forwards original signals after profiling.