Quick Start
Basic usage
Profile any GPU workload with a single command:
rtl trace -o trace.db python3 my_model.py
This automatically:
Injects the profiler library via
HSA_TOOLS_LIBCaptures all GPU kernel dispatches with timestamps
Merges per-process traces (for multi-GPU / distributed workloads)
Generates a summary, Perfetto JSON, and SQLite database
View results
Terminal summary
rtl summary trace.db
Trace: trace.db
GPU ops: 728
Kernel Calls Total(us) Avg(us) %
========================================================================================
Cijk_Ailk_Bljk_HHS_BH_MT128x128x128 240 28252.9 117.7 21.8
ncclDevKernel_Generic 160 29747.8 185.9 23.0
__amd_rocclr_fillBufferAligned.kd 7900 27929.8 3.5 21.6
GPU Utilization:
GPU 0: 0.13% (2630 ops, 17.2ms busy)
GPU 1: 0.11% (2430 ops, 15.0ms busy)
Perfetto timeline
The trace command auto-generates a compressed .json.gz file.
Open it in ui.perfetto.dev for interactive timeline visualization.
SQL queries
The trace file is a standard SQLite database. Query it directly:
# Top 10 kernels by GPU time
sqlite3 trace.db "SELECT * FROM top LIMIT 10;"
# GPU utilization
sqlite3 trace.db "SELECT * FROM busy;"
# All GEMM kernels
sqlite3 trace.db "
SELECT s.string, count(*), sum(o.end - o.start)/1000 as total_us
FROM rocpd_op o
JOIN rocpd_string s ON o.description_id = s.id
WHERE s.string LIKE '%Cijk%'
GROUP BY s.string
ORDER BY total_us DESC;
"
Multi-GPU / Distributed
rocm-trace-lite automatically handles multi-process workloads (e.g., torchrun):
rtl trace -o trace.db torchrun --nproc_per_node=8 my_model.py
Each process writes to its own trace file (trace_<PID>.db), which are
automatically merged into the final output. GPU IDs are preserved across processes.
Using roctx markers
Applications that use roctx markers are captured automatically:
import ctypes
lib = ctypes.CDLL("librtl.so")
# Nested ranges (push/pop)
lib.roctxRangePushA(b"forward_pass")
# ... GPU work ...
lib.roctxRangePop()
# Non-nested ranges (start/stop)
lib.roctxRangeStartA.restype = ctypes.c_uint64
rid = lib.roctxRangeStartA(b"data_loading")
# ... work ...
lib.roctxRangeStop(rid)
# Instant markers
lib.roctxMarkA(b"checkpoint")
These appear as UserMarker events in the trace.
CUDAGraph / HIP graph compatibility
CUDAGraph replay submits batch AQL packets that are incompatible with signal injection.
rocm-trace-lite automatically skips batch submissions (count > 1), so graph-replayed
kernels are not profiled but the application runs correctly.
If you still see crashes (e.g., graph capture baking stale signal handles), use lite mode which skips packets that already have a completion signal:
rtl trace --mode lite -o trace.db python3 my_cudagraph_model.py
Lite mode provides near-zero overhead and is the safest option for CUDAGraph workloads.
Environment variables
Variable |
Values |
Description |
|---|---|---|
|
path |
Output trace file (supports |
|
|
Profiling mode (see below) |
|
|
Packet-level diagnostic logging (1=summary, 2=per-packet) |
Profiling modes
Mode |
GPU timing |
Graph replay |
Overhead |
Use case |
|---|---|---|---|---|
default |
Yes |
Skipped |
~2-4% |
General profiling |
lite |
Yes (partial) |
Skipped |
~0% |
Production / always-on |
full |
Yes (all) |
Profiled |
~2-5% |
Deep analysis (ROCm 7.13+ only) |
Environment variable mode
For advanced control, set environment variables directly:
export HSA_TOOLS_LIB=/path/to/librtl.so
export RTL_OUTPUT=my_trace.db
export RTL_MODE=lite # optional: lite for ~0% overhead
python3 my_model.py