Quick Start

Basic usage

Profile any GPU workload with a single command:

rtl trace -o trace.db python3 my_model.py

This automatically:

Injects the profiler library via HSA_TOOLS_LIB
Captures all GPU kernel dispatches with timestamps
Merges per-process traces (for multi-GPU / distributed workloads)
Generates a summary, Perfetto JSON, and SQLite database

View results

Terminal summary

rtl summary trace.db

Trace: trace.db
  GPU ops:   728

Kernel                                              Calls  Total(us)  Avg(us)      %
========================================================================================
Cijk_Ailk_Bljk_HHS_BH_MT128x128x128                  240    28252.9    117.7   21.8
ncclDevKernel_Generic                                  160    29747.8    185.9   23.0
__amd_rocclr_fillBufferAligned.kd                     7900    27929.8      3.5   21.6

GPU Utilization:
  GPU 0: 0.13% (2630 ops, 17.2ms busy)
  GPU 1: 0.11% (2430 ops, 15.0ms busy)

Perfetto timeline

The trace command auto-generates a compressed .json.gz file. Open it in ui.perfetto.dev for interactive timeline visualization.

SQL queries

The trace file is a standard SQLite database. Query it directly:

# Top 10 kernels by GPU time
sqlite3 trace.db "SELECT * FROM top LIMIT 10;"

# GPU utilization
sqlite3 trace.db "SELECT * FROM busy;"

# All GEMM kernels
sqlite3 trace.db "
  SELECT s.string, count(*), sum(o.end - o.start)/1000 as total_us
  FROM rocpd_op o
  JOIN rocpd_string s ON o.description_id = s.id
  WHERE s.string LIKE '%Cijk%'
  GROUP BY s.string
  ORDER BY total_us DESC;
"

Multi-GPU / Distributed

rocm-trace-lite automatically handles multi-process workloads (e.g., torchrun):

rtl trace -o trace.db torchrun --nproc_per_node=8 my_model.py

Each process writes to its own trace file (trace_<PID>.db), which are automatically merged into the final output. GPU IDs are preserved across processes.

Using roctx markers

Applications that use roctx markers are captured automatically:

import ctypes
lib = ctypes.CDLL("librtl.so")

# Nested ranges (push/pop)
lib.roctxRangePushA(b"forward_pass")
# ... GPU work ...
lib.roctxRangePop()

# Non-nested ranges (start/stop)
lib.roctxRangeStartA.restype = ctypes.c_uint64
rid = lib.roctxRangeStartA(b"data_loading")
# ... work ...
lib.roctxRangeStop(rid)

# Instant markers
lib.roctxMarkA(b"checkpoint")

These appear as UserMarker events in the trace.

CUDAGraph / HIP graph compatibility

CUDAGraph replay submits batch AQL packets that are incompatible with signal injection. rocm-trace-lite automatically skips batch submissions (count > 1), so graph-replayed kernels are not profiled but the application runs correctly.

If you still see crashes (e.g., graph capture baking stale signal handles), use lite mode which skips packets that already have a completion signal:

rtl trace --mode lite -o trace.db python3 my_cudagraph_model.py

Lite mode provides near-zero overhead and is the safest option for CUDAGraph workloads.

Environment variables

Variable	Values	Description
`RTL_OUTPUT`	path	Output trace file (supports `%p` for PID). Alternative to `-o` flag. `RPD_LITE_OUTPUT` also accepted for backward compatibility.
`RTL_MODE`	`lite`, `standard`, `full`, `hip`	Profiling mode (see below)
`RTL_DEBUG`	`1`, `2`	Packet-level diagnostic logging (1=summary, 2=per-packet)

Profiling modes

Mode	GPU timing	HIP API	Graph replay	Overhead	Use case
lite	Yes (partial)	No	Skipped	~0%	Production / always-on
standard	Yes	No	Skipped	~2-4%	General profiling
full	Yes (all)	No	Profiled	~2-5%	Deep analysis (ROCm 7.13+ only)
hip	Yes	Yes	Skipped	<1%	CPU+GPU correlation

TraceLens analysis

Convert RTL traces to rocprofv3 format for TraceLens performance reports:

rtl convert trace.db --format rocprofv3 -o trace_results.json
TraceLens_generate_perf_report_rocprof --profile_json_path trace_results.json

This produces an Excel workbook with GPU timeline breakdown, kernel summary by category, and per-dispatch details. See TraceLens for installation.

Environment variable mode

For advanced control, set environment variables directly:

export HSA_TOOLS_LIB=/path/to/librtl.so
export RTL_OUTPUT=my_trace.db
export RTL_MODE=lite    # optional: lite for ~0% overhead
python3 my_model.py