Quick Start

Basic usage

Profile any GPU workload with a single command:

rtl trace -o trace.db python3 my_model.py

This automatically:

  1. Injects the profiler library via HSA_TOOLS_LIB

  2. Captures all GPU kernel dispatches with timestamps

  3. Merges per-process traces (for multi-GPU / distributed workloads)

  4. Generates a summary, Perfetto JSON, and SQLite database

View results

Terminal summary

rtl summary trace.db
Trace: trace.db
  GPU ops:   728

Kernel                                              Calls  Total(us)  Avg(us)      %
========================================================================================
Cijk_Ailk_Bljk_HHS_BH_MT128x128x128                  240    28252.9    117.7   21.8
ncclDevKernel_Generic                                  160    29747.8    185.9   23.0
__amd_rocclr_fillBufferAligned.kd                     7900    27929.8      3.5   21.6

GPU Utilization:
  GPU 0: 0.13% (2630 ops, 17.2ms busy)
  GPU 1: 0.11% (2430 ops, 15.0ms busy)

Perfetto timeline

The trace command auto-generates a compressed .json.gz file. Open it in ui.perfetto.dev for interactive timeline visualization.

SQL queries

The trace file is a standard SQLite database. Query it directly:

# Top 10 kernels by GPU time
sqlite3 trace.db "SELECT * FROM top LIMIT 10;"

# GPU utilization
sqlite3 trace.db "SELECT * FROM busy;"

# All GEMM kernels
sqlite3 trace.db "
  SELECT s.string, count(*), sum(o.end - o.start)/1000 as total_us
  FROM rocpd_op o
  JOIN rocpd_string s ON o.description_id = s.id
  WHERE s.string LIKE '%Cijk%'
  GROUP BY s.string
  ORDER BY total_us DESC;
"

Multi-GPU / Distributed

rocm-trace-lite automatically handles multi-process workloads (e.g., torchrun):

rtl trace -o trace.db torchrun --nproc_per_node=8 my_model.py

Each process writes to its own trace file (trace_<PID>.db), which are automatically merged into the final output. GPU IDs are preserved across processes.

Using roctx markers

Applications that use roctx markers are captured automatically:

import ctypes
lib = ctypes.CDLL("librtl.so")

# Nested ranges (push/pop)
lib.roctxRangePushA(b"forward_pass")
# ... GPU work ...
lib.roctxRangePop()

# Non-nested ranges (start/stop)
lib.roctxRangeStartA.restype = ctypes.c_uint64
rid = lib.roctxRangeStartA(b"data_loading")
# ... work ...
lib.roctxRangeStop(rid)

# Instant markers
lib.roctxMarkA(b"checkpoint")

These appear as UserMarker events in the trace.

CUDAGraph / HIP graph compatibility

CUDAGraph replay submits batch AQL packets that are incompatible with signal injection. rocm-trace-lite automatically skips batch submissions (count > 1), so graph-replayed kernels are not profiled but the application runs correctly.

If you still see crashes (e.g., graph capture baking stale signal handles), use lite mode which skips packets that already have a completion signal:

rtl trace --mode lite -o trace.db python3 my_cudagraph_model.py

Lite mode provides near-zero overhead and is the safest option for CUDAGraph workloads.

Environment variables

Variable

Values

Description

RTL_OUTPUT

path

Output trace file (supports %p for PID). Alternative to -o flag. RPD_LITE_OUTPUT also accepted for backward compatibility.

RTL_MODE

default, lite, full

Profiling mode (see below)

RTL_DEBUG

1, 2

Packet-level diagnostic logging (1=summary, 2=per-packet)

Profiling modes

Mode

GPU timing

Graph replay

Overhead

Use case

default

Yes

Skipped

~2-4%

General profiling

lite

Yes (partial)

Skipped

~0%

Production / always-on

full

Yes (all)

Profiled

~2-5%

Deep analysis (ROCm 7.13+ only)

Environment variable mode

For advanced control, set environment variables directly:

export HSA_TOOLS_LIB=/path/to/librtl.so
export RTL_OUTPUT=my_trace.db
export RTL_MODE=lite    # optional: lite for ~0% overhead
python3 my_model.py