rocm-trace-lite๏ƒ

Self-contained GPU kernel profiler for ROCm. Zero roctracer/rocprofiler-sdk dependency.

Captures GPU kernel dispatch timestamps using only HSA runtime interception, writing to a standard SQLite .db file. One command to profile, one file to analyze.

rtl trace -o trace.db python3 my_model.py

Lightweight

Single .so library. Only depends on libhsa-runtime64 + libsqlite3. No roctracer, no rocprofiler-sdk.

Multi-GPU Ready

Validated TP=8 on MI355X. Per-process trace files with automatic merge. Symmetric kernel capture across all ranks.

Perfetto Integration

Auto-generates compressed Perfetto JSON. Open in ui.perfetto.dev for timeline visualization.

SQLite Output

Outputs standard SQLite .db files. Compatible with RPD ecosystem tools and SQL queries.

Quick Example๏ƒ

# Install
pip install rocm-trace-lite

# Trace a workload
rtl trace -o trace.db python3 my_model.py

# View top kernels
rtl summary trace.db

# Open in Perfetto
# trace.json.gz is auto-generated, open at https://ui.perfetto.dev

Sample output (DeepSeek-R1 671B, TP=8, MI355X):

Trace: trace.db (200590 GPU ops)

Kernel                                             Calls  Total(ms)  Avg(us)     %
====================================================================================
ncclDevKernel_Generic_1                             4851    7879.5   1624.5   55.2%
aiter::fmoe_bf16_blockscaleFp8 (novs_silu)         3538    1239.8    350.4    8.7%
aiter::reduce_scatter_cross_device_store<bf16,8>    8906     927.2    104.1    6.5%
ck::kernel_gemm_xdl_cshuffle_v3 (blockscale)      20963     733.9     35.0    5.1%

GPU Utilization:
  GPU 0: 51.2% (25074 ops, 7.3s busy)
  GPU 1: 50.8% (25081 ops, 7.2s busy)

< 1% overhead validated on 6 ATOM dashboard models (DeepSeek-R1, GPT-OSS, Kimi-K2.5, MiniMax-M2.5). See tutorial: profiling prefill vs decode with built-in roctx markers.

Sample Results (MI355X, Apr 2026)๏ƒ

Live benchmark results and kernel traces from the ATOM dashboard validation sweep:

Supported Hardware๏ƒ

Architecture

GPU

Status

CDNA 3 (gfx942)

MI300A, MI300X

Tested

CDNA 3.5 (gfx950)

MI355X

Tested (TP=8 validated)

CDNA 4 (gfx1250)

MI450

Tested (single GPU)

CDNA 2 (gfx90a)

MI210, MI250, MI250X

Expected to work (untested)

Acknowledgments๏ƒ

This project was inspired by and builds upon the work of:

  • Jeff Dailyโ€™s ROCm Tracer for GPU (RTG) โ€” pioneered the HSA_TOOLS_LIB interception approach for lightweight GPU kernel tracing

  • Michael Woottonโ€™s rocmProfileData (RPD) โ€” established the SQLite-based trace format and ecosystem tools that rocm-trace-lite is compatible with