rocm-trace-lite๏
Self-contained GPU kernel profiler for ROCm. Zero roctracer/rocprofiler-sdk dependency.
Captures GPU kernel dispatch timestamps using only HSA runtime interception, writing to a standard SQLite .db file. One command to profile, one file to analyze.
rtl trace -o trace.db python3 my_model.py
Lightweight
Single .so library. Only depends on libhsa-runtime64 + libsqlite3. No roctracer, no rocprofiler-sdk.
Multi-GPU Ready
Validated TP=8 on MI355X. Per-process trace files with automatic merge. Symmetric kernel capture across all ranks.
Perfetto Integration
Auto-generates compressed Perfetto JSON. Open in ui.perfetto.dev for timeline visualization.
SQLite Output
Outputs standard SQLite .db files. Compatible with RPD ecosystem tools and SQL queries.
Quick Example๏
# Install
pip install rocm-trace-lite
# Trace a workload
rtl trace -o trace.db python3 my_model.py
# View top kernels
rtl summary trace.db
# Open in Perfetto
# trace.json.gz is auto-generated, open at https://ui.perfetto.dev
Sample output (DeepSeek-R1 671B, TP=8, MI355X):
Trace: trace.db (200590 GPU ops)
Kernel Calls Total(ms) Avg(us) %
====================================================================================
ncclDevKernel_Generic_1 4851 7879.5 1624.5 55.2%
aiter::fmoe_bf16_blockscaleFp8 (novs_silu) 3538 1239.8 350.4 8.7%
aiter::reduce_scatter_cross_device_store<bf16,8> 8906 927.2 104.1 6.5%
ck::kernel_gemm_xdl_cshuffle_v3 (blockscale) 20963 733.9 35.0 5.1%
GPU Utilization:
GPU 0: 51.2% (25074 ops, 7.3s busy)
GPU 1: 50.8% (25081 ops, 7.2s busy)
< 1% overhead validated on 6 ATOM dashboard models (DeepSeek-R1, GPT-OSS, Kimi-K2.5, MiniMax-M2.5). See tutorial: profiling prefill vs decode with built-in roctx markers.
Sample Results (MI355X, Apr 2026)๏
Live benchmark results and kernel traces from the ATOM dashboard validation sweep:
Supported Hardware๏
Architecture |
GPU |
Status |
|---|---|---|
CDNA 3 (gfx942) |
MI300A, MI300X |
Tested |
CDNA 3.5 (gfx950) |
MI355X |
Tested (TP=8 validated) |
CDNA 4 (gfx1250) |
MI450 |
Tested (single GPU) |
CDNA 2 (gfx90a) |
MI210, MI250, MI250X |
Expected to work (untested) |
Acknowledgments๏
This project was inspired by and builds upon the work of:
Jeff Dailyโs ROCm Tracer for GPU (RTG) โ pioneered the HSA_TOOLS_LIB interception approach for lightweight GPU kernel tracing
Michael Woottonโs rocmProfileData (RPD) โ established the SQLite-based trace format and ecosystem tools that rocm-trace-lite is compatible with