RTL
rocm-trace-lite

v0.3.3 — GPU Kernel Profiler

Zero-overhead. No roctracer. No rocprofiler-sdk.
GitHub · Download v0.3.3
Hot Trace Viewer
Overhead / Kernel Hotspots / RTL Modes / Version History / Tutorial
Quick Start
pip install rocm_trace_lite-0.3.3-py3-none-linux_x86_64.whl
rtl trace python3 my_model.py
rtl summary trace.db            # kernel breakdown by phase
rtl export trace.db              # → Perfetto JSON for ui.perfetto.dev

Overhead Validation — v0.3.3 (lite mode, MI355X, Apr 2026)

All models measured at conc=1 and conc=64. Symbol audit: 0 unresolved kernel symbols across all traces.

Standard Benchmark (ISL=1024, OSL=1024)

ModelTPConcBaseline (tok/s)RTL lite (tok/s)Overhead
DeepSeek-R1-FP88197.396.9-0.4%
DeepSeek-R1-FP88642285.92285.0-0.0%
DeepSeek-R1-MXFP441130.0129.1-0.7%
DeepSeek-R1-MXFP44642614.62634.8+0.8%
GLM-5-FP88158.558.5+0.0%
GLM-5-FP88641925.91934.9+0.5%
MiniMax-M2.521114.9115.5+0.5%
MiniMax-M2.52642374.82374.8+0.0%

Decode-Focused Benchmark (ISL=128, OSL=2048)

More prompts per run (conc×15, min 30) for higher statistical confidence on decode throughput.

ModelTPConcBaseline (tok/s)RTL lite (tok/s)Overhead
DeepSeek-R1-FP881100.9100.7-0.2%
DeepSeek-R1-FP88643145.13142.3-0.1%
DeepSeek-R1-MXFP441131.2131.3+0.1%
DeepSeek-R1-MXFP44642900.02886.9-0.5%
GLM-5-FP88159.058.9-0.2%
GLM-5-FP88642053.62057.0+0.2%
MiniMax-M2.521117.0117.0+0.0%
MiniMax-M2.52642643.02648.0+0.2%

Top 10 Kernels per Model — Prefill vs Decode

Kernel hotspots broken down by inference phase. Classification: MoE _ps_ suffix = prefill, paged_attention/sampling = decode, NCCL collectives shown separately. Rank 0 trace, conc=64, ISL/OSL=1k/1k. See Hot Trace Viewer for interactive visualization.

DeepSeek-R1-0528 FP8
260K ops · 109 kernels
TP=8
Prefill
73.5K ops · 1765.8ms compute (excl. 7809ms NCCL)
#KernelCallsAvg (μs)Compute %
1aiter::fmoe (novs_silu_1tg_ps_32x256)2902585.742.5%
2ck::gemm_xdl_v3 (blockscale_preshuffle)1,691155.114.8%
3__amd_rocclr_copyBuffer54,1534.513.7%
4elementwise_kernel (manual_unroll)488164.54.5%
5aiter::add_rmsnorm_quant610122.74.2%
6aiter::dynamic_per_group_scaled_quant1,72928.42.8%
7kn_get_mla_metadata_v1_241596.82.3%
8aiter::fmha_fwd (hd192_hd128_causal)243146.52.0%
9_gemm_a16_w16 (64x128x128)41677.41.8%
10hipBLASLt GEMM (MT128x128x256)23291.01.2%
Decode
186.5K ops · 4628.5ms compute (excl. 71ms NCCL)
#KernelCallsAvg (μs)Compute %
1aiter::fmoe (novs_silu_1tg_32x256)3,538350.426.8%
2aiter::reduce_scatter<bf16, 8>8,906104.120.0%
3ck::gemm_xdl_v3 (blockscale_preshuffle)19,27224.510.2%
4_gemm_a16_w16 (64x128x128)3,81775.36.2%
5__amd_rocclr_copyBuffer51,1245.35.9%
6aiter::fmoe (vs_silu_1tg_32x256)812313.75.5%
7kn_get_mla_metadata_v1_23,75059.04.8%
8aiter::mix_sample_outer_exponential3,81735.12.9%
9aiter::fmha_fwd (hd192_hd128_causal)4,45424.92.4%
10aiter::dynamic_per_group_scaled_quant19,6265.62.4%
gpt-oss-120b
95K ops · 113 kernels
TP=1
Prefill
3.0K ops · 718.8ms compute
#KernelCallsAvg (μs)Compute %
1ck_tile::MoeFlatmmKernel (GemmSpatial #1)1082311.934.7%
2elementwise_kernel (manual_unroll)432488.329.3%
3ck_tile::MoeFlatmmKernel (GemmSpatial #2)105785.611.5%
4hipBLASLt GEMM (MT256x256x64)108305.04.6%
5hipBLASLt GEMM (MT240x256x64)108290.74.4%
6_fused_qk_rope_reshape_and_cache72410.04.1%
7__amd_rocclr_copyBuffer37248.82.5%
8paged_attention72236.02.4%
9vectorized_elementwise (float8_copy)21357.21.7%
10vectorized_elementwise (Fill)14469.71.4%
Decode
92.0K ops · 2014.7ms compute
#KernelCallsAvg (μs)Compute %
1ck_tile::MoeFlatmmKernel (GemmSpatial #1)2,052287.129.2%
2ck_tile::MoeFlatmmKernel (GemmSpatial #2)2,055122.412.5%
3aiter::mix_sample_outer_exponential4,25949.210.4%
4__amd_rocclr_copyBuffer36,8175.19.4%
5ck_tile::MoeFlatmmKernel (GemmSpatial #3)1,29681.75.3%
6_fused_qk_rope_reshape_and_cache2,70032.54.4%
7paged_attention2,34031.13.6%
8vectorized_elementwise (float8_copy)4,1078.91.8%
9hipBLASLt GEMM (MT160x128x128)79239.51.6%
10vectorized_elementwise (Fill)128227.31.4%
Kimi-K2.5-MXFP4
195K ops · 153 kernels
TP=4
Prefill
17.2K ops · 2039.0ms compute (excl. 11321ms NCCL)
#KernelCallsAvg (μs)Compute %
1ck::moe_mxgemm_2lds (BPreshuffle #1)4472141.847.0%
2ck::moe_mxgemm_2lds (BPreshuffle #2)449502.411.1%
3elementwise_kernel (manual_unroll)364354.16.3%
4hipBLASLt GEMM (MT192x240x64)244460.45.5%
5hipBLASLt GEMM (MT256x256x64)671153.55.1%
6aiter::reduce_scatter<bf16, 4>303271.24.0%
7aiter::add_rmsnorm_quant609122.63.7%
8aiter::fmha_fwd (hd192_hd128_causal)396104.02.0%
9_fused_dynamic_mxfp4_quant_moe_sort1,32330.71.2%
10hipBLASLt GEMM (MT128x128x256)67155.71.0%
Decode
177.9K ops · 5704.4ms compute (excl. 84ms NCCL)
#KernelCallsAvg (μs)Compute %
1ck::moe_mxgemm_2lds (BPreshuffle #1)4,113271.619.6%
2aiter::reduce_scatter<bf16, 4>7,505134.617.7%
3ck::moe_mxgemm_2lds (BPreshuffle #2)4,111229.716.6%
4__amd_rocclr_copyBuffer52,9865.24.8%
5kn_get_mla_metadata_v1_24,19461.44.5%
6hipBLASLt GEMM (MT256x16x128)2,36192.93.8%
7_fused_dynamic_mxfp4_quant_moe_sort7,26228.23.6%
8aiter::allgather_lastdim<bf16, 4>4,24644.83.3%
9aiter::mix_sample_outer_exponential4,24642.63.2%
10hipBLASLt GEMM (MT224x64x128)1,78698.33.1%
DeepSeek-R1 MXFP4-MTP
213K ops · 152 kernels
TP=4
Prefill
68.9K ops · 3746.2ms compute (excl. 2973ms NCCL)
#KernelCallsAvg (μs)Compute %
1ck::moe_mxgemm_2lds (BPreshuffle #1)2,399553.935.5%
2ck::moe_mxgemm_2lds (BPreshuffle #2)2,398231.314.8%
3aiter::reduce_scatter<bf16, 4>4,244126.514.3%
4ck::gemm_xdl_v3 (blockscale_preshuffle)7,09146.38.8%
5aiter::fmha_fwd (hd192_hd128_causal)2,34752.63.3%
6_fused_dynamic_mxfp4_quant_moe_sort4,03125.12.7%
7elementwise_kernel (manual_unroll)360235.02.3%
8aiter::add_rmsnorm_quant601117.61.9%
9CatArrayBatchedCopy2,34628.81.8%
10ck_tile::FlatmmKernel (GemmSpatial)268201.41.4%
Decode
143.6K ops · 3488.9ms compute (excl. 218ms NCCL)
#KernelCallsAvg (μs)Compute %
1ck::moe_mxgemm_2lds (BPreshuffle #1)2,241238.215.3%
2aiter::reduce_scatter<bf16, 4>3,930120.613.6%
3ck::moe_mxgemm_2lds (BPreshuffle #2)2,242176.011.3%
4kn_get_mla_metadata_v1_24,19684.310.1%
5__amd_rocclr_copyBuffer50,0555.27.4%
6ck::gemm_xdl_v3 (blockscale_preshuffle)6,26828.85.2%
7hipBLASLt GEMM (MT256x16x128)2,19073.74.6%
8hipBLASLt GEMM (MT192x64x128)1,78485.04.3%
9aiter::allgather_lastdim<bf16, 4>4,25935.44.3%
10aiter::mix_sample_outer_exponential4,25934.44.2%
GLM-5-FP8
368K ops · 156 kernels
TP=8
Prefill
261.7K ops · 8132.1ms compute (excl. 11102ms NCCL)
#KernelCallsAvg (μs)Compute %
1aiter::reduce_scatter<bf16, 8>10,360259.933.1%
2aiter::fmoe (vs_silu_1tg_ps_32x256)5,775408.029.0%
3ck::gemm_xdl_v3 (blockscale_preshuffle)33,95526.210.9%
4paged_attention5,47976.75.2%
5__amd_rocclr_copyBuffer72,4784.84.2%
6aiter::dynamic_per_group_scaled_quant34,9117.63.3%
7aiter::local_device_load_rmsnorm10,35711.41.5%
8ck_tile::QuantGemmKernel380281.91.3%
9elementwise_kernel (manual_unroll)1,08986.01.2%
10aiter::add_rmsnorm_quant760102.61.0%
Decode
106.2K ops · 1475.4ms compute (excl. 243ms NCCL)
#KernelCallsAvg (μs)Compute %
1__amd_rocclr_copyBuffer52,5685.118.1%
2kn_get_mla_metadata_v1_24,19361.717.5%
3aiter::allgather_lastdim<bf16, 2>4,25646.713.5%
4aiter::mix_sample_outer_exponential4,25641.211.9%
5hipBLASLt GEMM (MT256x16x128)2,32556.89.0%
6hipBLASLt GEMM (MT128x64x128)1,76461.87.4%
7ck::gemm_xdl_v3 (blockscale_preshuffle)3,42217.34.0%
8aiter::reduce_scatter<bf16, 8>248220.33.7%
9_masked_embedding_kernel4,2626.11.8%
10vectorized_elementwise (Fill)160160.01.7%
MiniMax-M2.5
1.87M ops · 76 kernels
TP=2
Prefill
1.9K ops · 146.3ms compute
#KernelCallsAvg (μs)Compute %
1elementwise_kernel (manual_unroll)248440.474.6%
2aiter::fmoe (vs_silu_32x384)62305.512.9%
3aiter::reduce_scatter<bf16, 2>12450.14.2%
4ck::gemm_xdl_v3 (blockscale_preshuffle)12318.51.6%
5aiter::fmha_fwd (hd128_causal)6228.01.2%
6hipBLASLt GEMM (MT32x16x512)6215.80.7%
7aiter::dynamic_per_group_scaled_quant1854.60.6%
8aiter::local_device_load_rmsnorm1245.90.5%
9__amd_rocclr_copyBuffer1244.70.4%
10ck_tile::MoeSortingKernel627.20.3%
Decode
1.87M ops · 27920.3ms compute (excl. 460ms NCCL)
#KernelCallsAvg (μs)Compute %
1triton_red (RMSNorm reduce)151,48935.319.2%
2triton_poi (dtype cast)76,33461.816.9%
3vectorized_elementwise (Fill)1,065,2374.316.3%
4triton_poi (residual+norm)75,16943.611.7%
5aiter::fmoe (vs_silu_32x384)4,836476.28.2%
6triton_red (RMSNorm reduce #2)138,6847.53.7%
7aiter::reduce_scatter<bf16, 2>8,556101.83.1%
8triton_poi (residual+norm #2)75,0849.52.6%
9aiter::allgather_lastdim<bf16, 2>4,260130.72.0%
10aiter::cross_device_reduce_1stage626134.61.4%

RTL_MODE Comparison (DeepSeek-R1 TP=8)

ModeOps/rankKernelsGPU timingOverheadPerfetto
lite (default)211K108Yes (partial)~0%trace
default260K109Yes~2–4%trace
full5.49M*112*Yes (all)~2–5%trace*

* full mode requires ROCm 7.13+ with ROCR fix 559d48b1.

Version Regression (DeepSeek-R1 TP=8, conc=64)

VersionThroughput (tok/s)OverheadPerfetto
Baseline2804
v0.1.02799-0.2%trace
v0.1.12754-1.8%trace
v0.2.02729-2.7%trace
v0.3.02656-5.3%trace
main (v0.3.1)2699-3.7%trace
v0.3.2 (lite)2916-0.3%
v0.3.3 (lite)2285-0.0%

Tutorial — Profiling Prefill vs Decode with roctx Markers

RTL includes a built-in roctx shim. Add markers to your code to analyze GPU kernel hotspots per inference phase.

Step 1: Add roctx markers

import ctypes, torch

lib = ctypes.CDLL(None)  # RTL injects roctx via LD_PRELOAD
roctx_push = lib.roctxRangePushA
roctx_push.argtypes = [ctypes.c_char_p]
roctx_push.restype = ctypes.c_int
roctx_pop = lib.roctxRangePop
roctx_pop.restype = ctypes.c_int

# Prefill
roctx_push(b"prefill")
output = model.prefill(input_tokens)
torch.cuda.synchronize()
roctx_pop()

# Decode
roctx_push(b"decode")
for step in range(max_tokens):
    token = model.decode_step()
torch.cuda.synchronize()
roctx_pop()

Step 2: Run with RTL

rtl trace -o trace.db python3 my_model.py

That's it. rtl trace auto-sets LD_PRELOAD so roctx markers just work.

Step 3: Analyze

rtl summary trace.db

Or query by roctx region in SQLite:

SELECT s.string AS kernel, COUNT(*) AS calls,
       ROUND(AVG(o.end - o.start) / 1e3, 1) AS avg_us
FROM rocpd_op o JOIN rocpd_string s ON o.description_id = s.id
WHERE o.gpuId >= 0
  AND o.start BETWEEN <prefill_start> AND <prefill_end>
GROUP BY s.string ORDER BY SUM(o.end - o.start) DESC LIMIT 10;

Example output

roctx markers:
  prefill: 24.3ms
  decode:  54.9ms

=== prefill (74 ops, 0.7ms GPU time) ===
  Cijk_Ailk_Bljk (MT64x64x256)    calls=12   avg=16.1us  27.7%  # Attention GEMM (large batch)
  Cijk_Ailk_Bljk (MT64x128x128)   calls=12   avg=13.2us  22.7%  # FFN up-project
  Cijk_Ailk_Bljk (MT64x16x512)    calls=13   avg=12.0us  22.5%  # FFN down-project
  vectorized_elementwise (gelu)    calls=12   avg=5.7us    9.9%  # Activation

=== decode (6336 ops, 36.9ms GPU time) ===
  Cijk_Ailk_Bljk (MT256x16x384)   calls=1600 avg=8.0us   34.6%  # Skinny GEMM (batch=1)
  ScaleAlphaVec_PostGSU8_VW1       calls=1600 avg=4.5us   19.3%  # Post-GEMM scale
  Cijk_Ailk_Bljk (MT64x16x128)    calls=768  avg=9.1us   18.9%  # FFN GEMM

How to view Perfetto traces

  1. Download a .json.gz trace file from the tables above
  2. Open ui.perfetto.dev
  3. Ctrl+O → select file — roctx regions appear as labeled spans on the timeline