Overhead Validation — v0.3.3 (lite mode, MI355X, Apr 2026)
All models measured at conc=1 and conc=64. Symbol audit: 0 unresolved kernel symbols across all traces.
Standard Benchmark (ISL=1024, OSL=1024)
| Model | TP | Conc | Baseline (tok/s) | RTL lite (tok/s) | Overhead |
| DeepSeek-R1-FP8 | 8 | 1 | 97.3 | 96.9 | -0.4% |
| DeepSeek-R1-FP8 | 8 | 64 | 2285.9 | 2285.0 | -0.0% |
| DeepSeek-R1-MXFP4 | 4 | 1 | 130.0 | 129.1 | -0.7% |
| DeepSeek-R1-MXFP4 | 4 | 64 | 2614.6 | 2634.8 | +0.8% |
| GLM-5-FP8 | 8 | 1 | 58.5 | 58.5 | +0.0% |
| GLM-5-FP8 | 8 | 64 | 1925.9 | 1934.9 | +0.5% |
| MiniMax-M2.5 | 2 | 1 | 114.9 | 115.5 | +0.5% |
| MiniMax-M2.5 | 2 | 64 | 2374.8 | 2374.8 | +0.0% |
Decode-Focused Benchmark (ISL=128, OSL=2048)
More prompts per run (conc×15, min 30) for higher statistical confidence on decode throughput.
| Model | TP | Conc | Baseline (tok/s) | RTL lite (tok/s) | Overhead |
| DeepSeek-R1-FP8 | 8 | 1 | 100.9 | 100.7 | -0.2% |
| DeepSeek-R1-FP8 | 8 | 64 | 3145.1 | 3142.3 | -0.1% |
| DeepSeek-R1-MXFP4 | 4 | 1 | 131.2 | 131.3 | +0.1% |
| DeepSeek-R1-MXFP4 | 4 | 64 | 2900.0 | 2886.9 | -0.5% |
| GLM-5-FP8 | 8 | 1 | 59.0 | 58.9 | -0.2% |
| GLM-5-FP8 | 8 | 64 | 2053.6 | 2057.0 | +0.2% |
| MiniMax-M2.5 | 2 | 1 | 117.0 | 117.0 | +0.0% |
| MiniMax-M2.5 | 2 | 64 | 2643.0 | 2648.0 | +0.2% |
Top 10 Kernels per Model — Prefill vs Decode
Kernel hotspots broken down by inference phase. Classification: MoE _ps_ suffix = prefill, paged_attention/sampling = decode, NCCL collectives shown separately. Rank 0 trace, conc=64, ISL/OSL=1k/1k. See Hot Trace Viewer for interactive visualization.
DeepSeek-R1-0528 FP8
260K ops · 109 kernels
TP=8
Prefill
73.5K ops · 1765.8ms compute (excl. 7809ms NCCL)
| # | Kernel | Calls | Avg (μs) | Compute % |
| 1 | aiter::fmoe (novs_silu_1tg_ps_32x256) | 290 | 2585.7 | 42.5% |
| 2 | ck::gemm_xdl_v3 (blockscale_preshuffle) | 1,691 | 155.1 | 14.8% |
| 3 | __amd_rocclr_copyBuffer | 54,153 | 4.5 | 13.7% |
| 4 | elementwise_kernel (manual_unroll) | 488 | 164.5 | 4.5% |
| 5 | aiter::add_rmsnorm_quant | 610 | 122.7 | 4.2% |
| 6 | aiter::dynamic_per_group_scaled_quant | 1,729 | 28.4 | 2.8% |
| 7 | kn_get_mla_metadata_v1_2 | 415 | 96.8 | 2.3% |
| 8 | aiter::fmha_fwd (hd192_hd128_causal) | 243 | 146.5 | 2.0% |
| 9 | _gemm_a16_w16 (64x128x128) | 416 | 77.4 | 1.8% |
| 10 | hipBLASLt GEMM (MT128x128x256) | 232 | 91.0 | 1.2% |
Decode
186.5K ops · 4628.5ms compute (excl. 71ms NCCL)
| # | Kernel | Calls | Avg (μs) | Compute % |
| 1 | aiter::fmoe (novs_silu_1tg_32x256) | 3,538 | 350.4 | 26.8% |
| 2 | aiter::reduce_scatter<bf16, 8> | 8,906 | 104.1 | 20.0% |
| 3 | ck::gemm_xdl_v3 (blockscale_preshuffle) | 19,272 | 24.5 | 10.2% |
| 4 | _gemm_a16_w16 (64x128x128) | 3,817 | 75.3 | 6.2% |
| 5 | __amd_rocclr_copyBuffer | 51,124 | 5.3 | 5.9% |
| 6 | aiter::fmoe (vs_silu_1tg_32x256) | 812 | 313.7 | 5.5% |
| 7 | kn_get_mla_metadata_v1_2 | 3,750 | 59.0 | 4.8% |
| 8 | aiter::mix_sample_outer_exponential | 3,817 | 35.1 | 2.9% |
| 9 | aiter::fmha_fwd (hd192_hd128_causal) | 4,454 | 24.9 | 2.4% |
| 10 | aiter::dynamic_per_group_scaled_quant | 19,626 | 5.6 | 2.4% |
gpt-oss-120b
95K ops · 113 kernels
TP=1
Prefill
3.0K ops · 718.8ms compute
| # | Kernel | Calls | Avg (μs) | Compute % |
| 1 | ck_tile::MoeFlatmmKernel (GemmSpatial #1) | 108 | 2311.9 | 34.7% |
| 2 | elementwise_kernel (manual_unroll) | 432 | 488.3 | 29.3% |
| 3 | ck_tile::MoeFlatmmKernel (GemmSpatial #2) | 105 | 785.6 | 11.5% |
| 4 | hipBLASLt GEMM (MT256x256x64) | 108 | 305.0 | 4.6% |
| 5 | hipBLASLt GEMM (MT240x256x64) | 108 | 290.7 | 4.4% |
| 6 | _fused_qk_rope_reshape_and_cache | 72 | 410.0 | 4.1% |
| 7 | __amd_rocclr_copyBuffer | 372 | 48.8 | 2.5% |
| 8 | paged_attention | 72 | 236.0 | 2.4% |
| 9 | vectorized_elementwise (float8_copy) | 213 | 57.2 | 1.7% |
| 10 | vectorized_elementwise (Fill) | 144 | 69.7 | 1.4% |
Decode
92.0K ops · 2014.7ms compute
| # | Kernel | Calls | Avg (μs) | Compute % |
| 1 | ck_tile::MoeFlatmmKernel (GemmSpatial #1) | 2,052 | 287.1 | 29.2% |
| 2 | ck_tile::MoeFlatmmKernel (GemmSpatial #2) | 2,055 | 122.4 | 12.5% |
| 3 | aiter::mix_sample_outer_exponential | 4,259 | 49.2 | 10.4% |
| 4 | __amd_rocclr_copyBuffer | 36,817 | 5.1 | 9.4% |
| 5 | ck_tile::MoeFlatmmKernel (GemmSpatial #3) | 1,296 | 81.7 | 5.3% |
| 6 | _fused_qk_rope_reshape_and_cache | 2,700 | 32.5 | 4.4% |
| 7 | paged_attention | 2,340 | 31.1 | 3.6% |
| 8 | vectorized_elementwise (float8_copy) | 4,107 | 8.9 | 1.8% |
| 9 | hipBLASLt GEMM (MT160x128x128) | 792 | 39.5 | 1.6% |
| 10 | vectorized_elementwise (Fill) | 128 | 227.3 | 1.4% |
Kimi-K2.5-MXFP4
195K ops · 153 kernels
TP=4
Prefill
17.2K ops · 2039.0ms compute (excl. 11321ms NCCL)
| # | Kernel | Calls | Avg (μs) | Compute % |
| 1 | ck::moe_mxgemm_2lds (BPreshuffle #1) | 447 | 2141.8 | 47.0% |
| 2 | ck::moe_mxgemm_2lds (BPreshuffle #2) | 449 | 502.4 | 11.1% |
| 3 | elementwise_kernel (manual_unroll) | 364 | 354.1 | 6.3% |
| 4 | hipBLASLt GEMM (MT192x240x64) | 244 | 460.4 | 5.5% |
| 5 | hipBLASLt GEMM (MT256x256x64) | 671 | 153.5 | 5.1% |
| 6 | aiter::reduce_scatter<bf16, 4> | 303 | 271.2 | 4.0% |
| 7 | aiter::add_rmsnorm_quant | 609 | 122.6 | 3.7% |
| 8 | aiter::fmha_fwd (hd192_hd128_causal) | 396 | 104.0 | 2.0% |
| 9 | _fused_dynamic_mxfp4_quant_moe_sort | 1,323 | 30.7 | 1.2% |
| 10 | hipBLASLt GEMM (MT128x128x256) | 671 | 55.7 | 1.0% |
Decode
177.9K ops · 5704.4ms compute (excl. 84ms NCCL)
| # | Kernel | Calls | Avg (μs) | Compute % |
| 1 | ck::moe_mxgemm_2lds (BPreshuffle #1) | 4,113 | 271.6 | 19.6% |
| 2 | aiter::reduce_scatter<bf16, 4> | 7,505 | 134.6 | 17.7% |
| 3 | ck::moe_mxgemm_2lds (BPreshuffle #2) | 4,111 | 229.7 | 16.6% |
| 4 | __amd_rocclr_copyBuffer | 52,986 | 5.2 | 4.8% |
| 5 | kn_get_mla_metadata_v1_2 | 4,194 | 61.4 | 4.5% |
| 6 | hipBLASLt GEMM (MT256x16x128) | 2,361 | 92.9 | 3.8% |
| 7 | _fused_dynamic_mxfp4_quant_moe_sort | 7,262 | 28.2 | 3.6% |
| 8 | aiter::allgather_lastdim<bf16, 4> | 4,246 | 44.8 | 3.3% |
| 9 | aiter::mix_sample_outer_exponential | 4,246 | 42.6 | 3.2% |
| 10 | hipBLASLt GEMM (MT224x64x128) | 1,786 | 98.3 | 3.1% |
DeepSeek-R1 MXFP4-MTP
213K ops · 152 kernels
TP=4
Prefill
68.9K ops · 3746.2ms compute (excl. 2973ms NCCL)
| # | Kernel | Calls | Avg (μs) | Compute % |
| 1 | ck::moe_mxgemm_2lds (BPreshuffle #1) | 2,399 | 553.9 | 35.5% |
| 2 | ck::moe_mxgemm_2lds (BPreshuffle #2) | 2,398 | 231.3 | 14.8% |
| 3 | aiter::reduce_scatter<bf16, 4> | 4,244 | 126.5 | 14.3% |
| 4 | ck::gemm_xdl_v3 (blockscale_preshuffle) | 7,091 | 46.3 | 8.8% |
| 5 | aiter::fmha_fwd (hd192_hd128_causal) | 2,347 | 52.6 | 3.3% |
| 6 | _fused_dynamic_mxfp4_quant_moe_sort | 4,031 | 25.1 | 2.7% |
| 7 | elementwise_kernel (manual_unroll) | 360 | 235.0 | 2.3% |
| 8 | aiter::add_rmsnorm_quant | 601 | 117.6 | 1.9% |
| 9 | CatArrayBatchedCopy | 2,346 | 28.8 | 1.8% |
| 10 | ck_tile::FlatmmKernel (GemmSpatial) | 268 | 201.4 | 1.4% |
Decode
143.6K ops · 3488.9ms compute (excl. 218ms NCCL)
| # | Kernel | Calls | Avg (μs) | Compute % |
| 1 | ck::moe_mxgemm_2lds (BPreshuffle #1) | 2,241 | 238.2 | 15.3% |
| 2 | aiter::reduce_scatter<bf16, 4> | 3,930 | 120.6 | 13.6% |
| 3 | ck::moe_mxgemm_2lds (BPreshuffle #2) | 2,242 | 176.0 | 11.3% |
| 4 | kn_get_mla_metadata_v1_2 | 4,196 | 84.3 | 10.1% |
| 5 | __amd_rocclr_copyBuffer | 50,055 | 5.2 | 7.4% |
| 6 | ck::gemm_xdl_v3 (blockscale_preshuffle) | 6,268 | 28.8 | 5.2% |
| 7 | hipBLASLt GEMM (MT256x16x128) | 2,190 | 73.7 | 4.6% |
| 8 | hipBLASLt GEMM (MT192x64x128) | 1,784 | 85.0 | 4.3% |
| 9 | aiter::allgather_lastdim<bf16, 4> | 4,259 | 35.4 | 4.3% |
| 10 | aiter::mix_sample_outer_exponential | 4,259 | 34.4 | 4.2% |
GLM-5-FP8
368K ops · 156 kernels
TP=8
Prefill
261.7K ops · 8132.1ms compute (excl. 11102ms NCCL)
| # | Kernel | Calls | Avg (μs) | Compute % |
| 1 | aiter::reduce_scatter<bf16, 8> | 10,360 | 259.9 | 33.1% |
| 2 | aiter::fmoe (vs_silu_1tg_ps_32x256) | 5,775 | 408.0 | 29.0% |
| 3 | ck::gemm_xdl_v3 (blockscale_preshuffle) | 33,955 | 26.2 | 10.9% |
| 4 | paged_attention | 5,479 | 76.7 | 5.2% |
| 5 | __amd_rocclr_copyBuffer | 72,478 | 4.8 | 4.2% |
| 6 | aiter::dynamic_per_group_scaled_quant | 34,911 | 7.6 | 3.3% |
| 7 | aiter::local_device_load_rmsnorm | 10,357 | 11.4 | 1.5% |
| 8 | ck_tile::QuantGemmKernel | 380 | 281.9 | 1.3% |
| 9 | elementwise_kernel (manual_unroll) | 1,089 | 86.0 | 1.2% |
| 10 | aiter::add_rmsnorm_quant | 760 | 102.6 | 1.0% |
Decode
106.2K ops · 1475.4ms compute (excl. 243ms NCCL)
| # | Kernel | Calls | Avg (μs) | Compute % |
| 1 | __amd_rocclr_copyBuffer | 52,568 | 5.1 | 18.1% |
| 2 | kn_get_mla_metadata_v1_2 | 4,193 | 61.7 | 17.5% |
| 3 | aiter::allgather_lastdim<bf16, 2> | 4,256 | 46.7 | 13.5% |
| 4 | aiter::mix_sample_outer_exponential | 4,256 | 41.2 | 11.9% |
| 5 | hipBLASLt GEMM (MT256x16x128) | 2,325 | 56.8 | 9.0% |
| 6 | hipBLASLt GEMM (MT128x64x128) | 1,764 | 61.8 | 7.4% |
| 7 | ck::gemm_xdl_v3 (blockscale_preshuffle) | 3,422 | 17.3 | 4.0% |
| 8 | aiter::reduce_scatter<bf16, 8> | 248 | 220.3 | 3.7% |
| 9 | _masked_embedding_kernel | 4,262 | 6.1 | 1.8% |
| 10 | vectorized_elementwise (Fill) | 160 | 160.0 | 1.7% |
MiniMax-M2.5
1.87M ops · 76 kernels
TP=2
Prefill
1.9K ops · 146.3ms compute
| # | Kernel | Calls | Avg (μs) | Compute % |
| 1 | elementwise_kernel (manual_unroll) | 248 | 440.4 | 74.6% |
| 2 | aiter::fmoe (vs_silu_32x384) | 62 | 305.5 | 12.9% |
| 3 | aiter::reduce_scatter<bf16, 2> | 124 | 50.1 | 4.2% |
| 4 | ck::gemm_xdl_v3 (blockscale_preshuffle) | 123 | 18.5 | 1.6% |
| 5 | aiter::fmha_fwd (hd128_causal) | 62 | 28.0 | 1.2% |
| 6 | hipBLASLt GEMM (MT32x16x512) | 62 | 15.8 | 0.7% |
| 7 | aiter::dynamic_per_group_scaled_quant | 185 | 4.6 | 0.6% |
| 8 | aiter::local_device_load_rmsnorm | 124 | 5.9 | 0.5% |
| 9 | __amd_rocclr_copyBuffer | 124 | 4.7 | 0.4% |
| 10 | ck_tile::MoeSortingKernel | 62 | 7.2 | 0.3% |
Decode
1.87M ops · 27920.3ms compute (excl. 460ms NCCL)
| # | Kernel | Calls | Avg (μs) | Compute % |
| 1 | triton_red (RMSNorm reduce) | 151,489 | 35.3 | 19.2% |
| 2 | triton_poi (dtype cast) | 76,334 | 61.8 | 16.9% |
| 3 | vectorized_elementwise (Fill) | 1,065,237 | 4.3 | 16.3% |
| 4 | triton_poi (residual+norm) | 75,169 | 43.6 | 11.7% |
| 5 | aiter::fmoe (vs_silu_32x384) | 4,836 | 476.2 | 8.2% |
| 6 | triton_red (RMSNorm reduce #2) | 138,684 | 7.5 | 3.7% |
| 7 | aiter::reduce_scatter<bf16, 2> | 8,556 | 101.8 | 3.1% |
| 8 | triton_poi (residual+norm #2) | 75,084 | 9.5 | 2.6% |
| 9 | aiter::allgather_lastdim<bf16, 2> | 4,260 | 130.7 | 2.0% |
| 10 | aiter::cross_device_reduce_1stage | 62 | 6134.6 | 1.4% |
Tutorial — Profiling Prefill vs Decode with roctx Markers
RTL includes a built-in roctx shim. Add markers to your code to analyze GPU kernel hotspots per inference phase.
Step 1: Add roctx markers
import ctypes, torch
lib = ctypes.CDLL(None) # RTL injects roctx via LD_PRELOAD
roctx_push = lib.roctxRangePushA
roctx_push.argtypes = [ctypes.c_char_p]
roctx_push.restype = ctypes.c_int
roctx_pop = lib.roctxRangePop
roctx_pop.restype = ctypes.c_int
# Prefill
roctx_push(b"prefill")
output = model.prefill(input_tokens)
torch.cuda.synchronize()
roctx_pop()
# Decode
roctx_push(b"decode")
for step in range(max_tokens):
token = model.decode_step()
torch.cuda.synchronize()
roctx_pop()
Step 2: Run with RTL
rtl trace -o trace.db python3 my_model.py
That's it. rtl trace auto-sets LD_PRELOAD so roctx markers just work.
Step 3: Analyze
rtl summary trace.db
Or query by roctx region in SQLite:
SELECT s.string AS kernel, COUNT(*) AS calls,
ROUND(AVG(o.end - o.start) / 1e3, 1) AS avg_us
FROM rocpd_op o JOIN rocpd_string s ON o.description_id = s.id
WHERE o.gpuId >= 0
AND o.start BETWEEN <prefill_start> AND <prefill_end>
GROUP BY s.string ORDER BY SUM(o.end - o.start) DESC LIMIT 10;
Example output
roctx markers:
prefill: 24.3ms
decode: 54.9ms
=== prefill (74 ops, 0.7ms GPU time) ===
Cijk_Ailk_Bljk (MT64x64x256) calls=12 avg=16.1us 27.7% # Attention GEMM (large batch)
Cijk_Ailk_Bljk (MT64x128x128) calls=12 avg=13.2us 22.7% # FFN up-project
Cijk_Ailk_Bljk (MT64x16x512) calls=13 avg=12.0us 22.5% # FFN down-project
vectorized_elementwise (gelu) calls=12 avg=5.7us 9.9% # Activation
=== decode (6336 ops, 36.9ms GPU time) ===
Cijk_Ailk_Bljk (MT256x16x384) calls=1600 avg=8.0us 34.6% # Skinny GEMM (batch=1)
ScaleAlphaVec_PostGSU8_VW1 calls=1600 avg=4.5us 19.3% # Post-GEMM scale
Cijk_Ailk_Bljk (MT64x16x128) calls=768 avg=9.1us 18.9% # FFN GEMM
How to view Perfetto traces
- Download a
.json.gz trace file from the tables above
- Open ui.perfetto.dev
- Ctrl+O → select file — roctx regions appear as labeled spans on the timeline