rocm-trace-lite (RTL) — Zero-Overhead GPU Profiler

Overhead Validation — v0.3.7 (lite mode, MI355X, Apr 2026)

All models measured at conc=1 and conc=64. Symbol audit: 0 unresolved kernel symbols across all traces.

Standard Benchmark (ISL=1024, OSL=1024)

Model	TP	Conc	Baseline (tok/s)	RTL lite (tok/s)	Overhead
DeepSeek-R1-FP8	8	1	97.3	96.9	-0.4%
DeepSeek-R1-FP8	8	64	2285.9	2285.0	-0.0%
DeepSeek-R1-MXFP4	4	1	130.0	129.1	-0.7%
DeepSeek-R1-MXFP4	4	64	2614.6	2634.8	+0.8%
GLM-5-FP8	8	1	58.5	58.5	+0.0%
GLM-5-FP8	8	64	1925.9	1934.9	+0.5%
MiniMax-M2.5	2	1	114.9	115.5	+0.5%
MiniMax-M2.5	2	64	2374.8	2374.8	+0.0%

Decode-Focused Benchmark (ISL=128, OSL=2048)

More prompts per run (conc×15, min 30) for higher statistical confidence on decode throughput.

Model	TP	Conc	Baseline (tok/s)	RTL lite (tok/s)	Overhead
DeepSeek-R1-FP8	8	1	100.9	100.7	-0.2%
DeepSeek-R1-FP8	8	64	3145.1	3142.3	-0.1%
DeepSeek-R1-MXFP4	4	1	131.2	131.3	+0.1%
DeepSeek-R1-MXFP4	4	64	2900.0	2886.9	-0.5%
GLM-5-FP8	8	1	59.0	58.9	-0.2%
GLM-5-FP8	8	64	2053.6	2057.0	+0.2%
MiniMax-M2.5	2	1	117.0	117.0	+0.0%
MiniMax-M2.5	2	64	2643.0	2648.0	+0.2%

Top 10 Kernels per Model — Prefill vs Decode

Kernel hotspots broken down by inference phase. Classification: MoE _ps_ suffix = prefill, paged_attention/sampling = decode, NCCL collectives shown separately. Rank 0 trace, conc=64, ISL/OSL=1k/1k. See Hot Trace Viewer for interactive visualization.

DeepSeek-R1-0528 FP8

260K ops · 109 kernels

TP=8

Prefill

73.5K ops · 1765.8ms compute (excl. 7809ms NCCL)

#	Kernel	Calls	Avg (μs)	Compute %
1	aiter::fmoe (novs_silu_1tg_ps_32x256)	290	2585.7	42.5%
2	ck::gemm_xdl_v3 (blockscale_preshuffle)	1,691	155.1	14.8%
3	__amd_rocclr_copyBuffer	54,153	4.5	13.7%
4	elementwise_kernel (manual_unroll)	488	164.5	4.5%
5	aiter::add_rmsnorm_quant	610	122.7	4.2%
6	aiter::dynamic_per_group_scaled_quant	1,729	28.4	2.8%
7	kn_get_mla_metadata_v1_2	415	96.8	2.3%
8	aiter::fmha_fwd (hd192_hd128_causal)	243	146.5	2.0%
9	_gemm_a16_w16 (64x128x128)	416	77.4	1.8%
10	hipBLASLt GEMM (MT128x128x256)	232	91.0	1.2%

Decode

186.5K ops · 4628.5ms compute (excl. 71ms NCCL)

#	Kernel	Calls	Avg (μs)	Compute %
1	aiter::fmoe (novs_silu_1tg_32x256)	3,538	350.4	26.8%
2	aiter::reduce_scatter<bf16, 8>	8,906	104.1	20.0%
3	ck::gemm_xdl_v3 (blockscale_preshuffle)	19,272	24.5	10.2%
4	_gemm_a16_w16 (64x128x128)	3,817	75.3	6.2%
5	__amd_rocclr_copyBuffer	51,124	5.3	5.9%
6	aiter::fmoe (vs_silu_1tg_32x256)	812	313.7	5.5%
7	kn_get_mla_metadata_v1_2	3,750	59.0	4.8%
8	aiter::mix_sample_outer_exponential	3,817	35.1	2.9%
9	aiter::fmha_fwd (hd192_hd128_causal)	4,454	24.9	2.4%
10	aiter::dynamic_per_group_scaled_quant	19,626	5.6	2.4%

gpt-oss-120b

95K ops · 113 kernels

TP=1

Prefill

3.0K ops · 718.8ms compute

#	Kernel	Calls	Avg (μs)	Compute %
1	ck_tile::MoeFlatmmKernel (GemmSpatial #1)	108	2311.9	34.7%
2	elementwise_kernel (manual_unroll)	432	488.3	29.3%
3	ck_tile::MoeFlatmmKernel (GemmSpatial #2)	105	785.6	11.5%
4	hipBLASLt GEMM (MT256x256x64)	108	305.0	4.6%
5	hipBLASLt GEMM (MT240x256x64)	108	290.7	4.4%
6	_fused_qk_rope_reshape_and_cache	72	410.0	4.1%
7	__amd_rocclr_copyBuffer	372	48.8	2.5%
8	paged_attention	72	236.0	2.4%
9	vectorized_elementwise (float8_copy)	213	57.2	1.7%
10	vectorized_elementwise (Fill)	144	69.7	1.4%

Decode

92.0K ops · 2014.7ms compute

#	Kernel	Calls	Avg (μs)	Compute %
1	ck_tile::MoeFlatmmKernel (GemmSpatial #1)	2,052	287.1	29.2%
2	ck_tile::MoeFlatmmKernel (GemmSpatial #2)	2,055	122.4	12.5%
3	aiter::mix_sample_outer_exponential	4,259	49.2	10.4%
4	__amd_rocclr_copyBuffer	36,817	5.1	9.4%
5	ck_tile::MoeFlatmmKernel (GemmSpatial #3)	1,296	81.7	5.3%
6	_fused_qk_rope_reshape_and_cache	2,700	32.5	4.4%
7	paged_attention	2,340	31.1	3.6%
8	vectorized_elementwise (float8_copy)	4,107	8.9	1.8%
9	hipBLASLt GEMM (MT160x128x128)	792	39.5	1.6%
10	vectorized_elementwise (Fill)	128	227.3	1.4%

Kimi-K2.5-MXFP4

195K ops · 153 kernels

TP=4

Prefill

17.2K ops · 2039.0ms compute (excl. 11321ms NCCL)

#	Kernel	Calls	Avg (μs)	Compute %
1	ck::moe_mxgemm_2lds (BPreshuffle #1)	447	2141.8	47.0%
2	ck::moe_mxgemm_2lds (BPreshuffle #2)	449	502.4	11.1%
3	elementwise_kernel (manual_unroll)	364	354.1	6.3%
4	hipBLASLt GEMM (MT192x240x64)	244	460.4	5.5%
5	hipBLASLt GEMM (MT256x256x64)	671	153.5	5.1%
6	aiter::reduce_scatter<bf16, 4>	303	271.2	4.0%
7	aiter::add_rmsnorm_quant	609	122.6	3.7%
8	aiter::fmha_fwd (hd192_hd128_causal)	396	104.0	2.0%
9	_fused_dynamic_mxfp4_quant_moe_sort	1,323	30.7	1.2%
10	hipBLASLt GEMM (MT128x128x256)	671	55.7	1.0%

Decode

177.9K ops · 5704.4ms compute (excl. 84ms NCCL)

#	Kernel	Calls	Avg (μs)	Compute %
1	ck::moe_mxgemm_2lds (BPreshuffle #1)	4,113	271.6	19.6%
2	aiter::reduce_scatter<bf16, 4>	7,505	134.6	17.7%
3	ck::moe_mxgemm_2lds (BPreshuffle #2)	4,111	229.7	16.6%
4	__amd_rocclr_copyBuffer	52,986	5.2	4.8%
5	kn_get_mla_metadata_v1_2	4,194	61.4	4.5%
6	hipBLASLt GEMM (MT256x16x128)	2,361	92.9	3.8%
7	_fused_dynamic_mxfp4_quant_moe_sort	7,262	28.2	3.6%
8	aiter::allgather_lastdim<bf16, 4>	4,246	44.8	3.3%
9	aiter::mix_sample_outer_exponential	4,246	42.6	3.2%
10	hipBLASLt GEMM (MT224x64x128)	1,786	98.3	3.1%

DeepSeek-R1 MXFP4-MTP

213K ops · 152 kernels

TP=4

Prefill

68.9K ops · 3746.2ms compute (excl. 2973ms NCCL)

#	Kernel	Calls	Avg (μs)	Compute %
1	ck::moe_mxgemm_2lds (BPreshuffle #1)	2,399	553.9	35.5%
2	ck::moe_mxgemm_2lds (BPreshuffle #2)	2,398	231.3	14.8%
3	aiter::reduce_scatter<bf16, 4>	4,244	126.5	14.3%
4	ck::gemm_xdl_v3 (blockscale_preshuffle)	7,091	46.3	8.8%
5	aiter::fmha_fwd (hd192_hd128_causal)	2,347	52.6	3.3%
6	_fused_dynamic_mxfp4_quant_moe_sort	4,031	25.1	2.7%
7	elementwise_kernel (manual_unroll)	360	235.0	2.3%
8	aiter::add_rmsnorm_quant	601	117.6	1.9%
9	CatArrayBatchedCopy	2,346	28.8	1.8%
10	ck_tile::FlatmmKernel (GemmSpatial)	268	201.4	1.4%

Decode

143.6K ops · 3488.9ms compute (excl. 218ms NCCL)

#	Kernel	Calls	Avg (μs)	Compute %
1	ck::moe_mxgemm_2lds (BPreshuffle #1)	2,241	238.2	15.3%
2	aiter::reduce_scatter<bf16, 4>	3,930	120.6	13.6%
3	ck::moe_mxgemm_2lds (BPreshuffle #2)	2,242	176.0	11.3%
4	kn_get_mla_metadata_v1_2	4,196	84.3	10.1%
5	__amd_rocclr_copyBuffer	50,055	5.2	7.4%
6	ck::gemm_xdl_v3 (blockscale_preshuffle)	6,268	28.8	5.2%
7	hipBLASLt GEMM (MT256x16x128)	2,190	73.7	4.6%
8	hipBLASLt GEMM (MT192x64x128)	1,784	85.0	4.3%
9	aiter::allgather_lastdim<bf16, 4>	4,259	35.4	4.3%
10	aiter::mix_sample_outer_exponential	4,259	34.4	4.2%

GLM-5-FP8

368K ops · 156 kernels

TP=8

Prefill

261.7K ops · 8132.1ms compute (excl. 11102ms NCCL)

#	Kernel	Calls	Avg (μs)	Compute %
1	aiter::reduce_scatter<bf16, 8>	10,360	259.9	33.1%
2	aiter::fmoe (vs_silu_1tg_ps_32x256)	5,775	408.0	29.0%
3	ck::gemm_xdl_v3 (blockscale_preshuffle)	33,955	26.2	10.9%
4	paged_attention	5,479	76.7	5.2%
5	__amd_rocclr_copyBuffer	72,478	4.8	4.2%
6	aiter::dynamic_per_group_scaled_quant	34,911	7.6	3.3%
7	aiter::local_device_load_rmsnorm	10,357	11.4	1.5%
8	ck_tile::QuantGemmKernel	380	281.9	1.3%
9	elementwise_kernel (manual_unroll)	1,089	86.0	1.2%
10	aiter::add_rmsnorm_quant	760	102.6	1.0%

Decode

106.2K ops · 1475.4ms compute (excl. 243ms NCCL)

#	Kernel	Calls	Avg (μs)	Compute %
1	__amd_rocclr_copyBuffer	52,568	5.1	18.1%
2	kn_get_mla_metadata_v1_2	4,193	61.7	17.5%
3	aiter::allgather_lastdim<bf16, 2>	4,256	46.7	13.5%
4	aiter::mix_sample_outer_exponential	4,256	41.2	11.9%
5	hipBLASLt GEMM (MT256x16x128)	2,325	56.8	9.0%
6	hipBLASLt GEMM (MT128x64x128)	1,764	61.8	7.4%
7	ck::gemm_xdl_v3 (blockscale_preshuffle)	3,422	17.3	4.0%
8	aiter::reduce_scatter<bf16, 8>	248	220.3	3.7%
9	_masked_embedding_kernel	4,262	6.1	1.8%
10	vectorized_elementwise (Fill)	160	160.0	1.7%

MiniMax-M2.5

1.87M ops · 76 kernels

TP=2

Prefill

1.9K ops · 146.3ms compute

#	Kernel	Calls	Avg (μs)	Compute %
1	elementwise_kernel (manual_unroll)	248	440.4	74.6%
2	aiter::fmoe (vs_silu_32x384)	62	305.5	12.9%
3	aiter::reduce_scatter<bf16, 2>	124	50.1	4.2%
4	ck::gemm_xdl_v3 (blockscale_preshuffle)	123	18.5	1.6%
5	aiter::fmha_fwd (hd128_causal)	62	28.0	1.2%
6	hipBLASLt GEMM (MT32x16x512)	62	15.8	0.7%
7	aiter::dynamic_per_group_scaled_quant	185	4.6	0.6%
8	aiter::local_device_load_rmsnorm	124	5.9	0.5%
9	__amd_rocclr_copyBuffer	124	4.7	0.4%
10	ck_tile::MoeSortingKernel	62	7.2	0.3%

Decode

1.87M ops · 27920.3ms compute (excl. 460ms NCCL)

#	Kernel	Calls	Avg (μs)	Compute %
1	triton_red (RMSNorm reduce)	151,489	35.3	19.2%
2	triton_poi (dtype cast)	76,334	61.8	16.9%
3	vectorized_elementwise (Fill)	1,065,237	4.3	16.3%
4	triton_poi (residual+norm)	75,169	43.6	11.7%
5	aiter::fmoe (vs_silu_32x384)	4,836	476.2	8.2%
6	triton_red (RMSNorm reduce #2)	138,684	7.5	3.7%
7	aiter::reduce_scatter<bf16, 2>	8,556	101.8	3.1%
8	triton_poi (residual+norm #2)	75,084	9.5	2.6%
9	aiter::allgather_lastdim<bf16, 2>	4,260	130.7	2.0%
10	aiter::cross_device_reduce_1stage	62	6134.6	1.4%

RTL_MODE Comparison (DeepSeek-R1 TP=8)

Mode	Ops/rank	Kernels	GPU timing	Overhead	Perfetto
lite (default)	211K	108	Yes (partial)	~0%	trace
default	260K	109	Yes	~2–4%	trace
full	5.49M*	112*	Yes (all)	~2–5%	trace*
hip (new in v0.3.7)	—	—	Yes + HIP API	~5–10%	—

* full mode requires ROCm 7.13+ with ROCR fix 559d48b1. hip mode adds HIP API interception via LD_PRELOAD for CPU–GPU correlation.

HIP API Interception — v0.3.7 (MI355X, Apr 2026)

New in v0.3.7: RTL_MODE=hip captures CPU-side HIP runtime API calls alongside GPU kernel timing. Zero new dependencies — uses dlsym(RTLD_NEXT) interposition, no roctracer, no rocprofiler-sdk, no special libamdhip64.

Intercepted HIP APIs

Function	Category	Captures
hipModuleLaunchKernel	Kernel launch	grid, block, shared mem, stream, timing
hipExtModuleLaunchKernel	Kernel launch (ATOM/Triton)	grid, block, shared mem, timing
hipMemcpy / hipMemcpyAsync	Memory copy	size, kind, timing
hipMalloc / hipFree	Memory allocation	size, timing
hipStreamSynchronize	Stream sync	timing
hipDeviceSynchronize	Device sync	timing
hipGraphLaunch	Graph replay	stream, timing

Validation — PyTorch on MI355X (ROCm 7.2.2)

Workload	HIP APIs	GPU Ops	CUDAGraph	Status
torch.mm 512×512 FP16 × 20	24	22	—	PASS
CUDAGraph capture + 20 replays	56	70	Yes	PASS
64L Transformer, 200 iters (eager)	~28K	~28K	—	PASS

vs roctracer — Profiler Reliability Comparison

Metric	roctracer	rocprofiler-sdk	RTL
Kernel event drop (decode)	54%	0%	0%
hipGraphLaunch overhead	50us	316us	50us
Segfault at exit	No	Yes (masked CI 1 month)	No
Startup overhead	None	CI timeout	None
CUDAGraph compatible	Drops events	Crash	Works
External dependencies	roctracer + HSA	rocprofsdk + register	libhsa-runtime64 only

Data from issue #73 and ATOM PR #432. Kernel drop measured with Rohan's regression test on MI355X.

Quick Start

# Capture HIP API + GPU kernels in one trace
rtl trace --mode hip python3 my_model.py

# View results
rtl summary trace.db                # GPU kernel hotspots
sqlite3 trace.db "SELECT * FROM rocpd_api LIMIT 10;"  # HIP API calls

Version Regression (DeepSeek-R1 TP=8, conc=64)

Version	Throughput (tok/s)	Overhead	Perfetto
Baseline	2804	—	—
v0.1.0	2799	-0.2%	trace
v0.1.1	2754	-1.8%	trace
v0.2.0	2729	-2.7%	trace
v0.3.0	2656	-5.3%	trace
main (v0.3.1)	2699	-3.7%	trace
v0.3.2 (lite)	2916	-0.3%	—
v0.3.7 (lite)	2285	-0.0%	—
v0.3.7 (lite)	—	~0%	—

Tutorial — Profiling Prefill vs Decode with roctx Markers

RTL includes a built-in roctx shim. Add markers to your code to analyze GPU kernel hotspots per inference phase.

Step 1: Add roctx markers

import ctypes, torch

lib = ctypes.CDLL(None)  # RTL injects roctx via LD_PRELOAD
roctx_push = lib.roctxRangePushA
roctx_push.argtypes = [ctypes.c_char_p]
roctx_push.restype = ctypes.c_int
roctx_pop = lib.roctxRangePop
roctx_pop.restype = ctypes.c_int

# Prefill
roctx_push(b"prefill")
output = model.prefill(input_tokens)
torch.cuda.synchronize()
roctx_pop()

# Decode
roctx_push(b"decode")
for step in range(max_tokens):
    token = model.decode_step()
torch.cuda.synchronize()
roctx_pop()

Step 2: Run with RTL

rtl trace -o trace.db python3 my_model.py

That's it. rtl trace auto-sets LD_PRELOAD so roctx markers just work.

Step 3: Analyze

rtl summary trace.db

Or query by roctx region in SQLite:

SELECT s.string AS kernel, COUNT(*) AS calls,
       ROUND(AVG(o.end - o.start) / 1e3, 1) AS avg_us
FROM rocpd_op o JOIN rocpd_string s ON o.description_id = s.id
WHERE o.gpuId >= 0
  AND o.start BETWEEN <prefill_start> AND <prefill_end>
GROUP BY s.string ORDER BY SUM(o.end - o.start) DESC LIMIT 10;

Example output

roctx markers:
  prefill: 24.3ms
  decode:  54.9ms

=== prefill (74 ops, 0.7ms GPU time) ===
  Cijk_Ailk_Bljk (MT64x64x256)    calls=12   avg=16.1us  27.7%  # Attention GEMM (large batch)
  Cijk_Ailk_Bljk (MT64x128x128)   calls=12   avg=13.2us  22.7%  # FFN up-project
  Cijk_Ailk_Bljk (MT64x16x512)    calls=13   avg=12.0us  22.5%  # FFN down-project
  vectorized_elementwise (gelu)    calls=12   avg=5.7us    9.9%  # Activation

=== decode (6336 ops, 36.9ms GPU time) ===
  Cijk_Ailk_Bljk (MT256x16x384)   calls=1600 avg=8.0us   34.6%  # Skinny GEMM (batch=1)
  ScaleAlphaVec_PostGSU8_VW1       calls=1600 avg=4.5us   19.3%  # Post-GEMM scale
  Cijk_Ailk_Bljk (MT64x16x128)    calls=768  avg=9.1us   18.9%  # FFN GEMM

How to view Perfetto traces

Download a .json.gz trace file from the tables above
Open ui.perfetto.dev
Ctrl+O → select file — roctx regions appear as labeled spans on the timeline