Changelog
v0.3.5
Documentation
Fix 7 documentation inconsistencies with codebase (#89): remove fictional
RTL_NO_INJECT, document all 8 DB tables, add missing roctx functions, fix stalerpd-lite/rpd_litereferences
v0.3.4
Dispatch info: Record hardware queue ID, workgroup size, and grid dimensions per kernel in the
completionSignalcolumn ofrocpd_opGzip Perfetto output:
rtl tracenow produces compressed.json.gzfiles for Perfetto timeline visualizationBuild fix: prefer repo-root
librtl.soover system-installed version during developmentAdd ROCR interceptible queue SEGFAULT reproducer (
repro/)Sync
__version__withpyproject.toml
v0.3.3
Fix roctx support:
rtl tracenow setsLD_PRELOADautomatically so roctx markers work without manual setupAdd tutorial: profiling prefill vs decode with roctx markers
Add MI355X benchmark results and hot trace viewer as sample pages
v0.3.2
Bug fixes
Fix SQLite concurrency crash:
flush()andclose()now holdg_db_mutex, preventing races withrecord_kernel()batch commits. GLM-5 TP=8 lite mode was crashing with 1788 SQLite errors leading to GPU memory fault.Default mode changed to lite: Lite mode is now the default (
RTL_MODEunset = lite). Safe for all ROCm versions including 7.2 which has the ROCRInterceptQueue::staging_buffer_heap overflow bug. UseRTL_MODE=defaultfor full count==1 profiling (safe on ROCm 7.13+).Removed redundant
atexithandler for DB flush (shutdown already handles it).
v0.3.0
Profiling modes (RTL_MODE)
Three modes:
default(signal injection, skip graph replay),lite(also skip has-signal packets, ~0% overhead),full(profile everything including graph replay, requires ROCm 7.13+)Set via
RTL_MODEenv var orrtl trace --modeCLI flagDefault mode: profiles all
count==1kernel dispatches with real GPU timing, skips graph replay batchesLite mode: additionally skips NCCL and other kernels with existing completion signals, matching v0.1.1 behavior for near-zero overhead
Full mode: profiles graph replay batches (
count > 1). Requires ROCm 7.13+ with ROCR fix to avoidInterceptQueue::staging_buffer_heap overflow
CUDAGraph / HIP graph compatibility (#67)
Root cause identified:
hsa_amd_queue_intercept_createhas a heap overflow bug inInterceptQueue::staging_buffer_(hardcoded to 256 entries). Fixed upstream in rocm-systems commit 559d48b1.Default and lite modes skip batch submissions (
count > 1) as a workaroundRTL_DEBUG=1/2: Packet-level diagnostic logging. Level 1 logs per-call summary. Level 2 adds per-packet details.
Signal forwarding
Original
completion_signalis saved before injection and forwarded viahsa_signal_subtract_screleaseafter timestamp collection. Packets with app-provided signals are now profiled correctly instead of being skipped.
Testing
Added
TestBatchSkip,TestSignalForwarding,TestNoInjectMode,TestDebugLoggingsource-code audit testsAdded
TestHipGraphGPU E2E tests (basic, multi-stream, large, stress, batch skip logging, no-0x1009)HIP graph workloads added to
gpu_workloadbinary (hipgraph, hipgraph_ms, hipgraph_large, hipgraph_stress)
Bug fixes
Fixed
profiling_set_profiler_enabledreturn value not checked in RTL_NO_INJECT pathFixed version mismatch between
pyproject.toml(0.2.1) and__init__.py(0.3.0)
v0.2.0
Signal injection profiling
Breaking: Replaced observe-only profiling with signal injection (#31)
HIP runtime (ROCm 7.2) does not set completion_signal on kernel dispatch packets
Signal pool (64 pre-allocated, 4096 max) avoids per-dispatch allocation overhead
No extra HSA queues (avoids TP=8 OOM from barrier-packet approach)
Fix: batch dispatch (
count > 1) no longer silently droppedAdded diagnostic counters printed at shutdown for each process
Rename and consistency
Renamed
librpd_lite.sotolibrtl.so, standardized CLI onrtlAll stderr messages now use
rtl:prefixAdded preflight diagnostics (ldd-based dependency checks)
Kernel name demangling for readable trace output
Documentation
5-tool comparison table (vs RPD, rocprofiler-sdk, roctracer, Triton Proton)
Wheel installation instructions in README
Simplified quick start:
rtl tracedoes everything
Testing
314 tests (was 130): multi-thread, multi-stream, HIP graph, multi-GPU, stress
GPU CI on MI355X (single + 8-GPU runners)
Pre-release validation suite with microbenchmarks and E2E
Validated GPT-OSS 120B TP=8 on MI355X (~1M ops, 0 drops)
v0.1.1
Multi-process support (#28)
Per-process trace files via
%pPID substitutionAutomatic merge of per-process traces into single output
GPU ID preservation across merged traces
Packaging
Python wheel packaging with
rtl/rtlCLI toolspip install rocm-trace-litesupport
Testing
314 unit/integration tests (CPU + GPU)
HIP Graph capture/replay safety
Multi-GPU, multi-stream, multi-thread stress tests
roctx marker integration tests
v0.1.0
Initial release
HSA kernel tracing via
HSA_TOOLS_LIBinterceptionSQLite output in RPD-compatible format
Perfetto/Chrome trace converter (
rpd2trace.py)Built-in roctx shim (no libroctx64 dependency)
Single completion worker thread (replaced thread-per-dispatch)
Zero dependency on roctracer or rocprofiler-sdk