CLI Reference

rocm-trace-lite provides the rtl command-line tool. (rtl-legacy also works as an alias.)

rtl trace

Trace a GPU workload and generate profiling output.

rtl trace [-o OUTPUT] COMMAND [ARGS...]

Options:

Flag

Default

Description

-o, --output

trace.db

Output trace file path

Output files generated:

File

Description

trace.db

SQLite trace database (RPD format)

trace_summary.txt

Text summary of top kernels

trace.json.gz

Compressed Perfetto JSON (open in ui.perfetto.dev)

Examples:

# Basic tracing
rtl trace -o trace.db python3 my_model.py

# Multi-GPU with torchrun
rtl trace -o trace.db torchrun --nproc_per_node=4 train.py

# Trace a shell command
rtl trace -o trace.db -- ./my_hip_app --batch-size 32

rtl summary

Print top kernels and GPU utilization from a trace.

rtl summary [-n LIMIT] INPUT

Options:

Flag

Default

Description

-n, --limit

20

Number of top kernels to show

Example:

rtl summary -n 10 trace.db

rtl convert

Convert an RPD trace to Perfetto JSON or rocprofv3 JSON.

rtl convert [-o OUTPUT] [-f FORMAT] INPUT

Options:

Flag

Default

Description

-o, --output

<input>.json.gz

Output file path

-f, --format

perfetto

Output format: perfetto (Chrome Trace JSON) or rocprofv3 (rocprofiler-sdk-tool JSON for TraceLens)

Examples:

# Perfetto visualization (default)
rtl convert trace.db -o trace.json
# Open trace.json in https://ui.perfetto.dev

# rocprofv3 format for TraceLens analysis
rtl convert trace.db --format rocprofv3 -o trace_results.json
TraceLens_generate_perf_report_rocprof --profile_json_path trace_results.json

rtl info

Show structural information about a trace file.

rtl info INPUT

Example:

rtl info trace.db
Trace: trace.db
  Size: 1.2 MB
  Tables: rocpd_api, rocpd_api_ops, rocpd_copyapi, rocpd_kernelapi, rocpd_metadata, rocpd_monitor, rocpd_op, rocpd_string
  rocpd_op: 728 rows
  rocpd_string: 45 rows
  Duration: 13.247s
  Unique kernels: 5

Environment variables

Variable

Values

Description

RTL_OUTPUT

file path

Output trace file (supports %p for PID). Alternative to -o flag.

RTL_MODE

lite, standard, full, hip

Profiling mode. lite (default): skip has-signal packets (~0% overhead). standard: signal injection for all count==1 dispatches, skip graph replay. full: profile everything including graph replay (requires ROCm 7.13+). hip: HIP API interception via LD_PRELOAD — captures CPU-side HIP call timings alongside GPU kernel execution.

RTL_DEBUG

1

Log per-call summary: intercept call count, device ID, batch skip decisions

RTL_DEBUG

2

Log per-packet details: AQL type, signal handle, kernel object address