Multi-GPU and Distributed Profiling

rocm-trace-lite supports profiling across multiple GPUs and distributed processes out of the box.

How it works

When profiling multi-process workloads (e.g., torchrun, torch.distributed.launch):

Per-process files: Each process writes to its own trace file using PID substitution (trace_%p.db)
Automatic merge: After all processes exit, per-process files are merged into a single output
GPU ID preservation: Each process’s gpuId is preserved in the merged trace

torchrun --nproc_per_node=8 model.py
    ├── Process 0 (PID 1234) → trace_1234.db (GPU 0)
    ├── Process 1 (PID 1235) → trace_1235.db (GPU 1)
    ├── ...
    └── Process 7 (PID 1241) → trace_1241.db (GPU 7)
         ↓ automatic merge
    trace.db (all 8 GPUs combined)

Usage

# TP=8 inference
rtl trace -o trace.db torchrun --nproc_per_node=8 my_model.py

# Check per-GPU distribution
sqlite3 trace.db "SELECT gpuId, count(*) FROM rocpd_op GROUP BY gpuId;"

Diagnostic tool

For troubleshooting multi-process profiling, use the diagnostic script:

# Inspect per-process files before merge
python3 tests/diagnose_trace.py trace_*.db

This reports per-file kernel counts, GPU IDs, and flags asymmetry between processes.

Validated configurations

Configuration	GPUs	Kernels captured	Status
TP=1 single process	1	12/12	Validated
TP=2 torchrun + RCCL	2	728 (364+364)	Validated
TP=8 torchrun + RCCL	8	20,648	Validated

Diagnostic counters

Each process prints diagnostic counters at shutdown:

=== rtl diagnostic (PID 336455) ===
  intercept calls:     3380
  signals injected:    2630
  drop (shutdown):     0
  drop (not kernel):   500
  drop (batch skip):   250
  drop (no qi):        0
  drop (sig pool):     0
  drop (ts fail):      0
  drop (ts invalid):   0
  recorded OK:         2630
====================================

Key indicators:

signals injected should match recorded OK (no drops)
drop (batch skip) counts packets from batch submissions (count > 1) that were skipped — typically from CUDAGraph replay. This is expected.
drop (not kernel) counts non-kernel AQL packets (barriers, vendor-specific) that were not profiled.
drop (sig pool) > 0 means signal pool exhaustion (increase SIGNAL_POOL_MAX)
drop (ts fail) > 0 indicates GPU timestamp read failures

CUDAGraph compatibility

When profiling CUDAGraph workloads (e.g., ATOM/vLLM with hipgraph), batch submissions from graph replay are automatically skipped in default and lite modes. The drop (batch skip) counter reflects these skipped packets.

For near-zero overhead with CUDAGraph workloads, use lite mode:

rtl trace --mode lite -o trace.db torchrun --nproc_per_node=8 my_model.py
# or
RTL_MODE=lite rtl trace -o trace.db torchrun --nproc_per_node=8 my_model.py

To profile graph replay kernels (requires ROCm 7.13+ with ROCR fix):

RTL_MODE=full rtl trace -o trace.db torchrun --nproc_per_node=8 my_model.py

See RTL_DEBUG=1 output for per-call diagnostics.