Multi-GPU and Distributed Profiling
rocm-trace-lite supports profiling across multiple GPUs and distributed processes out of the box.
How it works
When profiling multi-process workloads (e.g., torchrun, torch.distributed.launch):
Per-process files: Each process writes to its own trace file using PID substitution (
trace_%p.db)Automatic merge: After all processes exit, per-process files are merged into a single output
GPU ID preservation: Each process’s
gpuIdis preserved in the merged trace
torchrun --nproc_per_node=8 model.py
├── Process 0 (PID 1234) → trace_1234.db (GPU 0)
├── Process 1 (PID 1235) → trace_1235.db (GPU 1)
├── ...
└── Process 7 (PID 1241) → trace_1241.db (GPU 7)
↓ automatic merge
trace.db (all 8 GPUs combined)
Usage
# TP=8 inference
rtl trace -o trace.db torchrun --nproc_per_node=8 my_model.py
# Check per-GPU distribution
sqlite3 trace.db "SELECT gpuId, count(*) FROM rocpd_op GROUP BY gpuId;"
Diagnostic tool
For troubleshooting multi-process profiling, use the diagnostic script:
# Inspect per-process files before merge
python3 tests/diagnose_trace.py trace_*.db
This reports per-file kernel counts, GPU IDs, and flags asymmetry between processes.
Validated configurations
Configuration |
GPUs |
Kernels captured |
Status |
|---|---|---|---|
TP=1 single process |
1 |
12/12 |
Validated |
TP=2 torchrun + RCCL |
2 |
728 (364+364) |
Validated |
TP=8 torchrun + RCCL |
8 |
20,648 |
Validated |
Diagnostic counters
Each process prints diagnostic counters at shutdown:
=== rtl diagnostic (PID 336455) ===
intercept calls: 3380
signals injected: 2630
drop (shutdown): 0
drop (not kernel): 500
drop (batch skip): 250
drop (no qi): 0
drop (sig pool): 0
drop (ts fail): 0
drop (ts invalid): 0
recorded OK: 2630
====================================
Key indicators:
signals injected should match recorded OK (no drops)
drop (batch skip) counts packets from batch submissions (
count > 1) that were skipped — typically from CUDAGraph replay. This is expected.drop (not kernel) counts non-kernel AQL packets (barriers, vendor-specific) that were not profiled.
drop (sig pool) > 0 means signal pool exhaustion (increase
SIGNAL_POOL_MAX)drop (ts fail) > 0 indicates GPU timestamp read failures
CUDAGraph compatibility
When profiling CUDAGraph workloads (e.g., ATOM/vLLM with hipgraph), batch submissions from graph replay are automatically skipped in default and lite modes. The drop (batch skip) counter reflects these skipped packets.
For near-zero overhead with CUDAGraph workloads, use lite mode:
rtl trace --mode lite -o trace.db torchrun --nproc_per_node=8 my_model.py
# or
RTL_MODE=lite rtl trace -o trace.db torchrun --nproc_per_node=8 my_model.py
To profile graph replay kernels (requires ROCm 7.13+ with ROCR fix):
RTL_MODE=full rtl trace -o trace.db torchrun --nproc_per_node=8 my_model.py
See RTL_DEBUG=1 output for per-call diagnostics.