Self-contained GPU kernel profiler for ROCm. Zero roctracer/rocprofiler-sdk dependency.
A streamlined, lightweight GPU kernel profiler. Captures dispatch timestamps using only HSA runtime interception (HSA_TOOLS_LIB), writing to a standard SQLite .db file. No dependency on HIP, roctracer, or rocprofiler-sdk.
| Feature | rocm-trace-lite | RPD | rocprofiler-sdk | roctracer | Triton Proton |
|---|---|---|---|---|---|
| Dependencies | libhsa-runtime64 + libsqlite3 | + libroctracer64 | Full ROCm 6.0+ stack | libroctracer64 | libroctracer64 (AMD) |
| GPU kernel timing | HSA signal injection | roctracer activity | Buffered/callback tracing | Activity records | roctracer / CUPTI |
| HIP API tracing | Yes (RTL_MODE=hip) |
Yes | Yes | Yes | — |
| HSA API tracing | — | — | Yes | Yes | — |
| roctx markers | Built-in shim | Via roctracer | Native | Yes (libroctx64) | Indirect |
| HW counters | — | — | Yes (AQLprofile) | — | NVIDIA only |
| Output format | SQLite (.db) + rocprofv3 JSON | SQLite (.rpd) | CSV / JSON / Perfetto / OTF2 | Raw callbacks | JSON / Chrome Trace |
| Perfetto visualization | rtl convert |
rpd2tracing.py | Native PFTrace | — | Built-in |
| TraceLens compatible | Yes (--format rocprofv3) |
No | Yes (native) | No | No |
| Zero profiler dependency | Yes | No | No | No | No |
| Status | Active | Active | Active (recommended) | Legacy (EoS 2026 Q2) | Active |
Download the latest .whl from GitHub Releases:
# Install the latest release
pip install rocm-trace-lite --find-links https://github.com/sunway513/rocm-trace-lite/releases/latest
# Or download and install manually
wget https://github.com/sunway513/rocm-trace-lite/releases/latest/download/rocm_trace_lite-<version>-py3-none-linux_x86_64.whl
pip install rocm_trace_lite-*.whlAfter installation, the rtl CLI command is available. One command does everything — trace, summary, and Perfetto export:
rtl trace python3 my_model.py# Build (requires ROCm headers)
make -j
# Install system-wide
make install # copies librtl.so to /usr/local/lib, scripts to /usr/local/binRequirements:
- ROCm (for HSA headers:
hsa/hsa.h,hsa/hsa_api_trace.h) - SQLite3 development headers (
apt install libsqlite3-dev) - g++ with C++17
rtl trace python3 my_model.py # lite mode (default)
rtl trace --mode standard python3 my_model.py # standard mode (~2-4% overhead)
rtl trace --mode hip python3 my_model.py # hip mode (HIP API + GPU timing)| Mode | GPU timing | HIP API | Graph replay | Overhead | Use case |
|---|---|---|---|---|---|
| lite | Yes (partial) | No | Skipped | ~0% | Production / always-on (default) |
| standard | Yes | No | Skipped | ~2-4% | General profiling |
| hip | Yes | Yes | Skipped | <1% | CPU+GPU correlation |
| full | Yes (all) | No | Profiled | ~2-5% | Deep analysis (requires ROCm 7.13+) |
Set via CLI (--mode) or env var (RTL_MODE=lite).
lite skips packets that already have a completion signal (e.g., NCCL kernels, barriers), resulting in near-zero overhead and safety on ROCm <= 7.2. This is the default when --mode is not specified. standard mode profiles all count==1 dispatches including those with signals. full profiles everything including CUDAGraph replay batches, but requires ROCm 7.13+ to avoid a known ROCR heap overflow.
Sample output:
rtl: loading (HSA runtime v3)
rtl: lazy init, writing to trace_12345.db
rtl: found 1 GPU agent(s)
rtl: signal pool initialized (64 signals)
rtl: completion worker started
Trace: trace.db (728 GPU ops)
Kernel Calls Total(us) Avg(us) %
================================================================================================
Cijk_Ailk_Bljk_HHS_BH_MT128x128x128 240 28252.9 117.7 21.8
ncclDevKernel_Generic 160 29747.8 185.9 23.0
__amd_rocclr_fillBufferAligned.kd 7900 27929.8 3.5 21.6
GPU Utilization:
GPU 0: 0.13% (2630 ops, 17.2ms busy)
GPU 1: 0.11% (2430 ops, 15.0ms busy)
Output files:
trace.db
trace_summary.txt
trace.json.gz (1.2 MB → open in https://ui.perfetto.dev)
RTL traces can be analyzed with TraceLens for automated performance reports — kernel breakdown, GPU timeline, roofline metrics, and more.
# 1. Collect trace
rtl trace -o trace.db python3 my_model.py
# 2. Convert to rocprofv3 format
rtl convert trace.db --format rocprofv3 -o trace_results.json
# 3. Generate TraceLens report
pip install git+https://github.com/AMD-AGI/TraceLens.git
TraceLens_generate_perf_report_rocprof --profile_json_path trace_results.json --kernel_detailsThis produces an Excel workbook with GPU timeline breakdown, kernel summary by category (GEMM, Elementwise, Reduction, etc.), and per-dispatch details with grid/block dimensions. Validated on GPT-OSS 120B TP=8 (162K dispatches, 92 unique kernels). See issue #100 for sample output.
- HSA_TOOLS_LIB OnLoad — ROCm HSA runtime calls
OnLoad()when the library is loaded, giving us the HSA API table - Queue intercept — We replace
hsa_queue_createto create interceptible queues viahsa_amd_queue_intercept_create, then register a callback on every AQL packet - Kernel profiling — For each kernel dispatch packet, we insert a profiling signal, wait for completion, then read GPU timestamps via
hsa_amd_profiling_get_dispatch_time - Symbol resolution — We intercept
hsa_executable_freezeto enumerate kernel symbols from code objects - roctx shim — Provides
roctxRangePushA/roctxRangePop/roctxMarkA/roctxRangeStartA/roctxRangeStopsymbols so applications using roctx markers get captured without linking libroctx64
Standard SQLite .db database. Query with any SQLite tool. Key tables:
rocpd_op— GPU kernel dispatches with start/end timestamps, gpuId, queueIdrocpd_string— Deduplicated string table (kernel names, op types)rocpd_metadata— Trace metadata (duration, host info)
Built-in views:
-- Top kernels by total GPU time
SELECT * FROM top LIMIT 10;
-- GPU utilization per device
SELECT * FROM busy;314 tests covering unit, E2E, multi-GPU, stress, and release validation.
# CPU-only tests (no GPU required)
make test-cpu
# GPU tests (requires ROCm GPU)
python3 -m pytest tests/ -v --timeout=180
# CI: CPU on every push, GPU on MI355X runnersThis project was inspired by and builds upon the work of:
- Jeff Daily's ROCm Tracer for GPU (RTG) — pioneered the HSA_TOOLS_LIB interception approach for lightweight GPU kernel tracing
- Michael Wootton's rocmProfileData (RPD) — established the SQLite-based trace format and ecosystem tools that rocm-trace-lite is compatible with
MIT