Skip to content

sunway513/rocm-trace-lite

Repository files navigation

rocm-trace-lite

Self-contained GPU kernel profiler for ROCm. Zero roctracer/rocprofiler-sdk dependency.

What it does

A streamlined, lightweight GPU kernel profiler. Captures dispatch timestamps using only HSA runtime interception (HSA_TOOLS_LIB), writing to a standard SQLite .db file. No dependency on HIP, roctracer, or rocprofiler-sdk.

Comparison with other ROCm profiling tools

Feature rocm-trace-lite RPD rocprofiler-sdk roctracer Triton Proton
Dependencies libhsa-runtime64 + libsqlite3 + libroctracer64 Full ROCm 6.0+ stack libroctracer64 libroctracer64 (AMD)
GPU kernel timing HSA signal injection roctracer activity Buffered/callback tracing Activity records roctracer / CUPTI
HIP API tracing Yes (RTL_MODE=hip) Yes Yes Yes
HSA API tracing Yes Yes
roctx markers Built-in shim Via roctracer Native Yes (libroctx64) Indirect
HW counters Yes (AQLprofile) NVIDIA only
Output format SQLite (.db) + rocprofv3 JSON SQLite (.rpd) CSV / JSON / Perfetto / OTF2 Raw callbacks JSON / Chrome Trace
Perfetto visualization rtl convert rpd2tracing.py Native PFTrace Built-in
TraceLens compatible Yes (--format rocprofv3) No Yes (native) No No
Zero profiler dependency Yes No No No No
Status Active Active Active (recommended) Legacy (EoS 2026 Q2) Active

Installation

From wheel (recommended)

Download the latest .whl from GitHub Releases:

# Install the latest release
pip install rocm-trace-lite --find-links https://github.com/sunway513/rocm-trace-lite/releases/latest

# Or download and install manually
wget https://github.com/sunway513/rocm-trace-lite/releases/latest/download/rocm_trace_lite-<version>-py3-none-linux_x86_64.whl
pip install rocm_trace_lite-*.whl

After installation, the rtl CLI command is available. One command does everything — trace, summary, and Perfetto export:

rtl trace python3 my_model.py

From source

# Build (requires ROCm headers)
make -j

# Install system-wide
make install    # copies librtl.so to /usr/local/lib, scripts to /usr/local/bin

Requirements:

  • ROCm (for HSA headers: hsa/hsa.h, hsa/hsa_api_trace.h)
  • SQLite3 development headers (apt install libsqlite3-dev)
  • g++ with C++17

Quick start

rtl trace python3 my_model.py                        # lite mode (default)
rtl trace --mode standard python3 my_model.py        # standard mode (~2-4% overhead)
rtl trace --mode hip python3 my_model.py             # hip mode (HIP API + GPU timing)

Profiling modes

Mode GPU timing HIP API Graph replay Overhead Use case
lite Yes (partial) No Skipped ~0% Production / always-on (default)
standard Yes No Skipped ~2-4% General profiling
hip Yes Yes Skipped <1% CPU+GPU correlation
full Yes (all) No Profiled ~2-5% Deep analysis (requires ROCm 7.13+)

Set via CLI (--mode) or env var (RTL_MODE=lite).

lite skips packets that already have a completion signal (e.g., NCCL kernels, barriers), resulting in near-zero overhead and safety on ROCm <= 7.2. This is the default when --mode is not specified. standard mode profiles all count==1 dispatches including those with signals. full profiles everything including CUDAGraph replay batches, but requires ROCm 7.13+ to avoid a known ROCR heap overflow.

Sample output:

rtl: loading (HSA runtime v3)
rtl: lazy init, writing to trace_12345.db
rtl: found 1 GPU agent(s)
rtl: signal pool initialized (64 signals)
rtl: completion worker started

Trace: trace.db (728 GPU ops)

Kernel                                                       Calls  Total(us)  Avg(us)      %
================================================================================================
Cijk_Ailk_Bljk_HHS_BH_MT128x128x128                           240    28252.9    117.7   21.8
ncclDevKernel_Generic                                          160    29747.8    185.9   23.0
__amd_rocclr_fillBufferAligned.kd                             7900    27929.8      3.5   21.6

GPU Utilization:
  GPU 0: 0.13% (2630 ops, 17.2ms busy)
  GPU 1: 0.11% (2430 ops, 15.0ms busy)

Output files:
  trace.db
  trace_summary.txt
  trace.json.gz (1.2 MB → open in https://ui.perfetto.dev)

TraceLens integration

RTL traces can be analyzed with TraceLens for automated performance reports — kernel breakdown, GPU timeline, roofline metrics, and more.

# 1. Collect trace
rtl trace -o trace.db python3 my_model.py

# 2. Convert to rocprofv3 format
rtl convert trace.db --format rocprofv3 -o trace_results.json

# 3. Generate TraceLens report
pip install git+https://github.com/AMD-AGI/TraceLens.git
TraceLens_generate_perf_report_rocprof --profile_json_path trace_results.json --kernel_details

This produces an Excel workbook with GPU timeline breakdown, kernel summary by category (GEMM, Elementwise, Reduction, etc.), and per-dispatch details with grid/block dimensions. Validated on GPT-OSS 120B TP=8 (162K dispatches, 92 unique kernels). See issue #100 for sample output.

How it works

  1. HSA_TOOLS_LIB OnLoad — ROCm HSA runtime calls OnLoad() when the library is loaded, giving us the HSA API table
  2. Queue intercept — We replace hsa_queue_create to create interceptible queues via hsa_amd_queue_intercept_create, then register a callback on every AQL packet
  3. Kernel profiling — For each kernel dispatch packet, we insert a profiling signal, wait for completion, then read GPU timestamps via hsa_amd_profiling_get_dispatch_time
  4. Symbol resolution — We intercept hsa_executable_freeze to enumerate kernel symbols from code objects
  5. roctx shim — Provides roctxRangePushA/roctxRangePop/roctxMarkA/roctxRangeStartA/roctxRangeStop symbols so applications using roctx markers get captured without linking libroctx64

Output format

Standard SQLite .db database. Query with any SQLite tool. Key tables:

  • rocpd_op — GPU kernel dispatches with start/end timestamps, gpuId, queueId
  • rocpd_string — Deduplicated string table (kernel names, op types)
  • rocpd_metadata — Trace metadata (duration, host info)

Built-in views:

-- Top kernels by total GPU time
SELECT * FROM top LIMIT 10;

-- GPU utilization per device
SELECT * FROM busy;

Tests

314 tests covering unit, E2E, multi-GPU, stress, and release validation.

# CPU-only tests (no GPU required)
make test-cpu

# GPU tests (requires ROCm GPU)
python3 -m pytest tests/ -v --timeout=180

# CI: CPU on every push, GPU on MI355X runners

Acknowledgments

This project was inspired by and builds upon the work of:

  • Jeff Daily's ROCm Tracer for GPU (RTG) — pioneered the HSA_TOOLS_LIB interception approach for lightweight GPU kernel tracing
  • Michael Wootton's rocmProfileData (RPD) — established the SQLite-based trace format and ecosystem tools that rocm-trace-lite is compatible with

License

MIT

About

Self-contained GPU kernel profiler for ROCm. Zero roctracer/rocprofiler-sdk dependency. HSA-only interception, SQLite trace output, Perfetto compatible.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors