rocm-trace-lite

Self-contained GPU kernel profiler for ROCm. Zero roctracer/rocprofiler-sdk dependency.

What it does

A streamlined, lightweight GPU kernel profiler. Captures dispatch timestamps using only HSA runtime interception (HSA_TOOLS_LIB), writing to a standard SQLite .db file. No dependency on HIP, roctracer, or rocprofiler-sdk.

Comparison with other ROCm profiling tools

Feature	rocm-trace-lite	RPD	rocprofiler-sdk	roctracer	Triton Proton
Dependencies	libhsa-runtime64 + libsqlite3	+ libroctracer64	Full ROCm 6.0+ stack	libroctracer64	libroctracer64 (AMD)
GPU kernel timing	HSA signal injection	roctracer activity	Buffered/callback tracing	Activity records	roctracer / CUPTI
HIP API tracing	Yes (`RTL_MODE=hip`)	Yes	Yes	Yes	—
HSA API tracing	—	—	Yes	Yes	—
roctx markers	Built-in shim	Via roctracer	Native	Yes (libroctx64)	Indirect
HW counters	—	—	Yes (AQLprofile)	—	NVIDIA only
Output format	SQLite (.db) + rocprofv3 JSON	SQLite (.rpd)	CSV / JSON / Perfetto / OTF2	Raw callbacks	JSON / Chrome Trace
Perfetto visualization	`rtl convert`	rpd2tracing.py	Native PFTrace	—	Built-in
TraceLens compatible	Yes (`--format rocprofv3`)	No	Yes (native)	No	No
Zero profiler dependency	Yes	No	No	No	No
Status	Active	Active	Active (recommended)	Legacy (EoS 2026 Q2)	Active

Installation

From wheel (recommended)

Download the latest .whl from GitHub Releases:

# Install the latest release
pip install rocm-trace-lite --find-links https://github.com/sunway513/rocm-trace-lite/releases/latest

# Or download and install manually
wget https://github.com/sunway513/rocm-trace-lite/releases/latest/download/rocm_trace_lite-<version>-py3-none-linux_x86_64.whl
pip install rocm_trace_lite-*.whl

After installation, the rtl CLI command is available. One command does everything — trace, summary, and Perfetto export:

rtl trace python3 my_model.py

From source

# Build (requires ROCm headers)
make -j

# Install system-wide
make install    # copies librtl.so to /usr/local/lib, scripts to /usr/local/bin

Requirements:

ROCm (for HSA headers: hsa/hsa.h, hsa/hsa_api_trace.h)
SQLite3 development headers (apt install libsqlite3-dev)
g++ with C++17

Quick start

rtl trace python3 my_model.py                        # lite mode (default)
rtl trace --mode standard python3 my_model.py        # standard mode (~2-4% overhead)
rtl trace --mode hip python3 my_model.py             # hip mode (HIP API + GPU timing)

Profiling modes

Mode	GPU timing	HIP API	Graph replay	Overhead	Use case
lite	Yes (partial)	No	Skipped	~0%	Production / always-on (default)
standard	Yes	No	Skipped	~2-4%	General profiling
hip	Yes	Yes	Skipped	<1%	CPU+GPU correlation
full	Yes (all)	No	Profiled	~2-5%	Deep analysis (requires ROCm 7.13+)

Set via CLI (--mode) or env var (RTL_MODE=lite).

lite skips packets that already have a completion signal (e.g., NCCL kernels, barriers), resulting in near-zero overhead and safety on ROCm <= 7.2. This is the default when --mode is not specified. standard mode profiles all count==1 dispatches including those with signals. full profiles everything including CUDAGraph replay batches, but requires ROCm 7.13+ to avoid a known ROCR heap overflow.

Sample output:

rtl: loading (HSA runtime v3)
rtl: lazy init, writing to trace_12345.db
rtl: found 1 GPU agent(s)
rtl: signal pool initialized (64 signals)
rtl: completion worker started

Trace: trace.db (728 GPU ops)

Kernel                                                       Calls  Total(us)  Avg(us)      %
================================================================================================
Cijk_Ailk_Bljk_HHS_BH_MT128x128x128                           240    28252.9    117.7   21.8
ncclDevKernel_Generic                                          160    29747.8    185.9   23.0
__amd_rocclr_fillBufferAligned.kd                             7900    27929.8      3.5   21.6

GPU Utilization:
  GPU 0: 0.13% (2630 ops, 17.2ms busy)
  GPU 1: 0.11% (2430 ops, 15.0ms busy)

Output files:
  trace.db
  trace_summary.txt
  trace.json.gz (1.2 MB → open in https://ui.perfetto.dev)

TraceLens integration

RTL traces can be analyzed with TraceLens for automated performance reports — kernel breakdown, GPU timeline, roofline metrics, and more.

# 1. Collect trace
rtl trace -o trace.db python3 my_model.py

# 2. Convert to rocprofv3 format
rtl convert trace.db --format rocprofv3 -o trace_results.json

# 3. Generate TraceLens report
pip install git+https://github.com/AMD-AGI/TraceLens.git
TraceLens_generate_perf_report_rocprof --profile_json_path trace_results.json --kernel_details

This produces an Excel workbook with GPU timeline breakdown, kernel summary by category (GEMM, Elementwise, Reduction, etc.), and per-dispatch details with grid/block dimensions. Validated on GPT-OSS 120B TP=8 (162K dispatches, 92 unique kernels). See issue #100 for sample output.

How it works

HSA_TOOLS_LIB OnLoad — ROCm HSA runtime calls OnLoad() when the library is loaded, giving us the HSA API table
Queue intercept — We replace hsa_queue_create to create interceptible queues via hsa_amd_queue_intercept_create, then register a callback on every AQL packet
Kernel profiling — For each kernel dispatch packet, we insert a profiling signal, wait for completion, then read GPU timestamps via hsa_amd_profiling_get_dispatch_time
Symbol resolution — We intercept hsa_executable_freeze to enumerate kernel symbols from code objects
roctx shim — Provides roctxRangePushA/roctxRangePop/roctxMarkA/roctxRangeStartA/roctxRangeStop symbols so applications using roctx markers get captured without linking libroctx64

Output format

Standard SQLite .db database. Query with any SQLite tool. Key tables:

rocpd_op — GPU kernel dispatches with start/end timestamps, gpuId, queueId
rocpd_string — Deduplicated string table (kernel names, op types)
rocpd_metadata — Trace metadata (duration, host info)

Built-in views:

-- Top kernels by total GPU time
SELECT * FROM top LIMIT 10;

-- GPU utilization per device
SELECT * FROM busy;

Tests

314 tests covering unit, E2E, multi-GPU, stress, and release validation.

# CPU-only tests (no GPU required)
make test-cpu

# GPU tests (requires ROCm GPU)
python3 -m pytest tests/ -v --timeout=180

# CI: CPU on every push, GPU on MI355X runners

Acknowledgments

This project was inspired by and builds upon the work of:

Jeff Daily's ROCm Tracer for GPU (RTG) — pioneered the HSA_TOOLS_LIB interception approach for lightweight GPU kernel tracing
Michael Wootton's rocmProfileData (RPD) — established the SQLite-based trace format and ecosystem tools that rocm-trace-lite is compatible with

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
.claude		.claude
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
examples		examples
profiler_perf_bench		profiler_perf_bench
repro		repro
rocm_trace_lite		rocm_trace_lite
src		src
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
PLANNING.md		PLANNING.md
README.md		README.md
REVIEW.md		REVIEW.md
build_wheel.sh		build_wheel.sh
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rocm-trace-lite

What it does

Comparison with other ROCm profiling tools

Installation

From wheel (recommended)

From source

Quick start

Profiling modes

TraceLens integration

How it works

Output format

Tests

Acknowledgments

License

About

Uh oh!

Releases 13

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rocm-trace-lite

What it does

Comparison with other ROCm profiling tools

Installation

From wheel (recommended)

From source

Quick start

Profiling modes

TraceLens integration

How it works

Output format

Tests

Acknowledgments

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages