Summary
RTL currently captures GPU kernel execution timing via HSA queue interception. This RFC proposes adding HIP runtime API tracing through LD_PRELOAD-based function interposition, enabling RTL to capture both CPU-side API calls (hipLaunchKernel, hipMemcpy, hipMalloc, hipStreamSynchronize, hipGraphLaunch) and GPU-side kernel timing in a single trace.
The design uses dlsym(RTLD_NEXT) to forward intercepted calls to the real HIP implementation after recording timestamps and arguments. This approach requires no special ROCm builds, no external profiling libraries, and does not interfere with CUDAGraph replay. Combined with correlation ID linking between API calls and kernel dispatches, this enables full host-device event correlation without depending on roctracer or rocprofiler-sdk.
The HIP interposition layer is compiled into the existing librtl.so and activated via RTL_MODE=hip. A re-entrancy guard (thread-local boolean) prevents recursive recording during HIP runtime initialization.
Motivation
- Users need to correlate CPU-side dispatch overhead with GPU kernel execution time
- roctracer drops 54% of kernel events on ROCm 7.2.x during decode-phase workloads
- rocprofiler-sdk has startup overhead, segfault, and hipGraphLaunch overhead issues
- RTL's HSA-only mode captures GPU timing but has no visibility into HIP API calls
- Future Kineto integration requires both HIP API and GPU kernel data for full torch.profiler compatibility
Design
Architecture
Application
│
├── LD_PRELOAD=librtl.so (HIP API wrappers)
│ hipLaunchKernel → record_hip_api() → forward via dlsym(RTLD_NEXT)
│ hipMemcpy → record_hip_api() → forward
│ hipGraphLaunch → record_hip_api() → forward
│
└── HSA_TOOLS_LIB=librtl.so (GPU kernel interception)
queue intercept → signal injection → record_kernel()
Both paths write to the same TraceDB:
rocpd_api (HIP API calls) ←─── correlation_id ───→ rocpd_op (GPU kernels)
via rocpd_api_ops
Re-entrancy Guard
static thread_local bool tls_in_hip_api = false;
extern "C" hipError_t hipMemcpy(...) {
if (tls_in_hip_api || !g_hip_api_enabled) return real_hipMemcpy(...);
tls_in_hip_api = true;
// record + forward
tls_in_hip_api = false;
return ret;
}
Correlation ID Flow
- HIP wrapper assigns
correlation_id = next_correlation_id()
- Records API call with this ID in
rocpd_api
- Pushes
{queue_handle → correlation_id} into concurrent map
- HSA completion worker pops from map when recording the kernel dispatch
- Writes
(api_id, op_id) into rocpd_api_ops join table
Functions to Intercept (Phase 1)
| Function |
Category |
Priority |
| hipModuleLaunchKernel |
Kernel launch |
P0 |
| hipExtModuleLaunchKernel |
Kernel launch (ATOM/Triton) |
P0 |
| hipMemcpy |
Sync memory copy |
P0 |
| hipMemcpyAsync |
Async memory copy |
P0 |
| hipMalloc / hipFree |
Memory allocation |
P1 |
| hipStreamSynchronize |
Stream sync |
P1 |
| hipGraphLaunch |
Graph replay |
P1 |
| hipLaunchKernelGGL |
Kernel launch (legacy) |
P2 |
RTL_MODE Integration
| Mode |
HSA intercept |
HIP API intercept |
Use case |
| lite |
Yes (skip graph) |
No |
Production, ~0% overhead |
| standard |
Yes (all) |
No |
GPU-only profiling |
| hip |
Yes (all) |
Yes |
Full CPU+GPU correlation |
Zero New Dependencies
- No link against libamdhip64.so (uses dlsym only)
- No link against roctracer or rocprofiler-sdk
- Only compile-time dependency: HIP headers (
hip_runtime_api.h)
ldd librtl.so output unchanged
Validation Plan
24 tests across 5 test files:
- 5 CPU unit tests (schema, re-entrancy, correlation IDs)
- 6 GPU integration tests (capture, timing, pid/tid, mode gating)
- 4 correlation tests (API↔kernel linking, timing order)
- 5 E2E tests (PyTorch, CUDAGraph, Perfetto output, roctracer parity)
- 2 overhead tests (<10% target for hip mode)
- 2 regression guards (no roctracer/libamdhip64 in ldd)
Files Changed
| File |
Action |
src/hip_api_intercept.cpp |
NEW |
src/hip_api_intercept.h |
NEW |
src/hip_intercept.cpp |
DELETE (placeholder) |
src/hsa_intercept.cpp |
MODIFY |
src/trace_db.h |
MODIFY |
src/trace_db.cpp |
MODIFY |
Makefile |
MODIFY |
rocm_trace_lite/cmd_trace.py |
MODIFY |
| 5 new test files |
NEW |
Summary
RTL currently captures GPU kernel execution timing via HSA queue interception. This RFC proposes adding HIP runtime API tracing through
LD_PRELOAD-based function interposition, enabling RTL to capture both CPU-side API calls (hipLaunchKernel, hipMemcpy, hipMalloc, hipStreamSynchronize, hipGraphLaunch) and GPU-side kernel timing in a single trace.The design uses
dlsym(RTLD_NEXT)to forward intercepted calls to the real HIP implementation after recording timestamps and arguments. This approach requires no special ROCm builds, no external profiling libraries, and does not interfere with CUDAGraph replay. Combined with correlation ID linking between API calls and kernel dispatches, this enables full host-device event correlation without depending on roctracer or rocprofiler-sdk.The HIP interposition layer is compiled into the existing
librtl.soand activated viaRTL_MODE=hip. A re-entrancy guard (thread-local boolean) prevents recursive recording during HIP runtime initialization.Motivation
Design
Architecture
Re-entrancy Guard
Correlation ID Flow
correlation_id = next_correlation_id()rocpd_api{queue_handle → correlation_id}into concurrent map(api_id, op_id)intorocpd_api_opsjoin tableFunctions to Intercept (Phase 1)
RTL_MODE Integration
Zero New Dependencies
hip_runtime_api.h)ldd librtl.sooutput unchangedValidation Plan
24 tests across 5 test files:
Files Changed
src/hip_api_intercept.cppsrc/hip_api_intercept.hsrc/hip_intercept.cppsrc/hsa_intercept.cppsrc/trace_db.hsrc/trace_db.cppMakefilerocm_trace_lite/cmd_trace.py