Skip to content

RFC-003: HIP Runtime API Interception for Unified CPU+GPU Profiling #93

@sunway513

Description

@sunway513

Summary

RTL currently captures GPU kernel execution timing via HSA queue interception. This RFC proposes adding HIP runtime API tracing through LD_PRELOAD-based function interposition, enabling RTL to capture both CPU-side API calls (hipLaunchKernel, hipMemcpy, hipMalloc, hipStreamSynchronize, hipGraphLaunch) and GPU-side kernel timing in a single trace.

The design uses dlsym(RTLD_NEXT) to forward intercepted calls to the real HIP implementation after recording timestamps and arguments. This approach requires no special ROCm builds, no external profiling libraries, and does not interfere with CUDAGraph replay. Combined with correlation ID linking between API calls and kernel dispatches, this enables full host-device event correlation without depending on roctracer or rocprofiler-sdk.

The HIP interposition layer is compiled into the existing librtl.so and activated via RTL_MODE=hip. A re-entrancy guard (thread-local boolean) prevents recursive recording during HIP runtime initialization.

Motivation

  1. Users need to correlate CPU-side dispatch overhead with GPU kernel execution time
  2. roctracer drops 54% of kernel events on ROCm 7.2.x during decode-phase workloads
  3. rocprofiler-sdk has startup overhead, segfault, and hipGraphLaunch overhead issues
  4. RTL's HSA-only mode captures GPU timing but has no visibility into HIP API calls
  5. Future Kineto integration requires both HIP API and GPU kernel data for full torch.profiler compatibility

Design

Architecture

Application
    │
    ├── LD_PRELOAD=librtl.so (HIP API wrappers)
    │     hipLaunchKernel → record_hip_api() → forward via dlsym(RTLD_NEXT)
    │     hipMemcpy       → record_hip_api() → forward
    │     hipGraphLaunch  → record_hip_api() → forward
    │
    └── HSA_TOOLS_LIB=librtl.so (GPU kernel interception)
          queue intercept → signal injection → record_kernel()
          
Both paths write to the same TraceDB:
    rocpd_api (HIP API calls)  ←─── correlation_id ───→  rocpd_op (GPU kernels)
                                    via rocpd_api_ops

Re-entrancy Guard

static thread_local bool tls_in_hip_api = false;

extern "C" hipError_t hipMemcpy(...) {
    if (tls_in_hip_api || !g_hip_api_enabled) return real_hipMemcpy(...);
    tls_in_hip_api = true;
    // record + forward
    tls_in_hip_api = false;
    return ret;
}

Correlation ID Flow

  1. HIP wrapper assigns correlation_id = next_correlation_id()
  2. Records API call with this ID in rocpd_api
  3. Pushes {queue_handle → correlation_id} into concurrent map
  4. HSA completion worker pops from map when recording the kernel dispatch
  5. Writes (api_id, op_id) into rocpd_api_ops join table

Functions to Intercept (Phase 1)

Function Category Priority
hipModuleLaunchKernel Kernel launch P0
hipExtModuleLaunchKernel Kernel launch (ATOM/Triton) P0
hipMemcpy Sync memory copy P0
hipMemcpyAsync Async memory copy P0
hipMalloc / hipFree Memory allocation P1
hipStreamSynchronize Stream sync P1
hipGraphLaunch Graph replay P1
hipLaunchKernelGGL Kernel launch (legacy) P2

RTL_MODE Integration

Mode HSA intercept HIP API intercept Use case
lite Yes (skip graph) No Production, ~0% overhead
standard Yes (all) No GPU-only profiling
hip Yes (all) Yes Full CPU+GPU correlation

Zero New Dependencies

  • No link against libamdhip64.so (uses dlsym only)
  • No link against roctracer or rocprofiler-sdk
  • Only compile-time dependency: HIP headers (hip_runtime_api.h)
  • ldd librtl.so output unchanged

Validation Plan

24 tests across 5 test files:

  • 5 CPU unit tests (schema, re-entrancy, correlation IDs)
  • 6 GPU integration tests (capture, timing, pid/tid, mode gating)
  • 4 correlation tests (API↔kernel linking, timing order)
  • 5 E2E tests (PyTorch, CUDAGraph, Perfetto output, roctracer parity)
  • 2 overhead tests (<10% target for hip mode)
  • 2 regression guards (no roctracer/libamdhip64 in ldd)

Files Changed

File Action
src/hip_api_intercept.cpp NEW
src/hip_api_intercept.h NEW
src/hip_intercept.cpp DELETE (placeholder)
src/hsa_intercept.cpp MODIFY
src/trace_db.h MODIFY
src/trace_db.cpp MODIFY
Makefile MODIFY
rocm_trace_lite/cmd_trace.py MODIFY
5 new test files NEW

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions