RFC-003: HIP Runtime API Interception for Unified CPU+GPU Profiling

## Summary

RTL currently captures GPU kernel execution timing via HSA queue interception. This RFC proposes adding HIP runtime API tracing through `LD_PRELOAD`-based function interposition, enabling RTL to capture both CPU-side API calls (hipLaunchKernel, hipMemcpy, hipMalloc, hipStreamSynchronize, hipGraphLaunch) and GPU-side kernel timing in a single trace.

The design uses `dlsym(RTLD_NEXT)` to forward intercepted calls to the real HIP implementation after recording timestamps and arguments. This approach requires no special ROCm builds, no external profiling libraries, and does not interfere with CUDAGraph replay. Combined with correlation ID linking between API calls and kernel dispatches, this enables full host-device event correlation without depending on roctracer or rocprofiler-sdk.

The HIP interposition layer is compiled into the existing `librtl.so` and activated via `RTL_MODE=hip`. A re-entrancy guard (thread-local boolean) prevents recursive recording during HIP runtime initialization.

## Motivation

1. Users need to correlate CPU-side dispatch overhead with GPU kernel execution time
2. roctracer drops 54% of kernel events on ROCm 7.2.x during decode-phase workloads
3. rocprofiler-sdk has startup overhead, segfault, and hipGraphLaunch overhead issues
4. RTL's HSA-only mode captures GPU timing but has no visibility into HIP API calls
5. Future Kineto integration requires both HIP API and GPU kernel data for full torch.profiler compatibility

## Design

### Architecture

```
Application
    │
    ├── LD_PRELOAD=librtl.so (HIP API wrappers)
    │     hipLaunchKernel → record_hip_api() → forward via dlsym(RTLD_NEXT)
    │     hipMemcpy       → record_hip_api() → forward
    │     hipGraphLaunch  → record_hip_api() → forward
    │
    └── HSA_TOOLS_LIB=librtl.so (GPU kernel interception)
          queue intercept → signal injection → record_kernel()
          
Both paths write to the same TraceDB:
    rocpd_api (HIP API calls)  ←─── correlation_id ───→  rocpd_op (GPU kernels)
                                    via rocpd_api_ops
```

### Re-entrancy Guard

```cpp
static thread_local bool tls_in_hip_api = false;

extern "C" hipError_t hipMemcpy(...) {
    if (tls_in_hip_api || !g_hip_api_enabled) return real_hipMemcpy(...);
    tls_in_hip_api = true;
    // record + forward
    tls_in_hip_api = false;
    return ret;
}
```

### Correlation ID Flow

1. HIP wrapper assigns `correlation_id = next_correlation_id()`
2. Records API call with this ID in `rocpd_api`
3. Pushes `{queue_handle → correlation_id}` into concurrent map
4. HSA completion worker pops from map when recording the kernel dispatch
5. Writes `(api_id, op_id)` into `rocpd_api_ops` join table

### Functions to Intercept (Phase 1)

| Function | Category | Priority |
|----------|----------|----------|
| hipModuleLaunchKernel | Kernel launch | P0 |
| hipExtModuleLaunchKernel | Kernel launch (ATOM/Triton) | P0 |
| hipMemcpy | Sync memory copy | P0 |
| hipMemcpyAsync | Async memory copy | P0 |
| hipMalloc / hipFree | Memory allocation | P1 |
| hipStreamSynchronize | Stream sync | P1 |
| hipGraphLaunch | Graph replay | P1 |
| hipLaunchKernelGGL | Kernel launch (legacy) | P2 |

### RTL_MODE Integration

| Mode | HSA intercept | HIP API intercept | Use case |
|------|--------------|-------------------|----------|
| lite | Yes (skip graph) | No | Production, ~0% overhead |
| standard | Yes (all) | No | GPU-only profiling |
| hip | Yes (all) | Yes | Full CPU+GPU correlation |

### Zero New Dependencies

- No link against libamdhip64.so (uses dlsym only)
- No link against roctracer or rocprofiler-sdk
- Only compile-time dependency: HIP headers (`hip_runtime_api.h`)
- `ldd librtl.so` output unchanged

## Validation Plan

24 tests across 5 test files:

- 5 CPU unit tests (schema, re-entrancy, correlation IDs)
- 6 GPU integration tests (capture, timing, pid/tid, mode gating)
- 4 correlation tests (API↔kernel linking, timing order)
- 5 E2E tests (PyTorch, CUDAGraph, Perfetto output, roctracer parity)
- 2 overhead tests (<10% target for hip mode)
- 2 regression guards (no roctracer/libamdhip64 in ldd)

## Files Changed

| File | Action |
|------|--------|
| `src/hip_api_intercept.cpp` | NEW |
| `src/hip_api_intercept.h` | NEW |
| `src/hip_intercept.cpp` | DELETE (placeholder) |
| `src/hsa_intercept.cpp` | MODIFY |
| `src/trace_db.h` | MODIFY |
| `src/trace_db.cpp` | MODIFY |
| `Makefile` | MODIFY |
| `rocm_trace_lite/cmd_trace.py` | MODIFY |
| 5 new test files | NEW |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC-003: HIP Runtime API Interception for Unified CPU+GPU Profiling #93

Summary

Motivation

Design

Architecture

Re-entrancy Guard

Correlation ID Flow

Functions to Intercept (Phase 1)

RTL_MODE Integration

Zero New Dependencies

Validation Plan

Files Changed

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Function	Category	Priority
hipModuleLaunchKernel	Kernel launch	P0
hipExtModuleLaunchKernel	Kernel launch (ATOM/Triton)	P0
hipMemcpy	Sync memory copy	P0
hipMemcpyAsync	Async memory copy	P0
hipMalloc / hipFree	Memory allocation	P1
hipStreamSynchronize	Stream sync	P1
hipGraphLaunch	Graph replay	P1
hipLaunchKernelGGL	Kernel launch (legacy)	P2

Mode	HSA intercept	HIP API intercept	Use case
lite	Yes (skip graph)	No	Production, ~0% overhead
standard	Yes (all)	No	GPU-only profiling
hip	Yes (all)	Yes	Full CPU+GPU correlation

File	Action
`src/hip_api_intercept.cpp`	NEW
`src/hip_api_intercept.h`	NEW
`src/hip_intercept.cpp`	DELETE (placeholder)
`src/hsa_intercept.cpp`	MODIFY
`src/trace_db.h`	MODIFY
`src/trace_db.cpp`	MODIFY
`Makefile`	MODIFY
`rocm_trace_lite/cmd_trace.py`	MODIFY
5 new test files	NEW

RFC-003: HIP Runtime API Interception for Unified CPU+GPU Profiling #93

Description

Summary

Motivation

Design

Architecture

Re-entrancy Guard

Correlation ID Flow

Functions to Intercept (Phase 1)

RTL_MODE Integration

Zero New Dependencies

Validation Plan

Files Changed

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions