RFC-002: HIP-level profiling mode (RTL_MODE=hip)#84
Conversation
Proposes a new profiling mode that uses the HIP CLR built-in profiler API (hipClrProfiler*) instead of HSA signal injection. Addresses the key limitations documented in issue #79: - Multi-process crashes (ATOM/vLLM subprocess spawn) — no more 0x1009 - CUDAGraph interference — CLR profiler handles graph nodes natively - Kernel name gaps (Triton JIT) — native demangled names from HIP - No HIP API timeline — CPU start/end timestamps per API call - No SDMA copy tracking — copies captured with byte counts - No launch latency measurement — CPU and GPU timestamps per dispatch Upstream dependency: ROCm/rocm-systems@5dc10a8 (German Andryeyev's clr: add internal profiler commit). Includes: - Architecture and implementation plan (4 phases) - Kernel launch latency analysis with clock-domain validation protocol - Overhead benchmark matrix and e2e validation plan - Long-run soak test for CLR profiler memory accumulation - 6 open questions including go/no-go for launch latency feature
🤖 GPT-5.4 Code ReviewMINOR ISSUES Only docs changed, so no code-level dependency/correctness bugs to verify. A few real concerns in the proposed design: 1. Dependency / compatibility risk
2. Correctness
3. Performance
4. Security / robustness
5. Scope / consistency
Overall: direction is fine and does not introduce forbidden Model: gpt-5.4 |
…n safety Addresses three P1/P2 review findings on the RFC: P1 — Persist API/GPU linkage in TraceDB schema The original RFC claimed no TraceDB changes were needed, but the existing schema has no persisted correlation_id column on rocpd_op or rocpd_api, and record_hip_api/record_kernel/record_copy silently drop the correlation_id argument. Without persistence, CPU->GPU correlation is lost on write and launch-latency queries / Perfetto arrows cannot be generated as described. Fix: add a new Phase 2 that introduces an additive correlation_id column to rocpd_op and rocpd_api plus indexes, updates the INSERT prepared statements to bind the parameter, and adds the corresponding SQL join for the converter. Backward compatible (default 0); older traces render via the pre-change code path. P1 — Launch-latency gating test used impossible async ordering The validation protocol asserted cpu_start < gpu_begin < gpu_end < cpu_end for hipLaunchKernel, but hipLaunchKernel is asynchronous: cpu_end is the launch API return time, and gpu_end normally occurs after it. As written, the gate would reject a correct implementation. Fix: rewrite Test 1 to use a bounding sync call. Assert causality cpu_start_L < gpu_begin and cpu_end_L < cpu_end_S (sync cannot return before the kernel completes), explicitly note that gpu_end > cpu_end_L is the expected async ordering, and remove the bogus assertion. P2 — Timer-thread drain conflicts with upstream CLR contract hip_clr_profiler_ext.h documents that the records buffer is profiler-owned, valid only until Reset() or unload, and that callers should process records before issuing further HIP calls. A background timer drain while the app runs races with record insertion and invalidates buffers. Fix: drop the "drain every 30s from a timer thread" recommendation. Plan of record is shutdown-only drain for v1, measured by the 30-minute vLLM soak test. If soak shows unacceptable memory growth, escalate to a cooperative stop/drain/resume via roctx marker or hipDeviceSynchronize interception, or request a streaming API upstream. Explicitly rule out a background timer thread as unsafe.
|
Thanks for the review. All three findings are valid. Pushed 43eb115 to address them. P1 — Persist API/GPU linkage (resolved)Verified the finding against
Fix: added a new Phase 2 to the implementation plan that introduces an additive P1 — Async launch ordering in gating test (resolved)You're right — Fix: rewrote Test 1 to use a bounding sync call. New assertions:
Explicitly noted in the RFC that P2 — Timer-thread drain conflicts with upstream contract (resolved)Re-read
A background timer calling Fix: dropped the timer-thread plan. Plan of record is now:
Also bumped the Phase numbering throughout (old Phase 2 → Phase 3, etc.) to make room for the new schema phase. |
🤖 GPT-5.4 Code ReviewMINOR ISSUES Real issues1. Dependency policy violation in docsThe RFC explicitly proposes/mentions:
Per your review criteria, the hard violation to watch is actual dependency/reference to 2. Fallback from
|
Implements the design from RFC-002 (#84). Adds a new profiling mode that uses the HIP CLR built-in profiler API from ROCm/rocm-systems@5dc10a8 instead of HSA signal injection. Benefits over HSA modes: - Multi-process safe (no 0x1009 in ATOM/vLLM subprocess spawn) - CUDAGraph native (CLR profiler handles graph nodes, no batch skip) - HIP API CPU timeline (first time RTL can measure launch latency) - SDMA copy tracking with byte counts - Demangled kernel names native from HIP runtime (no symbol iteration) ## Changes ### TraceDB schema (src/trace_db.cpp) Add correlation_id column to rocpd_op and rocpd_api (default 0, backward compatible). Add indexes on correlation_id. Migration via ALTER TABLE ADD COLUMN for pre-existing trace files; ignored errors make it idempotent. Prepared statements now bind correlation_id; record_hip_api / record_kernel / record_copy / record_roctx all persist the parameter that was previously silently dropped. ### HSA interception (src/hsa_intercept.cpp) Add RtlMode::HIP enum value. Parse "hip" from RTL_MODE env var. When mode is HIP, OnLoad() skips queue intercept setup / signal pool / worker thread, registers shutdown handler, and returns. shutdown() branches early for HIP mode and calls hip_intercept::hip_profiler_drain(). Teardown-order invariant (worker.join() before DB close) preserved for HSA modes — HIP branch does not explicitly close the DB, relying on the existing lazy_init atexit handler in trace_db.cpp. ### HIP profiler drain (src/hip_intercept.cpp) Rewrite from placeholder to actual implementation. Declare the HipClrApiRecord / HipClrGpuActivity types locally (upstream header not yet shipped). dlopen libamdhip64.so with RTLD_NOLOAD (so we don't force-load HIP when the app never used it), try multiple .so name variants for ROCm 6 / 7 / symlink cases. dlsym the 5 hipClrProfiler* functions. Drain sequence: Disable() -> GetRecords() -> iterate into trace_db -> Reset(). Graceful fallback when symbols are missing. Maps dispatch/copy/barrier op codes to record_kernel / record_copy. ### Launcher (rocm_trace_lite/cli.py, cmd_trace.py) Add "hip" to --mode choices. When mode=hip, cmd_trace sets GPU_CLR_PROFILE=/dev/null in subprocess env. This lets the CLR profiler self-activate during hip::init() (librtl.so can't call hipClrProfilerEnable() from its OnLoad because HIP is mid-init at that point). /dev/null suppresses the profiler's own JSON autosave; rtl extracts records via GetRecords() and writes SQLite itself. ## Testing - make -j: clean build, no new dependencies (dl already linked) - make test-cpu: 216 passed, 34 skipped, 0 failed - test_source_guard still green (no roctracer/rocprofiler leaks) - Existing HSA mode behavior unchanged GPU validation on MI300X with a ROCm build that includes the CLR profiler patch is tracked separately and gated by the validation protocol from RFC-002 section "Kernel launch latency analysis". Refs #79, #84
Implements the design from RFC-002 (#84). Adds a new profiling mode that uses the HIP CLR built-in profiler API from ROCm/rocm-systems@5dc10a8 instead of HSA signal injection. Benefits over HSA modes: - Multi-process safe (no 0x1009 in ATOM/vLLM subprocess spawn) - CUDAGraph native (CLR profiler handles graph nodes, no batch skip) - HIP API CPU timeline (first time RTL can measure launch latency) - SDMA copy tracking with byte counts - Demangled kernel names native from HIP runtime (no symbol iteration) Add correlation_id column to rocpd_op and rocpd_api (default 0, backward compatible). Add indexes on correlation_id. Migration via ALTER TABLE ADD COLUMN for pre-existing trace files; ignored errors make it idempotent. Prepared statements now bind correlation_id; record_hip_api / record_kernel / record_copy / record_roctx all persist the parameter that was previously silently dropped. Add RtlMode::HIP enum value. Parse "hip" from RTL_MODE env var. When mode is HIP, OnLoad() skips queue intercept setup / signal pool / worker thread, registers shutdown handler, and returns. shutdown() branches early for HIP mode and calls hip_intercept::hip_profiler_drain(). Teardown-order invariant (worker.join() before DB close) preserved for HSA modes — HIP branch does not explicitly close the DB, relying on the existing lazy_init atexit handler in trace_db.cpp. Rewrite from placeholder to actual implementation. Declare the HipClrApiRecord / HipClrGpuActivity types locally (upstream header not yet shipped). dlopen libamdhip64.so with RTLD_NOLOAD (so we don't force-load HIP when the app never used it), try multiple .so name variants for ROCm 6 / 7 / symlink cases. dlsym the 5 hipClrProfiler* functions. Drain sequence: Disable() -> GetRecords() -> iterate into trace_db -> Reset(). Graceful fallback when symbols are missing. Maps dispatch/copy/barrier op codes to record_kernel / record_copy. Add "hip" to --mode choices. When mode=hip, cmd_trace sets GPU_CLR_PROFILE=/dev/null in subprocess env. This lets the CLR profiler self-activate during hip::init() (librtl.so can't call hipClrProfilerEnable() from its OnLoad because HIP is mid-init at that point). /dev/null suppresses the profiler's own JSON autosave; rtl extracts records via GetRecords() and writes SQLite itself. - make -j: clean build, no new dependencies (dl already linked) - make test-cpu: 216 passed, 34 skipped, 0 failed - test_source_guard still green (no roctracer/rocprofiler leaks) - Existing HSA mode behavior unchanged GPU validation on MI300X with a ROCm build that includes the CLR profiler patch is tracked separately and gated by the validation protocol from RFC-002 section "Kernel launch latency analysis". Refs #79, #84
Summary
Proposes a new
RTL_MODE=hipprofiling mode that uses the HIP CLR built-in profiler API (hipClrProfiler*) from ROCm/rocm-systems@5dc10a8 instead of HSA signal injection. Addresses the limitations documented in #79.This is an RFC only — no code changes yet. Requesting review on the design before implementation starts.
Key benefits over existing HSA modes
HSA_STATUS_ERROR_INVALID_PACKET_FORMAT (0x1009)in ATOM/vLLM subprocess spawn (the Add trace_matmul.py example #1 blocker from Feature Request: HIP-level kernel completion callback for non-invasive profiling #79)What's in the RFC
GPU_CLR_PROFILEdirect use, roctracer)benchmarks/overhead_bench.pywithhipmode across 5 workloadsUpstream dependency
Needs ROCm/rocm-systems@5dc10a8 to be available. RTL will
dlopenthe HIP library and probe for the 5hipClrProfiler*symbols; if absent, falls back gracefully toRTL_MODE=default.Requesting review
Particularly interested in feedback on:
Closes #79 (on merge of implementation PR, not this RFC).
Test plan