RFC-002: HIP-level profiling mode (RTL_MODE=hip) by sunway513 · Pull Request #84 · sunway513/rocm-trace-lite

sunway513 · 2026-04-10T18:36:51Z

Summary

Proposes a new RTL_MODE=hip profiling mode that uses the HIP CLR built-in profiler API (hipClrProfiler*) from ROCm/rocm-systems@5dc10a8 instead of HSA signal injection. Addresses the limitations documented in #79.

This is an RFC only — no code changes yet. Requesting review on the design before implementation starts.

Key benefits over existing HSA modes

Multi-process safe — no more HSA_STATUS_ERROR_INVALID_PACKET_FORMAT (0x1009) in ATOM/vLLM subprocess spawn (the Add trace_matmul.py example #1 blocker from Feature Request: HIP-level kernel completion callback for non-invasive profiling #79)
CUDAGraph native — CLR profiler handles graph nodes, no need for batch skip
Demangled kernel names — native from HIP runtime, including Triton JIT kernels
HIP API timeline — CPU start/end timestamps for every HIP runtime call
SDMA copy tracking — memcpy operations with byte counts and GPU timing
Kernel launch latency — first time RTL can independently measure this (CPU + GPU timestamps per dispatch)

What's in the RFC

Architecture and 4-phase implementation plan
File-by-file change list
Alternatives considered (callback API, GPU_CLR_PROFILE direct use, roctracer)
Kernel launch latency analysis — what we can compute, clock-domain caveat, 4-test validation protocol, go/no-go decision point for the feature
Overhead benchmark matrix — extending benchmarks/overhead_bench.py with hip mode across 5 workloads
E2E validation plan — 6 scenarios including vLLM GPT-OSS 120B, ATOM DeepSeek-R1, long-run soak test for CLR profiler memory accumulation
6 open questions, including the go/no-go for launch latency feature

Upstream dependency

Needs ROCm/rocm-systems@5dc10a8 to be available. RTL will dlopen the HIP library and probe for the 5 hipClrProfiler* symbols; if absent, falls back gracefully to RTL_MODE=default.

Requesting review

Particularly interested in feedback on:

Clock-domain alignment validation strategy (section "Kernel launch latency analysis")
Periodic drain strategy for long-running workloads (open question Fix timestamp domain mismatch between host and GPU events #4)
Whether the fallback path is acceptable or we should hard-fail when the API is missing

Closes #79 (on merge of implementation PR, not this RFC).

Test plan

Review RFC structure and content
Confirm upstream CLR profiler API timeline with HIP team
Validate clock-domain alignment assumption with test workload on MI300X
Iterate on open questions before implementation starts

Proposes a new profiling mode that uses the HIP CLR built-in profiler API (hipClrProfiler*) instead of HSA signal injection. Addresses the key limitations documented in issue #79: - Multi-process crashes (ATOM/vLLM subprocess spawn) — no more 0x1009 - CUDAGraph interference — CLR profiler handles graph nodes natively - Kernel name gaps (Triton JIT) — native demangled names from HIP - No HIP API timeline — CPU start/end timestamps per API call - No SDMA copy tracking — copies captured with byte counts - No launch latency measurement — CPU and GPU timestamps per dispatch Upstream dependency: ROCm/rocm-systems@5dc10a8 (German Andryeyev's clr: add internal profiler commit). Includes: - Architecture and implementation plan (4 phases) - Kernel launch latency analysis with clock-domain validation protocol - Overhead benchmark matrix and e2e validation plan - Long-run soak test for CLR profiler memory accumulation - 6 open questions including go/no-go for launch latency feature

github-actions · 2026-04-10T18:37:20Z

🤖 GPT-5.4 Code Review

MINOR ISSUES

Only docs changed, so no code-level dependency/correctness bugs to verify. A few real concerns in the proposed design:

1. Dependency / compatibility risk

Potential dependency-policy violation if hip/amd_detail/hip_clr_profiler_ext.h is included at build time.
The RFC says “optional runtime capability (dlopen + dlsym probe)” and “Makefile no changes,” which is good, but this must remain true in implementation. If the code includes that header unconditionally, RTL will gain a build-time dependency on a non-stable HIP internal header.
Ask: keep it fully runtime-resolved, with local typedefs guarded in RTL code.
libamdhip64.so + internal symbols are not stable ABI.
Not a blocker, but the RFC should be clearer that this is experimental and version-gated. kHipClrApiNames[] especially sounds like an internal data symbol that may disappear or get hidden.

2. Correctness

dlopen("libamdhip64.so", RTLD_NOW | RTLD_NOLOAD) may fail on systems where the loaded soname is versioned.
Real-world HIP loads are often via libamdhip64.so.<ver>. RTLD_NOLOAD with the unversioned name is not guaranteed to find an already loaded versioned object. The fallback behavior is fine, but this may cause false “API not available” on supported systems.
Ask: document a more robust probe strategy.
Shutdown-only drain is risky for long-running jobs.
The RFC itself notes unbounded in-memory accumulation. If implementation follows shutdown-only collection, correctness degrades into OOM / lost profiling on long runs.
Ask: periodic drain should be part of the initial design, not an open question, unless CLR guarantees bounded memory.
hipClrProfilerDisable() “drains in-flight GPU work internally” is assumed, not verified.
If that assumption is wrong, shutdown may either block unexpectedly long or miss tail records. This needs validation before relying on it in OnUnload().

3. Performance

Timer thread / periodic drain could contend with SQLite if writes happen directly from that thread.
Not a blocker, but if added, it should preserve the current DB threading model. SQLite connection ownership/thread mode must stay explicit.
Shutdown-time bulk conversion may create large latency spikes.
For very large traces, converting all CLR records to DB rows only at exit can make teardown expensive and memory-heavy. This reinforces the need for incremental draining.

4. Security / robustness

Do not trust kernel_name / API name pointers blindly.
Since these come from runtime-owned buffers, implementation should null-check and copy defensively before reset/unload. If Reset() invalidates pointers, using them after reset would be a UAF bug.
Unchecked dlsym for data symbols is easy to get wrong.
The RFC proposes resolving kHipClrApiNames and count. Implementation must check both symbol presence and bounds before indexing by api_id.

5. Scope / consistency

The RFC says “4 profiler symbols” but lists 5 APIs including hipClrProfilerWriteJson().
Minor doc inconsistency; likely intentional since JSON export is “bonus,” but should be clarified.

Overall: direction is fine and does not introduce forbidden roctracer/rocprofiler-sdk/libroctx64 dependencies in this diff, but the implementation should be careful around runtime symbol probing, bounded memory, and pointer lifetime.

Model: gpt-5.4

…n safety Addresses three P1/P2 review findings on the RFC: P1 — Persist API/GPU linkage in TraceDB schema The original RFC claimed no TraceDB changes were needed, but the existing schema has no persisted correlation_id column on rocpd_op or rocpd_api, and record_hip_api/record_kernel/record_copy silently drop the correlation_id argument. Without persistence, CPU->GPU correlation is lost on write and launch-latency queries / Perfetto arrows cannot be generated as described. Fix: add a new Phase 2 that introduces an additive correlation_id column to rocpd_op and rocpd_api plus indexes, updates the INSERT prepared statements to bind the parameter, and adds the corresponding SQL join for the converter. Backward compatible (default 0); older traces render via the pre-change code path. P1 — Launch-latency gating test used impossible async ordering The validation protocol asserted cpu_start < gpu_begin < gpu_end < cpu_end for hipLaunchKernel, but hipLaunchKernel is asynchronous: cpu_end is the launch API return time, and gpu_end normally occurs after it. As written, the gate would reject a correct implementation. Fix: rewrite Test 1 to use a bounding sync call. Assert causality cpu_start_L < gpu_begin and cpu_end_L < cpu_end_S (sync cannot return before the kernel completes), explicitly note that gpu_end > cpu_end_L is the expected async ordering, and remove the bogus assertion. P2 — Timer-thread drain conflicts with upstream CLR contract hip_clr_profiler_ext.h documents that the records buffer is profiler-owned, valid only until Reset() or unload, and that callers should process records before issuing further HIP calls. A background timer drain while the app runs races with record insertion and invalidates buffers. Fix: drop the "drain every 30s from a timer thread" recommendation. Plan of record is shutdown-only drain for v1, measured by the 30-minute vLLM soak test. If soak shows unacceptable memory growth, escalate to a cooperative stop/drain/resume via roctx marker or hipDeviceSynchronize interception, or request a streaming API upstream. Explicitly rule out a background timer thread as unsafe.

sunway513 · 2026-04-10T18:51:38Z

Thanks for the review. All three findings are valid. Pushed 43eb115 to address them.

P1 — Persist API/GPU linkage (resolved)

Verified the finding against src/trace_db.cpp:

rocpd_op schema (line 82-92) and rocpd_api (line 93-101) have no correlation_id column
record_hip_api (line 281), record_kernel (line 307), record_copy (line 331) all accept a correlation_id parameter but never bind it in the INSERT statements
rocpd_api_ops junction table is declared but unused

Fix: added a new Phase 2 to the implementation plan that introduces an additive correlation_id INTEGER DEFAULT 0 column on both rocpd_op and rocpd_api, plus indexes, and updates the writers to bind the parameter. Backward compatible — existing readers see 0 and fall back. The converter's Perfetto flow events are driven by a SQL join on correlation_id. File-changes table updated to reflect that trace_db.cpp is a required change, not "no changes".

P1 — Async launch ordering in gating test (resolved)

You're right — hipLaunchKernel is async, so cpu_end_L is the launch-API return time and gpu_end > cpu_end_L is the correct ordering for any working implementation. The original Test 1 would have rejected a perfectly healthy profiler.

Fix: rewrote Test 1 to use a bounding sync call. New assertions:

cpu_start_L < gpu_begin (causality)
cpu_end_L < cpu_end_S (the sync cannot return before the kernel completes)
gpu_end < cpu_end_S (the real upper-bound guarantee)

Explicitly noted in the RFC that gpu_end > cpu_end_L is expected async ordering and must NOT be asserted. Added an implementation note about pairing the launch and sync records by thread-local interleaving order in the HipClrApiRecord stream.

P2 — Timer-thread drain conflicts with upstream contract (resolved)

Re-read hip_clr_profiler_ext.h in ROCm/rocm-systems@5dc10a8:

The returned pointer is owned by the profiler. It remains valid until the next call to hipClrProfilerReset() or library unload. [...] callers should process records before issuing further HIP calls when profiling is active.

A background timer calling GetRecords / Reset while the app runs races with record insertion and invalidates the export buffer. The original recommendation was unsafe.

Fix: dropped the timer-thread plan. Plan of record is now:

v1 (default): shutdown-only drain. Simple, safe, correct.
Soak test as the gate: the existing 30-minute vLLM run in the validation plan measures actual RSS growth. If it stays under budget, v1 ships as-is.
v2 (only if soak fails): cooperative stop/drain/resume triggered by roctx marker OR hipDeviceSynchronize interception (device is known-idle at that moment), OR upstream streaming API request. Explicitly rules out the timer-thread approach as unsafe.

Also bumped the Phase numbering throughout (old Phase 2 → Phase 3, etc.) to make room for the new schema phase.

github-actions · 2026-04-10T18:51:47Z

🤖 GPT-5.4 Code Review

MINOR ISSUES

Real issues

1. Dependency policy violation in docs

The RFC explicitly proposes/mentions:

roctx integration (“No roctx markers integration”, “roctx marker rtl_drain”)
rocprof / rocprofiler-sdk as comparison points

Per your review criteria, the hard violation to watch is actual dependency/reference to roctracer, rocprofiler-sdk, or libroctx64. This diff is docs-only, so there is no code dependency introduced, but it does reference rocprofiler-sdk and roctx/libroctx64 concepts. If the project policy is “no references at all,” this should be removed or reworded. If the policy is “no build/runtime dependency,” then this is fine.

2. Fallback from `RTL_MODE=hip` to `default` is risky

The RFC says if HIP CLR symbols are unavailable, fall back to RTL_MODE=default.

That is not obviously safe:

user may have selected hip specifically to avoid the known HSA-mode multiprocess crash / graph interference
silent behavioral downgrade can reintroduce the exact failure mode they were avoiding

Better behavior: fail closed for RTL_MODE=hip unless an explicit opt-in fallback is set.

3. `dlopen(..., RTLD_NOLOAD)` may incorrectly fail

The RFC assumes:

dlopen("libamdhip64.so", RTLD_NOW | RTLD_NOLOAD);

This is brittle because the loaded SONAME may differ (libamdhip64.so.6, path-qualified load, etc.). In that case RTLD_NOLOAD can fail even when HIP is already present.

If implemented as written, this is a correctness issue in capability detection. At minimum the RFC should call out SONAME/version handling or a secondary probe strategy.

4. Shutdown-only drain has real memory-risk

The RFC acknowledges profiler-side accumulation, but the plan of record is still “shutdown-only drain.” For long-running services this is a real performance/stability concern, not just an open question:

unbounded record growth
large final shutdown latency
possible OOM on high-dispatch workloads

Given the upstream contract, background draining is unsafe, but the RFC should more strongly constrain v1 scope to bounded jobs/tests only, or require a documented memory cap / disable behavior.

Things I did not flag

No actual roctracer dependency introduced in this diff.
No SQLite/thread-safety/signal-handling bug in code, because this is docs-only.
No concrete HSA API misuse in the diff, though the proposed HIP/HSA lifecycle split should be reviewed carefully once code lands.

Recommendation

Acceptable as an RFC, but before implementation:

make hip mode failure explicit instead of auto-fallback to default
tighten library probing semantics beyond bare RTLD_NOLOAD
clearly bound or mitigate shutdown-only memory growth
clarify whether “no dependency” also forbids doc references to roctx/rocprofiler-sdk

Model: gpt-5.4

Implements the design from RFC-002 (#84). Adds a new profiling mode that uses the HIP CLR built-in profiler API from ROCm/rocm-systems@5dc10a8 instead of HSA signal injection. Benefits over HSA modes: - Multi-process safe (no 0x1009 in ATOM/vLLM subprocess spawn) - CUDAGraph native (CLR profiler handles graph nodes, no batch skip) - HIP API CPU timeline (first time RTL can measure launch latency) - SDMA copy tracking with byte counts - Demangled kernel names native from HIP runtime (no symbol iteration) ## Changes ### TraceDB schema (src/trace_db.cpp) Add correlation_id column to rocpd_op and rocpd_api (default 0, backward compatible). Add indexes on correlation_id. Migration via ALTER TABLE ADD COLUMN for pre-existing trace files; ignored errors make it idempotent. Prepared statements now bind correlation_id; record_hip_api / record_kernel / record_copy / record_roctx all persist the parameter that was previously silently dropped. ### HSA interception (src/hsa_intercept.cpp) Add RtlMode::HIP enum value. Parse "hip" from RTL_MODE env var. When mode is HIP, OnLoad() skips queue intercept setup / signal pool / worker thread, registers shutdown handler, and returns. shutdown() branches early for HIP mode and calls hip_intercept::hip_profiler_drain(). Teardown-order invariant (worker.join() before DB close) preserved for HSA modes — HIP branch does not explicitly close the DB, relying on the existing lazy_init atexit handler in trace_db.cpp. ### HIP profiler drain (src/hip_intercept.cpp) Rewrite from placeholder to actual implementation. Declare the HipClrApiRecord / HipClrGpuActivity types locally (upstream header not yet shipped). dlopen libamdhip64.so with RTLD_NOLOAD (so we don't force-load HIP when the app never used it), try multiple .so name variants for ROCm 6 / 7 / symlink cases. dlsym the 5 hipClrProfiler* functions. Drain sequence: Disable() -> GetRecords() -> iterate into trace_db -> Reset(). Graceful fallback when symbols are missing. Maps dispatch/copy/barrier op codes to record_kernel / record_copy. ### Launcher (rocm_trace_lite/cli.py, cmd_trace.py) Add "hip" to --mode choices. When mode=hip, cmd_trace sets GPU_CLR_PROFILE=/dev/null in subprocess env. This lets the CLR profiler self-activate during hip::init() (librtl.so can't call hipClrProfilerEnable() from its OnLoad because HIP is mid-init at that point). /dev/null suppresses the profiler's own JSON autosave; rtl extracts records via GetRecords() and writes SQLite itself. ## Testing - make -j: clean build, no new dependencies (dl already linked) - make test-cpu: 216 passed, 34 skipped, 0 failed - test_source_guard still green (no roctracer/rocprofiler leaks) - Existing HSA mode behavior unchanged GPU validation on MI300X with a ROCm build that includes the CLR profiler patch is tracked separately and gated by the validation protocol from RFC-002 section "Kernel launch latency analysis". Refs #79, #84

Implements the design from RFC-002 (#84). Adds a new profiling mode that uses the HIP CLR built-in profiler API from ROCm/rocm-systems@5dc10a8 instead of HSA signal injection. Benefits over HSA modes: - Multi-process safe (no 0x1009 in ATOM/vLLM subprocess spawn) - CUDAGraph native (CLR profiler handles graph nodes, no batch skip) - HIP API CPU timeline (first time RTL can measure launch latency) - SDMA copy tracking with byte counts - Demangled kernel names native from HIP runtime (no symbol iteration) Add correlation_id column to rocpd_op and rocpd_api (default 0, backward compatible). Add indexes on correlation_id. Migration via ALTER TABLE ADD COLUMN for pre-existing trace files; ignored errors make it idempotent. Prepared statements now bind correlation_id; record_hip_api / record_kernel / record_copy / record_roctx all persist the parameter that was previously silently dropped. Add RtlMode::HIP enum value. Parse "hip" from RTL_MODE env var. When mode is HIP, OnLoad() skips queue intercept setup / signal pool / worker thread, registers shutdown handler, and returns. shutdown() branches early for HIP mode and calls hip_intercept::hip_profiler_drain(). Teardown-order invariant (worker.join() before DB close) preserved for HSA modes — HIP branch does not explicitly close the DB, relying on the existing lazy_init atexit handler in trace_db.cpp. Rewrite from placeholder to actual implementation. Declare the HipClrApiRecord / HipClrGpuActivity types locally (upstream header not yet shipped). dlopen libamdhip64.so with RTLD_NOLOAD (so we don't force-load HIP when the app never used it), try multiple .so name variants for ROCm 6 / 7 / symlink cases. dlsym the 5 hipClrProfiler* functions. Drain sequence: Disable() -> GetRecords() -> iterate into trace_db -> Reset(). Graceful fallback when symbols are missing. Maps dispatch/copy/barrier op codes to record_kernel / record_copy. Add "hip" to --mode choices. When mode=hip, cmd_trace sets GPU_CLR_PROFILE=/dev/null in subprocess env. This lets the CLR profiler self-activate during hip::init() (librtl.so can't call hipClrProfilerEnable() from its OnLoad because HIP is mid-init at that point). /dev/null suppresses the profiler's own JSON autosave; rtl extracts records via GetRecords() and writes SQLite itself. - make -j: clean build, no new dependencies (dl already linked) - make test-cpu: 216 passed, 34 skipped, 0 failed - test_source_guard still green (no roctracer/rocprofiler leaks) - Existing HSA mode behavior unchanged GPU validation on MI300X with a ROCm build that includes the CLR profiler patch is tracked separately and gated by the validation protocol from RFC-002 section "Kernel launch latency analysis". Refs #79, #84

sunway513 mentioned this pull request Apr 10, 2026

feat: HIP CLR profiler integration (hipProfiler*Ext v0.1.0) #85

Open

5 tasks

sunway513 marked this pull request as draft April 16, 2026 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC-002: HIP-level profiling mode (RTL_MODE=hip)#84

RFC-002: HIP-level profiling mode (RTL_MODE=hip)#84
sunway513 wants to merge 2 commits into
mainfrom
rfc/hip-profiling-mode

sunway513 commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

sunway513 commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sunway513 commented Apr 10, 2026

Summary

Key benefits over existing HSA modes

What's in the RFC

Upstream dependency

Requesting review

Test plan

Uh oh!

github-actions Bot commented Apr 10, 2026

🤖 GPT-5.4 Code Review

1. Dependency / compatibility risk

2. Correctness

3. Performance

4. Security / robustness

5. Scope / consistency

Uh oh!

sunway513 commented Apr 10, 2026

P1 — Persist API/GPU linkage (resolved)

P1 — Async launch ordering in gating test (resolved)

P2 — Timer-thread drain conflicts with upstream contract (resolved)

Uh oh!

github-actions Bot commented Apr 10, 2026

🤖 GPT-5.4 Code Review

Real issues

1. Dependency policy violation in docs

2. Fallback from RTL_MODE=hip to default is risky

3. dlopen(..., RTLD_NOLOAD) may incorrectly fail

4. Shutdown-only drain has real memory-risk

Things I did not flag

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

2. Fallback from `RTL_MODE=hip` to `default` is risky

3. `dlopen(..., RTLD_NOLOAD)` may incorrectly fail