Skip to content

RFC-002: HIP-level profiling mode (RTL_MODE=hip)#84

Draft
sunway513 wants to merge 2 commits into
mainfrom
rfc/hip-profiling-mode
Draft

RFC-002: HIP-level profiling mode (RTL_MODE=hip)#84
sunway513 wants to merge 2 commits into
mainfrom
rfc/hip-profiling-mode

Conversation

@sunway513
Copy link
Copy Markdown
Owner

Summary

Proposes a new RTL_MODE=hip profiling mode that uses the HIP CLR built-in profiler API (hipClrProfiler*) from ROCm/rocm-systems@5dc10a8 instead of HSA signal injection. Addresses the limitations documented in #79.

This is an RFC only — no code changes yet. Requesting review on the design before implementation starts.

Key benefits over existing HSA modes

  • Multi-process safe — no more HSA_STATUS_ERROR_INVALID_PACKET_FORMAT (0x1009) in ATOM/vLLM subprocess spawn (the Add trace_matmul.py example #1 blocker from Feature Request: HIP-level kernel completion callback for non-invasive profiling #79)
  • CUDAGraph native — CLR profiler handles graph nodes, no need for batch skip
  • Demangled kernel names — native from HIP runtime, including Triton JIT kernels
  • HIP API timeline — CPU start/end timestamps for every HIP runtime call
  • SDMA copy tracking — memcpy operations with byte counts and GPU timing
  • Kernel launch latency — first time RTL can independently measure this (CPU + GPU timestamps per dispatch)

What's in the RFC

  • Architecture and 4-phase implementation plan
  • File-by-file change list
  • Alternatives considered (callback API, GPU_CLR_PROFILE direct use, roctracer)
  • Kernel launch latency analysis — what we can compute, clock-domain caveat, 4-test validation protocol, go/no-go decision point for the feature
  • Overhead benchmark matrix — extending benchmarks/overhead_bench.py with hip mode across 5 workloads
  • E2E validation plan — 6 scenarios including vLLM GPT-OSS 120B, ATOM DeepSeek-R1, long-run soak test for CLR profiler memory accumulation
  • 6 open questions, including the go/no-go for launch latency feature

Upstream dependency

Needs ROCm/rocm-systems@5dc10a8 to be available. RTL will dlopen the HIP library and probe for the 5 hipClrProfiler* symbols; if absent, falls back gracefully to RTL_MODE=default.

Requesting review

Particularly interested in feedback on:

  1. Clock-domain alignment validation strategy (section "Kernel launch latency analysis")
  2. Periodic drain strategy for long-running workloads (open question Fix timestamp domain mismatch between host and GPU events #4)
  3. Whether the fallback path is acceptable or we should hard-fail when the API is missing

Closes #79 (on merge of implementation PR, not this RFC).

Test plan

  • Review RFC structure and content
  • Confirm upstream CLR profiler API timeline with HIP team
  • Validate clock-domain alignment assumption with test workload on MI300X
  • Iterate on open questions before implementation starts

Proposes a new profiling mode that uses the HIP CLR built-in profiler
API (hipClrProfiler*) instead of HSA signal injection. Addresses the
key limitations documented in issue #79:

- Multi-process crashes (ATOM/vLLM subprocess spawn) — no more 0x1009
- CUDAGraph interference — CLR profiler handles graph nodes natively
- Kernel name gaps (Triton JIT) — native demangled names from HIP
- No HIP API timeline — CPU start/end timestamps per API call
- No SDMA copy tracking — copies captured with byte counts
- No launch latency measurement — CPU and GPU timestamps per dispatch

Upstream dependency: ROCm/rocm-systems@5dc10a8 (German Andryeyev's
clr: add internal profiler commit).

Includes:
- Architecture and implementation plan (4 phases)
- Kernel launch latency analysis with clock-domain validation protocol
- Overhead benchmark matrix and e2e validation plan
- Long-run soak test for CLR profiler memory accumulation
- 6 open questions including go/no-go for launch latency feature
@github-actions
Copy link
Copy Markdown

🤖 GPT-5.4 Code Review

MINOR ISSUES

Only docs changed, so no code-level dependency/correctness bugs to verify. A few real concerns in the proposed design:

1. Dependency / compatibility risk

  • Potential dependency-policy violation if hip/amd_detail/hip_clr_profiler_ext.h is included at build time.
    The RFC says “optional runtime capability (dlopen + dlsym probe)” and “Makefile no changes,” which is good, but this must remain true in implementation. If the code includes that header unconditionally, RTL will gain a build-time dependency on a non-stable HIP internal header.
    Ask: keep it fully runtime-resolved, with local typedefs guarded in RTL code.

  • libamdhip64.so + internal symbols are not stable ABI.
    Not a blocker, but the RFC should be clearer that this is experimental and version-gated. kHipClrApiNames[] especially sounds like an internal data symbol that may disappear or get hidden.

2. Correctness

  • dlopen("libamdhip64.so", RTLD_NOW | RTLD_NOLOAD) may fail on systems where the loaded soname is versioned.
    Real-world HIP loads are often via libamdhip64.so.<ver>. RTLD_NOLOAD with the unversioned name is not guaranteed to find an already loaded versioned object. The fallback behavior is fine, but this may cause false “API not available” on supported systems.
    Ask: document a more robust probe strategy.

  • Shutdown-only drain is risky for long-running jobs.
    The RFC itself notes unbounded in-memory accumulation. If implementation follows shutdown-only collection, correctness degrades into OOM / lost profiling on long runs.
    Ask: periodic drain should be part of the initial design, not an open question, unless CLR guarantees bounded memory.

  • hipClrProfilerDisable() “drains in-flight GPU work internally” is assumed, not verified.
    If that assumption is wrong, shutdown may either block unexpectedly long or miss tail records. This needs validation before relying on it in OnUnload().

3. Performance

  • Timer thread / periodic drain could contend with SQLite if writes happen directly from that thread.
    Not a blocker, but if added, it should preserve the current DB threading model. SQLite connection ownership/thread mode must stay explicit.

  • Shutdown-time bulk conversion may create large latency spikes.
    For very large traces, converting all CLR records to DB rows only at exit can make teardown expensive and memory-heavy. This reinforces the need for incremental draining.

4. Security / robustness

  • Do not trust kernel_name / API name pointers blindly.
    Since these come from runtime-owned buffers, implementation should null-check and copy defensively before reset/unload. If Reset() invalidates pointers, using them after reset would be a UAF bug.

  • Unchecked dlsym for data symbols is easy to get wrong.
    The RFC proposes resolving kHipClrApiNames and count. Implementation must check both symbol presence and bounds before indexing by api_id.

5. Scope / consistency

  • The RFC says “4 profiler symbols” but lists 5 APIs including hipClrProfilerWriteJson().
    Minor doc inconsistency; likely intentional since JSON export is “bonus,” but should be clarified.

Overall: direction is fine and does not introduce forbidden roctracer/rocprofiler-sdk/libroctx64 dependencies in this diff, but the implementation should be careful around runtime symbol probing, bounded memory, and pointer lifetime.


Model: gpt-5.4

…n safety

Addresses three P1/P2 review findings on the RFC:

P1 — Persist API/GPU linkage in TraceDB schema

The original RFC claimed no TraceDB changes were needed, but the
existing schema has no persisted correlation_id column on rocpd_op or
rocpd_api, and record_hip_api/record_kernel/record_copy silently drop
the correlation_id argument. Without persistence, CPU->GPU correlation
is lost on write and launch-latency queries / Perfetto arrows cannot be
generated as described.

Fix: add a new Phase 2 that introduces an additive correlation_id
column to rocpd_op and rocpd_api plus indexes, updates the INSERT
prepared statements to bind the parameter, and adds the corresponding
SQL join for the converter. Backward compatible (default 0); older
traces render via the pre-change code path.

P1 — Launch-latency gating test used impossible async ordering

The validation protocol asserted cpu_start < gpu_begin < gpu_end <
cpu_end for hipLaunchKernel, but hipLaunchKernel is asynchronous:
cpu_end is the launch API return time, and gpu_end normally occurs
after it. As written, the gate would reject a correct implementation.

Fix: rewrite Test 1 to use a bounding sync call. Assert causality
cpu_start_L < gpu_begin and cpu_end_L < cpu_end_S (sync cannot return
before the kernel completes), explicitly note that gpu_end > cpu_end_L
is the expected async ordering, and remove the bogus assertion.

P2 — Timer-thread drain conflicts with upstream CLR contract

hip_clr_profiler_ext.h documents that the records buffer is
profiler-owned, valid only until Reset() or unload, and that callers
should process records before issuing further HIP calls. A background
timer drain while the app runs races with record insertion and
invalidates buffers.

Fix: drop the "drain every 30s from a timer thread" recommendation.
Plan of record is shutdown-only drain for v1, measured by the
30-minute vLLM soak test. If soak shows unacceptable memory growth,
escalate to a cooperative stop/drain/resume via roctx marker or
hipDeviceSynchronize interception, or request a streaming API
upstream. Explicitly rule out a background timer thread as unsafe.
@sunway513
Copy link
Copy Markdown
Owner Author

Thanks for the review. All three findings are valid. Pushed 43eb115 to address them.

P1 — Persist API/GPU linkage (resolved)

Verified the finding against src/trace_db.cpp:

  • rocpd_op schema (line 82-92) and rocpd_api (line 93-101) have no correlation_id column
  • record_hip_api (line 281), record_kernel (line 307), record_copy (line 331) all accept a correlation_id parameter but never bind it in the INSERT statements
  • rocpd_api_ops junction table is declared but unused

Fix: added a new Phase 2 to the implementation plan that introduces an additive correlation_id INTEGER DEFAULT 0 column on both rocpd_op and rocpd_api, plus indexes, and updates the writers to bind the parameter. Backward compatible — existing readers see 0 and fall back. The converter's Perfetto flow events are driven by a SQL join on correlation_id. File-changes table updated to reflect that trace_db.cpp is a required change, not "no changes".

P1 — Async launch ordering in gating test (resolved)

You're right — hipLaunchKernel is async, so cpu_end_L is the launch-API return time and gpu_end > cpu_end_L is the correct ordering for any working implementation. The original Test 1 would have rejected a perfectly healthy profiler.

Fix: rewrote Test 1 to use a bounding sync call. New assertions:

  • cpu_start_L < gpu_begin (causality)
  • cpu_end_L < cpu_end_S (the sync cannot return before the kernel completes)
  • gpu_end < cpu_end_S (the real upper-bound guarantee)

Explicitly noted in the RFC that gpu_end > cpu_end_L is expected async ordering and must NOT be asserted. Added an implementation note about pairing the launch and sync records by thread-local interleaving order in the HipClrApiRecord stream.

P2 — Timer-thread drain conflicts with upstream contract (resolved)

Re-read hip_clr_profiler_ext.h in ROCm/rocm-systems@5dc10a8:

The returned pointer is owned by the profiler. It remains valid until the next call to hipClrProfilerReset() or library unload. [...] callers should process records before issuing further HIP calls when profiling is active.

A background timer calling GetRecords / Reset while the app runs races with record insertion and invalidates the export buffer. The original recommendation was unsafe.

Fix: dropped the timer-thread plan. Plan of record is now:

  • v1 (default): shutdown-only drain. Simple, safe, correct.
  • Soak test as the gate: the existing 30-minute vLLM run in the validation plan measures actual RSS growth. If it stays under budget, v1 ships as-is.
  • v2 (only if soak fails): cooperative stop/drain/resume triggered by roctx marker OR hipDeviceSynchronize interception (device is known-idle at that moment), OR upstream streaming API request. Explicitly rules out the timer-thread approach as unsafe.

Also bumped the Phase numbering throughout (old Phase 2 → Phase 3, etc.) to make room for the new schema phase.

@github-actions
Copy link
Copy Markdown

🤖 GPT-5.4 Code Review

MINOR ISSUES

Real issues

1. Dependency policy violation in docs

The RFC explicitly proposes/mentions:

  • roctx integration (“No roctx markers integration”, “roctx marker rtl_drain”)
  • rocprof / rocprofiler-sdk as comparison points

Per your review criteria, the hard violation to watch is actual dependency/reference to roctracer, rocprofiler-sdk, or libroctx64. This diff is docs-only, so there is no code dependency introduced, but it does reference rocprofiler-sdk and roctx/libroctx64 concepts. If the project policy is “no references at all,” this should be removed or reworded. If the policy is “no build/runtime dependency,” then this is fine.

2. Fallback from RTL_MODE=hip to default is risky

The RFC says if HIP CLR symbols are unavailable, fall back to RTL_MODE=default.

That is not obviously safe:

  • user may have selected hip specifically to avoid the known HSA-mode multiprocess crash / graph interference
  • silent behavioral downgrade can reintroduce the exact failure mode they were avoiding

Better behavior: fail closed for RTL_MODE=hip unless an explicit opt-in fallback is set.

3. dlopen(..., RTLD_NOLOAD) may incorrectly fail

The RFC assumes:

dlopen("libamdhip64.so", RTLD_NOW | RTLD_NOLOAD);

This is brittle because the loaded SONAME may differ (libamdhip64.so.6, path-qualified load, etc.). In that case RTLD_NOLOAD can fail even when HIP is already present.

If implemented as written, this is a correctness issue in capability detection. At minimum the RFC should call out SONAME/version handling or a secondary probe strategy.

4. Shutdown-only drain has real memory-risk

The RFC acknowledges profiler-side accumulation, but the plan of record is still “shutdown-only drain.” For long-running services this is a real performance/stability concern, not just an open question:

  • unbounded record growth
  • large final shutdown latency
  • possible OOM on high-dispatch workloads

Given the upstream contract, background draining is unsafe, but the RFC should more strongly constrain v1 scope to bounded jobs/tests only, or require a documented memory cap / disable behavior.

Things I did not flag

  • No actual roctracer dependency introduced in this diff.
  • No SQLite/thread-safety/signal-handling bug in code, because this is docs-only.
  • No concrete HSA API misuse in the diff, though the proposed HIP/HSA lifecycle split should be reviewed carefully once code lands.

Recommendation

Acceptable as an RFC, but before implementation:

  1. make hip mode failure explicit instead of auto-fallback to default
  2. tighten library probing semantics beyond bare RTLD_NOLOAD
  3. clearly bound or mitigate shutdown-only memory growth
  4. clarify whether “no dependency” also forbids doc references to roctx/rocprofiler-sdk

Model: gpt-5.4

sunway513 added a commit that referenced this pull request Apr 10, 2026
Implements the design from RFC-002 (#84). Adds a new profiling mode
that uses the HIP CLR built-in profiler API from
ROCm/rocm-systems@5dc10a8 instead of HSA signal injection.

Benefits over HSA modes:
- Multi-process safe (no 0x1009 in ATOM/vLLM subprocess spawn)
- CUDAGraph native (CLR profiler handles graph nodes, no batch skip)
- HIP API CPU timeline (first time RTL can measure launch latency)
- SDMA copy tracking with byte counts
- Demangled kernel names native from HIP runtime (no symbol iteration)

## Changes

### TraceDB schema (src/trace_db.cpp)
Add correlation_id column to rocpd_op and rocpd_api (default 0,
backward compatible). Add indexes on correlation_id. Migration via
ALTER TABLE ADD COLUMN for pre-existing trace files; ignored errors
make it idempotent. Prepared statements now bind correlation_id;
record_hip_api / record_kernel / record_copy / record_roctx all
persist the parameter that was previously silently dropped.

### HSA interception (src/hsa_intercept.cpp)
Add RtlMode::HIP enum value. Parse "hip" from RTL_MODE env var. When
mode is HIP, OnLoad() skips queue intercept setup / signal pool / worker
thread, registers shutdown handler, and returns. shutdown() branches
early for HIP mode and calls hip_intercept::hip_profiler_drain().
Teardown-order invariant (worker.join() before DB close) preserved
for HSA modes — HIP branch does not explicitly close the DB, relying
on the existing lazy_init atexit handler in trace_db.cpp.

### HIP profiler drain (src/hip_intercept.cpp)
Rewrite from placeholder to actual implementation. Declare the
HipClrApiRecord / HipClrGpuActivity types locally (upstream header
not yet shipped). dlopen libamdhip64.so with RTLD_NOLOAD (so we don't
force-load HIP when the app never used it), try multiple .so name
variants for ROCm 6 / 7 / symlink cases. dlsym the 5 hipClrProfiler*
functions. Drain sequence: Disable() -> GetRecords() -> iterate into
trace_db -> Reset(). Graceful fallback when symbols are missing.
Maps dispatch/copy/barrier op codes to record_kernel / record_copy.

### Launcher (rocm_trace_lite/cli.py, cmd_trace.py)
Add "hip" to --mode choices. When mode=hip, cmd_trace sets
GPU_CLR_PROFILE=/dev/null in subprocess env. This lets the CLR
profiler self-activate during hip::init() (librtl.so can't call
hipClrProfilerEnable() from its OnLoad because HIP is mid-init at
that point). /dev/null suppresses the profiler's own JSON autosave;
rtl extracts records via GetRecords() and writes SQLite itself.

## Testing

- make -j: clean build, no new dependencies (dl already linked)
- make test-cpu: 216 passed, 34 skipped, 0 failed
- test_source_guard still green (no roctracer/rocprofiler leaks)
- Existing HSA mode behavior unchanged

GPU validation on MI300X with a ROCm build that includes the CLR
profiler patch is tracked separately and gated by the validation
protocol from RFC-002 section "Kernel launch latency analysis".

Refs #79, #84
sunway513 added a commit that referenced this pull request Apr 11, 2026
Implements the design from RFC-002 (#84). Adds a new profiling mode
that uses the HIP CLR built-in profiler API from
ROCm/rocm-systems@5dc10a8 instead of HSA signal injection.

Benefits over HSA modes:
- Multi-process safe (no 0x1009 in ATOM/vLLM subprocess spawn)
- CUDAGraph native (CLR profiler handles graph nodes, no batch skip)
- HIP API CPU timeline (first time RTL can measure launch latency)
- SDMA copy tracking with byte counts
- Demangled kernel names native from HIP runtime (no symbol iteration)

Add correlation_id column to rocpd_op and rocpd_api (default 0,
backward compatible). Add indexes on correlation_id. Migration via
ALTER TABLE ADD COLUMN for pre-existing trace files; ignored errors
make it idempotent. Prepared statements now bind correlation_id;
record_hip_api / record_kernel / record_copy / record_roctx all
persist the parameter that was previously silently dropped.

Add RtlMode::HIP enum value. Parse "hip" from RTL_MODE env var. When
mode is HIP, OnLoad() skips queue intercept setup / signal pool / worker
thread, registers shutdown handler, and returns. shutdown() branches
early for HIP mode and calls hip_intercept::hip_profiler_drain().
Teardown-order invariant (worker.join() before DB close) preserved
for HSA modes — HIP branch does not explicitly close the DB, relying
on the existing lazy_init atexit handler in trace_db.cpp.

Rewrite from placeholder to actual implementation. Declare the
HipClrApiRecord / HipClrGpuActivity types locally (upstream header
not yet shipped). dlopen libamdhip64.so with RTLD_NOLOAD (so we don't
force-load HIP when the app never used it), try multiple .so name
variants for ROCm 6 / 7 / symlink cases. dlsym the 5 hipClrProfiler*
functions. Drain sequence: Disable() -> GetRecords() -> iterate into
trace_db -> Reset(). Graceful fallback when symbols are missing.
Maps dispatch/copy/barrier op codes to record_kernel / record_copy.

Add "hip" to --mode choices. When mode=hip, cmd_trace sets
GPU_CLR_PROFILE=/dev/null in subprocess env. This lets the CLR
profiler self-activate during hip::init() (librtl.so can't call
hipClrProfilerEnable() from its OnLoad because HIP is mid-init at
that point). /dev/null suppresses the profiler's own JSON autosave;
rtl extracts records via GetRecords() and writes SQLite itself.

- make -j: clean build, no new dependencies (dl already linked)
- make test-cpu: 216 passed, 34 skipped, 0 failed
- test_source_guard still green (no roctracer/rocprofiler leaks)
- Existing HSA mode behavior unchanged

GPU validation on MI300X with a ROCm build that includes the CLR
profiler patch is tracked separately and gated by the validation
protocol from RFC-002 section "Kernel launch latency analysis".

Refs #79, #84
@sunway513 sunway513 marked this pull request as draft April 16, 2026 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: HIP-level kernel completion callback for non-invasive profiling

1 participant