Refit for embedded use as well by mwootton · Pull Request #103 · sunway513/rocm-trace-lite

mwootton · 2026-05-14T03:02:43Z

Changes to allow the core tracing to be embedded into an application.

Based on registration, defer to callback functions to handle record output rather than trace_db.

github-actions · 2026-05-14T03:03:10Z

🤖 GPT-5.4 Code Review

NEEDS CHANGES

Real issues

1. Data race on callback registration/use

trace_db.cpp stores callbacks and user-data in plain globals:

static ApiEventCallback g_api_event_cb = nullptr;
static void* g_api_event_cb_data = nullptr;
...

These are read from app threads (hip_api_intercept.cpp) and the completion worker (hsa_intercept.cpp) while potentially being written by the embedder. That is a C++ data race/UB.

This matters even if README says “register before OnLoad”; the API itself exposes mutable global state and rtl_trigger_shutdown() is documented as idempotent/safe. At minimum, callback pointers and user-data should be atomics or protected by a lock, and reads should snapshot callback + userdata consistently.

2. Inconsistent shutdown behavior when only API callback is set

In hsa_intercept.cpp:

if (!trace_db::get_kernel_event_callback()) {
    trace_db::get_trace_db().flush();
    trace_db::get_trace_db().close();
}

and in OnLoad:

if (!trace_db::get_kernel_event_callback()) {
    (void)trace_db::get_trace_db();
}

But HIP API interception uses get_api_event_callback(). If an embedder sets only the API callback:

OnLoad still opens SQLite unnecessarily
shutdown still flushes/closes SQLite unnecessarily

That contradicts the README claim that “No SQLite file is created” when callbacks are set. The condition should consider either callback, not just kernel callback.

3. Callback lifetime contract is unsafe as implemented

Both callback record structs carry raw const char* pointers to transient storage:

HIP path passes stack buffers like char args[64]
kernel path passes name.c_str() from a local std::string

Those pointers are valid only for the duration of the callback invocation. The README does not state this explicitly, and embedders may reasonably retain the record or enqueue it asynchronously, causing use-after-scope bugs.

Not necessarily a bug in current internal use if callbacks consume synchronously, but the API contract is unsafe/underspecified and easy to misuse. Either:

document that strings are only valid during the callback, or
pass owned strings / lengths / stable storage.

Other checks

Dependency violations

No new references to roctracer, rocprofiler-sdk, or libroctx64 in this diff.

Performance

No obvious new hot-path allocation regressions beyond existing behavior. Callback path avoids SQLite, which is good.

Security

No new format-string issue seen. snprintf usage is bounded.

Recommendation

Fix the callback synchronization and the SQLite gating logic before merging. Also tighten/document the callback string lifetime contract.

Model: gpt-5.4

github-actions · 2026-05-14T03:29:19Z

🤖 GPT-5.4 Code Review

NEEDS CHANGES

Real issues

1. Data race on callback pointers/user-data

trace_db.h says callbacks are “read-only from hot paths via atomic loads”, but the implementation uses plain globals:

static ApiEventCallback g_api_event_cb = nullptr;
static void* g_api_event_cb_data = nullptr;
static KernelEventCallback g_kernel_event_cb = nullptr;
static void* g_kernel_event_cb_data = nullptr;

These are read from app threads and the completion worker without synchronization. Even if the intended contract is “set before OnLoad”, the code also exposes rtl_trigger_shutdown() and documents idempotent/multi-call behavior, so this should be made robust and match the comment/API contract.

Why it matters: unsynchronized cross-thread reads of non-atomic globals are UB in C++.

Fix: use std::atomic<ApiEventCallback>, std::atomic<KernelEventCallback>, and atomic void*/uintptr_t for user data, with acquire/release loads/stores.

2. `rtl_trigger_shutdown()` is not obviously idempotent/thread-safe

README claims:

Safe to call multiple times (idempotent).

But this patch only adds:

void rtl_trigger_shutdown() {
    hsa_intercept::shutdown();
}

Unless hsa_intercept::shutdown() is already guarded against concurrent/repeated calls, this is a correctness bug in the documented API. The diff doesn’t show any new once-guard.

Fix: wrap shutdown with std::once_flag or an atomic state machine and document whether concurrent calls are allowed.

3. Callback path can execute arbitrary user code on hot/intercept threads with no containment

Both:

HIP API wrappers call the API callback on the application thread
kernel completion worker calls the kernel callback on the worker thread

That may be intended, but there is no protection against callback reentrancy/exception propagation.

In particular, deliver_hip_api() invokes user callback while ScopedReentrancyGuard is active, which avoids self-recursion into HIP wrappers, but:

a throwing C++ callback will unwind through C-facing interceptors / worker thread
a blocking callback can stall the completion worker and back up dispatch processing

Fix: at minimum, document callbacks must be noexcept and non-blocking; better, catch all exceptions around callback invocation and drop/report the event.

Other observations

Dependency violations

No new references to roctracer, rocprofiler-sdk, or libroctx64 in this diff.

Performance

completion_worker() still does std::string name = lookup_kernel_name(...) before checking whether callback/DB path is used. That’s pre-existing shape, but for embedded mode this still allocates/copies on every completion. Not a blocker unless lookup_kernel_name already returns interned storage.
No new thread-per-dispatch issue introduced here.

Security

snprintf usage looks bounded and format strings are constant.
No obvious buffer overflow introduced in this patch.

Style

Generally consistent, but the header comment says “atomic loads” while implementation is non-atomic; that inconsistency should be fixed with the race above.

Model: gpt-5.4

github-actions · 2026-05-14T03:35:51Z

🤖 GPT-5.4 Code Review

NEEDS CHANGES

Real issues

1. Data race on callback pointers/user_data

trace_db.cpp stores callbacks in plain globals:

static ApiEventCallback g_api_event_cb = nullptr;
static void* g_api_event_cb_data = nullptr;

and reads them from multiple threads/hot paths (hip_api_intercept.cpp, hsa_intercept.cpp).

The comment says registration happens before OnLoad, but the public API does not enforce that, and the README explicitly describes embedding into another library where init ordering can be subtle. As written, any late registration/clear is UB. Also, callback and user_data are read separately, so even with “mostly correct” ordering you can observe mismatched pairs.

Fix: store callback + user_data atomically as one immutable struct pointer, or use atomics for both with acquire/release semantics and document that changing after startup is unsupported.

2. Exceptions escaping callbacks can terminate/interfere with runtime threads

Both callback paths invoke embedder code directly:

HIP wrappers on app threads
HSA completion worker thread

If a C++ callback throws through these interception frames, behavior is bad at best and can terminate the process. The header comment says “must be noexcept”, but this is not enforced.

Fix: wrap callback invocation in try/catch (...) and drop/log on failure. Especially important on the completion worker thread.

3. `rtl_trigger_shutdown()` exported from `trace_db` can race with in-flight HIP API callbacks

The README claims:

After it returns, no more callbacks will fire.

But rtl_trigger_shutdown() only forwards to hsa_intercept::shutdown(), which joins the completion worker and drains HSA-side state. It does not stop application threads from continuing to enter HIP wrappers and calling deliver_hip_api() concurrently after shutdown returns.

So the documented guarantee is false unless shutdown also disables HIP API recording globally and callers ensure no app threads are still running through wrappers.

Fix: either:

make shutdown disable HIP interception/callback delivery before returning, or
weaken the API contract/docs to only guarantee no more kernel completion callbacks.

Other checks

Dependency violations

No new references to roctracer, rocprofiler-sdk, or libroctx64 in this diff.

Performance

No major regression in this diff. Callback path avoids SQLite work. The kernel path still allocates std::string name = lookup_kernel_name(...); if that was already present, this PR doesn’t worsen it.

Security

No obvious buffer overflow/format-string issue introduced. snprintf usage remains bounded.

Style

Generally consistent with existing code.

Model: gpt-5.4

sunway513 · 2026-05-14T15:16:53Z

Test report — PR #103 (refit for embedded use)

Verdict: ✅ ready to merge. Overhead-neutral vs main, new callback API works as documented, all CPU tests pass.

Tested on mi355-gpu-15 (8× MI355X) inside rocm/atom-dev:latest (2026-04-24). PR head e9cafb59, base dde1e969 (main HEAD).

Build

Clean dep chain: libhsa-runtime64, libsqlite3 only (no roctracer / rocprofiler-sdk / libamdhip64) ✅
librtl.so size: 1300072 → 1304000 bytes (+3928 bytes for new callback path)
New exported symbols verified via nm -D:
- trace_db::set_api_event_callback
- trace_db::set_kernel_event_callback
- trace_db::rtl_trigger_shutdown

CPU tests

pytest tests/: 243 passed, 47 skipped ✅
ruff check: clean ✅
Source guard tests pass (no roctracer / rocprofiler-sdk references) ✅

GPU smoke (50× 1024×1024 matmul, RTL lite)

	intercept	inject	recorded	DB size
main `librtl.so`	57	55	55	49 KB
PR `librtl.so` (default)	57	55	55	49 KB
PR `librtl.so` + LD_PRELOAD callback	57	55	55	0 KB

Default path: bit-for-bit identical to main. With a callback registered, SQLite write is suppressed and 55 callback invocations fire instead — confirms the "instead of TraceDB" semantics in the header comment.

E2E serving overhead — GPT-OSS 120B TP=8

ATOM serving + benchmark_serving, 64 prompts × ISL 1024 × OSL 128, max-concurrency 8, ignore-eos, 4 warmups.

phase          dur(s)  out_tok/s   med_ttft   med_itl   p99_itl  mean_tpot
baseline         3.98     2057.3      85.60     3.168     4.172      3.229
main_rtl         5.67     1444.2     103.98     3.217    14.546      4.748
pr103_rtl        5.62     1458.9      99.03     3.224    14.574      4.768

PR vs main (apples-to-apples, both RTL lite):

output_throughput: +1.0%
median_itl: +0.2%
mean_tpot: +0.4%
median_ttft: −4.8%

All deltas inside run-to-run noise on a 4–6 second wall. Refactor is overhead-neutral. ✅

Gap (non-blocking)

PR adds three exported C++ symbols (set_api_event_callback, set_kernel_event_callback, rtl_trigger_shutdown) but no test or example exercising them. Per our convention (every major code change ships docs + test), please add a minimal tests/test_embedded_callback.cpp or examples/embedded_consumer.cpp. Happy to drop in the 30-line LD_PRELOAD harness I used for validation if helpful.

RTL-vs-baseline aside (not a PR issue)

Both RTL phases show ~30% wall-time penalty on this short workload (4s → 5.7s), driven by mean_tpot +47% and p99_itl 3.5× — but median_itl only +1.5%. The gap suggests a tail of high-TPOT events under signal injection. This affects main and PR equally, so it's not a regression for this PR. Worth a separate look at the mean/p99 ITL spread on short-context decode paths.

github-actions · 2026-05-15T15:14:22Z

🤖 GPT-5.4 Code Review

NEEDS CHANGES

Real issues

1. Data race on callback pointers/user_data

trace_db.cpp stores callbacks in plain globals and reads them from multiple threads:

app threads in HIP wrappers
completion worker thread in hsa_intercept.cpp
shutdown path

The comment says registration happens before OnLoad, but the API is public and nothing enforces that. More importantly, reads can race with teardown/unset or late registration by an embedder. This is UB in C++.

Where

src/trace_db.cpp
- static ApiEventCallback g_api_event_cb = nullptr;
- static void* g_api_event_cb_data = nullptr;
- static KernelEventCallback g_kernel_event_cb = nullptr;
- static void* g_kernel_event_cb_data = nullptr;

Fix
Use atomics for callback pointers and user-data, or encapsulate both in one immutable struct published atomically.

2. Callback exceptions can unwind through C/HSA/HIP interception paths

The header documents callbacks “must be noexcept”, but the implementation does not enforce containment. If an embedder throws from either callback, that exception can unwind through:

HIP interposed C ABI functions
HSA completion worker thread

That can terminate the process or corrupt shutdown behavior.

Where

src/hip_api_intercept.cpp in deliver_hip_api()
src/hsa_intercept.cpp in completion_worker() callback path

Fix
Wrap callback invocation in try/catch (...) and drop/log on failure.

3. `rtl_trigger_shutdown()` can run before `OnLoad` and force DB creation unexpectedly

The README says embedders can call rtl_trigger_shutdown() safely/idempotently. But shutdown() now skips DB flush/close only if callbacks are set. If rtl_trigger_shutdown() is called before OnLoad and before callbacks are registered, this path still does:

trace_db::get_trace_db().flush();
trace_db::get_trace_db().close();

That can lazily create/open the SQLite DB during shutdown, which violates the documented “no SQLite file is created” behavior for embedding scenarios.

Where

src/hsa_intercept.cpp shutdown()

Fix
Guard flush/close on actual DB initialization state, not just callback presence.

Other checks

Dependency violations

No new references to:

roctracer
rocprofiler-sdk
libroctx64

Good.

Performance

No obvious new hot-path heap allocation beyond existing lookup_kernel_name() string creation. Callback dispatch itself is fine. No thread-per-dispatch issue introduced.

Security

No format-string bug introduced. snprintf usage here is bounded. No obvious overflow in this diff.

Style

Consistent with existing code overall.

Model: gpt-5.4

sunway513 · 2026-05-20T06:34:07Z

@mwootton friendly ping — the embedded-tests commit (8c9de8f) added the test we asked for (thanks!) but landed three blockers. Status today:

Lint (trivial): tests/test_embedded_callback.py has 3 unused imports (subprocess, tempfile, pytest) — ruff check --fix will clear it.

GPT-5.4 review (NEEDS CHANGES) — three real correctness items on the new callback API:

Data race — g_api_event_cb / g_kernel_event_cb / their _data slots are plain globals, read from app threads + completion worker. Header says "register before OnLoad" but nothing enforces it. UB in C++. Fix: std::atomic for the pointers, or atomically publish an immutable struct.
Exception safety — header says callbacks "must be noexcept" but the call sites in deliver_hip_api() and completion_worker() don't enforce it. An embedder throw unwinds through HIP C ABI / HSA worker → process abort. Fix: try { cb(ev, ud); } catch (...) { drop+log; }.
rtl_trigger_shutdown() lazy-creates SQLite — if called before OnLoad and before any callback is registered, shutdown() still hits get_trace_db().flush()/.close(), which materializes the SQLite file. Violates the documented "no SQLite file in embed mode" contract. Fix: guard flush/close on actual DB init state, not callback presence.

Once those are in, I'll rerun the full recipe (~30 min on mi355-gpu-15) against the new head and update the verdict. Build + GPU 8× tests are already passing, so I expect the fixes to be ~50 LOC and the re-test to come back green.

Refit for embedded use as well

381b79d

Fix a race. Avoid using sql during callback mode

664d686

Try again. Missed a file.

e9cafb5

Add embedded tests

8c9de8f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refit for embedded use as well#103

Refit for embedded use as well#103
mwootton wants to merge 4 commits into
mainfrom
users/mwootton/rpd_changes

mwootton commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

sunway513 commented May 14, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

sunway513 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mwootton commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

🤖 GPT-5.4 Code Review

Real issues

1. Data race on callback registration/use

2. Inconsistent shutdown behavior when only API callback is set

3. Callback lifetime contract is unsafe as implemented

Other checks

Dependency violations

Performance

Security

Recommendation

Uh oh!

github-actions Bot commented May 14, 2026

🤖 GPT-5.4 Code Review

Real issues

1. Data race on callback pointers/user-data

2. rtl_trigger_shutdown() is not obviously idempotent/thread-safe

3. Callback path can execute arbitrary user code on hot/intercept threads with no containment

Other observations

Dependency violations

Performance

Security

Style

Uh oh!

github-actions Bot commented May 14, 2026

🤖 GPT-5.4 Code Review

Real issues

1. Data race on callback pointers/user_data

2. Exceptions escaping callbacks can terminate/interfere with runtime threads

3. rtl_trigger_shutdown() exported from trace_db can race with in-flight HIP API callbacks

Other checks

Dependency violations

Performance

Security

Style

Uh oh!

sunway513 commented May 14, 2026

Test report — PR #103 (refit for embedded use)

Build

CPU tests

GPU smoke (50× 1024×1024 matmul, RTL lite)

E2E serving overhead — GPT-OSS 120B TP=8

Gap (non-blocking)

RTL-vs-baseline aside (not a PR issue)

Uh oh!

github-actions Bot commented May 15, 2026

🤖 GPT-5.4 Code Review

Real issues

1. Data race on callback pointers/user_data

2. Callback exceptions can unwind through C/HSA/HIP interception paths

3. rtl_trigger_shutdown() can run before OnLoad and force DB creation unexpectedly

Other checks

Dependency violations

Performance

Security

Style

Uh oh!

sunway513 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2. `rtl_trigger_shutdown()` is not obviously idempotent/thread-safe

3. `rtl_trigger_shutdown()` exported from `trace_db` can race with in-flight HIP API callbacks

3. `rtl_trigger_shutdown()` can run before `OnLoad` and force DB creation unexpectedly