Skip to content

Refit for embedded use as well#103

Open
mwootton wants to merge 4 commits into
mainfrom
users/mwootton/rpd_changes
Open

Refit for embedded use as well#103
mwootton wants to merge 4 commits into
mainfrom
users/mwootton/rpd_changes

Conversation

@mwootton
Copy link
Copy Markdown
Owner

Changes to allow the core tracing to be embedded into an application.

Based on registration, defer to callback functions to handle record output rather than trace_db.

@github-actions
Copy link
Copy Markdown

🤖 GPT-5.4 Code Review

NEEDS CHANGES

Real issues

1. Data race on callback registration/use

trace_db.cpp stores callbacks and user-data in plain globals:

static ApiEventCallback g_api_event_cb = nullptr;
static void* g_api_event_cb_data = nullptr;
...

These are read from app threads (hip_api_intercept.cpp) and the completion worker (hsa_intercept.cpp) while potentially being written by the embedder. That is a C++ data race/UB.

This matters even if README says “register before OnLoad”; the API itself exposes mutable global state and rtl_trigger_shutdown() is documented as idempotent/safe. At minimum, callback pointers and user-data should be atomics or protected by a lock, and reads should snapshot callback + userdata consistently.

2. Inconsistent shutdown behavior when only API callback is set

In hsa_intercept.cpp:

if (!trace_db::get_kernel_event_callback()) {
    trace_db::get_trace_db().flush();
    trace_db::get_trace_db().close();
}

and in OnLoad:

if (!trace_db::get_kernel_event_callback()) {
    (void)trace_db::get_trace_db();
}

But HIP API interception uses get_api_event_callback(). If an embedder sets only the API callback:

  • OnLoad still opens SQLite unnecessarily
  • shutdown still flushes/closes SQLite unnecessarily

That contradicts the README claim that “No SQLite file is created” when callbacks are set. The condition should consider either callback, not just kernel callback.

3. Callback lifetime contract is unsafe as implemented

Both callback record structs carry raw const char* pointers to transient storage:

  • HIP path passes stack buffers like char args[64]
  • kernel path passes name.c_str() from a local std::string

Those pointers are valid only for the duration of the callback invocation. The README does not state this explicitly, and embedders may reasonably retain the record or enqueue it asynchronously, causing use-after-scope bugs.

Not necessarily a bug in current internal use if callbacks consume synchronously, but the API contract is unsafe/underspecified and easy to misuse. Either:

  • document that strings are only valid during the callback, or
  • pass owned strings / lengths / stable storage.

Other checks

Dependency violations

No new references to roctracer, rocprofiler-sdk, or libroctx64 in this diff.

Performance

No obvious new hot-path allocation regressions beyond existing behavior. Callback path avoids SQLite, which is good.

Security

No new format-string issue seen. snprintf usage is bounded.

Recommendation

Fix the callback synchronization and the SQLite gating logic before merging. Also tighten/document the callback string lifetime contract.


Model: gpt-5.4

@github-actions
Copy link
Copy Markdown

🤖 GPT-5.4 Code Review

NEEDS CHANGES

Real issues

1. Data race on callback pointers/user-data

trace_db.h says callbacks are “read-only from hot paths via atomic loads”, but the implementation uses plain globals:

static ApiEventCallback g_api_event_cb = nullptr;
static void* g_api_event_cb_data = nullptr;
static KernelEventCallback g_kernel_event_cb = nullptr;
static void* g_kernel_event_cb_data = nullptr;

These are read from app threads and the completion worker without synchronization. Even if the intended contract is “set before OnLoad”, the code also exposes rtl_trigger_shutdown() and documents idempotent/multi-call behavior, so this should be made robust and match the comment/API contract.

Why it matters: unsynchronized cross-thread reads of non-atomic globals are UB in C++.

Fix: use std::atomic<ApiEventCallback>, std::atomic<KernelEventCallback>, and atomic void*/uintptr_t for user data, with acquire/release loads/stores.


2. rtl_trigger_shutdown() is not obviously idempotent/thread-safe

README claims:

Safe to call multiple times (idempotent).

But this patch only adds:

void rtl_trigger_shutdown() {
    hsa_intercept::shutdown();
}

Unless hsa_intercept::shutdown() is already guarded against concurrent/repeated calls, this is a correctness bug in the documented API. The diff doesn’t show any new once-guard.

Fix: wrap shutdown with std::once_flag or an atomic state machine and document whether concurrent calls are allowed.


3. Callback path can execute arbitrary user code on hot/intercept threads with no containment

Both:

  • HIP API wrappers call the API callback on the application thread
  • kernel completion worker calls the kernel callback on the worker thread

That may be intended, but there is no protection against callback reentrancy/exception propagation.

In particular, deliver_hip_api() invokes user callback while ScopedReentrancyGuard is active, which avoids self-recursion into HIP wrappers, but:

  • a throwing C++ callback will unwind through C-facing interceptors / worker thread
  • a blocking callback can stall the completion worker and back up dispatch processing

Fix: at minimum, document callbacks must be noexcept and non-blocking; better, catch all exceptions around callback invocation and drop/report the event.


Other observations

Dependency violations

No new references to roctracer, rocprofiler-sdk, or libroctx64 in this diff.

Performance

  • completion_worker() still does std::string name = lookup_kernel_name(...) before checking whether callback/DB path is used. That’s pre-existing shape, but for embedded mode this still allocates/copies on every completion. Not a blocker unless lookup_kernel_name already returns interned storage.
  • No new thread-per-dispatch issue introduced here.

Security

  • snprintf usage looks bounded and format strings are constant.
  • No obvious buffer overflow introduced in this patch.

Style

  • Generally consistent, but the header comment says “atomic loads” while implementation is non-atomic; that inconsistency should be fixed with the race above.

Model: gpt-5.4

@github-actions
Copy link
Copy Markdown

🤖 GPT-5.4 Code Review

NEEDS CHANGES

Real issues

1. Data race on callback pointers/user_data

trace_db.cpp stores callbacks in plain globals:

static ApiEventCallback g_api_event_cb = nullptr;
static void* g_api_event_cb_data = nullptr;

and reads them from multiple threads/hot paths (hip_api_intercept.cpp, hsa_intercept.cpp).

The comment says registration happens before OnLoad, but the public API does not enforce that, and the README explicitly describes embedding into another library where init ordering can be subtle. As written, any late registration/clear is UB. Also, callback and user_data are read separately, so even with “mostly correct” ordering you can observe mismatched pairs.

Fix: store callback + user_data atomically as one immutable struct pointer, or use atomics for both with acquire/release semantics and document that changing after startup is unsupported.


2. Exceptions escaping callbacks can terminate/interfere with runtime threads

Both callback paths invoke embedder code directly:

  • HIP wrappers on app threads
  • HSA completion worker thread

If a C++ callback throws through these interception frames, behavior is bad at best and can terminate the process. The header comment says “must be noexcept”, but this is not enforced.

Fix: wrap callback invocation in try/catch (...) and drop/log on failure. Especially important on the completion worker thread.


3. rtl_trigger_shutdown() exported from trace_db can race with in-flight HIP API callbacks

The README claims:

After it returns, no more callbacks will fire.

But rtl_trigger_shutdown() only forwards to hsa_intercept::shutdown(), which joins the completion worker and drains HSA-side state. It does not stop application threads from continuing to enter HIP wrappers and calling deliver_hip_api() concurrently after shutdown returns.

So the documented guarantee is false unless shutdown also disables HIP API recording globally and callers ensure no app threads are still running through wrappers.

Fix: either:

  • make shutdown disable HIP interception/callback delivery before returning, or
  • weaken the API contract/docs to only guarantee no more kernel completion callbacks.

Other checks

Dependency violations

No new references to roctracer, rocprofiler-sdk, or libroctx64 in this diff.

Performance

No major regression in this diff. Callback path avoids SQLite work. The kernel path still allocates std::string name = lookup_kernel_name(...); if that was already present, this PR doesn’t worsen it.

Security

No obvious buffer overflow/format-string issue introduced. snprintf usage remains bounded.

Style

Generally consistent with existing code.


Model: gpt-5.4

@sunway513
Copy link
Copy Markdown
Owner

Test report — PR #103 (refit for embedded use)

Verdict: ✅ ready to merge. Overhead-neutral vs main, new callback API works as documented, all CPU tests pass.

Tested on mi355-gpu-15 (8× MI355X) inside rocm/atom-dev:latest (2026-04-24). PR head e9cafb59, base dde1e969 (main HEAD).

Build

  • Clean dep chain: libhsa-runtime64, libsqlite3 only (no roctracer / rocprofiler-sdk / libamdhip64) ✅
  • librtl.so size: 1300072 → 1304000 bytes (+3928 bytes for new callback path)
  • New exported symbols verified via nm -D:
    • trace_db::set_api_event_callback
    • trace_db::set_kernel_event_callback
    • trace_db::rtl_trigger_shutdown

CPU tests

  • pytest tests/: 243 passed, 47 skipped
  • ruff check: clean ✅
  • Source guard tests pass (no roctracer / rocprofiler-sdk references) ✅

GPU smoke (50× 1024×1024 matmul, RTL lite)

intercept inject recorded DB size
main librtl.so 57 55 55 49 KB
PR librtl.so (default) 57 55 55 49 KB
PR librtl.so + LD_PRELOAD callback 57 55 55 0 KB

Default path: bit-for-bit identical to main. With a callback registered, SQLite write is suppressed and 55 callback invocations fire instead — confirms the "instead of TraceDB" semantics in the header comment.

E2E serving overhead — GPT-OSS 120B TP=8

ATOM serving + benchmark_serving, 64 prompts × ISL 1024 × OSL 128, max-concurrency 8, ignore-eos, 4 warmups.

phase          dur(s)  out_tok/s   med_ttft   med_itl   p99_itl  mean_tpot
baseline         3.98     2057.3      85.60     3.168     4.172      3.229
main_rtl         5.67     1444.2     103.98     3.217    14.546      4.748
pr103_rtl        5.62     1458.9      99.03     3.224    14.574      4.768

PR vs main (apples-to-apples, both RTL lite):

  • output_throughput: +1.0%
  • median_itl: +0.2%
  • mean_tpot: +0.4%
  • median_ttft: −4.8%

All deltas inside run-to-run noise on a 4–6 second wall. Refactor is overhead-neutral. ✅

Gap (non-blocking)

PR adds three exported C++ symbols (set_api_event_callback, set_kernel_event_callback, rtl_trigger_shutdown) but no test or example exercising them. Per our convention (every major code change ships docs + test), please add a minimal tests/test_embedded_callback.cpp or examples/embedded_consumer.cpp. Happy to drop in the 30-line LD_PRELOAD harness I used for validation if helpful.

RTL-vs-baseline aside (not a PR issue)

Both RTL phases show ~30% wall-time penalty on this short workload (4s → 5.7s), driven by mean_tpot +47% and p99_itl 3.5× — but median_itl only +1.5%. The gap suggests a tail of high-TPOT events under signal injection. This affects main and PR equally, so it's not a regression for this PR. Worth a separate look at the mean/p99 ITL spread on short-context decode paths.

@github-actions
Copy link
Copy Markdown

🤖 GPT-5.4 Code Review

NEEDS CHANGES

Real issues

1. Data race on callback pointers/user_data

trace_db.cpp stores callbacks in plain globals and reads them from multiple threads:

  • app threads in HIP wrappers
  • completion worker thread in hsa_intercept.cpp
  • shutdown path

The comment says registration happens before OnLoad, but the API is public and nothing enforces that. More importantly, reads can race with teardown/unset or late registration by an embedder. This is UB in C++.

Where

  • src/trace_db.cpp
    • static ApiEventCallback g_api_event_cb = nullptr;
    • static void* g_api_event_cb_data = nullptr;
    • static KernelEventCallback g_kernel_event_cb = nullptr;
    • static void* g_kernel_event_cb_data = nullptr;

Fix
Use atomics for callback pointers and user-data, or encapsulate both in one immutable struct published atomically.


2. Callback exceptions can unwind through C/HSA/HIP interception paths

The header documents callbacks “must be noexcept”, but the implementation does not enforce containment. If an embedder throws from either callback, that exception can unwind through:

  • HIP interposed C ABI functions
  • HSA completion worker thread

That can terminate the process or corrupt shutdown behavior.

Where

  • src/hip_api_intercept.cpp in deliver_hip_api()
  • src/hsa_intercept.cpp in completion_worker() callback path

Fix
Wrap callback invocation in try/catch (...) and drop/log on failure.


3. rtl_trigger_shutdown() can run before OnLoad and force DB creation unexpectedly

The README says embedders can call rtl_trigger_shutdown() safely/idempotently. But shutdown() now skips DB flush/close only if callbacks are set. If rtl_trigger_shutdown() is called before OnLoad and before callbacks are registered, this path still does:

trace_db::get_trace_db().flush();
trace_db::get_trace_db().close();

That can lazily create/open the SQLite DB during shutdown, which violates the documented “no SQLite file is created” behavior for embedding scenarios.

Where

  • src/hsa_intercept.cpp shutdown()

Fix
Guard flush/close on actual DB initialization state, not just callback presence.


Other checks

Dependency violations

No new references to:

  • roctracer
  • rocprofiler-sdk
  • libroctx64

Good.

Performance

No obvious new hot-path heap allocation beyond existing lookup_kernel_name() string creation. Callback dispatch itself is fine. No thread-per-dispatch issue introduced.

Security

No format-string bug introduced. snprintf usage here is bounded. No obvious overflow in this diff.

Style

Consistent with existing code overall.


Model: gpt-5.4

@sunway513
Copy link
Copy Markdown
Owner

@mwootton friendly ping — the embedded-tests commit (8c9de8f) added the test we asked for (thanks!) but landed three blockers. Status today:

Lint (trivial): tests/test_embedded_callback.py has 3 unused imports (subprocess, tempfile, pytest) — ruff check --fix will clear it.

GPT-5.4 review (NEEDS CHANGES) — three real correctness items on the new callback API:

  1. Data raceg_api_event_cb / g_kernel_event_cb / their _data slots are plain globals, read from app threads + completion worker. Header says "register before OnLoad" but nothing enforces it. UB in C++. Fix: std::atomic for the pointers, or atomically publish an immutable struct.
  2. Exception safety — header says callbacks "must be noexcept" but the call sites in deliver_hip_api() and completion_worker() don't enforce it. An embedder throw unwinds through HIP C ABI / HSA worker → process abort. Fix: try { cb(ev, ud); } catch (...) { drop+log; }.
  3. rtl_trigger_shutdown() lazy-creates SQLite — if called before OnLoad and before any callback is registered, shutdown() still hits get_trace_db().flush()/.close(), which materializes the SQLite file. Violates the documented "no SQLite file in embed mode" contract. Fix: guard flush/close on actual DB init state, not callback presence.

Once those are in, I'll rerun the full recipe (~30 min on mi355-gpu-15) against the new head and update the verdict. Build + GPU 8× tests are already passing, so I expect the fixes to be ~50 LOC and the re-test to come back green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants