Skip to content

feat(observability): surface bundle start failures at boot#213

Open
Ovaculos wants to merge 5 commits into
NimbleBrainInc:mainfrom
Ovaculos:fix/bundle-start-failed-observability
Open

feat(observability): surface bundle start failures at boot#213
Ovaculos wants to merge 5 commits into
NimbleBrainInc:mainfrom
Ovaculos:fix/bundle-start-failed-observability

Conversation

@Ovaculos
Copy link
Copy Markdown
Contributor

@Ovaculos Ovaculos commented May 13, 2026

Closes #7

Summary

Boot-time bundle startup failures previously went only to container stderr — operators had to grep to discover that a workspace bundle silently never came up. The user-visible symptom: a missing tool with no event trail in the workspace log, no SSE notification, and no entry in /v1/health.

This branch adds three observability channels for boot failures and four small follow-ups from review.

What changed

Base feature (1cf008d)feat(observability): surface bundle start failures at boot

Three changes on the catch path in startWorkspaceBundles:

  1. New bundle.startFailed engine event. Routed workspace-scoped via SSE_ROUTES (drives SSE fan-out) and WORKSPACE_EVENTS (persists to the workspace log).
  2. startWorkspaceBundles returns a failures: BundleStartFailure[] array alongside entries. Runtime stashes it on _bundleStartFailures and exposes via bundleStartFailures().
  3. HealthMonitor takes a startFailures option at construction; getStatus() merges them in as terminal dead entries so /v1/health reflects the failed bundle instead of omitting it.

Distinct from bundle.crashed, which requires a running source that went away. A start failure means no McpSource ever existed, so the record can't be restarted by the periodic health-check loop; the dead entry is terminal.

Follow-ups (review fixes)

  • f7372c1feat(health-monitor): propagate wsId on dead BundleHealth entries. Two workspaces installing the same connector produce identical source names, so same-named start failures would render as indistinguishable dead rows. Optional wsId?: string on BundleHealth, populated only for failures (live sources don't carry a wsId on McpSource).
  • 4170a2drefactor(runtime): defensive-copy bundleStartFailures() return value. Match bundleNames() pattern one method up.
  • af8c59frefactor(events): rename bundle.start_failed to bundle.startFailed. Match casing of sibling bundle.* events (installed, uninstalled, crashed, recovered, dead).
  • e55eefbtest(workspace-runtime): assert shape over error message string. Decouple from buildLocalSource error wording.

Test plan

  • bun run test:unit — 48 affected tests pass
  • bunx tsc --noEmit — typecheck clean
  • bun run verify — full CI parity locally
  • Manual: install a path-bundle pointing at a nonexistent directory; verify (a) workspace log contains a bundle.startFailed line, (b) /v1/health lists the bundle as state: "dead" with wsId populated, (c) SSE client receives the event.

🤖 Generated with Claude Code

Ovaculos and others added 5 commits May 13, 2026 14:57
Boot-time bundle startup failures previously went only to container
stderr — operators had to grep to discover that a workspace bundle
silently never came up. The user-visible symptom was a missing tool
with no event trail in the workspace log, no SSE notification, and
no entry in `/v1/health`.

Three changes, all on the catch path in `startWorkspaceBundles`:

1. New `bundle.start_failed` engine event. Routed workspace-scoped
   via `SSE_ROUTES` (drives SSE fan-out) and `WORKSPACE_EVENTS`
   (persists to the workspace log).
2. `startWorkspaceBundles` returns a `failures: BundleStartFailure[]`
   array alongside `entries`. Runtime stashes it on `_bundleStartFailures`
   and exposes via `bundleStartFailures()`.
3. `HealthMonitor` takes a `startFailures` option at construction;
   `getStatus()` merges them in as terminal `dead` entries so
   `/v1/health` reflects the failed bundle instead of omitting it.

Distinct from `bundle.crashed`, which requires a running source that
went away. A start failure means no `McpSource` ever existed, so the
record can't be restarted by the periodic health-check loop; the
`dead` entry is terminal.

Tests cover the catch path keeps siblings unaffected, the merged
status in `getStatus()`, SSE routing, and workspace-log persistence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`BundleHealth.name` is the source name (slugged from manifest or
URL). Two workspaces installing the same connector produce identical
names, so a same-named start failure across workspaces would render
as indistinguishable `dead` rows in `/v1/health`.

Add an optional `wsId?: string` to `BundleHealth`. Populated only
for boot-time start failures (the data is on `BundleStartFailure`);
live entries leave it undefined because `McpSource` doesn't carry a
wsId. Consumers can disambiguate without a schema migration on the
live path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Return a shallow copy instead of the internal array reference, matching
the pattern used by `bundleNames()` one method up. HealthMonitor stores
the reference and doesn't mutate it today, but exposing the live array
invites future callers to splice or push into it and silently corrupt
the boot-time failure record.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match the casing convention of sibling bundle.* events (`installed`,
`uninstalled`, `crashed`, `recovered`, `dead`) which use a single
camel/lowercase token after the dot. Pure rename — no payload or
routing change. Internal-only event with no external subscribers
yet, so safe to rename without a deprecation window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The startWorkspaceBundles failure tests asserted on a substring of
the error message ("Local bundle not found") emitted by buildLocalSource.
That couples the test to wording inside a different module — rewording
the error there would break tests here for no behavioral reason.

Switch to shape assertions: error and bundleName are non-empty strings,
plus the existing wsId / serverName equality checks. The behavior under
test is "a failure was recorded with the expected fields populated,"
not "the message reads exactly this."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bundle startup failures not logged to workspace or surfaced to UI

1 participant