feat(audit): replay endpoint and SSE live tail (#8) by cjimti · Pull Request #11 · plexara/mcp-test

cjimti · 2026-05-06T08:46:34Z

Third slice of #8. Lands the two state-changing / streaming endpoints from the inspection roadmap. Closes off both backend follow-ups so the next branch can focus entirely on the portal UI rewrite.

What's in this PR

Replay endpoint (`POST /api/v1/portal/audit/events/{id}/replay`)

Re-invokes a captured tool call through an in-process MCP client and writes a new audit row tagged source=portal-replay with replayed_from = {id}. The replay is fired with the portal-authenticated identity (NOT the original caller's), so the new row reflects who triggered it.

Per-identity rate limiting. Token bucket: 5 burst, one token / 12s = ~5/min sustained. Exhausted callers get 429 Too Many Requests with Retry-After. Bucket is keyed by <auth_type>:<subject>; nil or empty identity returns 401 (fails closed; the limiter's empty-key fail-open path is unreachable from this handler).
Refusals (4xx, no tool call made).
- 400 invalid UUID
- 404 event not in the audit store
- 400 original event has no captured payload (capture was disabled when it was written)
- 400 any captured parameter value is the literal [redacted] (replaying with placeholders would mislead about what the call did)
- 400 named tool is no longer registered
- 429 per-identity rate limit exhausted
Response shape. Top-level success boolean so callers don't have to introspect the SDK-shaped result; replay_event_id for follow-up /events/{id} linking; replayed_from echo. HTTP 502 Bad Gateway on transport-level callErr OR tool-side IsError (mirrors /admin/tryit semantics).
error_category mirrors pkg/mcpmw/audit.go precedence. When both callErr != nil and cr.IsError, the bucket is "handler" (not "tool"), matching what a native call would record. Operators filtering ?error_category=tool over /events see consistent bucketing across native and replayed calls.
Deep-copies request params. audit.SanitizeParameters returns the input map AS-IS when redactKeys is empty (fast path). Without a deep copy, the SDK / tool handler could mutate the original event's RequestParams via the shared map pointer. deepCopyMap clones JSON-shape maps before passing to CallTool and recordReplayAudit.
Audit Log uses background ctx. recordReplayAudit uses context.WithTimeout(context.Background(), 5s) instead of r.Context(). A client disconnect at the moment we're persisting the replay row must not drop it; the response promised replay_event_id and that id has to lead to a real /events row.
CSRF gate. Mounted via requireCSRFHeader so the endpoint requires X-Requested-With (the SPA sets this; a forged <form> POST cannot).

SSE live tail (`GET /api/v1/portal/audit/stream`)

Emits one SSE event per newly-written audit row. New audit.SubscribingLogger optional capability; consumers type-assert the configured logger.

Capability design. SubscribingLogger.Subscribe(buf int) (<-chan Event, func()) returns a buffered receive-only channel and a cancel func. The caller MUST call cancel; otherwise the registry leaks. AsyncLogger broadcasts AFTER inner.Log() returns nil, so subscribers see only persisted events; failed writes don't surface to live tail. MemoryLogger broadcasts on every Log.
Concurrency: per-subscriber mutex. A subscriber struct gates send (broadcast) and close (cancel) via its own sync.Mutex so a concurrent cancel can't race with broadcast on a closed channel. Race-tested with a 100-event slow-consumer drop test.
Slow consumers drop. Buffer-full subscribers silently drop the event for that subscriber; producer never blocks. Default buffer is 64 events; SSE consumers should drain promptly during bursts.
SSE framing. Opening : connected comment confirms the connection before the first audit row; : keepalive comment every 30s prevents intermediate proxies from killing idle connections; event: audit\ndata: <event JSON> per event. Frame is encoded into a bytes.Buffer and written in a single w.Write so a partial encoder failure can't ship a half-formed frame.
Headers. Content-Type: text/event-stream, Cache-Control: no-store, Connection: keep-alive, X-Accel-Buffering: no (nginx hint to disable proxy-side buffering).
History vs tail. Subscribers see only events written AFTER they subscribe. For history, use /events or /export.

`audit` package: capability + MemoryLogger consistency

New audit.SubscribingLogger interface in pkg/audit/logger.go (alongside PayloadLogger and StreamingLogger).
AsyncLogger.Subscribe + broadcast in pkg/audit/async.go. Broadcast is non-blocking per subscriber via the per-subscriber mutex pattern.
MemoryLogger.Subscribe + broadcast in pkg/audit/memory.go. Used by tests that bypass AsyncLogger.
MemoryLogger.Log auto-assigns Event.ID (uuid.NewString) when the caller leaves it empty, matching Postgres's behavior. Tests no longer need to set IDs explicitly to round-trip through /events/{id} or /replay.
MemoryLogger now implements PayloadLogger (returns the stored Payload pointer) so the replay handler can fetch the original params in unit tests without a Postgres dependency.

Rate limiter (`pkg/httpsrv/ratelimit.go`)

Per-key token bucket with injectable clock. Burst, refill rate, idle GC. Currently used only by /replay; reusable for any future per-identity-rate-limited endpoint.

Tests

File	Coverage
`pkg/audit/subscribe_test.go`	Race-tested: deliver-after-Log, fan-out to multiple subscribers, cancel-stops-delivery, slow-consumer-drops-events (100 events into buf=2), failed-inner-Log-does-not-broadcast
`pkg/httpsrv/ratelimit_test.go`	Burst-then-refill timing with injected clock; per-key independence; empty-key fail-open; idle GC after 10 minutes
`pkg/httpsrv/portal_api_replay_test.go`	Real `mcp.Server` with identity toolkit registered; happy-path replay via in-process MCP client; 503-when-mcpServer-nil; 404-on-event-not-found; 400-on-redacted-params; 400-on-no-payload; 429 after burst exhaust (`Retry-After` header asserted); `hasRedactedParam` table; `identityKey` table; `deepCopyMap` aliasing test; `callToolResultToMap` content-type matrix; SSE deliver-and-comment via real `httptest.Server` (`http.Flusher`)
`tests/audit_replay_test.go`	Full HTTP stack via `portalApp`: replay roundtrip + assert new audit row carries `replayed_from`; redacted refusal returns 400; invalid UUID returns 400
`tests/audit_stream_test.go`	Full HTTP stack: opening `: connected` comment fires immediately; SSE delivers tool call's audit event within 3s

Docs

docs/operations/audit.md: new "Replay a captured call" subsection with curl examples and a side-effects warning ("Replay re-runs the tool's side effects. If the original call wrote to a database, sent a notification, or charged a card, the replay does it again."); new "Live tail" subsection.
docs/reference/http-api.md: rows for /replay (rate limit, refusals, CSRF) and /stream (SSE format, keepalive, X-Accel-Buffering).

Closes (subtasks of #8)

Replay endpoint. Per-identity rate-limited, refuses unsafe replays, writes new audit row with replayed_from linkage.
SSE live tail. New SubscribingLogger capability, fan-out broadcast on AsyncLogger and MemoryLogger, atomic SSE framing, heartbeat, race-tested.

Does not close the umbrella issue. Remaining: portal UI inspection drawer, comparison page, and docs/operations/inspection.md walkthrough. All three are frontend / docs-only on top of these endpoints.

Process: pre-commit adversarial review gate

This branch is the first to land through the pre-commit hook installed in the previous round. The gate ran 3 rounds against the working tree before allowing git commit:

Round 1: 17 findings. Doc/comment rate-limit math wrong by 5x, recordReplayAudit Payload missing ResponseResult/Error, audit Log used request ctx (cancellation drops row), tool IsError returned 200 hiding failure, SSE 3-stage write could ship half-formed frame, SanitizeParameters fast-path aliasing, stale comment about NewEvent UUIDs, nil identity could fail-open, ErrorCategory bucketed wrong vs middleware, plus 8 minor / skipped.
Round 2: 6 findings. ErrorCategory precedence reversed (regression in the round-1 fix), callToolResultToMap mirror dropped detail keys, empty &Identity{} could still fail-open. Three doc/comment minors skipped.
Round 3: CLEAN.

All real round-1 and round-2 findings are addressed in this commit. There is no fix(...): address PR review follow-up commit on this branch — by the gate's design, there shouldn't be one ever.

Verification

make verify green: gofmt, vet, race-tested unit suite, lint (golangci-lint v2.11.4), gosec (v2.25.0), govulncheck, semgrep, coverage gate at 80% (filtered total 80.0%).
go test -tags integration -count=1 -timeout 300s ./... green: postgres testcontainer suite + tests/ HTTP stack.

Test plan

make verify
go test -tags integration ./...
Manual: make dev, fire an echo call, then:
- curl -X POST -H "X-API-Key: $KEY" -H "X-Requested-With: x" "$BASE/api/v1/portal/audit/events/<id>/replay" returns {"success": true, "replay_event_id": "<new-uuid>", "replayed_from": "<id>", ...}
- Re-fire the replay 6 times in a row; the 6th returns 429 with a Retry-After header
- curl -N -H "X-API-Key: $KEY" "$BASE/api/v1/portal/audit/stream" shows : connected immediately, then event: audit\ndata: ... lines as new tool calls fire from a second terminal
Manual: try to /replay an event whose original captured params include a [redacted] value; observe 400 with the message about re-staging via Try-It

Adds two backend features from #8 to the audit pipeline. Replay endpoint (POST /audit/events/{id}/replay): - Re-invokes the captured tool call through an in-process MCP client. Writes a new audit row tagged source=portal-replay with replayed_from set; the row carries the portal-authenticated identity, not the original caller's, so an operator can see who fired the replay. - Per-identity token bucket (5 burst, ~5/min sustained); 429 with Retry-After when exhausted. - Refuses (400) when the original event has no captured payload, has redacted parameter values, or names a tool no longer registered. CSRF-gated via X-Requested-With. - Response includes top-level success, replay_event_id, and replayed_from. HTTP 502 on transport-level callErr OR tool-side IsError, mirroring /admin/tryit. - error_category mirrors mcpmw/audit middleware precedence (tool only when callErr==nil; handler overwrites tool when callErr != nil) so /events ?error_category= filtering bucket the same way for native and replayed calls. SSE live tail (GET /audit/stream): - New audit.SubscribingLogger optional capability. AsyncLogger broadcasts to subscribers AFTER inner.Log() succeeds (failed writes don't surface to live tail). MemoryLogger broadcasts on every Log. - Per-subscriber mutex serializes send vs cancel so cancellation can't race with broadcast on a closed channel; race-tested. - Each subscriber gets a buffered channel; slow consumers drop events for that subscriber, never block the producer. - SSE handler emits opening ': connected' comment, ': keepalive' every 30s, and one 'event: audit\\ndata: <json>' per write. Frame is encoded into a bytes.Buffer and written atomically so a partial Encode failure can't ship a half-formed frame. - Per-row r.Context().Err() check so a client disconnect doesn't waste a full Postgres page (1000 rows) before the page-level ctx check inside Stream. MemoryLogger / Logger interface changes: - MemoryLogger.Log auto-assigns Event.ID (uuid.NewString) when unset, matching Postgres. Tests no longer need to set IDs explicitly to round-trip through /events/{id}. - MemoryLogger now implements PayloadLogger (returns the stored Payload pointer) so the replay handler can fetch original params in unit tests without a Postgres dependency. Tests: - pkg/audit/subscribe_test.go: race-tested AsyncLogger.Subscribe (delivery, fan-out, cancel, slow-consumer drop, failed inner write skips broadcast). - pkg/httpsrv/ratelimit_test.go: token bucket burst, refill, per- key independence, empty-key fail-open, GC of idle buckets. - pkg/httpsrv/portal_api_replay_test.go: full replay happy path via in-process MCP server, 503/400/404/429 paths, deepCopyMap, callToolResultToMap content-type matrix, identityKey, SSE stream-delivers-event end-to-end. - tests/audit_replay_test.go + audit_stream_test.go: full HTTP stack via portalApp; replay roundtrip, redacted-refusal, invalid-UUID; SSE delivers within 3s, opening ': connected' comment. Docs: - docs/operations/audit.md: replay subsection (with example), live-tail subsection, replay-rerunning-side-effects warning. - docs/reference/http-api.md: replay + stream endpoint rows. Process: - Pre-commit adversarial-review gate ran 3 rounds: 17 + 6 + 0 findings. All real findings fixed before this commit. make verify green; race-tested; integration tests green.

CodeQL on PR #11 fired go/clear-text-logging on the audit.Log(*ev) call: err.Error() flows into Event.ErrorMessage, then *ev is passed to a function literally named Log. The chain is real but the audit logger's contract is to capture this verbatim (sanitized via redact_keys) for forensics, so the rule doesn't apply by design. Fix: - .github/codeql/codeql-config.yml: exclude go/clear-text-logging with documented justification. - .github/workflows/codeql.yml: switch from queries: security-and-quality to config-file. CI now reads the same exclusions a local run does. - Makefile: make codeql target. Uses the project config; runs the same query suite CI runs. Heavy (~3 min on first invocation, ~1 min cached); not part of make verify by default. - scripts/codeql-gate.sh: parses the SARIF output, drops findings matching the config's query-filters.exclude.id list, exits non-zero if any unfiltered finding remains. Portable bash 3.2 (no mapfile / readarray) so it runs on the macOS dev path. Local verification: make codeql produced 6 findings, all ruleId=go/clear-text-logging; codeql-gate filtered them all; clean. The pre-commit-gate hook (~/.claude/hooks/review-gate.sh) does NOT run codeql by default — too slow for every commit. The memory checklist + prompt template are updated to require make codeql before push for substantial branches. This was the gap PR #11 hit: gosec + semgrep don't have CodeQL's data-flow rules, so the local verify said clean but CodeQL on push caught it.

github-advanced-security AI found potential problems May 6, 2026

View reviewed changes

Comment thread pkg/httpsrv/portal_api.go Fixed

cjimti merged commit 8922aaa into main May 6, 2026
8 checks passed

cjimti deleted the feat/audit-replay-sse branch May 6, 2026 09:06

This was referenced May 6, 2026

audit-inspection: complete the inspection / debugging utility #8

Closed

feat(audit): portal inspection UI — drawer, compare page, walkthrough docs #12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(audit): replay endpoint and SSE live tail (#8)#11

feat(audit): replay endpoint and SSE live tail (#8)#11
cjimti merged 2 commits into
mainfrom
feat/audit-replay-sse

cjimti commented May 6, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cjimti commented May 6, 2026

What's in this PR

Replay endpoint (POST /api/v1/portal/audit/events/{id}/replay)

SSE live tail (GET /api/v1/portal/audit/stream)

audit package: capability + MemoryLogger consistency

Rate limiter (pkg/httpsrv/ratelimit.go)

Tests

Docs

Closes (subtasks of #8)

Process: pre-commit adversarial review gate

Verification

Test plan

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Replay endpoint (`POST /api/v1/portal/audit/events/{id}/replay`)

SSE live tail (`GET /api/v1/portal/audit/stream`)

`audit` package: capability + MemoryLogger consistency

Rate limiter (`pkg/httpsrv/ratelimit.go`)