Performance optimizations by gateixeira · Pull Request #6 · gateixeira/live-actions

gateixeira · 2026-05-18T09:44:24Z

This pull request introduces several improvements to both the backend and frontend of the application, focusing on reliability, observability, and user experience. The most significant changes include the addition of a robust /readyz readiness endpoint, enhancements to metrics and database handling, and new frontend hooks to provide smoother live updates without UI jitter during user interaction.

Previously, internal/database used a single *sql.DB capped at one open connection. Every read was serialized behind every write, so a slow analytics query could stall webhook ingestion. Open two *sql.DB handles against the same SQLite file: - writeDB: 1 connection (SQLite has a single writer) - readDB: pooled (max(NumCPU, 4) connections) With WAL enabled, readers no longer block the writer (and vice versa), so concurrent SELECTs run in parallel with webhook writes. Also encode connection pragmas (journal_mode, busy_timeout, synchronous, foreign_keys) in the DSN via _pragma= so they are applied to every pooled connection rather than only the first one (pragmas other than journal_mode are connection-scoped). Routing: BeginTx/ExecContext go to writeDB; QueryContext/QueryRowContext go to readDB. All ExecContext sites in this package are INSERT/UPDATE/DELETE, so this routing is correct. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The HTTP webhook handler used to call EventOrderingService.AddEvent, which synchronously INSERTed into webhook_events on the single writer connection. Each request thus blocked on a writer-pool slot, contending with the flush worker's UPSERTs into workflow_jobs/runs. Replace this with a buffered ingest pipeline: POST /webhook -> validate + parse + build OrderedEvent -> push onto in-memory channel (default capacity: 10000) -> 202 Accepted ingestWorker (new goroutine in EventOrderingService) -> drains the channel -> batches up to 200 events or 50ms -> persists via DBWrapper.StoreWebhookEvents (single transaction with one prepared INSERT statement) flushWorker (existing) -> unchanged: replays persisted events through processFunc Back-pressure: AddEvent uses a non-blocking fast path, then blocks for up to enqueueTimeout (default 8s, under GitHub's ~10s webhook timeout) when the channel is full. On timeout it returns ErrIngestQueueFull and the HTTP handler responds 503. GitHub does not auto-retry webhooks, so the warning log + 503 surface signals the operator to manually redeliver from the GitHub UI for the affected delivery_id. Shutdown ordering: Stop() cancels the service context. ingestWorker drains any remaining buffered events using a fresh background context (s.ctx is already cancelled), closes ingestDoneCh, then exits. The flush worker waits on ingestDoneCh before running its final flushAll, so events that were in-flight at shutdown are persisted to webhook_events before the final replay attempt. Crash window: events still in the in-memory channel at the moment of a hard process crash are lost. This is bounded by ingestBatchWait (50ms) times ingestBatchSize (200) and is the deliberate trade-off for the throughput improvement. Adds DBWrapper.StoreWebhookEvents (batched UPSERT) to the database interface and mock; updates the affected ordering-service tests to assert the new channel + batch semantics rather than the old synchronous DB write. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Previously the flushWorker drained at most batchSize=100 events every flushInterval=5s, capping end-to-end processing at ~20 events/sec — the dominant bottleneck once async ingest landed. Sustained load above that rate produced an unboundedly growing webhook_events backlog and made the UI lag minutes behind real time. Changes: - Bump defaults: batchSize 100 -> 500, flushInterval 5s -> 1s, raising per-tick capacity ~25x. - Drain loop in flushReadyEvents pulls successive batches until the query returns < batchSize, bounded by a 500ms time budget and 20 iterations to keep the single SQLite writer available to the ingest worker and to respect ctx cancellation between batches. - flushAll on shutdown now also loops until the queue is empty (or its fresh 30s deadline expires), instead of stopping after one 1000-row fetch. - Drop the redundant StoreWebhookEvent UPSERT inside processOrderedEvent: the row is already persisted by ingestWorker, so re-upserting it added one writer-pool roundtrip per event with no observable effect. Adds a multi-iteration drain test and updates the default-value assertions. Full suite passes with -race. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The dashboard's Peak Demand card reads MAX(running+queued) from metrics_snapshots, which was being written every 10s. Between snapshot ticks the live Running/Queued cards (computed from workflow_jobs at read time) could exceed the recorded peak, producing the visible skew of 'sum of cards > peak demand'. Tightening the tick to 2s reduces the worst-case undercounting window from 10s to 2s and is well within the writer pool's headroom now that the flush worker no longer monopolises it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Under sustained load (e.g. RATE=200 webhooks/s), the server pushed one SSE per processed job/run event, causing the React dashboard to re-render ~200 times/sec. Latency was fine but the UI flickered and felt unresponsive because every metrics_update toggled live counters and every workflow_update triggered a table refetch. Add a server-side coalescer that keeps only the latest payload per event type and emits at most once per 500ms via a single ticker goroutine. SendMetricsUpdate and SendWorkflowUpdate now route through the coalescer; SendEvent remains a direct/immediate path for callers that need it. Since the dashboard always refetches detail data via REST after each SSE notification, dropping intermediate updates is safe — the client still observes the most recent state, just at a steady ~2/sec/type cadence instead of N/sec. Tests: add stopLeakedSSECoalescer helper called from each setup function so the background goroutine does not race with subsequent tests' logger init. Make stop() idempotent via sync.Once. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The dashboard previously re-rendered on every SSE event, which under sustained load made the metrics cards twitch and forced the workflow table to refetch while the user was scrolling or expanding rows. Add useThrottledLive: a hook that buffers high-frequency updates in a ref and only commits them to React state every 30s. Calling its setter does not trigger a re-render — re-renders happen on the timer tick or when interaction ends. SSE events are still received and captured; they are just applied to the UI on a calmer cadence. Wire the dashboard's live running/queued counters and the workflow refresh signal through the hook. Pause commits while the pointer is over the workflow table or focus is inside it, and flush any buffered update immediately when interaction ends so the user sees fresh data the moment they look away. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add three connection-level pragmas to both the read and write pools: - cache_size=-65536: 64 MiB page cache per connection so hot indexes and pages stay resident across queries (default is 2 MiB). - temp_store=MEMORY: keep temporary tables and sort scratch in RAM instead of spilling to disk, speeding up the GROUP BY / ORDER BY queries used by the dashboard. - mmap_size=256 MiB: memory-map the main database file for reads, cutting syscall overhead on pooled SELECTs. These are zero-semantic-change tuning knobs and apply uniformly to the existing DSN-based pragma plumbing. Update pool_smoke_test.go to assert the new values are seen by both pools. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Every workflow_job webhook used to call sendMetricsUpdate(), which ran GetCurrentJobCounts against the read pool and pushed an SSE event. Under sustained load that meant one extra DB query per ingested job event, even though MetricsUpdateService already polls the same counts every 2s and pushes them through the SSE coalescer. Remove the per-event call (and the helper function) so the only producer of metrics_update events is MetricsUpdateService. The UI cadence stays the same — coalesced server-side at 500ms, throttled client-side to 30s. Drop the now-obsolete TestWorkflowJobHandler_HandleEvent_GetCurrentJob CountsError and the GetCurrentJobCounts mock setups from the remaining job-handler tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Previously the SSE handler had a single shared events channel that every HTTP connection's forwarder goroutine read from. Go delivers each value on a channel to exactly one receiver, so with N open browser tabs each event landed in only one tab — multi-tab dashboards saw a roughly 1/N slice of updates while keep-alives masked the issue. Replace the shared channel with a registry of per-connection subscribers. sendEventNow snapshots the subscriber set under a mutex and does a non-blocking send to each (full per-subscriber buffer is dropped for that subscriber only, never blocks peers). HandleSSE subscribes on entry, defers unsubscribe, and reads its own channel directly — no forwarder goroutine. Adds regression tests covering fanout-to-all-subscribers, slow-subscriber-does-not-block-peers, and unsubscribe semantics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds operational visibility into the webhook ingest pipeline and SSE fanout, plus a real readiness probe that load balancers can act on. Endpoints - /readyz: pings both DB pools (write/read) and refuses traffic when the in-memory ingest queue is >= 95% full. /healthz remains a simple process-liveness probe. New Prometheus metrics - live_actions_webhook_events_total{event_type, outcome} outcome ∈ {accepted, rejected_queue_full, rejected_invalid, ignored} - live_actions_ingest_queue_depth (GaugeFunc) - live_actions_ingest_queue_capacity - live_actions_flush_batch_duration_seconds (Histogram) - live_actions_flush_batch_events (Histogram) - live_actions_sse_events_broadcast_total{type} - live_actions_sse_events_dropped_total{type} - live_actions_sse_subscribers (GaugeFunc) - go_sql_* via collectors.NewDBStatsCollector for the write and read pools Plumbing - EventOrderingService.IngestQueueLen / IngestQueueCap accessors - WebhookHandler.OrderingService accessor (for readyz + metric registration) - SSEHandler.SubscriberCount exported (kept lower-case alias for old tests) - flushReadyEvents now records duration + processed-event count Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…r aggregates The dashboard previously had two parallel refresh sources for the same data: the 30s REST poll refreshed every metric (including running/queued) AND SSE metrics_update events were buffered in a 30s-throttled commit. With the throttle, a 'live' counter could lag a real event by up to 30s for no benefit. Refactor so each path has a single, well-defined responsibility: - SSE metrics_update is the sole source of truth for the live running/queued counters. Updates land sub-second (paused while the user is interacting with the table to keep the UI stable). - The 30s REST poll only refreshes the chart history and the aggregate metrics (avg_queue_time, avg_run_time, peak_demand). Its current_metrics running_jobs/queued_jobs is used solely as the initial seed before the first SSE event arrives. - WorkflowTable refetches on workflow_update, kept on the throttled path so a burst of events does not trigger an API refetch storm. Adds usePausableLive — a thin variant of useThrottledLive without the 30s commit interval — for live counters that must update immediately when not paused. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…appy-path skip) Centralize status-priority logic in models/status_priority.go and harden AddOrUpdateJob/AddOrUpdateRun with a full lifecycle transition rule, so typed-table writes are correct regardless of arrival order. This is the foundation for the upcoming skip-webhook-events cut-over. Transition rules now enforced in the DB layer: - absent -> insert - terminal == same -> idempotent no-op (false, nil) - terminal != same -> reject (preserves recorded outcome; fixes the cancelled/completed equal-priority collision) - non-terminal lower -> reject (no lifecycle downgrade) - otherwise -> apply Other correctness fixes in the same area: - Reject unknown statuses up-front in both job and run handlers and in the DB layer (was: substitute priority 999, which would always win and poison rows on the upcoming hot path). - WorkflowJobHandler.HandleEvent now propagates AddOrUpdateJob errors instead of swallowing them, matching the run handler. Required so the ordering layer can spill / retry transient DB failures. - Handlers' GetStatusPriority delegates to the shared helpers; unknown statuses now return priority 0 instead of 999. Tests: new internal/database/transition_test.go exercises the rules against a real on-disk SQLite (terminal collision, stale blocks further writes, idempotent terminal replay, lifecycle progression, downgrade rejection, unknown-status rejection for both job and run). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tryable errors The ingest worker now invokes processFunc directly for each event off the in-memory channel. webhook_events is only written to when processing fails with a transient (or unclassified) error — those events are spilled in batches so the cold-path flush worker can retry them. This removes 2 of the 3 SQLite writes per webhook delivery: before: INSERT webhook_events + typed UPSERT + UPDATE webhook_events after: typed UPSERT (+ optional spill INSERT on failure) Now safe because the typed-table UPSERTs landed in the previous commit are order-independent: events arriving out of order are arbitrated by the DB layer's transition rules, not by the webhook_events ordering buffer. Mechanics: - models.OrderedEvent gains Persisted bool. Set true only when the event is read back from webhook_events; processOrderedEvent uses this to decide whether to call MarkEventProcessed/MarkEventFailed (skipped on the happy path because no row exists). - New typed sentinels services.ErrPermanent / ErrTransient. Permanent failures (no handler registered, etc.) are dropped; transient and unclassified errors are spilled. Default is conservative: spill on unknown. - flushInterval bumped 1s -> 5s; that worker only services the cold spillover path now. - New WebhookEventsTotal outcome labels: spilled, spill_failed, permanent_failure for visibility into the cold path. Crash durability: in-flight events in ingestCh are lost on hard kill. Graceful shutdown still drains the channel (via process + spill) before the flush worker runs flushAll. Documented at the top of ingestWorker. Measured (single-instance, local laptop, sustained scenario): before (step1): 500 req/s, p95=1.66ms, p99=6.23ms, max=134ms after (step2): 2000 req/s, p95=1.24ms, p99=9.99ms, max=78ms, 0 errors, 0 queue-full -> ~4x throughput at lower max latency stress at 756 req/s achieved: p99 6.83 -> 3.69 ms (-46%) Tests: replaced TestEventOrderingService_IngestWorker_BatchesInserts with three new cases: HappyPathSkipsSpill (no DB write at all on success), TransientFailureSpills (DB write happens on error), and PermanentFailureDropped (no spill for permanent errors). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR focuses on improving webhook ingest throughput and overall responsiveness by introducing an async in-memory ingest pipeline with spill-to-SQLite retry, splitting SQLite into read/write pools for better concurrency, adding a proper /readyz readiness probe, improving Prometheus observability, and smoothing frontend “live” updates to avoid UI jitter during user interaction.

Changes:

Added an async ingest queue + spill/retry path for webhook events, with backpressure and readiness integration.
Split SQLite access into separate write/read pools with per-connection PRAGMAs and expanded transition-ordering logic to be lifecycle-order independent.
Implemented SSE coalescing and new frontend hooks to pause/throttle live updates during interaction; refreshed k6 load testing scripts/docs.

Show a summary per file

File	Description
pkg/metrics/metrics.go	Adds new Prometheus metrics and dynamic collectors (DB stats, ingest depth, SSE subscribers).
models/status_priority.go	Centralizes lifecycle priority + terminal-state helpers for jobs/runs.
models/models.go	Adds `OrderedEvent.Persisted` to distinguish hot vs cold path processing.
load-tests/webhook-load.js	Reworks k6 test into constant-arrival-rate benchmark with better metrics and JSON payloads.
load-tests/results/.gitignore	Ignores generated load test output artifacts.
load-tests/README.md	Documents benchmark rationale, scenarios, and comparison workflow.
internal/services/event_ordering_service.go	Implements ingest worker, spill batching, queue backpressure, and improved flush draining.
internal/services/event_ordering_service_test.go	Updates/extends tests for async ingest, spill behavior, and drain loop.
internal/database/workflow.go	Enforces lifecycle monotonicity/terminal guards and routes reads/writes to proper pools.
internal/database/transition_test.go	Adds end-to-end transition rule tests against real SQLite DB.
internal/database/pool_smoke_test.go	Validates PRAGMAs are applied to both DB pools.
internal/database/mock.go	Extends DB mock with batch webhook event insert method.
internal/database/metrics.go	Routes metrics snapshot writes to writer pool and reads to reader pool.
internal/database/label_demand.go	Routes analytics queries to reader pool.
internal/database/interface.go	Updates DB wrapper to maintain separate read/write pools and expands interface.
internal/database/failure_analytics.go	Routes failure analytics queries to reader pool.
internal/database/event.go	Adds transactional batch insert for webhook_events and routes queries to reader pool.
internal/database/db.go	Creates read/write DB handles, applies PRAGMAs via DSN per-connection, tunes pool sizing.
handlers/workflow_run_handler.go	Rejects unknown statuses up-front; uses shared status priority mapping.
handlers/workflow_run_handler_test.go	Aligns tests to new status priority behavior and shared SSE setup.
handlers/workflow_job_handler.go	Rejects unknown statuses; propagates DB upsert errors; uses shared status priority mapping.
handlers/workflow_job_handler_test.go	Updates expectations after metrics update removal + error propagation changes.
handlers/webhook_handler.go	Uses async ordering service enqueue, adds webhook ingest metrics, and only marks persisted events processed/failed.
handlers/webhook_handler_test.go	Ensures SSE coalescer cleanup occurs in setup for isolation.
handlers/sse_handler.go	Replaces single global channel with per-subscriber buffers + coalescer to reduce jitter and fanout contention.
handlers/sse_handler_test.go	Updates/extends tests for new subscriber model, fanout, and coalesced sends.
handlers/setup_test.go	Adds helper intended to stop leaked coalescer goroutines between tests.
handlers/health_handler.go	Introduces `/readyz` that checks DB pools + ingest queue saturation.
handlers/health_handler_test.go	Adds tests for readiness behavior (DB down, queue saturated, nil queue).
handlers/event_handler.go	Exposes ordering service for readiness/metrics wiring.
handlers/api_test.go	Ensures SSE coalescer cleanup occurs in setup for isolation.
frontend/src/hooks/useThrottledLive.ts	Adds throttled “latest value” hook with pause support to reduce refetch storms/jitter.
frontend/src/hooks/usePausableLive.ts	Adds pauseable “live value” hook to defer updates during interaction.
frontend/src/App.tsx	Uses new hooks to pause dashboard updates while interacting and throttle workflow refresh.
cmd/server/server.go	Wires read/write DB pools, `/readyz`, and registers dynamic Prometheus collectors.

Copilot's findings

Files reviewed: 35/35 changed files
Comments generated: 7

Copilot · 2026-05-18T11:48:12Z

+	// Webhook ingest counters: outcome ∈ {accepted, rejected_queue_full,
+	// rejected_invalid, ignored}.
+	WebhookEventsTotal *prometheus.CounterVec


Updated in commit 1975cee: the WebhookEventsTotal outcome comment now includes all currently emitted values (accepted, rejected_queue_full, rejected_invalid, ignored, spilled, spill_failed, permanent_failure).

+g := prometheus.NewGaugeFunc(prometheus.GaugeOpts{
+Name: "live_actions_ingest_queue_depth",
+Help: "Current number of events buffered in the in-memory ingest queue",
+}, depth)
+_ = prometheus.Register(g)


+func stopLeakedSSECoalescer() {
+	if sseHandler != nil && sseHandler.coalescer != nil {
+		sseHandler.coalescer.stop()
+	}
+}


+import (
+"path/filepath"
+"testing"
+
+"github.com/gateixeira/live-actions/pkg/logger"


+
+  useEffect(() => {
+    pausedRef.current = paused


+
+  useEffect(() => {
+    pausedRef.current = paused


+g := prometheus.NewGaugeFunc(prometheus.GaugeOpts{
+Name: "live_actions_ingest_queue_depth",
+Help: "Current number of events buffered in the in-memory ingest queue",
+}, depth)
+_ = prometheus.Register(g)


gateixeira and others added 13 commits April 27, 2026 10:50

gateixeira requested a review from Copilot May 18, 2026 11:39

Copilot started reviewing on behalf of gateixeira May 18, 2026 11:39 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Copilot started work on behalf of gateixeira May 18, 2026 11:44 View session

Update webhook outcome comment to include all label values

1975cee

Copilot finished work on behalf of gateixeira May 18, 2026 11:48

gateixeira merged commit 066ae82 into main May 18, 2026

gateixeira deleted the feat/performance-optimizations branch May 18, 2026 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimizations#6

Performance optimizations#6
gateixeira merged 14 commits into
mainfrom
feat/performance-optimizations

gateixeira commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gateixeira commented May 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Copilot AI May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants