Performance optimizations#6
Merged
Merged
Conversation
Previously, internal/database used a single *sql.DB capped at one open connection. Every read was serialized behind every write, so a slow analytics query could stall webhook ingestion. Open two *sql.DB handles against the same SQLite file: - writeDB: 1 connection (SQLite has a single writer) - readDB: pooled (max(NumCPU, 4) connections) With WAL enabled, readers no longer block the writer (and vice versa), so concurrent SELECTs run in parallel with webhook writes. Also encode connection pragmas (journal_mode, busy_timeout, synchronous, foreign_keys) in the DSN via _pragma= so they are applied to every pooled connection rather than only the first one (pragmas other than journal_mode are connection-scoped). Routing: BeginTx/ExecContext go to writeDB; QueryContext/QueryRowContext go to readDB. All ExecContext sites in this package are INSERT/UPDATE/DELETE, so this routing is correct. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The HTTP webhook handler used to call EventOrderingService.AddEvent,
which synchronously INSERTed into webhook_events on the single writer
connection. Each request thus blocked on a writer-pool slot, contending
with the flush worker's UPSERTs into workflow_jobs/runs.
Replace this with a buffered ingest pipeline:
POST /webhook
-> validate + parse + build OrderedEvent
-> push onto in-memory channel (default capacity: 10000)
-> 202 Accepted
ingestWorker (new goroutine in EventOrderingService)
-> drains the channel
-> batches up to 200 events or 50ms
-> persists via DBWrapper.StoreWebhookEvents (single transaction
with one prepared INSERT statement)
flushWorker (existing)
-> unchanged: replays persisted events through processFunc
Back-pressure: AddEvent uses a non-blocking fast path, then blocks for
up to enqueueTimeout (default 8s, under GitHub's ~10s webhook timeout)
when the channel is full. On timeout it returns ErrIngestQueueFull and
the HTTP handler responds 503. GitHub does not auto-retry webhooks, so
the warning log + 503 surface signals the operator to manually
redeliver from the GitHub UI for the affected delivery_id.
Shutdown ordering: Stop() cancels the service context. ingestWorker
drains any remaining buffered events using a fresh background context
(s.ctx is already cancelled), closes ingestDoneCh, then exits. The
flush worker waits on ingestDoneCh before running its final flushAll,
so events that were in-flight at shutdown are persisted to
webhook_events before the final replay attempt.
Crash window: events still in the in-memory channel at the moment of a
hard process crash are lost. This is bounded by ingestBatchWait (50ms)
times ingestBatchSize (200) and is the deliberate trade-off for the
throughput improvement.
Adds DBWrapper.StoreWebhookEvents (batched UPSERT) to the database
interface and mock; updates the affected ordering-service tests to
assert the new channel + batch semantics rather than the old
synchronous DB write.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously the flushWorker drained at most batchSize=100 events every flushInterval=5s, capping end-to-end processing at ~20 events/sec — the dominant bottleneck once async ingest landed. Sustained load above that rate produced an unboundedly growing webhook_events backlog and made the UI lag minutes behind real time. Changes: - Bump defaults: batchSize 100 -> 500, flushInterval 5s -> 1s, raising per-tick capacity ~25x. - Drain loop in flushReadyEvents pulls successive batches until the query returns < batchSize, bounded by a 500ms time budget and 20 iterations to keep the single SQLite writer available to the ingest worker and to respect ctx cancellation between batches. - flushAll on shutdown now also loops until the queue is empty (or its fresh 30s deadline expires), instead of stopping after one 1000-row fetch. - Drop the redundant StoreWebhookEvent UPSERT inside processOrderedEvent: the row is already persisted by ingestWorker, so re-upserting it added one writer-pool roundtrip per event with no observable effect. Adds a multi-iteration drain test and updates the default-value assertions. Full suite passes with -race. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The dashboard's Peak Demand card reads MAX(running+queued) from metrics_snapshots, which was being written every 10s. Between snapshot ticks the live Running/Queued cards (computed from workflow_jobs at read time) could exceed the recorded peak, producing the visible skew of 'sum of cards > peak demand'. Tightening the tick to 2s reduces the worst-case undercounting window from 10s to 2s and is well within the writer pool's headroom now that the flush worker no longer monopolises it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Under sustained load (e.g. RATE=200 webhooks/s), the server pushed one SSE per processed job/run event, causing the React dashboard to re-render ~200 times/sec. Latency was fine but the UI flickered and felt unresponsive because every metrics_update toggled live counters and every workflow_update triggered a table refetch. Add a server-side coalescer that keeps only the latest payload per event type and emits at most once per 500ms via a single ticker goroutine. SendMetricsUpdate and SendWorkflowUpdate now route through the coalescer; SendEvent remains a direct/immediate path for callers that need it. Since the dashboard always refetches detail data via REST after each SSE notification, dropping intermediate updates is safe — the client still observes the most recent state, just at a steady ~2/sec/type cadence instead of N/sec. Tests: add stopLeakedSSECoalescer helper called from each setup function so the background goroutine does not race with subsequent tests' logger init. Make stop() idempotent via sync.Once. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The dashboard previously re-rendered on every SSE event, which under sustained load made the metrics cards twitch and forced the workflow table to refetch while the user was scrolling or expanding rows. Add useThrottledLive: a hook that buffers high-frequency updates in a ref and only commits them to React state every 30s. Calling its setter does not trigger a re-render — re-renders happen on the timer tick or when interaction ends. SSE events are still received and captured; they are just applied to the UI on a calmer cadence. Wire the dashboard's live running/queued counters and the workflow refresh signal through the hook. Pause commits while the pointer is over the workflow table or focus is inside it, and flush any buffered update immediately when interaction ends so the user sees fresh data the moment they look away. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add three connection-level pragmas to both the read and write pools:
- cache_size=-65536: 64 MiB page cache per connection so hot indexes
and pages stay resident across queries (default is 2 MiB).
- temp_store=MEMORY: keep temporary tables and sort scratch in RAM
instead of spilling to disk, speeding up the GROUP BY / ORDER BY
queries used by the dashboard.
- mmap_size=256 MiB: memory-map the main database file for reads,
cutting syscall overhead on pooled SELECTs.
These are zero-semantic-change tuning knobs and apply uniformly to the
existing DSN-based pragma plumbing. Update pool_smoke_test.go to assert
the new values are seen by both pools.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Every workflow_job webhook used to call sendMetricsUpdate(), which ran GetCurrentJobCounts against the read pool and pushed an SSE event. Under sustained load that meant one extra DB query per ingested job event, even though MetricsUpdateService already polls the same counts every 2s and pushes them through the SSE coalescer. Remove the per-event call (and the helper function) so the only producer of metrics_update events is MetricsUpdateService. The UI cadence stays the same — coalesced server-side at 500ms, throttled client-side to 30s. Drop the now-obsolete TestWorkflowJobHandler_HandleEvent_GetCurrentJob CountsError and the GetCurrentJobCounts mock setups from the remaining job-handler tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously the SSE handler had a single shared events channel that every HTTP connection's forwarder goroutine read from. Go delivers each value on a channel to exactly one receiver, so with N open browser tabs each event landed in only one tab — multi-tab dashboards saw a roughly 1/N slice of updates while keep-alives masked the issue. Replace the shared channel with a registry of per-connection subscribers. sendEventNow snapshots the subscriber set under a mutex and does a non-blocking send to each (full per-subscriber buffer is dropped for that subscriber only, never blocks peers). HandleSSE subscribes on entry, defers unsubscribe, and reads its own channel directly — no forwarder goroutine. Adds regression tests covering fanout-to-all-subscribers, slow-subscriber-does-not-block-peers, and unsubscribe semantics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds operational visibility into the webhook ingest pipeline and SSE fanout,
plus a real readiness probe that load balancers can act on.
Endpoints
- /readyz: pings both DB pools (write/read) and refuses traffic when the
in-memory ingest queue is >= 95% full. /healthz remains a simple
process-liveness probe.
New Prometheus metrics
- live_actions_webhook_events_total{event_type, outcome}
outcome ∈ {accepted, rejected_queue_full, rejected_invalid, ignored}
- live_actions_ingest_queue_depth (GaugeFunc)
- live_actions_ingest_queue_capacity
- live_actions_flush_batch_duration_seconds (Histogram)
- live_actions_flush_batch_events (Histogram)
- live_actions_sse_events_broadcast_total{type}
- live_actions_sse_events_dropped_total{type}
- live_actions_sse_subscribers (GaugeFunc)
- go_sql_* via collectors.NewDBStatsCollector for the write and read pools
Plumbing
- EventOrderingService.IngestQueueLen / IngestQueueCap accessors
- WebhookHandler.OrderingService accessor (for readyz + metric registration)
- SSEHandler.SubscriberCount exported (kept lower-case alias for old tests)
- flushReadyEvents now records duration + processed-event count
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…r aggregates
The dashboard previously had two parallel refresh sources for the same data:
the 30s REST poll refreshed every metric (including running/queued) AND SSE
metrics_update events were buffered in a 30s-throttled commit. With the
throttle, a 'live' counter could lag a real event by up to 30s for no
benefit.
Refactor so each path has a single, well-defined responsibility:
- SSE metrics_update is the sole source of truth for the live
running/queued counters. Updates land sub-second (paused while the user
is interacting with the table to keep the UI stable).
- The 30s REST poll only refreshes the chart history and the aggregate
metrics (avg_queue_time, avg_run_time, peak_demand). Its current_metrics
running_jobs/queued_jobs is used solely as the initial seed before the
first SSE event arrives.
- WorkflowTable refetches on workflow_update, kept on the throttled path
so a burst of events does not trigger an API refetch storm.
Adds usePausableLive — a thin variant of useThrottledLive without the
30s commit interval — for live counters that must update immediately when
not paused.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…appy-path skip)
Centralize status-priority logic in models/status_priority.go and harden
AddOrUpdateJob/AddOrUpdateRun with a full lifecycle transition rule, so
typed-table writes are correct regardless of arrival order. This is the
foundation for the upcoming skip-webhook-events cut-over.
Transition rules now enforced in the DB layer:
- absent -> insert
- terminal == same -> idempotent no-op (false, nil)
- terminal != same -> reject (preserves recorded outcome; fixes the
cancelled/completed equal-priority collision)
- non-terminal lower -> reject (no lifecycle downgrade)
- otherwise -> apply
Other correctness fixes in the same area:
- Reject unknown statuses up-front in both job and run handlers and in the
DB layer (was: substitute priority 999, which would always win and
poison rows on the upcoming hot path).
- WorkflowJobHandler.HandleEvent now propagates AddOrUpdateJob errors
instead of swallowing them, matching the run handler. Required so the
ordering layer can spill / retry transient DB failures.
- Handlers' GetStatusPriority delegates to the shared helpers; unknown
statuses now return priority 0 instead of 999.
Tests: new internal/database/transition_test.go exercises the rules
against a real on-disk SQLite (terminal collision, stale blocks further
writes, idempotent terminal replay, lifecycle progression, downgrade
rejection, unknown-status rejection for both job and run).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tryable errors
The ingest worker now invokes processFunc directly for each event off the
in-memory channel. webhook_events is only written to when processing fails
with a transient (or unclassified) error — those events are spilled in
batches so the cold-path flush worker can retry them.
This removes 2 of the 3 SQLite writes per webhook delivery:
before: INSERT webhook_events + typed UPSERT + UPDATE webhook_events
after: typed UPSERT (+ optional spill INSERT on failure)
Now safe because the typed-table UPSERTs landed in the previous commit are
order-independent: events arriving out of order are arbitrated by the DB
layer's transition rules, not by the webhook_events ordering buffer.
Mechanics:
- models.OrderedEvent gains Persisted bool. Set true only when the event
is read back from webhook_events; processOrderedEvent uses this to
decide whether to call MarkEventProcessed/MarkEventFailed (skipped on
the happy path because no row exists).
- New typed sentinels services.ErrPermanent / ErrTransient. Permanent
failures (no handler registered, etc.) are dropped; transient and
unclassified errors are spilled. Default is conservative: spill on
unknown.
- flushInterval bumped 1s -> 5s; that worker only services the cold
spillover path now.
- New WebhookEventsTotal outcome labels: spilled, spill_failed,
permanent_failure for visibility into the cold path.
Crash durability: in-flight events in ingestCh are lost on hard kill.
Graceful shutdown still drains the channel (via process + spill) before
the flush worker runs flushAll. Documented at the top of ingestWorker.
Measured (single-instance, local laptop, sustained scenario):
before (step1): 500 req/s, p95=1.66ms, p99=6.23ms, max=134ms
after (step2): 2000 req/s, p95=1.24ms, p99=9.99ms, max=78ms,
0 errors, 0 queue-full
-> ~4x throughput at lower max latency
stress at 756 req/s achieved: p99 6.83 -> 3.69 ms (-46%)
Tests: replaced TestEventOrderingService_IngestWorker_BatchesInserts with
three new cases: HappyPathSkipsSpill (no DB write at all on success),
TransientFailureSpills (DB write happens on error), and
PermanentFailureDropped (no spill for permanent errors).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR focuses on improving webhook ingest throughput and overall responsiveness by introducing an async in-memory ingest pipeline with spill-to-SQLite retry, splitting SQLite into read/write pools for better concurrency, adding a proper /readyz readiness probe, improving Prometheus observability, and smoothing frontend “live” updates to avoid UI jitter during user interaction.
Changes:
- Added an async ingest queue + spill/retry path for webhook events, with backpressure and readiness integration.
- Split SQLite access into separate write/read pools with per-connection PRAGMAs and expanded transition-ordering logic to be lifecycle-order independent.
- Implemented SSE coalescing and new frontend hooks to pause/throttle live updates during interaction; refreshed k6 load testing scripts/docs.
Show a summary per file
| File | Description |
|---|---|
| pkg/metrics/metrics.go | Adds new Prometheus metrics and dynamic collectors (DB stats, ingest depth, SSE subscribers). |
| models/status_priority.go | Centralizes lifecycle priority + terminal-state helpers for jobs/runs. |
| models/models.go | Adds OrderedEvent.Persisted to distinguish hot vs cold path processing. |
| load-tests/webhook-load.js | Reworks k6 test into constant-arrival-rate benchmark with better metrics and JSON payloads. |
| load-tests/results/.gitignore | Ignores generated load test output artifacts. |
| load-tests/README.md | Documents benchmark rationale, scenarios, and comparison workflow. |
| internal/services/event_ordering_service.go | Implements ingest worker, spill batching, queue backpressure, and improved flush draining. |
| internal/services/event_ordering_service_test.go | Updates/extends tests for async ingest, spill behavior, and drain loop. |
| internal/database/workflow.go | Enforces lifecycle monotonicity/terminal guards and routes reads/writes to proper pools. |
| internal/database/transition_test.go | Adds end-to-end transition rule tests against real SQLite DB. |
| internal/database/pool_smoke_test.go | Validates PRAGMAs are applied to both DB pools. |
| internal/database/mock.go | Extends DB mock with batch webhook event insert method. |
| internal/database/metrics.go | Routes metrics snapshot writes to writer pool and reads to reader pool. |
| internal/database/label_demand.go | Routes analytics queries to reader pool. |
| internal/database/interface.go | Updates DB wrapper to maintain separate read/write pools and expands interface. |
| internal/database/failure_analytics.go | Routes failure analytics queries to reader pool. |
| internal/database/event.go | Adds transactional batch insert for webhook_events and routes queries to reader pool. |
| internal/database/db.go | Creates read/write DB handles, applies PRAGMAs via DSN per-connection, tunes pool sizing. |
| handlers/workflow_run_handler.go | Rejects unknown statuses up-front; uses shared status priority mapping. |
| handlers/workflow_run_handler_test.go | Aligns tests to new status priority behavior and shared SSE setup. |
| handlers/workflow_job_handler.go | Rejects unknown statuses; propagates DB upsert errors; uses shared status priority mapping. |
| handlers/workflow_job_handler_test.go | Updates expectations after metrics update removal + error propagation changes. |
| handlers/webhook_handler.go | Uses async ordering service enqueue, adds webhook ingest metrics, and only marks persisted events processed/failed. |
| handlers/webhook_handler_test.go | Ensures SSE coalescer cleanup occurs in setup for isolation. |
| handlers/sse_handler.go | Replaces single global channel with per-subscriber buffers + coalescer to reduce jitter and fanout contention. |
| handlers/sse_handler_test.go | Updates/extends tests for new subscriber model, fanout, and coalesced sends. |
| handlers/setup_test.go | Adds helper intended to stop leaked coalescer goroutines between tests. |
| handlers/health_handler.go | Introduces /readyz that checks DB pools + ingest queue saturation. |
| handlers/health_handler_test.go | Adds tests for readiness behavior (DB down, queue saturated, nil queue). |
| handlers/event_handler.go | Exposes ordering service for readiness/metrics wiring. |
| handlers/api_test.go | Ensures SSE coalescer cleanup occurs in setup for isolation. |
| frontend/src/hooks/useThrottledLive.ts | Adds throttled “latest value” hook with pause support to reduce refetch storms/jitter. |
| frontend/src/hooks/usePausableLive.ts | Adds pauseable “live value” hook to defer updates during interaction. |
| frontend/src/App.tsx | Uses new hooks to pause dashboard updates while interacting and throttle workflow refresh. |
| cmd/server/server.go | Wires read/write DB pools, /readyz, and registers dynamic Prometheus collectors. |
Copilot's findings
- Files reviewed: 35/35 changed files
- Comments generated: 7
Comment on lines
+24
to
+26
| // Webhook ingest counters: outcome ∈ {accepted, rejected_queue_full, | ||
| // rejected_invalid, ignored}. | ||
| WebhookEventsTotal *prometheus.CounterVec |
Contributor
There was a problem hiding this comment.
Updated in commit 1975cee: the WebhookEventsTotal outcome comment now includes all currently emitted values (accepted, rejected_queue_full, rejected_invalid, ignored, spilled, spill_failed, permanent_failure).
Comment on lines
+145
to
+149
| g := prometheus.NewGaugeFunc(prometheus.GaugeOpts{ | ||
| Name: "live_actions_ingest_queue_depth", | ||
| Help: "Current number of events buffered in the in-memory ingest queue", | ||
| }, depth) | ||
| _ = prometheus.Register(g) |
Comment on lines
+7
to
+11
| func stopLeakedSSECoalescer() { | ||
| if sseHandler != nil && sseHandler.coalescer != nil { | ||
| sseHandler.coalescer.stop() | ||
| } | ||
| } |
Comment on lines
+3
to
+7
| import ( | ||
| "path/filepath" | ||
| "testing" | ||
|
|
||
| "github.com/gateixeira/live-actions/pkg/logger" |
Comment on lines
+31
to
+33
|
|
||
| useEffect(() => { | ||
| pausedRef.current = paused |
Comment on lines
+31
to
+33
|
|
||
| useEffect(() => { | ||
| pausedRef.current = paused |
Comment on lines
+145
to
+149
| g := prometheus.NewGaugeFunc(prometheus.GaugeOpts{ | ||
| Name: "live_actions_ingest_queue_depth", | ||
| Help: "Current number of events buffered in the in-memory ingest queue", | ||
| }, depth) | ||
| _ = prometheus.Register(g) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces several improvements to both the backend and frontend of the application, focusing on reliability, observability, and user experience. The most significant changes include the addition of a robust
/readyzreadiness endpoint, enhancements to metrics and database handling, and new frontend hooks to provide smoother live updates without UI jitter during user interaction.