Skip to content

Performance optimizations#6

Merged
gateixeira merged 14 commits into
mainfrom
feat/performance-optimizations
May 18, 2026
Merged

Performance optimizations#6
gateixeira merged 14 commits into
mainfrom
feat/performance-optimizations

Conversation

@gateixeira
Copy link
Copy Markdown
Owner

This pull request introduces several improvements to both the backend and frontend of the application, focusing on reliability, observability, and user experience. The most significant changes include the addition of a robust /readyz readiness endpoint, enhancements to metrics and database handling, and new frontend hooks to provide smoother live updates without UI jitter during user interaction.

gateixeira and others added 13 commits April 27, 2026 10:50
Previously, internal/database used a single *sql.DB capped at one open
connection. Every read was serialized behind every write, so a slow
analytics query could stall webhook ingestion.

Open two *sql.DB handles against the same SQLite file:
  - writeDB: 1 connection (SQLite has a single writer)
  - readDB:  pooled (max(NumCPU, 4) connections)

With WAL enabled, readers no longer block the writer (and vice versa),
so concurrent SELECTs run in parallel with webhook writes.

Also encode connection pragmas (journal_mode, busy_timeout, synchronous,
foreign_keys) in the DSN via _pragma= so they are applied to every
pooled connection rather than only the first one (pragmas other than
journal_mode are connection-scoped).

Routing: BeginTx/ExecContext go to writeDB; QueryContext/QueryRowContext
go to readDB. All ExecContext sites in this package are
INSERT/UPDATE/DELETE, so this routing is correct.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The HTTP webhook handler used to call EventOrderingService.AddEvent,
which synchronously INSERTed into webhook_events on the single writer
connection. Each request thus blocked on a writer-pool slot, contending
with the flush worker's UPSERTs into workflow_jobs/runs.

Replace this with a buffered ingest pipeline:

  POST /webhook
    -> validate + parse + build OrderedEvent
    -> push onto in-memory channel (default capacity: 10000)
    -> 202 Accepted

  ingestWorker (new goroutine in EventOrderingService)
    -> drains the channel
    -> batches up to 200 events or 50ms
    -> persists via DBWrapper.StoreWebhookEvents (single transaction
       with one prepared INSERT statement)

  flushWorker (existing)
    -> unchanged: replays persisted events through processFunc

Back-pressure: AddEvent uses a non-blocking fast path, then blocks for
up to enqueueTimeout (default 8s, under GitHub's ~10s webhook timeout)
when the channel is full. On timeout it returns ErrIngestQueueFull and
the HTTP handler responds 503. GitHub does not auto-retry webhooks, so
the warning log + 503 surface signals the operator to manually
redeliver from the GitHub UI for the affected delivery_id.

Shutdown ordering: Stop() cancels the service context. ingestWorker
drains any remaining buffered events using a fresh background context
(s.ctx is already cancelled), closes ingestDoneCh, then exits. The
flush worker waits on ingestDoneCh before running its final flushAll,
so events that were in-flight at shutdown are persisted to
webhook_events before the final replay attempt.

Crash window: events still in the in-memory channel at the moment of a
hard process crash are lost. This is bounded by ingestBatchWait (50ms)
times ingestBatchSize (200) and is the deliberate trade-off for the
throughput improvement.

Adds DBWrapper.StoreWebhookEvents (batched UPSERT) to the database
interface and mock; updates the affected ordering-service tests to
assert the new channel + batch semantics rather than the old
synchronous DB write.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously the flushWorker drained at most batchSize=100 events every
flushInterval=5s, capping end-to-end processing at ~20 events/sec — the
dominant bottleneck once async ingest landed. Sustained load above that
rate produced an unboundedly growing webhook_events backlog and made
the UI lag minutes behind real time.

Changes:
- Bump defaults: batchSize 100 -> 500, flushInterval 5s -> 1s, raising
  per-tick capacity ~25x.
- Drain loop in flushReadyEvents pulls successive batches until the
  query returns < batchSize, bounded by a 500ms time budget and 20
  iterations to keep the single SQLite writer available to the ingest
  worker and to respect ctx cancellation between batches.
- flushAll on shutdown now also loops until the queue is empty (or its
  fresh 30s deadline expires), instead of stopping after one 1000-row
  fetch.
- Drop the redundant StoreWebhookEvent UPSERT inside
  processOrderedEvent: the row is already persisted by ingestWorker, so
  re-upserting it added one writer-pool roundtrip per event with no
  observable effect.

Adds a multi-iteration drain test and updates the default-value
assertions. Full suite passes with -race.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The dashboard's Peak Demand card reads MAX(running+queued) from
metrics_snapshots, which was being written every 10s. Between snapshot
ticks the live Running/Queued cards (computed from workflow_jobs at
read time) could exceed the recorded peak, producing the visible skew
of 'sum of cards > peak demand'.

Tightening the tick to 2s reduces the worst-case undercounting window
from 10s to 2s and is well within the writer pool's headroom now that
the flush worker no longer monopolises it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Under sustained load (e.g. RATE=200 webhooks/s), the server pushed one
SSE per processed job/run event, causing the React dashboard to
re-render ~200 times/sec. Latency was fine but the UI flickered and
felt unresponsive because every metrics_update toggled live counters
and every workflow_update triggered a table refetch.

Add a server-side coalescer that keeps only the latest payload per
event type and emits at most once per 500ms via a single ticker
goroutine. SendMetricsUpdate and SendWorkflowUpdate now route through
the coalescer; SendEvent remains a direct/immediate path for callers
that need it.

Since the dashboard always refetches detail data via REST after each
SSE notification, dropping intermediate updates is safe — the client
still observes the most recent state, just at a steady ~2/sec/type
cadence instead of N/sec.

Tests: add stopLeakedSSECoalescer helper called from each setup
function so the background goroutine does not race with subsequent
tests' logger init. Make stop() idempotent via sync.Once.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The dashboard previously re-rendered on every SSE event, which under
sustained load made the metrics cards twitch and forced the workflow
table to refetch while the user was scrolling or expanding rows.

Add useThrottledLive: a hook that buffers high-frequency updates in a
ref and only commits them to React state every 30s. Calling its setter
does not trigger a re-render — re-renders happen on the timer tick or
when interaction ends. SSE events are still received and captured;
they are just applied to the UI on a calmer cadence.

Wire the dashboard's live running/queued counters and the workflow
refresh signal through the hook. Pause commits while the pointer is
over the workflow table or focus is inside it, and flush any buffered
update immediately when interaction ends so the user sees fresh data
the moment they look away.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add three connection-level pragmas to both the read and write pools:

  - cache_size=-65536: 64 MiB page cache per connection so hot indexes
    and pages stay resident across queries (default is 2 MiB).
  - temp_store=MEMORY: keep temporary tables and sort scratch in RAM
    instead of spilling to disk, speeding up the GROUP BY / ORDER BY
    queries used by the dashboard.
  - mmap_size=256 MiB: memory-map the main database file for reads,
    cutting syscall overhead on pooled SELECTs.

These are zero-semantic-change tuning knobs and apply uniformly to the
existing DSN-based pragma plumbing. Update pool_smoke_test.go to assert
the new values are seen by both pools.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Every workflow_job webhook used to call sendMetricsUpdate(), which ran
GetCurrentJobCounts against the read pool and pushed an SSE event.
Under sustained load that meant one extra DB query per ingested job
event, even though MetricsUpdateService already polls the same counts
every 2s and pushes them through the SSE coalescer.

Remove the per-event call (and the helper function) so the only
producer of metrics_update events is MetricsUpdateService. The UI
cadence stays the same — coalesced server-side at 500ms, throttled
client-side to 30s.

Drop the now-obsolete TestWorkflowJobHandler_HandleEvent_GetCurrentJob
CountsError and the GetCurrentJobCounts mock setups from the remaining
job-handler tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously the SSE handler had a single shared events channel that every
HTTP connection's forwarder goroutine read from. Go delivers each value
on a channel to exactly one receiver, so with N open browser tabs each
event landed in only one tab — multi-tab dashboards saw a roughly 1/N
slice of updates while keep-alives masked the issue.

Replace the shared channel with a registry of per-connection
subscribers. sendEventNow snapshots the subscriber set under a mutex
and does a non-blocking send to each (full per-subscriber buffer is
dropped for that subscriber only, never blocks peers). HandleSSE
subscribes on entry, defers unsubscribe, and reads its own channel
directly — no forwarder goroutine.

Adds regression tests covering fanout-to-all-subscribers,
slow-subscriber-does-not-block-peers, and unsubscribe semantics.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds operational visibility into the webhook ingest pipeline and SSE fanout,
plus a real readiness probe that load balancers can act on.

Endpoints
- /readyz: pings both DB pools (write/read) and refuses traffic when the
  in-memory ingest queue is >= 95% full. /healthz remains a simple
  process-liveness probe.

New Prometheus metrics
- live_actions_webhook_events_total{event_type, outcome}
  outcome ∈ {accepted, rejected_queue_full, rejected_invalid, ignored}
- live_actions_ingest_queue_depth (GaugeFunc)
- live_actions_ingest_queue_capacity
- live_actions_flush_batch_duration_seconds (Histogram)
- live_actions_flush_batch_events (Histogram)
- live_actions_sse_events_broadcast_total{type}
- live_actions_sse_events_dropped_total{type}
- live_actions_sse_subscribers (GaugeFunc)
- go_sql_* via collectors.NewDBStatsCollector for the write and read pools

Plumbing
- EventOrderingService.IngestQueueLen / IngestQueueCap accessors
- WebhookHandler.OrderingService accessor (for readyz + metric registration)
- SSEHandler.SubscriberCount exported (kept lower-case alias for old tests)
- flushReadyEvents now records duration + processed-event count

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…r aggregates

The dashboard previously had two parallel refresh sources for the same data:
the 30s REST poll refreshed every metric (including running/queued) AND SSE
metrics_update events were buffered in a 30s-throttled commit. With the
throttle, a 'live' counter could lag a real event by up to 30s for no
benefit.

Refactor so each path has a single, well-defined responsibility:
  - SSE metrics_update is the sole source of truth for the live
    running/queued counters. Updates land sub-second (paused while the user
    is interacting with the table to keep the UI stable).
  - The 30s REST poll only refreshes the chart history and the aggregate
    metrics (avg_queue_time, avg_run_time, peak_demand). Its current_metrics
    running_jobs/queued_jobs is used solely as the initial seed before the
    first SSE event arrives.
  - WorkflowTable refetches on workflow_update, kept on the throttled path
    so a burst of events does not trigger an API refetch storm.

Adds usePausableLive — a thin variant of useThrottledLive without the
30s commit interval — for live counters that must update immediately when
not paused.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…appy-path skip)

Centralize status-priority logic in models/status_priority.go and harden
AddOrUpdateJob/AddOrUpdateRun with a full lifecycle transition rule, so
typed-table writes are correct regardless of arrival order. This is the
foundation for the upcoming skip-webhook-events cut-over.

Transition rules now enforced in the DB layer:
- absent             -> insert
- terminal == same   -> idempotent no-op (false, nil)
- terminal != same   -> reject (preserves recorded outcome; fixes the
                        cancelled/completed equal-priority collision)
- non-terminal lower -> reject (no lifecycle downgrade)
- otherwise          -> apply

Other correctness fixes in the same area:
- Reject unknown statuses up-front in both job and run handlers and in the
  DB layer (was: substitute priority 999, which would always win and
  poison rows on the upcoming hot path).
- WorkflowJobHandler.HandleEvent now propagates AddOrUpdateJob errors
  instead of swallowing them, matching the run handler. Required so the
  ordering layer can spill / retry transient DB failures.
- Handlers' GetStatusPriority delegates to the shared helpers; unknown
  statuses now return priority 0 instead of 999.

Tests: new internal/database/transition_test.go exercises the rules
against a real on-disk SQLite (terminal collision, stale blocks further
writes, idempotent terminal replay, lifecycle progression, downgrade
rejection, unknown-status rejection for both job and run).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tryable errors

The ingest worker now invokes processFunc directly for each event off the
in-memory channel. webhook_events is only written to when processing fails
with a transient (or unclassified) error — those events are spilled in
batches so the cold-path flush worker can retry them.

This removes 2 of the 3 SQLite writes per webhook delivery:
  before: INSERT webhook_events  + typed UPSERT + UPDATE webhook_events
  after:  typed UPSERT             (+ optional spill INSERT on failure)

Now safe because the typed-table UPSERTs landed in the previous commit are
order-independent: events arriving out of order are arbitrated by the DB
layer's transition rules, not by the webhook_events ordering buffer.

Mechanics:
  - models.OrderedEvent gains Persisted bool. Set true only when the event
    is read back from webhook_events; processOrderedEvent uses this to
    decide whether to call MarkEventProcessed/MarkEventFailed (skipped on
    the happy path because no row exists).
  - New typed sentinels services.ErrPermanent / ErrTransient. Permanent
    failures (no handler registered, etc.) are dropped; transient and
    unclassified errors are spilled. Default is conservative: spill on
    unknown.
  - flushInterval bumped 1s -> 5s; that worker only services the cold
    spillover path now.
  - New WebhookEventsTotal outcome labels: spilled, spill_failed,
    permanent_failure for visibility into the cold path.

Crash durability: in-flight events in ingestCh are lost on hard kill.
Graceful shutdown still drains the channel (via process + spill) before
the flush worker runs flushAll. Documented at the top of ingestWorker.

Measured (single-instance, local laptop, sustained scenario):
  before (step1):        500 req/s, p95=1.66ms,  p99=6.23ms,  max=134ms
  after  (step2): 2000 req/s,  p95=1.24ms,  p99=9.99ms,  max=78ms,
                  0 errors, 0 queue-full
                  -> ~4x throughput at lower max latency
  stress at 756 req/s achieved: p99 6.83 -> 3.69 ms (-46%)

Tests: replaced TestEventOrderingService_IngestWorker_BatchesInserts with
three new cases: HappyPathSkipsSpill (no DB write at all on success),
TransientFailureSpills (DB write happens on error), and
PermanentFailureDropped (no spill for permanent errors).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR focuses on improving webhook ingest throughput and overall responsiveness by introducing an async in-memory ingest pipeline with spill-to-SQLite retry, splitting SQLite into read/write pools for better concurrency, adding a proper /readyz readiness probe, improving Prometheus observability, and smoothing frontend “live” updates to avoid UI jitter during user interaction.

Changes:

  • Added an async ingest queue + spill/retry path for webhook events, with backpressure and readiness integration.
  • Split SQLite access into separate write/read pools with per-connection PRAGMAs and expanded transition-ordering logic to be lifecycle-order independent.
  • Implemented SSE coalescing and new frontend hooks to pause/throttle live updates during interaction; refreshed k6 load testing scripts/docs.
Show a summary per file
File Description
pkg/metrics/metrics.go Adds new Prometheus metrics and dynamic collectors (DB stats, ingest depth, SSE subscribers).
models/status_priority.go Centralizes lifecycle priority + terminal-state helpers for jobs/runs.
models/models.go Adds OrderedEvent.Persisted to distinguish hot vs cold path processing.
load-tests/webhook-load.js Reworks k6 test into constant-arrival-rate benchmark with better metrics and JSON payloads.
load-tests/results/.gitignore Ignores generated load test output artifacts.
load-tests/README.md Documents benchmark rationale, scenarios, and comparison workflow.
internal/services/event_ordering_service.go Implements ingest worker, spill batching, queue backpressure, and improved flush draining.
internal/services/event_ordering_service_test.go Updates/extends tests for async ingest, spill behavior, and drain loop.
internal/database/workflow.go Enforces lifecycle monotonicity/terminal guards and routes reads/writes to proper pools.
internal/database/transition_test.go Adds end-to-end transition rule tests against real SQLite DB.
internal/database/pool_smoke_test.go Validates PRAGMAs are applied to both DB pools.
internal/database/mock.go Extends DB mock with batch webhook event insert method.
internal/database/metrics.go Routes metrics snapshot writes to writer pool and reads to reader pool.
internal/database/label_demand.go Routes analytics queries to reader pool.
internal/database/interface.go Updates DB wrapper to maintain separate read/write pools and expands interface.
internal/database/failure_analytics.go Routes failure analytics queries to reader pool.
internal/database/event.go Adds transactional batch insert for webhook_events and routes queries to reader pool.
internal/database/db.go Creates read/write DB handles, applies PRAGMAs via DSN per-connection, tunes pool sizing.
handlers/workflow_run_handler.go Rejects unknown statuses up-front; uses shared status priority mapping.
handlers/workflow_run_handler_test.go Aligns tests to new status priority behavior and shared SSE setup.
handlers/workflow_job_handler.go Rejects unknown statuses; propagates DB upsert errors; uses shared status priority mapping.
handlers/workflow_job_handler_test.go Updates expectations after metrics update removal + error propagation changes.
handlers/webhook_handler.go Uses async ordering service enqueue, adds webhook ingest metrics, and only marks persisted events processed/failed.
handlers/webhook_handler_test.go Ensures SSE coalescer cleanup occurs in setup for isolation.
handlers/sse_handler.go Replaces single global channel with per-subscriber buffers + coalescer to reduce jitter and fanout contention.
handlers/sse_handler_test.go Updates/extends tests for new subscriber model, fanout, and coalesced sends.
handlers/setup_test.go Adds helper intended to stop leaked coalescer goroutines between tests.
handlers/health_handler.go Introduces /readyz that checks DB pools + ingest queue saturation.
handlers/health_handler_test.go Adds tests for readiness behavior (DB down, queue saturated, nil queue).
handlers/event_handler.go Exposes ordering service for readiness/metrics wiring.
handlers/api_test.go Ensures SSE coalescer cleanup occurs in setup for isolation.
frontend/src/hooks/useThrottledLive.ts Adds throttled “latest value” hook with pause support to reduce refetch storms/jitter.
frontend/src/hooks/usePausableLive.ts Adds pauseable “live value” hook to defer updates during interaction.
frontend/src/App.tsx Uses new hooks to pause dashboard updates while interacting and throttle workflow refresh.
cmd/server/server.go Wires read/write DB pools, /readyz, and registers dynamic Prometheus collectors.

Copilot's findings

  • Files reviewed: 35/35 changed files
  • Comments generated: 7

Comment thread pkg/metrics/metrics.go
Comment on lines +24 to +26
// Webhook ingest counters: outcome ∈ {accepted, rejected_queue_full,
// rejected_invalid, ignored}.
WebhookEventsTotal *prometheus.CounterVec
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in commit 1975cee: the WebhookEventsTotal outcome comment now includes all currently emitted values (accepted, rejected_queue_full, rejected_invalid, ignored, spilled, spill_failed, permanent_failure).

Comment thread pkg/metrics/metrics.go
Comment on lines +145 to +149
g := prometheus.NewGaugeFunc(prometheus.GaugeOpts{
Name: "live_actions_ingest_queue_depth",
Help: "Current number of events buffered in the in-memory ingest queue",
}, depth)
_ = prometheus.Register(g)
Comment thread handlers/setup_test.go
Comment on lines +7 to +11
func stopLeakedSSECoalescer() {
if sseHandler != nil && sseHandler.coalescer != nil {
sseHandler.coalescer.stop()
}
}
Comment on lines +3 to +7
import (
"path/filepath"
"testing"

"github.com/gateixeira/live-actions/pkg/logger"
Comment on lines +31 to +33

useEffect(() => {
pausedRef.current = paused
Comment on lines +31 to +33

useEffect(() => {
pausedRef.current = paused
Comment thread pkg/metrics/metrics.go
Comment on lines +145 to +149
g := prometheus.NewGaugeFunc(prometheus.GaugeOpts{
Name: "live_actions_ingest_queue_depth",
Help: "Current number of events buffered in the in-memory ingest queue",
}, depth)
_ = prometheus.Register(g)
@gateixeira gateixeira merged commit 066ae82 into main May 18, 2026
@gateixeira gateixeira deleted the feat/performance-optimizations branch May 18, 2026 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants