Skip to content

fix(apigateway): heartbeat, per-chunk persistence, configurable embed-job knobs#480

Merged
cjimti merged 1 commit into
mainfrom
fix/embed-jobs-heartbeat-incremental-479
May 25, 2026
Merged

fix(apigateway): heartbeat, per-chunk persistence, configurable embed-job knobs#480
cjimti merged 1 commit into
mainfrom
fix/embed-jobs-heartbeat-incremental-479

Conversation

@cjimti
Copy link
Copy Markdown
Member

@cjimti cjimti commented May 25, 2026

Closes #479.

Summary

Closes the three problems #479 documented:

  1. Configurable timeout and batch size with sensible defaults. apigateway.embed_jobs.batch_size (default 32) and apigateway.embed_jobs.lease_duration (default 10m) are now operator-tunable. Startup logs a warning when embed_timeout >= lease_duration.
  2. Forward progress resets the timeout. A heartbeat goroutine renews the DB lease every lease_duration / 3 while Compute is running, and each completed chunk is written to api_catalog_operation_embeddings immediately via a new additive UpsertOperationEmbeddingsBatch. A job that fails on chunk N now leaves chunks 0..N-1 persisted for the next attempt's dedup pass.
  3. Operator-visible status. A pending job with attempts > 0 renders in the catalog UI as retrying (N tries) with the last error in the tooltip, instead of indistinguishable queued. The reproduction case (a 148-op spec that doom-looped through 6 attempts with no visible signal) is the failure mode this badge change closes.

Reproduction this fixes

Upload a ~150-op OpenAPI spec to a deployment using CPU-only Ollama embeddings. Before this PR: each attempt runs ~10m30s, hits the worker's LeaseDuration + 30s grace ctx ceiling at the last chunk, throws away the prior chunks' work, retries from scratch, repeats forever. UI shows queued through every attempt.

After this PR: the heartbeat keeps the lease alive past lease_duration while Compute is genuinely progressing; per-chunk persistence preserves work across retries; the badge surfaces retry state so a doom loop is visible at a glance.

Architecture changes

Backend

  • embedjobs.Store.RenewLease(ctx, id, workerID, duration) added; Postgres impl issues UPDATE ... SET lease_expires_at = NOW() + interval WHERE id AND worker_id AND status='running', returns ErrNotFound on lease rotation.
  • embedjobs.NewPostgresStore accepts WithLeaseDuration(d) and exposes LeaseDuration().
  • embedjobs.WorkerConfig gains LeaseDuration and BatchSize; defaults are DefaultLeaseDuration = 10m and DefaultEmbedBatchSize = 32.
  • Worker spawns a heartbeat goroutine on every process() that renews the lease while Compute runs. Stops on context cancel or ErrNotFound (lease rotated to another worker).
  • drainQueue's per-iteration ctx ceiling changed from LeaseDuration + 30s grace to a 1-hour processSafetyBound. The earlier ceiling silently defeated the heartbeat (DB lease alive but local ctx canceled at the lease window). The DB lease, heartbeat-maintained, is now the authoritative deadline; the 1-hour bound is a wall-clock backstop against a Compute that hangs without forward progress.
  • EmbeddingComputer.Compute now takes a ComputeRequest struct (BatchSize + PersistBatch callback added).
  • ComputeOperationEmbeddings takes a ComputeRequest and threads BatchSize + PersistBatch through fillFreshEmbeddings into a new embedInBatchesIter (per-chunk callback variant of embedInBatches).
  • catalog.Store.UpsertOperationEmbeddingsBatch added: INSERT ... ON CONFLICT (catalog_id, spec_name, operation_id) DO UPDATE. Preserves rows outside the batch. Memory and Postgres backends both implemented.
  • The atomic UpsertOperationEmbeddings at job completion still runs as the canonical full replacement, so removed operations are cleaned up.

UI

EmbeddingStatusBadge in ui/src/pages/settings/CatalogsPanel.tsx adds a sixth state. When job_status === "pending" && job_attempts > 0, renders retrying (N tries) with the last error in the tooltip.

Config

apigateway:
  embed_jobs:
    workers: 1            # goroutines per pod
    embed_timeout: 5m     # per-batch HTTP call timeout
    lease_duration: 10m   # claim window the heartbeat re-stamps at lease_duration/3
    batch_size: 32        # texts per upstream EmbedBatch call

Tests

New unit tests:

  • TestWorker_HeartbeatRenewsLeaseWhileComputing — heartbeat fires during a long Compute.
  • TestWorker_HeartbeatStopsAfterCompute — goroutine exits on the deferred cancel.
  • TestWorker_HeartbeatLetsComputeOutlastLeaseDuration — regression gate against a future ctx ceiling tied to LeaseDuration.
  • TestWorker_PersistBatchForwardsToUpsertBatch — PersistBatch callback reaches Persister.UpsertBatch.
  • TestWorker_PersistBatchErrorFailsCompute — DB error during incremental persist surfaces as a retryable failure.
  • TestWorker_BatchSizeFlowsToComputeRequest — config flows from WorkerConfig.BatchSize to ComputeRequest.BatchSize.
  • TestNewWorker_DefaultsLeaseAndBatchSize — defaults apply when caller omits both.
  • TestRenewLease_HappyPath / _NotFoundOnLeaseRotation / _DBErrorWrapped / _NonPositiveDurationFallsBackToConfigured — Postgres semantics.
  • TestWithLeaseDuration_StampsConfiguredValueOnClaim / _NonPositiveKeepsDefault — option plumbing.
  • TestMemoryStore_UpsertOperationEmbeddingsBatch_AdditiveSemantics / _UpdatesExisting / _WithoutSpec — memory backend additive contract.
  • TestUpsertOperationEmbeddingsBatch_OnConflictUpdates / _EmptyRowsShortCircuits / _FKViolation_ReturnsNotFound / _CommitError — Postgres backend.

Integration test:

  • TestWorker_IntegrationResumeAfterMidJobFailure — wires the real Worker against a fake queue and an in-memory persister, drives a 3-chunk job that fails on chunk 1 on attempt 1 and succeeds on attempt 2. Asserts all rows are present in the persister after the retry, proving prior chunks survived the failure.

make verify passes (fmt, race tests, ≥80% coverage, golangci-lint with --new-from-rev, gosec, govulncheck, semgrep, dead-code, mutation ≥60%, doc-check, release-check).

Deferred (pre-existing, not introduced here)

catalog.ErrNotFound returned by Upsert and the new UpsertBatch is wrapped by the persister with fmt.Errorf("catalogEmbeddingPersister: %w", err) and retried up to MaxAttempts instead of failing terminally. The existing UpsertOperationEmbeddings path has the same behavior and predates this PR. A deleted-mid-job spec costs ~155s of retry backoff before moving to failed. Worth a separate change to translate to a terminal sentinel.

Operator notes

Existing deployments with embed_timeout already configured but no lease_duration get the 10m default for the lease; the startup warning will fire if embed_timeout >= 10m. Recommendation for CPU-only Ollama on large specs:

apigateway:
  embed_jobs:
    embed_timeout: 15m
    lease_duration: 20m

Test plan

  • CI builds the image
  • Bump lease_duration in the deployment configmap; roll
  • Re-upload a ~150-op OpenAPI spec; confirm the job completes and the UI shows green N/M indexed
  • Force a transient failure (e.g., temporarily blackhole the embedder); confirm the badge shows retrying (N tries) with the upstream error in the tooltip
  • After recovery, confirm the retry resumes via the dedup map (Ollama call count for the second attempt < total ops, proving prior chunks were preserved)

…-job knobs (#479)

The embed-job worker held a fixed 10-minute lease and persisted all
chunks in one final atomic write. On a CPU-only embedder, a 148-op
spec ran ~10.5 min per attempt, hit the reaper's lease ceiling at
the last chunk, threw away the prior chunks' work, and retried from
scratch forever, silently rendering "queued" in the UI.

This change:

- Adds Store.RenewLease + a worker heartbeat goroutine that
  re-stamps lease_expires_at every lease_duration/3 while Compute
  runs, so a slow embed pass no longer looks abandoned.
- Removes the worker's local LeaseDuration+30s ctx ceiling that
  silently defeated the heartbeat. drainQueue now uses a 1-hour
  processSafetyBound as the wall-clock backstop; the DB lease
  (heartbeat-maintained) is the authoritative deadline for normal
  operation.
- Threads a PersistBatch callback through ComputeOperationEmbeddings
  and embedInBatchesIter so each chunk's rows hit
  api_catalog_operation_embeddings via a new
  UpsertOperationEmbeddingsBatch (INSERT ON CONFLICT DO UPDATE,
  preserves rows outside the batch). The final atomic Upsert at
  job completion still does the canonical full replacement.
- Adds apigateway.embed_jobs.batch_size and lease_duration as
  config knobs with sensible defaults (32, 10m) and a startup
  warning when embed_timeout >= lease_duration.
- Updates EmbeddingStatusBadge to distinguish a pending job with
  attempts > 0 as "retrying (N tries)" with the last error in
  the tooltip, the silent-failure mode that hid the doom loop.

Includes an end-to-end integration test that proves a mid-job
failure preserves prior chunks across attempts, plus a regression
gate (TestWorker_HeartbeatLetsComputeOutlastLeaseDuration) that
fails if a future refactor reintroduces a ctx ceiling tied to
LeaseDuration.

Deferred (pre-existing, not introduced by this PR): catalog.ErrNotFound
returned by Upsert / UpsertBatch is wrapped by the persister and
retried up to MaxAttempts instead of terminating immediately. The
behavior matches the pre-existing UpsertOperationEmbeddings path
and a deleted-mid-job spec costs ~155s of retry backoff before
failing; worth a separate change to translate to a terminal sentinel.
@cjimti cjimti merged commit 443fae7 into main May 25, 2026
7 checks passed
@codecov
Copy link
Copy Markdown

codecov Bot commented May 25, 2026

Codecov Report

❌ Patch coverage is 83.70044% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.19%. Comparing base (b1f7651) to head (a79dc48).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/platform/apigateway_embed_jobs.go 68.51% 16 Missing and 1 partial ⚠️
pkg/toolkits/apigateway/embedjobs/worker.go 87.17% 3 Missing and 2 partials ⚠️
pkg/toolkits/apigateway/catalog/store_postgres.go 86.66% 2 Missing and 2 partials ⚠️
pkg/toolkits/apigateway/ranking.go 73.33% 2 Missing and 2 partials ⚠️
pkg/toolkits/apigateway/embed_spec.go 92.30% 2 Missing and 1 partial ⚠️
pkg/toolkits/apigateway/catalog/memory.go 88.88% 1 Missing and 1 partial ⚠️
...kg/toolkits/apigateway/embedjobs/store_postgres.go 93.75% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #480      +/-   ##
==========================================
- Coverage   86.21%   86.19%   -0.03%     
==========================================
  Files         237      237              
  Lines       32792    32969     +177     
==========================================
+ Hits        28272    28416     +144     
- Misses       3272     3296      +24     
- Partials     1248     1257       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

apigateway embed jobs: configurable batch size, lease heartbeat, and operator-visible status

1 participant