feat(health): /health and /ready endpoints per RUN_MODE (EVO-1226)#73
Conversation
…ters (EVO-1226)
Readiness needs to verify broker liveness plus existence of the topics a mode
consumes. Add `healthCheck(expectedTopics)` to the broker contract:
- Kafka: confirm producer/admin active, then filter `admin.listTopics()`.
- RabbitMQ: probe the resolved EXCHANGE via `checkExchange` on a throwaway
channel (a failed check closes the channel, so it must never run on the live
consumer channel; bare queue names don't exist — the queue is `${mode}-${topic}`).
Empty `expectedTopics` => connection-only check.
Standardized Kubernetes/Cloud Run probes across the pipeline runner modes (campaign-packer/sender/tracker, event-receiver, event-process): - HealthModule with GET /health (liveness, no dependency checks) and GET /ready (200 only when every mode-relevant dependency answers, else 503 naming the failing indicator). - Composite indicators: Postgres SELECT 1, Redis PING, broker connection+topics, ClickHouse SELECT 1 (event-process only). Each wraps its call in withTimeout and never rejects; the controller aggregates with allSettled. - Worker modes now open an HTTP listener via AppFactory.shouldServeHttp() so the probes are reachable; /health and /ready are excluded from the api/v1 prefix. - @SkipResponseTransform keeps the probe body un-wrapped so the contract is identical whether or not the global response interceptor is registered. - Remove the now-duplicate GET /health from AppController.
Reviewer's GuideAdds a new HealthModule with /health and /ready endpoints wired per RUN_MODE, including dependency health indicators (Postgres, Redis, broker, ClickHouse), a broker healthCheck contract for Kafka/RabbitMQ, unified HTTP listener logic for worker modes, and a SkipResponseTransform mechanism so probe responses bypass the global response wrapper. Sequence diagram for /ready readiness probe flowsequenceDiagram
actor Probe
participant HttpServer
participant HealthController
participant BrokerHealthIndicator
participant IMessageBroker
Probe->>HttpServer: GET /ready
HttpServer->>HealthController: readiness(response)
HealthController->>BrokerHealthIndicator: check()
BrokerHealthIndicator->>IMessageBroker: healthCheck(expectedTopics)
IMessageBroker-->>BrokerHealthIndicator: BrokerHealth
BrokerHealthIndicator-->>HealthController: IndicatorResult
HealthController-->>Probe: 200 or 503 with {status,failing,checks}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In
HealthController.readinessthe implementation wraps eachcheck()in an individual try/catch while the comment still talks aboutallSettled; consider either actually usingPromise.allSettledor updating the comment to match the current pattern to avoid confusion for future maintainers. - Both HTTP gating methods in
AppFactorynow hard-code slightly differentRunModelists (shouldStartHttpServervsshouldServeHttp); consider centralizing these mode sets or deriving one from the other so future mode additions don’t accidentally drift between the two.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `HealthController.readiness` the implementation wraps each `check()` in an individual try/catch while the comment still talks about `allSettled`; consider either actually using `Promise.allSettled` or updating the comment to match the current pattern to avoid confusion for future maintainers.
- Both HTTP gating methods in `AppFactory` now hard-code slightly different `RunMode` lists (`shouldStartHttpServer` vs `shouldServeHttp`); consider centralizing these mode sets or deriving one from the other so future mode additions don’t accidentally drift between the two.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
…bypass (EVO-1226)
Boots a real Nest HTTP app with the HealthController + the global
ResponseTransformInterceptor and asserts over supertest: /health 200 {status:ok},
/ready 200/503 with the failing indicator named, the body stays un-wrapped
(@SkipResponseTransform honored), and a normal controller is still wrapped
(the bypass is selective, not a globally-disabled interceptor).
dpaes
left a comment
There was a problem hiding this comment.
✅ Approved
Reviewed by cloning the branch and running an adversarial multi-agent pass, then cross-checking the highest-risk seams by hand. All 5 ACs are satisfied at runtime and there are no blockers. Merging to develop.
What I verified (all clean)
- DI resolves in every RUN_MODE:
DataSourcefromTypeOrmModule.forRoot(base imports),IMESSAGE_BROKERfrom the@GlobalBrokerModule,ClickHouseServiceprovided+exported byProcessingModule.ClickHouseHealthIndicatoris constructed in all modes, butClickHouseServicehas no constructor deps (connection is deferred toonModuleInit, a pre-existing side effect sinceProcessingModulewas already in base imports) — no boot crash. - HTTP boot:
NestFactory.create(AppModule.forRoot())for all modes → full HTTP app; single unifiedapp.listen()gated byshouldServeHttp(); no double-listen; legacy workers stay port-less (temporal-worker behavior preserved per scope note). - AC5:
HealthModulein base imports → bare/health+/readyexposed identically across the 7 serving modes;@SkipResponseTransformkeeps the body un-wrapped (e2e proves the bypass is selective). - Broker readiness:
resolveExchangetargets the same exchange producers/consumers use;checkExchangeruns on a throwaway channel (never tears down the live consumer); no cold-start flap (subscribes run inapp.init()beforeapp.listen(), and consumers self-assert the exchange/topic on subscribe). - AC4: ClickHouse indicator added to the active set only when
runMode === EVENT_PROCESS.
Non-blocking findings (follow-ups, not gating this merge)
- (medium)
campaign-packerdoesn't gatecampaigns.controlin/ready. The packer consumes two topics (campaigns.pack+campaigns.control, registered incampaign-packer.module.ts:21-22, subscribed incampaigns-control.consumer.ts:39-40), buthealth-topics.ts:21-22lists only[campaigns.pack]. This violates the file's own "list only consumed topics" rule and is inconsistent with the sender (which correctly includescontrol);health-topics.spec.ts:11codifies the wrong assumption. Happy path passes (consumer self-creates the object on subscribe), but ifcampaigns.control's broker-side object disappears under a running packer, pause/stop signals are silently lost while/readystill returns 200. Trivial fix: addCAMPAIGNS_CONTROL_TOPICto the packer case + fix the spec. - (medium/low)
apimode/healthcontract changed + stale Dockerfile probe. Route movedapi/v1/health(enveloped,{status:'healthy',timestamp}) → bare/health({status:'ok'}, raw).Dockerfile:83still curls…/api/v1/health→ now 404, masked by|| exit 0(container stays healthy, but the api-mode liveness probe is now a no-op). Suggest updating the Dockerfile to bare/health+ a release note. (Pre-existing port mismatch too: Dockerfile${PORT:-3005}vs main.ts3000vs .env.example3334.) - (medium) Test + smoke gaps. The two riskiest pieces — the AC4 mode-resolution factory and
shouldServeHttp()— have no tests; the e2e injectsACTIVE_INDICATORSas a fixed array, bypassing both the factory and the worker-mode listener boot. CI is Sourcery-only (no jest/tsc), so the green results are self-reported, and the real-service smoke wasn't run. Recommendnpm test+ a smoke against ≥1 worker mode. - (low)
/readydiscardsdetail/error.BrokerHealthIndicatorcomputesdetail:{missingTopics}+error, but the controller only surfacesname/status. AC2 is met (indicator is named), but the debug-useful enrichment is dropped — cheap to surface. - (low/info) Redis gate is over-strict for modes that don't use Redis (but matches the card spec);
allSettledcomment vsPromise.all+catch(equivalent); extra dedicated Redis client per pod (intentional, hardened).
Solid, well-structured work — thanks for the thorough specs and the clear PR write-up of the deltas vs. the ticket.
Summary
Standardized Kubernetes/Cloud Run liveness + readiness probes across the new pipeline runner modes so the platform can auto-restart a hung process and hold traffic until dependencies are reachable (FR37, NFR18).
HealthModule(src/health/) exposing two unauthenticated, un-prefixed endpoints:GET /health— liveness, 200 whenever the process is alive, no dependency checks.GET /ready— readiness, 200 only after every dependency relevant to the mode responds; otherwise 503 with{ status, failing, checks }naming the failing indicator.SELECT 1, RedisPING, broker (connection + topic existence), ClickHouseSELECT 1(event-processonly). Each wraps its call inwithTimeoutand never rejects; the controller aggregates withallSettled.IMessageBroker.healthCheck(expectedTopics)implemented by both adapters — Kafka viaadmin.listTopics(), RabbitMQ viacheckExchangeon a throwaway channel (never the live consumer channel; the real queue is${mode}-${topic}, andcampaigns.*are exchanges).AppFactory.shouldServeHttp();/health+/readyare excluded from theapi/v1prefix;@SkipResponseTransform()keeps the probe body un-wrapped so the contract is byte-identical whether or not the global response interceptor is registered.GET /healthfromAppController.Scope notes (as-built vs ticket)
developa 5th pipeline runner exists —campaign-tracker— so it was included for probe consistency (same worker-without-HTTP pattern). Health-served modes:campaign-packer,campaign-sender,campaign-tracker,event-receiver,event-process(+single/api).campaign-senderis already a real, wired module ondevelop(the ticket's "un-stub" step is obsolete —STUB_RUN_MODESis empty).campaigns.pack, sender→campaigns.send+campaigns.control, tracker→campaigns.tracked, receiver/process→connection-only.api/temporal-workerhealth, k8s manifests.Validation
evo-flow: npx tsc -b --noEmit→ clean (pre-existing legacymain.ts/interceptor debt untouched)evo-flow: npx eslint <new files>→ cleanevo-flow: npx jest src/health+ adapter health specs → 33/33evo-flow: existing kafka/rabbitmq adapter specs→ 53/53 (interface change didn't break mocks)evo-flow: src/runners specs→ 231/231RUN_MODE=… npm start+curl /health/ready, Redis-down → 503failing:['redis'],event-processincludes clickhouse) — pending reviewer/local stack.Changed / new files
src/health/**(module, controller, 4 indicators,health-topics.ts,with-timeout.ts, specs),src/common/decorators/skip-response-transform.decorator.ts, adapter health specs.src/main.ts,src/app-factory.ts,src/app.module.ts,src/app.controller.ts,src/common/interceptors/response-transform.interceptor.ts,IMessageBroker+ Kafka/RabbitMQ adapters.Linked Issue
Summary by Sourcery
Introduce a unified health module providing standardized liveness and readiness endpoints and broker health checks across all relevant run modes.
New Features:
Enhancements:
Tests: