Skip to content

feat(health): /health and /ready endpoints per RUN_MODE (EVO-1226)#73

Merged
dpaes merged 3 commits into
developfrom
danilocarneiro/evo-1226-51-implementar-health-e-readiness-endpoints-por-modo
Jun 18, 2026
Merged

feat(health): /health and /ready endpoints per RUN_MODE (EVO-1226)#73
dpaes merged 3 commits into
developfrom
danilocarneiro/evo-1226-51-implementar-health-e-readiness-endpoints-por-modo

Conversation

@daniloleonecarneiro

@daniloleonecarneiro daniloleonecarneiro commented Jun 18, 2026

Copy link
Copy Markdown

Summary

Standardized Kubernetes/Cloud Run liveness + readiness probes across the new pipeline runner modes so the platform can auto-restart a hung process and hold traffic until dependencies are reachable (FR37, NFR18).

  • HealthModule (src/health/) exposing two unauthenticated, un-prefixed endpoints:
    • GET /health — liveness, 200 whenever the process is alive, no dependency checks.
    • GET /ready — readiness, 200 only after every dependency relevant to the mode responds; otherwise 503 with { status, failing, checks } naming the failing indicator.
  • Composite indicators: Postgres SELECT 1, Redis PING, broker (connection + topic existence), ClickHouse SELECT 1 (event-process only). Each wraps its call in withTimeout and never rejects; the controller aggregates with allSettled.
  • Broker contract: new IMessageBroker.healthCheck(expectedTopics) implemented by both adapters — Kafka via admin.listTopics(), RabbitMQ via checkExchange on a throwaway channel (never the live consumer channel; the real queue is ${mode}-${topic}, and campaigns.* are exchanges).
  • Reachability: worker modes now open an HTTP listener via AppFactory.shouldServeHttp(); /health + /ready are excluded from the api/v1 prefix; @SkipResponseTransform() keeps the probe body un-wrapped so the contract is byte-identical whether or not the global response interceptor is registered.
  • Removed the now-duplicate GET /health from AppController.

Scope notes (as-built vs ticket)

  • The ticket named 4 modes; on current develop a 5th pipeline runner exists — campaign-tracker — so it was included for probe consistency (same worker-without-HTTP pattern). Health-served modes: campaign-packer, campaign-sender, campaign-tracker, event-receiver, event-process (+ single/api).
  • campaign-sender is already a real, wired module on develop (the ticket's "un-stub" step is obsolete — STUB_RUN_MODES is empty).
  • Per-mode broker topic checks gate only on consumed topics (a published topic auto-creates on first publish): packer→campaigns.pack, sender→campaigns.send+campaigns.control, tracker→campaigns.tracked, receiver/process→connection-only.
  • Out of scope (unchanged): alerting (5.3), legacy api/temporal-worker health, k8s manifests.

Validation

  • evo-flow: npx tsc -b --noEmit → clean (pre-existing legacy main.ts/interceptor debt untouched)
  • evo-flow: npx eslint <new files> → clean
  • evo-flow: npx jest src/health + adapter health specs → 33/33
  • evo-flow: existing kafka/rabbitmq adapter specs → 53/53 (interface change didn't break mocks)
  • evo-flow: src/runners specs → 231/231
  • Manual smoke (RUN_MODE=… npm start + curl /health /ready, Redis-down → 503 failing:['redis'], event-process includes clickhouse) — pending reviewer/local stack.

Changed / new files

  • New: src/health/** (module, controller, 4 indicators, health-topics.ts, with-timeout.ts, specs), src/common/decorators/skip-response-transform.decorator.ts, adapter health specs.
  • Edited: src/main.ts, src/app-factory.ts, src/app.module.ts, src/app.controller.ts, src/common/interceptors/response-transform.interceptor.ts, IMessageBroker + Kafka/RabbitMQ adapters.

Linked Issue

  • EVO-1226

Summary by Sourcery

Introduce a unified health module providing standardized liveness and readiness endpoints and broker health checks across all relevant run modes.

New Features:

  • Add HealthModule with /health liveness and /ready readiness endpoints shared across all applicable RUN_MODEs.
  • Introduce pluggable health indicators for Postgres, Redis, broker, and ClickHouse (event-process only), with a common contract and timeout handling.
  • Expose broker-level healthCheck on IMessageBroker and implement it for Kafka and RabbitMQ adapters, including per-mode expected topic checks.
  • Add a SkipResponseTransform decorator to allow selected endpoints (health probes) to bypass the global response transformation.

Enhancements:

  • Extend AppFactory and bootstrap logic so both full API modes and pipeline worker modes can open an HTTP listener for health/readiness probes while keeping legacy workers unchanged.
  • Update global ResponseTransformInterceptor to be reflector-aware and honor endpoints marked to skip response wrapping.
  • Remove the legacy /health endpoint from AppController in favor of the centralized HealthController implementation.

Tests:

  • Add unit tests for the health controller, health indicators, broker topic mapping, timeout helper, and broker adapter healthCheck implementations.

Danilo Leone added 2 commits June 18, 2026 13:58
…ters (EVO-1226)

Readiness needs to verify broker liveness plus existence of the topics a mode
consumes. Add `healthCheck(expectedTopics)` to the broker contract:

- Kafka: confirm producer/admin active, then filter `admin.listTopics()`.
- RabbitMQ: probe the resolved EXCHANGE via `checkExchange` on a throwaway
  channel (a failed check closes the channel, so it must never run on the live
  consumer channel; bare queue names don't exist — the queue is `${mode}-${topic}`).

Empty `expectedTopics` => connection-only check.
Standardized Kubernetes/Cloud Run probes across the pipeline runner modes
(campaign-packer/sender/tracker, event-receiver, event-process):

- HealthModule with GET /health (liveness, no dependency checks) and GET /ready
  (200 only when every mode-relevant dependency answers, else 503 naming the
  failing indicator).
- Composite indicators: Postgres SELECT 1, Redis PING, broker connection+topics,
  ClickHouse SELECT 1 (event-process only). Each wraps its call in withTimeout
  and never rejects; the controller aggregates with allSettled.
- Worker modes now open an HTTP listener via AppFactory.shouldServeHttp() so the
  probes are reachable; /health and /ready are excluded from the api/v1 prefix.
- @SkipResponseTransform keeps the probe body un-wrapped so the contract is
  identical whether or not the global response interceptor is registered.
- Remove the now-duplicate GET /health from AppController.
@sourcery-ai

sourcery-ai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Reviewer's Guide

Adds a new HealthModule with /health and /ready endpoints wired per RUN_MODE, including dependency health indicators (Postgres, Redis, broker, ClickHouse), a broker healthCheck contract for Kafka/RabbitMQ, unified HTTP listener logic for worker modes, and a SkipResponseTransform mechanism so probe responses bypass the global response wrapper.

Sequence diagram for /ready readiness probe flow

sequenceDiagram
  actor Probe
  participant HttpServer
  participant HealthController
  participant BrokerHealthIndicator
  participant IMessageBroker

  Probe->>HttpServer: GET /ready
  HttpServer->>HealthController: readiness(response)
  HealthController->>BrokerHealthIndicator: check()
  BrokerHealthIndicator->>IMessageBroker: healthCheck(expectedTopics)
  IMessageBroker-->>BrokerHealthIndicator: BrokerHealth
  BrokerHealthIndicator-->>HealthController: IndicatorResult
  HealthController-->>Probe: 200 or 503 with {status,failing,checks}
Loading

File-Level Changes

Change Details Files
Introduce centralized HealthModule with liveness/readiness endpoints and pluggable indicators used across run modes.
  • Add HealthModule that imports ProcessingModule, registers HealthController, and wires Postgres/Redis/Broker/ClickHouse indicators via ACTIVE_INDICATORS based on RunMode.
  • Implement HealthController exposing GET /health (no checks) and GET /ready aggregating indicator.check() results into { status, checks, failing } with 503 on any down indicator.
  • Define HealthIndicator contract, ACTIVE_INDICATORS DI token, withTimeout helper, and per-dependency indicators for Postgres, Redis, broker (using expectedBrokerTopics), and ClickHouse, plus corresponding unit tests.
src/health/health.module.ts
src/health/health.controller.ts
src/health/indicators/health-indicator.interface.ts
src/health/indicators/postgres.health-indicator.ts
src/health/indicators/redis.health-indicator.ts
src/health/indicators/broker.health-indicator.ts
src/health/indicators/clickhouse.health-indicator.ts
src/health/with-timeout.ts
src/health/health-topics.ts
src/health/**/*.spec.ts
Extend broker interface and adapters with a non-throwing healthCheck used by the broker readiness indicator.
  • Add BrokerHealth type and healthCheck(expectedTopics) method to IMessageBroker with documentation that it must never throw.
  • Implement KafkaBrokerAdapter.healthCheck using admin.listTopics() to detect missing topics and degrade to connected:false on metadata fetch errors.
  • Implement RabbitMQBrokerAdapter.healthCheck that verifies per-topic exchanges using a throwaway channel and safely handles missing exchanges and channel errors.
  • Add adapter health specs to validate the new healthCheck behavior.
src/shared/broker/interfaces/message-broker.interface.ts
src/shared/broker/adapters/kafka-broker.adapter.ts
src/shared/broker/adapters/rabbitmq-broker.adapter.ts
src/shared/broker/adapters/kafka-broker.adapter.health.spec.ts
src/shared/broker/adapters/rabbitmq-broker.adapter.health.spec.ts
Unify HTTP server startup across run modes and expose health endpoints without the API prefix while preserving raw responses.
  • Add AppFactory.shouldServeHttp() to decide which RUN_MODEs open an HTTP listener (API, SINGLE, EVENT_RECEIVER, CAMPAIGN_* and EVENT_PROCESS).
  • Move app.listen() into a unified listener block gated by shouldServeHttp(), reusing the same app for worker modes and only logging Swagger URL for full API modes via shouldStartHttpServer().
  • Exclude GET /health and /ready paths from the global api/v1 prefix in main.ts so probes are un-prefixed.
  • Remove legacy /health handler from AppController now that HealthController owns the route.
  • Wire HealthModule into AppModule so all modes get the health endpoints.
src/app-factory.ts
src/main.ts
src/app.controller.ts
src/app.module.ts
Allow certain endpoints (notably health probes) to bypass the global response transformation.
  • Add SKIP_RESPONSE_TRANSFORM metadata key and @SkipResponseTransform decorator to mark controllers/handlers that should return raw bodies.
  • Update ResponseTransformInterceptor to accept an optional Reflector, read SKIP_RESPONSE_TRANSFORM metadata, and short-circuit transformation when set.
  • Change main.ts to construct ResponseTransformInterceptor with app.get(Reflector).
  • Annotate HealthController with @SkipResponseTransform so /health and /ready responses are unwrapped regardless of interceptor registration.
src/common/decorators/skip-response-transform.decorator.ts
src/common/interceptors/response-transform.interceptor.ts
src/main.ts
src/health/health.controller.ts

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In HealthController.readiness the implementation wraps each check() in an individual try/catch while the comment still talks about allSettled; consider either actually using Promise.allSettled or updating the comment to match the current pattern to avoid confusion for future maintainers.
  • Both HTTP gating methods in AppFactory now hard-code slightly different RunMode lists (shouldStartHttpServer vs shouldServeHttp); consider centralizing these mode sets or deriving one from the other so future mode additions don’t accidentally drift between the two.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `HealthController.readiness` the implementation wraps each `check()` in an individual try/catch while the comment still talks about `allSettled`; consider either actually using `Promise.allSettled` or updating the comment to match the current pattern to avoid confusion for future maintainers.
- Both HTTP gating methods in `AppFactory` now hard-code slightly different `RunMode` lists (`shouldStartHttpServer` vs `shouldServeHttp`); consider centralizing these mode sets or deriving one from the other so future mode additions don’t accidentally drift between the two.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

…bypass (EVO-1226)

Boots a real Nest HTTP app with the HealthController + the global
ResponseTransformInterceptor and asserts over supertest: /health 200 {status:ok},
/ready 200/503 with the failing indicator named, the body stays un-wrapped
(@SkipResponseTransform honored), and a normal controller is still wrapped
(the bypass is selective, not a globally-disabled interceptor).

@dpaes dpaes left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved

Reviewed by cloning the branch and running an adversarial multi-agent pass, then cross-checking the highest-risk seams by hand. All 5 ACs are satisfied at runtime and there are no blockers. Merging to develop.

What I verified (all clean)

  • DI resolves in every RUN_MODE: DataSource from TypeOrmModule.forRoot (base imports), IMESSAGE_BROKER from the @Global BrokerModule, ClickHouseService provided+exported by ProcessingModule. ClickHouseHealthIndicator is constructed in all modes, but ClickHouseService has no constructor deps (connection is deferred to onModuleInit, a pre-existing side effect since ProcessingModule was already in base imports) — no boot crash.
  • HTTP boot: NestFactory.create(AppModule.forRoot()) for all modes → full HTTP app; single unified app.listen() gated by shouldServeHttp(); no double-listen; legacy workers stay port-less (temporal-worker behavior preserved per scope note).
  • AC5: HealthModule in base imports → bare /health+/ready exposed identically across the 7 serving modes; @SkipResponseTransform keeps the body un-wrapped (e2e proves the bypass is selective).
  • Broker readiness: resolveExchange targets the same exchange producers/consumers use; checkExchange runs on a throwaway channel (never tears down the live consumer); no cold-start flap (subscribes run in app.init() before app.listen(), and consumers self-assert the exchange/topic on subscribe).
  • AC4: ClickHouse indicator added to the active set only when runMode === EVENT_PROCESS.

Non-blocking findings (follow-ups, not gating this merge)

  1. (medium) campaign-packer doesn't gate campaigns.control in /ready. The packer consumes two topics (campaigns.pack + campaigns.control, registered in campaign-packer.module.ts:21-22, subscribed in campaigns-control.consumer.ts:39-40), but health-topics.ts:21-22 lists only [campaigns.pack]. This violates the file's own "list only consumed topics" rule and is inconsistent with the sender (which correctly includes control); health-topics.spec.ts:11 codifies the wrong assumption. Happy path passes (consumer self-creates the object on subscribe), but if campaigns.control's broker-side object disappears under a running packer, pause/stop signals are silently lost while /ready still returns 200. Trivial fix: add CAMPAIGNS_CONTROL_TOPIC to the packer case + fix the spec.
  2. (medium/low) api mode /health contract changed + stale Dockerfile probe. Route moved api/v1/health (enveloped, {status:'healthy',timestamp}) → bare /health ({status:'ok'}, raw). Dockerfile:83 still curls …/api/v1/health → now 404, masked by || exit 0 (container stays healthy, but the api-mode liveness probe is now a no-op). Suggest updating the Dockerfile to bare /health + a release note. (Pre-existing port mismatch too: Dockerfile ${PORT:-3005} vs main.ts 3000 vs .env.example 3334.)
  3. (medium) Test + smoke gaps. The two riskiest pieces — the AC4 mode-resolution factory and shouldServeHttp() — have no tests; the e2e injects ACTIVE_INDICATORS as a fixed array, bypassing both the factory and the worker-mode listener boot. CI is Sourcery-only (no jest/tsc), so the green results are self-reported, and the real-service smoke wasn't run. Recommend npm test + a smoke against ≥1 worker mode.
  4. (low) /ready discards detail/error. BrokerHealthIndicator computes detail:{missingTopics} + error, but the controller only surfaces name/status. AC2 is met (indicator is named), but the debug-useful enrichment is dropped — cheap to surface.
  5. (low/info) Redis gate is over-strict for modes that don't use Redis (but matches the card spec); allSettled comment vs Promise.all+catch (equivalent); extra dedicated Redis client per pod (intentional, hardened).

Solid, well-structured work — thanks for the thorough specs and the clear PR write-up of the deltas vs. the ticket.

@dpaes dpaes merged commit 1284e03 into develop Jun 18, 2026
5 checks passed
@dpaes dpaes deleted the danilocarneiro/evo-1226-51-implementar-health-e-readiness-endpoints-por-modo branch June 18, 2026 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants