Skip to content

feat(evo-flow): pause/stop fast-path via campaigns.control consumers (EVO-1222)#59

Merged
dpaes merged 2 commits into
developfrom
feat/EVO-1222
Jun 12, 2026
Merged

feat(evo-flow): pause/stop fast-path via campaigns.control consumers (EVO-1222)#59
dpaes merged 2 commits into
developfrom
feat/EVO-1222

Conversation

@nickoliveira23

Copy link
Copy Markdown

Summary

Story 4.8 — the broker fast-path half of the hybrid pause/stop design. The REST pause/stop/resume endpoints already wrote the authoritative Campaign.status and the sender already rechecked it (4.3), but propagation waited out the 5s status-cache TTL. This publishes campaigns.control so the change is honored in <1s, with the Postgres flag remaining the source of truth.

  • CampaignsService.pause/resume/stop (+ bulk) publish campaigns.control { campaignId, action, correlationId } after the status save. Best-effort publish: a broker outage never fails the transition (the sender honors the flag at its next TTL recheck) — so it cannot trip the controller's workflow compensation.
  • CampaignsControlConsumer (sender) drops the campaign's cached status → next dispatch recheck re-reads Postgres immediately.
  • CampaignsControlConsumer (packer) aborts an in-flight pagination on pause/stop, clears on resume (best-effort; the sender guard is authoritative).
  • Kafka adapter provisions campaigns.control single-partition (ordered pause/resume) + 24h retention.

Scope (Option A, approved): the REST endpoints already exist and are wired to the legacy Temporal workflow. This story is purely additive — it layers the broker-native fast-path without touching the Temporal wiring. AC2 (authoritative-flag fallback) was already satisfied by the existing sender recheck.

Security

  • No new endpoint / auth surface (Bearer guard unchanged). Control payload carries no PII. correlationId flows from the request CLS (minted if absent).

Test plan

  • evo-flow: npm run typecheck → clean
  • evo-flow: npx jest campaigns-control.consumer campaigns.service.spec campaign-packer.service.spec campaign-sender.service.spec kafka-broker.adapter → 68 passed
  • evo-flow: npx eslint <changed sources> → clean (no new findings)

Changed Files

  • src/modules/campaigns/services/campaigns.service.ts (+spec) — publish on transition (best-effort)
  • src/runners/campaign-sender/{consumers/campaigns-control.consumer.ts (+spec),services/campaign-sender.service.ts,campaign-sender.module.ts} — cache invalidation
  • src/runners/campaign-packer/{consumers/campaigns-control.consumer.ts (+spec),services/campaign-packer.service.ts (+spec),campaign-packer.module.ts} — pagination abort
  • src/shared/broker/adapters/kafka-broker.adapter.ts — per-topic partition/retention override

AC re-check (5/5)

  • AC1 pause <30s + status Paused → publish + sender cache invalidation + recheck. ✓
  • AC2 control lost → sender rechecks Postgres at TTL → still aborts (pre-existing). ✓
  • AC3 resume → Sending + publish + tabular idempotency (only PENDING). ✓
  • AC4 stop → Stopped + publish + resume blocked (only from PAUSED). ✓
  • AC5 {campaignId, action, correlationId} validated by both consumers. ✓

Self-review (code review pass)

  • HIGH fixed: the fast-path publish was awaited+propagated — a broker outage would fail the already-committed transition and trip the controller's compensation (Postgres=PAUSED + workflow resumed + 500). Now best-effort (logged, never throws). Regression test added.
  • Test added: packer abort break-loop (publishes page 1, control lands, page 2 onward skipped).

Known limitations (by design — degrade gracefully to the authoritative flag)

  • Fan-out: if senders share a work-queue consumer group, only one replica invalidates its cache on a control message; the others honor the change via the 5s TTL recheck (within NFR5's 30s). Same for the packer abort (only the instance that both paginates and consumes the control fires it).
  • Partition/retention override applies at topic creation only — on a cluster where campaigns.control already exists (auto-created at 12 partitions), the 1-partition override does not retroactively apply (and Kafka cannot reduce partitions). Ordering is non-critical here (the authoritative recheck guarantees correctness).
  • Pre-existing on develop (not introduced here): campaigns.controller.spec.ts has one red test (stop expects a 2-arg signature) and campaigns.service.ts has 4 pre-existing eslint findings in create()/findAll(). The crm/evo-flow CI does not gate jest/eslint.

Linked Issue

  • EVO-1222

🤖 Generated with Claude Code

…(EVO-1222)

Story 4.8. The pause/stop/resume REST endpoints already wrote the authoritative
Campaign.status flag and the sender already rechecked it (4.3), but propagation
waited out the 5s status-cache TTL. This adds the broker fast-path so the change
is honored in <1s, keeping the Postgres flag as the source of truth.

- CampaignsService.pause/resume/stop (and bulk) publish campaigns.control
  { campaignId, action, correlationId } after the status save. The publish is
  best-effort: a broker outage never fails the transition (the sender honors the
  flag at the next TTL recheck), so it cannot trip the controller's workflow
  compensation.
- CampaignsControlConsumer on the sender drops the campaign's cached status so
  the next dispatch recheck re-reads Postgres immediately.
- CampaignsControlConsumer on the packer aborts an in-flight pagination on
  pause/stop and clears the flag on resume (best-effort; the sender guard is
  authoritative).
- Kafka adapter provisions campaigns.control single-partition (ordered
  pause/resume) with 24h retention.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @nickoliveira23, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@dpaes dpaes left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed — requesting changes 🔴

The five ACs are logically delivered and the best-effort publish (the HIGH self-fix) checks out — a broker outage can't fail the committed transition or trip the controller compensation. But there's one HIGH finding that silently defeats the story's headline deliverable (the <1s fast-path), independently verified against the code at the head ref.

HIGH — campaigns.control publisher emits a correlationId its own consumers reject.

  • publishControl (campaigns.service.ts:153-157) publishes correlationId = correlation.resolveIncoming(correlation.getCorrelationId()). resolveIncoming preserves any inbound value matching SAFE_CORRELATION_ID = /^[A-Za-z0-9._:-]{1,128}$/ verbatim — i.e. non-v4 tokens (abc-123, uuid v1/v7) pass through. RequestContextMiddleware seeds the CLS correlationId from the inbound X-Correlation-Id via the same resolveIncoming, so an upstream non-UUID id propagates straight into the payload.
  • The contract validates correlationId: z.uuidv4() (.strict). Both consumers (packer and sender) safeParse against it; a non-v4 id fails validation → nack(requeue=false) + an "Invalid payload" warning — on a message the system itself produced. contracts.spec.ts:134-143 explicitly asserts the guard rejects a non-UUID and a non-v4 UUID.
  • Net: whenever the request carries a non-UUID X-Correlation-Id, the control message is dropped on both consumers and the <1s fast-path silently no-ops, degrading to the 5s TTL fallback. Not data loss (the authoritative Postgres flag still pauses within NFR5), but it's the exact value this story exists to deliver. The new specs mask it — campaigns.service.spec mocks resolveIncoming to a hardcoded valid UUID and both consumer specs use a literal v4.
  • This is the same SAFE_CORRELATION_ID vs z.uuidv4() drift already seen on EVO-1209 — campaigns.pack avoids it by minting a fresh uuid4() in the workflow; #59 is the first producer to feed the request CLS id into a z.uuidv4 contract.

Fix: either mint a fresh UUID for the control event's correlationId (as campaigns.pack does) and keep the request id in a separate field/log, or relax the contract to z.string().regex(SAFE_CORRELATION_ID) consistently across the sibling pack/send/tracked contracts. Add a regression spec that drives a non-UUID correlationId end-to-end (publish → guard) instead of a hardcoded UUID.

Non-blocking (verified, not introduced here): the pre-existing red campaigns.controller.spec.ts (line 98, 2-arg stop expectation) and the create()/findAll() eslint findings are not in this PR's diff — your "pre-existing" call holds. Separately, SENDING_TESTAB campaigns aren't pausable via the pre-existing state machine — worth a product call on whether A/B-test runs should support pause/stop.

Leaving the PR open for the correlationId fix.

…ionId (EVO-1222)

The control event carried `resolveIncoming(getCorrelationId())`, which preserves
non-UUIDv4 request tokens (SAFE_CORRELATION_ID is looser than v4). The contract
is `z.uuidv4()` strict, so both consumers nack(requeue=false) a message the
system itself produced — the <1s fast-path silently degraded to the 5s TTL
fallback whenever the inbound X-Correlation-Id was non-v4.

Mint a fresh UUID v4 per control event instead, matching the campaigns.pack
producer (correlation ids are producer-minted across the pipeline). Drop the
now-unused CorrelationContext dependency. The service spec no longer mocks id
resolution; it validates the published payload through isCampaignsControlContract
(the exact check both consumers run) plus an explicit v4-format regression.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@dpaes dpaes left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — fix verified, approved ✅

806926a resolves the HIGH from the previous round. Verified against the head ref:

  • Root-cause fix is correct. publishControl now sets correlationId: randomUUID() (Node crypto.randomUUID() → UUID v4), so the producer no longer feeds the request CLS id into the z.uuidv4() contract. Matches the campaigns.pack producer (producer-minted ids across the pipeline), so a non-v4 upstream X-Correlation-Id can no longer be rejected by the packer/sender consumers. Contract and consumers are unchanged — correctly, they're the pipeline standard.
  • Clean removal. The now-unused CorrelationContext dependency is gone from the import and constructor; no orphan this.correlation reference remains and no other call site constructs CampaignsService with the old 3-arg signature.
  • Specs un-masked — real guard now. The service spec drops the correlation mock and asserts each published payload passes isCampaignsControlContract(payload) (the exact validator both consumers run), plus an explicit mints a fresh uuid v4 correlationId regression test. A revert to resolveIncoming would now turn these red.
  • Best-effort publish, state machine, and tabular idempotency are unchanged from the prior round (already verified).

Non-blocking note: the control event no longer carries the originating request's correlation id (the strict contract has no field for it, and this mirrors campaigns.pack). If request→control tracing is wanted later, that's a transport-header/log concern, not this payload.

The previously-flagged pre-existing items (red campaigns.controller.spec.ts, create()/findAll() eslint, SENDING_TESTAB non-pausable) remain out of scope as agreed.

@dpaes dpaes merged commit 350cc28 into develop Jun 12, 2026
6 checks passed
@dpaes dpaes deleted the feat/EVO-1222 branch June 12, 2026 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants