Skip to content

feat(broker): redelivery backstop with delivery-limit + DLQ/DLT (EVO-1677)#44

Merged
dpaes merged 1 commit into
developfrom
feat/EVO-1677
Jun 10, 2026
Merged

feat(broker): redelivery backstop with delivery-limit + DLQ/DLT (EVO-1677)#44
dpaes merged 1 commit into
developfrom
feat/EVO-1677

Conversation

@nickoliveira23

Copy link
Copy Markdown

Summary

Adds a broker-level redelivery ceiling (EVO-1677) so a poison message can never block a queue/partition indefinitely — independent of whether the application classified the error as terminal (that is EVO-1676's typed TerminalError). Implemented uniformly across both adapters via a header-based attempt counter; no RabbitMQ quorum-queue migration.

On nack(msg, requeue=true):

  • Under the limit → republish to the same topic/queue with an incremented x-delivery-attempt (retry goes to the tail, not the head), then ack/commit the original.
  • At the limit → route to <name>.dlq + increment evo_broker_dead_lettered_total, then ack/commit the original.
  • RabbitMQ: sendToQueue on the existing classic durable queues; <queue>.dlq asserted lazily.
  • Kafka: republish via the producer + commit past the original (an in-place seek would rewind and block the partition); <topic>.dlq.

nack(msg, requeue=false) (explicit terminal drop) is unchanged.

Security

  • Broker-layer, no auth surface. Republish preserves at-least-once; a duplicate under partial channel failure is covered by downstream idempotency (EVO-1204/1211).

Test plan

  • evo-flow: npm run typecheck — clean
  • evo-flow: npx jest src/shared/broker — 155 passed, 21 skipped (integration specs gated by *_INTEGRATION=1)
  • evo-flow: npx eslint <changed .ts> — clean

Covers ACs: RabbitMQ N-failures dead-letter to <queue>.dlq (AC1); Kafka N-failures route to <topic>.dlq (AC2); both increment evo_broker_dead_lettered_total (AC3); metric exposed on /metrics + alert rule documented (AC4, alerting is external/5.3); design documented in REDELIVERY.md (AC5).

Review notes

  • Verified the worst case (dead-letter loop): the Kafka pattern RegExp ^events\.received\.[^.]+$ (single segment) does NOT match events.received.<platform>.dlq (two segments), and the RabbitMQ DLQ is reached via sendToQueue (bypasses the exchange/binding) — so a DLQ message is never re-consumed.
  • Fixed during review: declaredDlqs is now cleared on reconnect alongside declaredExchanges, so a post-reconnect dead-letter re-asserts the DLQ queue instead of silently dropping.

Scope / boundary

  • The broker DLQ (<name>.dlq) is intentionally separate from the app-level events.failed (EVO-1214); routing one into the other is a future decision.
  • No automatic DLQ reprocessing consumer / replay UI (manual in MVP, per the card). True DLQ-depth gauge needs broker-admin polling and is deferred; the dead_lettered_total counter is the alert signal.

Changed Files

  • src/shared/broker/redelivery.ts (new) + redelivery.spec.ts (new)
  • src/shared/broker/metrics/broker-metrics.ts
  • src/shared/broker/adapters/rabbitmq-broker.adapter.ts (+ spec)
  • src/shared/broker/adapters/kafka-broker.adapter.ts (+ spec)
  • src/shared/broker/REDELIVERY.md (new)

Linked Issue

  • EVO-1677

…1677)

Adds a broker-level ceiling on redeliveries so a poison message can never block
a queue/partition indefinitely, regardless of whether the application classified
the error (EVO-1676 is the orthogonal typed classification). Uniform across both
adapters via a header-based attempt counter — no RabbitMQ quorum-queue migration.

- redelivery.ts: x-delivery-attempt header, dlqNameFor (<name>.dlq), BROKER_DELIVERY_LIMIT (default 3)
- nack(requeue=true): republish with incremented attempt under the limit; route
  to <name>.dlq and increment evo_broker_dead_lettered_total at the limit
  - RabbitMQ: sendToQueue on the existing classic durable queues; <queue>.dlq asserted lazily
  - Kafka: republish via producer + commit past original (seek would block the partition); <topic>.dlq
- broker-metrics: evo_broker_dead_lettered_total{broker,topic}
- declaredDlqs cleared on reconnect alongside declaredExchanges
- REDELIVERY.md documents the limit, DLQ naming, alert rule and replay policy
- specs: under-limit republish, at-limit dead-letter (+metric) for both brokers; redelivery helper unit

nack(requeue=false) terminal drop is unchanged. The broker DLQ is intentionally
separate from the app-level events.failed (EVO-1214); integration is a future decision.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @nickoliveira23, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@dpaes dpaes left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — EVO-1677

Backstop is correct end-to-end. AC1 (RabbitMQ republish with x-delivery-attempt++, then <queue>.dlq at the limit), AC2 (Kafka: seek replaced by republish + producer.send(<topic>.dlq) awaited before commitPastOriginal → no partition rewind, no loss), AC3 (evo_broker_dead_lettered_total{broker,topic} counter exposed on /metrics), and AC5 (REDELIVERY.md: limit default 3 validated >=1, <name>.dlq naming + x-dlq-reason/x-original-topic, manual replay) all verified. Loop-prevention confirmed: the 3-segment Kafka subscribe pattern can't match the 4-segment .dlq topic; the RabbitMQ .dlq is asserted but never consumed.

AC4 (alert on DLQ depth) — accepted as deferred. The metric is a Counter (dead-lettered rate — the practical alert signal), not a depth gauge (broker-admin polling, explicitly out of scope). The alert rule itself belongs in [5.3] EVO-1224 (external Prometheus/Alertmanager). Approving on the team convention that alerting config lives externally — landing the DLQ alert in 5.3 is at your discretion as a follow-up.

Accepted-LOW: the RabbitMQ DLQ leg uses non-confirm sendToQueue then ack(original) → a narrow broker-failure window could lose instead of dead-letter (same fire-and-forget property as the existing publish; the Kafka path is safe — send is awaited + idempotent). A confirm-channel on the DLQ leg would close it — your call.

Merging to develop.

@dpaes dpaes merged commit f34921f into develop Jun 10, 2026
4 checks passed
@dpaes dpaes deleted the feat/EVO-1677 branch June 10, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants