feat(broker): redelivery backstop with delivery-limit + DLQ/DLT (EVO-1677)#44
Conversation
…1677)
Adds a broker-level ceiling on redeliveries so a poison message can never block
a queue/partition indefinitely, regardless of whether the application classified
the error (EVO-1676 is the orthogonal typed classification). Uniform across both
adapters via a header-based attempt counter — no RabbitMQ quorum-queue migration.
- redelivery.ts: x-delivery-attempt header, dlqNameFor (<name>.dlq), BROKER_DELIVERY_LIMIT (default 3)
- nack(requeue=true): republish with incremented attempt under the limit; route
to <name>.dlq and increment evo_broker_dead_lettered_total at the limit
- RabbitMQ: sendToQueue on the existing classic durable queues; <queue>.dlq asserted lazily
- Kafka: republish via producer + commit past original (seek would block the partition); <topic>.dlq
- broker-metrics: evo_broker_dead_lettered_total{broker,topic}
- declaredDlqs cleared on reconnect alongside declaredExchanges
- REDELIVERY.md documents the limit, DLQ naming, alert rule and replay policy
- specs: under-limit republish, at-limit dead-letter (+metric) for both brokers; redelivery helper unit
nack(requeue=false) terminal drop is unchanged. The broker DLQ is intentionally
separate from the app-level events.failed (EVO-1214); integration is a future decision.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Sorry @nickoliveira23, you have reached your weekly rate limit of 500000 diff characters.
Please try again later or upgrade to continue using Sourcery
dpaes
left a comment
There was a problem hiding this comment.
✅ Approved — EVO-1677
Backstop is correct end-to-end. AC1 (RabbitMQ republish with x-delivery-attempt++, then <queue>.dlq at the limit), AC2 (Kafka: seek replaced by republish + producer.send(<topic>.dlq) awaited before commitPastOriginal → no partition rewind, no loss), AC3 (evo_broker_dead_lettered_total{broker,topic} counter exposed on /metrics), and AC5 (REDELIVERY.md: limit default 3 validated >=1, <name>.dlq naming + x-dlq-reason/x-original-topic, manual replay) all verified. Loop-prevention confirmed: the 3-segment Kafka subscribe pattern can't match the 4-segment .dlq topic; the RabbitMQ .dlq is asserted but never consumed.
AC4 (alert on DLQ depth) — accepted as deferred. The metric is a Counter (dead-lettered rate — the practical alert signal), not a depth gauge (broker-admin polling, explicitly out of scope). The alert rule itself belongs in [5.3] EVO-1224 (external Prometheus/Alertmanager). Approving on the team convention that alerting config lives externally — landing the DLQ alert in 5.3 is at your discretion as a follow-up.
Accepted-LOW: the RabbitMQ DLQ leg uses non-confirm sendToQueue then ack(original) → a narrow broker-failure window could lose instead of dead-letter (same fire-and-forget property as the existing publish; the Kafka path is safe — send is awaited + idempotent). A confirm-channel on the DLQ leg would close it — your call.
Merging to develop.
Summary
Adds a broker-level redelivery ceiling (EVO-1677) so a poison message can never block a queue/partition indefinitely — independent of whether the application classified the error as terminal (that is EVO-1676's typed
TerminalError). Implemented uniformly across both adapters via a header-based attempt counter; no RabbitMQ quorum-queue migration.On
nack(msg, requeue=true):x-delivery-attempt(retry goes to the tail, not the head), then ack/commit the original.<name>.dlq+ incrementevo_broker_dead_lettered_total, then ack/commit the original.sendToQueueon the existing classic durable queues;<queue>.dlqasserted lazily.seekwould rewind and block the partition);<topic>.dlq.nack(msg, requeue=false)(explicit terminal drop) is unchanged.Security
Test plan
evo-flow: npm run typecheck— cleanevo-flow: npx jest src/shared/broker— 155 passed, 21 skipped (integration specs gated by*_INTEGRATION=1)evo-flow: npx eslint <changed .ts>— cleanCovers ACs: RabbitMQ N-failures dead-letter to
<queue>.dlq(AC1); Kafka N-failures route to<topic>.dlq(AC2); both incrementevo_broker_dead_lettered_total(AC3); metric exposed on/metrics+ alert rule documented (AC4, alerting is external/5.3); design documented inREDELIVERY.md(AC5).Review notes
^events\.received\.[^.]+$(single segment) does NOT matchevents.received.<platform>.dlq(two segments), and the RabbitMQ DLQ is reached viasendToQueue(bypasses the exchange/binding) — so a DLQ message is never re-consumed.declaredDlqsis now cleared on reconnect alongsidedeclaredExchanges, so a post-reconnect dead-letter re-asserts the DLQ queue instead of silently dropping.Scope / boundary
<name>.dlq) is intentionally separate from the app-levelevents.failed(EVO-1214); routing one into the other is a future decision.dead_lettered_totalcounter is the alert signal.Changed Files
src/shared/broker/redelivery.ts(new) +redelivery.spec.ts(new)src/shared/broker/metrics/broker-metrics.tssrc/shared/broker/adapters/rabbitmq-broker.adapter.ts(+ spec)src/shared/broker/adapters/kafka-broker.adapter.ts(+ spec)src/shared/broker/REDELIVERY.md(new)Linked Issue