Skip to content

Conversation

@Arseniy-Popov
Copy link
Contributor

@Arseniy-Popov Arseniy-Popov commented Dec 2, 2025

Description

Work-in-progress for a relational database based broker.
Fixes #799

Motivation

The primary benefit of a message queue built on top of a relational database is the ability to insert messages transactionally, atomically with other database operations, thus enabling the transactional outbox pattern. Also, the relational database is usually the most readiliy available, already provisioned piece of infrastructure for a given service. While implementing all patterns / semantics of full-blown message queue / streaming platform (e.g. kafka-like partitions-enabled horizontal scaling with local ordering, etc.) would be problematic, given a proper understanding of the trade-offs involved, the relational database based queue would be an appropriate tool for many low-to-medium throughput, latency-tolerant uses, including as parts of a larger messaging flow that involves a "proper" queue (e.g. as transactional layer between a service and a queue).

Design

The key components are in message.py, usecase.py, client.py.

Flow:

On start, the subscriber spawns four types of concurrent loops:

  1. Fetch loop: Periodically fetches PENDING or RETRYABLE messages from the database, simultaneously updating them in the database: marking as PROCESSING, setting acquired_at to now, and incrementing attempts_count. Only messages with next_attempt_at <= now are fetched, ordered by next_attempt_at. The fetched messages are placed into an internal queue. The fetch limit is the minimum of fetch_batch_size and the free buffer capacity (fetch_batch_size * overfetch_factor minus currently queued messages). If the last fetch was "full" (returned as many messages as the limit), the next fetch happens after min_fetch_interval; otherwise after max_fetch_interval.

  2. Worker loops (max_workers instances): Each worker takes a message from the internal queue and checks if the attempt is allowed by the retry_strategy. If allowed, the message is processed, if not, Reject'ed. Depending on the processing result, AckPolicy, and manual Ack/Nack/Reject, the message is Ack'ed, Nack'ed, or Reject'ed. For Nack'ed messages the retry_strategy is consulted to determine if and when the message might be retried. If allowed to be retried, the message is marked as RETRYABLE, otherwise as FAILED. Ack'ed messages are marked as COMPLETED and Reject'ed messages are marked as FAILED. The message is then buffered for flushing.

  3. Flush loop: Periodically flushes the buffered message state changes to the database. COMPLETED and FAILED messages are moved from the primary table to the archive table. The state of RETRYABLE messages is updated in the primary table.

  4. Release stuck loop: Periodically releases messages that have been stuck in PROCESSING state for longer than release_stuck_timeout since acquired_at. These messages are marked back as PENDING.

On stop, all loops are gracefully stopped. Messages that have been acquired but are not yet being processed are drained from the internal queue and marked back as PENDING. The subscriber waits for all tasks to complete within graceful_shutdown_timeout, then performs a final flush.

Notes:

This design allows for work sharing between processes/nodes because "SELECT FOR UPDATE SKIP LOCKED" is utilized.

This design adheres to the "at least once" processing guarantee because flushing changes to the database happens only after a processing attempt. Messages might be processed more times than allowed by the retry_strategy if, among other things, the flush doesn't happen due to crash or failure after a message is processed.

This design handles the poison message problem (messages that crash the worker without the ability to catch the exception due to e.g. OOM terminations) because attempts_count is incremented and retry_strategy is consulted with prior to processing attempt.

Why not use LISTEN/NOTIFY? It is specific to Postgres, while it is preferable to start with functionality universal to any database. When using multiple nodes/processes, distributing messages among them would still require "SELECT FOR UPDATE SKIP LOCKED", because the notification will be delivered to all nodes/processes. A notification may also fail to arrive, especially if a node restarts. That is, polling is needed in any case. And once polling is in place, listen/notify can be integrated to “wake up” the polling loop earlier than as per the interval-based schedule.

Type of change

Please delete options that are not relevant.

  • Documentation (typos, code examples, or any documentation updates)
  • Bug fix (a non-breaking change that resolves an issue)
  • New feature (a non-breaking change that adds functionality)
  • Breaking change (a fix or feature that would disrupt existing functionality)
  • This change requires a documentation update

Checklist

  • My code adheres to the style guidelines of this project (just lint shows no errors)
  • I have conducted a self-review of my own code
  • I have made the necessary changes to the documentation
  • My changes do not generate any new warnings
  • I have added tests to validate the effectiveness of my fix or the functionality of my new feature
  • Both new and existing unit tests pass successfully on my local environment by running just test-coverage
  • I have ensured that static analysis tests are passing by running just static-analysis
  • I have included code examples to illustrate the modifications

@CLAassistant
Copy link

CLAassistant commented Dec 2, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Dec 2, 2025
@Arseniy-Popov Arseniy-Popov force-pushed the feat/sqla-broker branch 3 times, most recently from 9b0d481 to 90721ee Compare December 3, 2025 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PostgreSQL support

2 participants