[GSoC Proposal Draft] - Digvijay Rawat - SQL Adapter for Background Jobs #240

Digvijay-x1 · 2026-03-13T17:32:29Z

Digvijay-x1
Mar 13, 2026

SQL Adapter for Background Jobs

Introduction

Rage is a Ruby web framework built for speed, using fibers and the Iodine HTTP server. Its built-in background job system, Rage::Deferred, is intentionally lightweight and works well for single-node deployments.

The problem is durability in cloud-native environments. Today, deferred jobs are persisted to local disk. In Kubernetes and similar setups, local storage is ephemeral and coordination through file locks is node-local. This creates a real risk of lost jobs during pod restarts.

This proposal introduces a SQL backend for Rage::Deferred that keeps Rage's execution model intact while adding shared durability and multi-pod coordination.

The backend will use Active Record as the database abstraction layer, so Rage does not need separate backend implementations per SQL engine.

Problem Understanding

What Rage does today (from the codebase)

Current Rage::Deferred behavior has a clean separation between queue logic and backend storage:

The queue persists and schedules tasks.
The backend contract is storage-only (add, remove, pending_tasks).
On startup, deferred tasks are loaded from backend and scheduled by the queue.

So the existing dependency direction is one-way: queue -> backend.

Why this is a problem in Kubernetes

The disk backend relies on per-node file locking (flock) and local files.

Pod restart can wipe local WAL files.
flock does not coordinate across nodes/pods.
Work cannot be naturally rebalanced across pods.

That means Rage::Deferred jobs can be silently lost in normal cloud operations (rollouts, evictions, OOM kills).

Design Goals

Preserve Rage's existing queue/backend boundaries.
Keep at-least-once semantics.
Add shared durability and safe multi-worker claiming.
Avoid introducing heavyweight infra (Redis/Sidekiq).
Fit Rage's existing migration/task workflow.
Use Active Record so applications keep a single database adapter gem for their chosen engine.

Technical Approach

1. Keep one-way architecture (Queue -> Backend)

A key update based on mentor feedback:

The SQL backend will not schedule jobs or own Iodine timers.
Scheduling remains in the queue layer.
The backend only provides data operations and coordination primitives.

2. Backend responsibilities

Rage::Deferred::Backends::Sql will handle:

Insert task row (add).
Delete task row on completion (remove).
Return startup-claimable rows (pending_tasks).
Worker registration, heartbeat updates, stale-worker cleanup.
Atomic task claiming using SQL row-level locks.
Sending unrecoverable rows to a dedicated DLQ table.

Queue/deferred lifecycle code will handle:

startup loading and scheduling,
periodic claiming loop for due delayed tasks,
periodic stale-worker sweep trigger,
graceful release on shutdown.

3. Schema design

I am proposing three tables:

CREATE TABLE rage_active_workers (
  id                varchar(255) PRIMARY KEY,
  worker_heartbeat  timestamp NOT NULL,
  created_at        timestamp NOT NULL
);

CREATE TABLE rage_deferred_tasks (
  id                varchar(255) PRIMARY KEY,
  owner_id          varchar(255),
  serialized_task   text NOT NULL,
  publish_at        timestamp,
  created_at        timestamp NOT NULL
);

CREATE INDEX idx_rage_tasks_owner ON rage_deferred_tasks(owner_id);
CREATE INDEX idx_rage_tasks_claim ON rage_deferred_tasks(owner_id, publish_at);

CREATE TABLE rage_deferred_dead_tasks (
  id                     bigint PRIMARY KEY,
  task_id                varchar(255) NOT NULL,
  serialized_task        text NOT NULL,
  owner_id               varchar(255),
  publish_at             timestamp,
  failed_execution_count integer NOT NULL,
  failure_class          varchar(255),
  failure_message        text,
  failed_at              timestamp NOT NULL,
  created_at             timestamp NOT NULL
);

CREATE INDEX idx_rage_dead_task_id ON rage_deferred_dead_tasks(task_id);
CREATE INDEX idx_rage_dead_failed_at ON rage_deferred_dead_tasks(failed_at);

Notes:

owner_id IS NULL means unclaimed.
We still avoid a separate status column for active tasks.
DLQ is a separate table (per mentor feedback), so max-failure policy changes do not accidentally revive dead rows.

4. Enqueue and execution flow

Immediate task:

Queue calls backend add with owner_id = current_worker_id.
Queue schedules execution immediately (Iodine.run_after(nil)).
On success, queue calls backend remove.

Delayed task:

Queue calls backend add with owner_id = NULL and publish_at.
Queue-level periodic claim loop fetches due unowned rows.
Claimed rows are scheduled by queue.

This keeps SQL as durability/coordination, not execution.

5. Active Record claiming flow

Task claiming will be implemented through Active Record transactions and locking APIs, keeping the backend implementation framework-level instead of adapter-specific.

def claim_due_tasks(batch_size)
  with_connection do
    RageDeferredTask.transaction do
      rows = RageDeferredTask
        .where(owner_id: nil)
        .where("publish_at IS NULL OR publish_at <= ?", Time.current)
        .order(created_at: :asc)
        .limit(batch_size)
        .lock("FOR UPDATE SKIP LOCKED")
        .to_a

      return [] if rows.empty?

      ids = rows.map(&:id)
      RageDeferredTask.where(id: ids).update_all(owner_id: @worker_id)

      rows.map do |row|
        [row.id, Marshal.load(row.serialized_task), row.publish_at&.to_i]
      end
    end
  end
end

The implementation remains in Active Record and uses the application's configured DB adapter.
FOR UPDATE SKIP LOCKED is passed through Active Record locking APIs to avoid double-claiming across workers.

5.1 Contract mapping for `add` / `remove` / `pending_tasks`

To stay aligned with the existing backend contract, SQL behavior maps directly to these methods:

Deferred backend contract	SQL backend method (planned)	Behavior
`add(task, publish_at:, task_id:)`	`add`	Insert into `rage_deferred_tasks`; `owner_id` is current worker for immediate tasks, `NULL` for delayed tasks.
`remove(task_id)`	`remove`	Delete task row by primary key after successful completion or terminal handling.
`pending_tasks`	`pending_tasks`	Claim a bounded batch of due unowned rows, deserialize, and return `[task_id, context, publish_at]` tuples for queue scheduling.

6. Crash recovery and lifecycle hooks

Iodine already exposes worker lifecycle events (:on_start, :on_finish).
Plan:

On worker start (on_start):
- register worker,
- start heartbeat timer,
- start queue-owned claim timer,
- run initial claim for orphaned due tasks.
On worker finish (on_finish):
- release owned tasks (owner_id = NULL),
- unregister worker.
Hard crash path:
- surviving workers sweep stale heartbeats,
- orphan stale-worker tasks,
- reclaim them in next claim cycle.

7. Dead-letter handling

For deserialization/infrastructure failures:

In a transaction, insert failed row into rage_deferred_dead_tasks with failure metadata.
Delete that row from rage_deferred_tasks.
Continue processing remaining claimed rows.

This creates a clear operator workflow: inspect, requeue manually if needed, or purge by retention policy.

8. Migration strategy (instead of runtime DDL)

Another important update from feedback: no create_tables call during runtime boot.

Rage already has migration tooling and task loading.

So schema will be delivered through migrations:

rage g migration create_rage_deferred_sql_tables
rage db:migrate

This avoids assuming production app DB users have DDL permission.

Configuration API

Rage::Configuration::Deferred currently supports :disk and nil only.

The project will extend that API with :sql while preserving existing behavior:

Rage.configure do
  config.deferred.backend = :sql, {
    batch_size: 150,            # tasks per claim cycle
    claim_interval: 5_000,      # ms
    heartbeat_interval: 30_000, # ms
    stale_threshold: 90         # seconds
  }
end

Backend will use ActiveRecord::Base.connection_pool.with_connection for DB operations.

Testing Plan

Unit tests

backend add/remove/claim APIs,
stale worker sweep,
worker registration lifecycle,
DLQ move-on-failure behavior.

Integration tests

multi-worker claiming with SKIP LOCKED,
delayed task promotion via periodic claim,
graceful shutdown release and reclaim,
hard crash simulation and stale recovery,
no duplicate execution within a claim window.

E2E validation

A test Rage app on Kind with 2+ pods and PostgreSQL to validate:

jobs survive pod restart,
jobs redistribute across pods,
no silent loss under rolling updates.

Milestones & Timeline

Week	Milestone	Deliverables
1-2	Architecture alignment	Final SQL backend contract aligned to current queue/backend boundary.
3-4	Core SQL backend	`add/remove/pending_tasks/claim` + heartbeat/worker registry primitives.
5-6	Migration and config support	`:sql` backend option, option parsing, migration generator/docs.
7-8	Reliability features	stale-worker recovery + separate DLQ table and retention/requeue docs.
9-10	Test infrastructure	integration suite (multi-worker, delayed claims, crash recovery).
11-12	Final polish	docs, CHANGELOG, review feedback, upstream PR prep.

Deliverables

Rage::Deferred::Backends::Sql implemented with ActiveRecord.
:sql support in deferred backend configuration.
Migration files for active worker/tasks/dead-task tables and indexes.
Unit + integration tests for distribution and recovery behavior.
Documentation: setup, migration steps, operational notes (DLQ and requeue).
E2E Kubernetes validation report with reproducible steps.

Validation Criteria

The feature is considered complete when all of the following pass:

Enqueued tasks are persisted and executed successfully.
Delayed tasks execute after becoming eligible.
Tasks survive worker/pod restarts.
Stale-worker recovery reassigns orphaned tasks.
Multi-worker instances claim disjoint batches under load.
Corrupt/unreadable rows are moved to DLQ, not silently dropped.

About Me

I'm an undergraduate student who started learning Ruby while preparing for my college technical society, where we maintain campus systems like ERP, event sites, and internal admin tools. Since our ERP stack uses Ruby on Rails, Ruby became my natural starting point.

In my second year, I began contributing to Ruby open source projects and found myself increasingly interested in infrastructure-level work. This Rage SQL adapter project fits that interest perfectly: it solves a practical reliability gap for production deployments while staying close to framework internals.

I want to work on this because it combines distributed coordination, failure recovery, and API design in a way that is both challenging and directly useful to Rage users.

1. How much time would you be able to devote to the project?

I have summer break from May to July (21 May to 19 July), during which I can dedicate about 40 hours/week for 8 weeks (around 320 hours).

From 27 July to 15 September (about 7 weeks), my semester will be active, and I can dedicate around 20 hours/week (around 140 hours).

Overall, this is enough time for the project, and I can adjust hours if needed near deadlines.

2. What other obligations might you need to work around during the summer?

Mid-semester exams are expected around 10 September (date not finalized yet).
No other major summer obligations planned.

3. How often, and through which channel(s), do you plan on communicating with your mentor?

I plan to share daily progress updates and keep communication regular.

Discord
2 meetings/week (flexible), on Google Meet or mentor-preferred platform

If selected for Rage in GSoC, I will focus on producing high-quality contributions, staying active in mentor/community communication, and continuing long-term contributions even after the program.

Thanks and regards,
Digvijay Rawat (Digvijay-x1)

rsamoilov · 2026-03-16T12:48:46Z

rsamoilov
Mar 16, 2026
Maintainer

Hi @Digvijay-x1

Overall, you have a very strong grasp of the core problem. I appreciate that you explicitly noted this project is about adding a data durability layer for stateless environments, rather than trying to build a distributed queue.

Your approach to handling graceful shutdowns with SIGTERM and mitigating thundering herds on boot using FOR UPDATE SKIP LOCKED is spot on.

There're a few edge cases I'd love for you to clarify or rethink for your next revision:

1. Database Connections & Fiber Concurrency

There is a bit of a contradiction in the proposal regarding Active Record. In the Solution section, you mention "no new dependencies beyond activerecord", but your code snippets use the raw pg gem, and your timeline mentions building an Active Record adapter later. We need to align on exactly which approach you are proposing as the primary deliverable.
More importantly, there is a misconception about fibers in your statement: "Rage is fiber-based and single-threaded per worker, so connection pooling is unnecessary". While a worker is single-threaded, fibers execute concurrently. If Fiber A makes a database query and yields on I/O, Fiber B might wake up and try to send a query over that exact same raw PG::Connection. This will corrupt the connection state. Whether you use Active Record or raw pg, you will need a fiber-aware connection pool. One advantage of Active Record is that such pool already exists.

2. Dead Letter Queue

Your current DB cleanup strategy relies on hard deletes upon completion, which is great for keeping the table lean. However, the proposal is currently missing a strategy for poison pills.

For example, in your pending_tasks snippet, if a task fails to deserialize via Marshal.load, the error is rescued and ignored. Because the task is never deleted or updated, it will sit in the database forever. If a pod crashes, the sweeper will orphan it, another worker will pick it up, fail to deserialize it again, and the cycle continues. Your proposal should include a plan for a Dead Letter Queue (e.g., moving failed/poisoned tasks to a separate table or marking them with an error status) so they don't clog the system.

3. SQLite Support

Your proposal relies on FOR UPDATE SKIP LOCKED, which is perfect for Postgres and MySQL. However, SQLite does not support this clause. How would you approach the pending_tasks query for a SQLite adapter to prevent contention when multiple worker processes boot simultaneously?

Looking forward to your updated version! Let me know if you have any questions about this feedback.

1 reply

Digvijay-x1 Mar 21, 2026
Author

Hello Roman,

Thank you so much for spending the time in reviewing the proposal and pointing the edge cases. I have update the proposal.

Database Connections & Fiber Concurrency

I have change the dependency from pg to AR, the sql code is only for references.

Dead Letter Queue

Added a DLQ

SQLite Support

I have updated the timeline for it

rsamoilov · 2026-03-24T20:01:14Z

rsamoilov
Mar 24, 2026
Maintainer

Hi @Digvijay-x1

I love this! Added a couple of suggestions, but this is already very strong! Make sure to submit your proposal before March 31!

Polling loop for delayed tasks (every 5s) instead of per-task timers.

I can see the appeal here - we could use the SQL backend to remove one of the trafe-offs Rage::Deferred makes, i.e. storing all delayed tasks in memory. I'm on the fence about this approach, but if you want to go this way, I'd suggest to describe how exactly this will be accomplished. The problem here is that we want to encapsulate the DB logic in the backend class. If the backend class polls the database and schedules tasks, it creates a two-way dependency where the queue depends on the backend and the backend depends on the queue. Ideally, we'd want to avoid this and have a one-way queue -> backend dependency.

This is also very different from how Rage::Deferred operates today, so abstracting this new behaviour away while preserving how the existing disk backend works is also something you'd need to think through.

The DLQ is implemented via the failed_execution_count column on the existing rage_deferred_tasks table. No separate table is needed.

I think you do need a separate table. The limit for failed_execution_count can be changed. For example, the task has failed 5 times and is marked as dead. Then, a new code is deployed where this task class has the max_retries 10 setting, and the backend suddenly thinks the task is not dead anymore.

Create tables if they don't exist
with_connection { |conn| self.class.create_tables(conn) }

Great thinking, but this will be problematic in real-world deployments. This approach essentially hides some of the tables from the schema file. Additionally, it assumes the DB credentials given to the app process have the DDL permissions, which is not necessarily the case. If an app process only has DML permissions, this line will crash the app on boot.

Instead, there will need to be a command to generate the migrations for the SQL backend.

A periodic sweeper (running every 60 seconds) on surviving workers detects stale heartbeats and reclaims tasks

I have the same suggestion here as with the polling loop - think through the abstrations and try to avoid the two-way dependency where the backend schedules the tasks in the queue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSoC Proposal Draft] - Digvijay Rawat - SQL Adapter for Background Jobs #240

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[GSoC Proposal Draft] - Digvijay Rawat - SQL Adapter for Background Jobs #240

Uh oh!

Uh oh!

Digvijay-x1 Mar 13, 2026

SQL Adapter for Background Jobs

Introduction

Problem Understanding

What Rage does today (from the codebase)

Why this is a problem in Kubernetes

Design Goals

Technical Approach

1. Keep one-way architecture (Queue -> Backend)

2. Backend responsibilities

3. Schema design

4. Enqueue and execution flow

5. Active Record claiming flow

5.1 Contract mapping for add / remove / pending_tasks

6. Crash recovery and lifecycle hooks

7. Dead-letter handling

8. Migration strategy (instead of runtime DDL)

Configuration API

Testing Plan

Unit tests

Integration tests

E2E validation

Milestones & Timeline

Deliverables

Validation Criteria

About Me

Replies: 2 comments · 1 reply

Uh oh!

rsamoilov Mar 16, 2026 Maintainer

Uh oh!

Digvijay-x1 Mar 21, 2026 Author

Uh oh!

rsamoilov Mar 24, 2026 Maintainer

Digvijay-x1
Mar 13, 2026

5.1 Contract mapping for `add` / `remove` / `pending_tasks`

Replies: 2 comments 1 reply

rsamoilov
Mar 16, 2026
Maintainer

Digvijay-x1 Mar 21, 2026
Author

rsamoilov
Mar 24, 2026
Maintainer