PSv2: Async jobs hang forever when NATS tasks exhaust max_deliver without posting results

## Summary

When an external ML worker pulls tasks from a NATS-backed async job but fails to post results back, the tasks silently die in NATS after exhausting `max_deliver` retries. Django is never notified, no error is logged on the job record, and the job remains in STARTED status indefinitely.

## Reproduction

1. Create an async_api job with N images
2. Have a worker pull tasks via `GET /jobs/{id}/tasks/`
3. Worker fails to process or crashes — never calls `POST /jobs/{id}/result/`
4. Wait for NATS `ack_wait` (30s default) × `max_deliver` (5) = ~2.5 minutes per message
5. All messages become dead in NATS
6. Job stays STARTED forever with 0 errors logged

## Observed Behavior

- NATS consumer state: `num_pending=0`, `num_ack_pending=0`, `num_redelivered=756`
- Job progress: 168/925 processed, 757 remaining, 0 failed
- Job logs: empty (no errors, no warnings)
- Job status: STARTED (never transitions to FAILURE)

## Root Cause

The result-processing path (`process_nats_pipeline_result`) is the only place that updates job progress and logs errors. When a worker pulls a task but never posts results, this code path is never invoked. NATS handles retries internally and eventually drops the message — but there is no callback, webhook, or polling mechanism for Django to detect that messages have been permanently dropped.

The `_fail_job()` function added in #1162 only triggers when Redis state is missing during result processing. It does not cover the case where result processing never happens at all.

## Proposed Solution

Add a **stale consumer detection** mechanism. Two possible approaches:

### Option A: Check inside the `/tasks/` endpoint

When `reserve_tasks()` returns an empty list, check the NATS consumer state. If `num_pending == 0` and `num_ack_pending == 0` but the job still has remaining images (from Redis or the Job progress), mark the job as FAILURE with a descriptive error message.

**Pros:** Runs naturally as workers poll, no extra infrastructure.
**Cons:** Requires a worker to keep polling; if all workers stop, the check never runs.

### Option B: Periodic Celery beat task

Add a beat task that runs every few minutes, queries all STARTED async_api jobs, checks their NATS consumer state, and fails any job where the consumer is exhausted but progress is incomplete.

**Pros:** Catches stalled jobs even if no workers are polling. Can also detect jobs where the NATS stream was deleted (e.g., container restart with ephemeral storage).
**Cons:** Adds a periodic task and requires NATS connectivity from the beat worker.

### Option C: Both

Use Option A for fast detection during active polling, and Option B as a safety net.

## Additional Context

- Related PR: #1162 (async job cancellation + Redis error handling)
- NATS consumer config: `max_deliver=5`, `ack_wait=30s` (configurable via `NATS_TASK_TTR`)
- The job's `dispatch_mode` is `async_api` and `pipeline` is set
- JetStream storage is ephemeral (`/tmp/nats/jetstream`), so a NATS container restart also causes silent data loss — the beat task (Option B) would catch this too

## Acceptance Criteria

- [ ] A job whose NATS tasks have all been exhausted (dead) is detected and marked as FAILURE
- [ ] An error message is logged on the job record explaining that tasks were dropped
- [ ] The detection works even if no external workers are actively polling
- [ ] Existing tests pass; new test covers the dead-message scenario

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PSv2: Async jobs hang forever when NATS tasks exhaust max_deliver without posting results #1168

Summary

Reproduction

Observed Behavior

Root Cause

Proposed Solution

Option A: Check inside the `/tasks/` endpoint

Option B: Periodic Celery beat task

Option C: Both

Additional Context

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PSv2: Async jobs hang forever when NATS tasks exhaust max_deliver without posting results #1168

Description

Summary

Reproduction

Observed Behavior

Root Cause

Proposed Solution

Option A: Check inside the /tasks/ endpoint

Option B: Periodic Celery beat task

Option C: Both

Additional Context

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Option A: Check inside the `/tasks/` endpoint