research: graceful restart improvements — reduce downtime and user impact

## Context

Research into whether Untether's restart experience can be improved beyond the current graceful drain approach. Investigated zero-downtime patterns, Telegram-specific constraints, and subprocess preservation techniques.

**TL;DR:** True zero-downtime restart is impossible for our architecture (Telegram long-polling + long-lived subprocesses). But several practical improvements can reduce the gap from ~15-30s to ~5s and improve UX significantly.

## Current approach (already good)

Untether already implements:
- Graceful drain: on SIGTERM, waits up to 120s for active runs to finish
- Per-chat "Restarting — waiting for your run to finish…" notifications
- Timeout notification: "Restart timed out — N run(s) interrupted"
- Progress persistence: orphan messages edited to "⚠️ interrupted by restart" on next startup
- `KillMode=mixed` + `TimeoutStopSec=150` for clean subprocess cleanup

## Why true zero-downtime is impossible

**Telegram constraint:** Only one process can call `getUpdates` per bot token. A second caller gets `409 Conflict`. No framework (python-telegram-bot, aiogram, telethon) has solved this — they all recommend either "accept the gap" or "use webhook mode."

**Subprocess constraint:** Active Claude Code runs communicate via PTY FDs and stdout pipes. When the parent process exits, these FDs are destroyed. Reconnecting to a running subprocess's JSONL output stream after parent death requires either:
- `SCM_RIGHTS` fd transfer over Unix socket (complex, fragile)
- `os.execv()` with inheritable FDs (destroys asyncio event loop, requires full state serialisation)
- `pidfd_getfd` (Linux 5.6+, requires ptrace permissions)

None of these are production-ready patterns for our use case.

## Practical improvements (recommended)

### Tier 1: Low effort, high value

**1. Persist `update_id` offset** 
Write the last confirmed Telegram `update_id` to a state file. On restart, resume from `offset=last_id+1`. Telegram holds undelivered updates for 24 hours. This means zero lost messages — just a brief polling gap (~5-15s).

Currently, the new process starts with offset=0 and might re-process recent updates (deduplication prevents issues, but it's wasteful).

**2. `Type=notify` systemd service**
Use `sd_notify(READY=1)` after the event loop starts and first `getUpdates` succeeds. This tells systemd the new process is actually healthy, not just "PID exists." Also enables `STOPPING=1` during drain for better status reporting. Library: `async-sdnotify` on PyPI (or stdlib `socket` with `$NOTIFY_SOCKET`).

**3. Reduce `RestartSec`**
Currently uses systemd default (100ms). Could explicitly set `RestartSec=2` for faster restart after drain completes.

### Tier 2: Medium effort, targeted value

**4. Socket activation for webhook/trigger server**
Define `untether.socket` listening on the trigger port. During restarts, kernel queues incoming webhook HTTP requests. Zero dropped webhooks. Requires:
- `untether.socket` unit file
- Detect `LISTEN_FDS` env var in trigger server startup
- Use `web.SockSite(runner, sock)` instead of `web.TCPSite`

**5. Better drain UX**
- **Broadcast restart notice** to all active chats (not just those with running tasks) — "🔄 Untether is restarting, back in ~10s"
- **`/restart --force`** option to skip drain wait
- **Per-run drain priority** — let short-running tasks finish, cancel long ones first
- **Drain progress in Telegram** — edit a message with countdown: "Waiting for 2 runs to finish (45s remaining)"

**6. Pre-restart state snapshot enhancement**
Before exiting, write richer state: active session IDs, resume tokens, engine types, run durations. The new process reads this on startup and can:
- Show "Your previous run was interrupted. Resume with /continue" with session details
- Provide better resume guidance (which topic, which engine)

### Tier 3: Future consideration (high effort)

**7. Webhook mode for planned restarts**
On `/restart`, temporarily call `setWebhook` pointing to the trigger HTTP server, restart, then `deleteWebhook` to resume polling. Eliminates the polling gap entirely for planned restarts. The trigger aiohttp server is already running. Complex but technically feasible.

**8. Subprocess output redirection**
Before exiting, redirect each child's stdout to a temp file (via `/proc/<pid>/fd/`), persist the path, and have the new process `tail -f` to resume JSONL translation. Fragile but would preserve runs across restarts.

## What NOT to pursue

- **Blue-green dual-process:** Impossible — Telegram enforces single poller per token
- **`os.execv()` in-place restart:** Destroys asyncio event loop, requires full state serialisation
- **`reptyr` terminal reattachment:** Works for terminals, not for stdout pipes
- **Redis/queue intermediary:** Overkill for single-user service, doesn't solve subprocess problem

## The honest conclusion

**The hot-reload work in PR #285 (issue #269) is actually the highest-value improvement.** It eliminates restarts for the most common config change scenario (trigger updates). The remaining restart cases (bot token, server port, engine binaries) are rare.

For those rare restarts, the current drain approach is solid. Tier 1 improvements (offset persistence + `Type=notify`) would reduce the gap to ~5s with minimal effort. Tier 2 improvements polish the UX but don't fundamentally change the architecture.

## Related

- #269 — hot-reload trigger configuration (PR #285)
- #286 — unfreeze TelegramBridgeConfig for more hot-reloadable settings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research: graceful restart improvements — reduce downtime and user impact #287

Context

Current approach (already good)

Why true zero-downtime is impossible

Practical improvements (recommended)

Tier 1: Low effort, high value

Tier 2: Medium effort, targeted value

Tier 3: Future consideration (high effort)

What NOT to pursue

The honest conclusion

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

research: graceful restart improvements — reduce downtime and user impact #287

Description

Context

Current approach (already good)

Why true zero-downtime is impossible

Practical improvements (recommended)

Tier 1: Low effort, high value

Tier 2: Medium effort, targeted value

Tier 3: Future consideration (high effort)

What NOT to pursue

The honest conclusion

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions