Skip to content

research: graceful restart improvements — reduce downtime and user impact #287

@nathanschram

Description

Context

Research into whether Untether's restart experience can be improved beyond the current graceful drain approach. Investigated zero-downtime patterns, Telegram-specific constraints, and subprocess preservation techniques.

TL;DR: True zero-downtime restart is impossible for our architecture (Telegram long-polling + long-lived subprocesses). But several practical improvements can reduce the gap from ~15-30s to ~5s and improve UX significantly.

Current approach (already good)

Untether already implements:

  • Graceful drain: on SIGTERM, waits up to 120s for active runs to finish
  • Per-chat "Restarting — waiting for your run to finish…" notifications
  • Timeout notification: "Restart timed out — N run(s) interrupted"
  • Progress persistence: orphan messages edited to "⚠️ interrupted by restart" on next startup
  • KillMode=mixed + TimeoutStopSec=150 for clean subprocess cleanup

Why true zero-downtime is impossible

Telegram constraint: Only one process can call getUpdates per bot token. A second caller gets 409 Conflict. No framework (python-telegram-bot, aiogram, telethon) has solved this — they all recommend either "accept the gap" or "use webhook mode."

Subprocess constraint: Active Claude Code runs communicate via PTY FDs and stdout pipes. When the parent process exits, these FDs are destroyed. Reconnecting to a running subprocess's JSONL output stream after parent death requires either:

  • SCM_RIGHTS fd transfer over Unix socket (complex, fragile)
  • os.execv() with inheritable FDs (destroys asyncio event loop, requires full state serialisation)
  • pidfd_getfd (Linux 5.6+, requires ptrace permissions)

None of these are production-ready patterns for our use case.

Practical improvements (recommended)

Tier 1: Low effort, high value

1. Persist update_id offset
Write the last confirmed Telegram update_id to a state file. On restart, resume from offset=last_id+1. Telegram holds undelivered updates for 24 hours. This means zero lost messages — just a brief polling gap (~5-15s).

Currently, the new process starts with offset=0 and might re-process recent updates (deduplication prevents issues, but it's wasteful).

2. Type=notify systemd service
Use sd_notify(READY=1) after the event loop starts and first getUpdates succeeds. This tells systemd the new process is actually healthy, not just "PID exists." Also enables STOPPING=1 during drain for better status reporting. Library: async-sdnotify on PyPI (or stdlib socket with $NOTIFY_SOCKET).

3. Reduce RestartSec
Currently uses systemd default (100ms). Could explicitly set RestartSec=2 for faster restart after drain completes.

Tier 2: Medium effort, targeted value

4. Socket activation for webhook/trigger server
Define untether.socket listening on the trigger port. During restarts, kernel queues incoming webhook HTTP requests. Zero dropped webhooks. Requires:

  • untether.socket unit file
  • Detect LISTEN_FDS env var in trigger server startup
  • Use web.SockSite(runner, sock) instead of web.TCPSite

5. Better drain UX

  • Broadcast restart notice to all active chats (not just those with running tasks) — "🔄 Untether is restarting, back in ~10s"
  • /restart --force option to skip drain wait
  • Per-run drain priority — let short-running tasks finish, cancel long ones first
  • Drain progress in Telegram — edit a message with countdown: "Waiting for 2 runs to finish (45s remaining)"

6. Pre-restart state snapshot enhancement
Before exiting, write richer state: active session IDs, resume tokens, engine types, run durations. The new process reads this on startup and can:

  • Show "Your previous run was interrupted. Resume with /continue" with session details
  • Provide better resume guidance (which topic, which engine)

Tier 3: Future consideration (high effort)

7. Webhook mode for planned restarts
On /restart, temporarily call setWebhook pointing to the trigger HTTP server, restart, then deleteWebhook to resume polling. Eliminates the polling gap entirely for planned restarts. The trigger aiohttp server is already running. Complex but technically feasible.

8. Subprocess output redirection
Before exiting, redirect each child's stdout to a temp file (via /proc/<pid>/fd/), persist the path, and have the new process tail -f to resume JSONL translation. Fragile but would preserve runs across restarts.

What NOT to pursue

  • Blue-green dual-process: Impossible — Telegram enforces single poller per token
  • os.execv() in-place restart: Destroys asyncio event loop, requires full state serialisation
  • reptyr terminal reattachment: Works for terminals, not for stdout pipes
  • Redis/queue intermediary: Overkill for single-user service, doesn't solve subprocess problem

The honest conclusion

The hot-reload work in PR #285 (issue #269) is actually the highest-value improvement. It eliminates restarts for the most common config change scenario (trigger updates). The remaining restart cases (bot token, server port, engine binaries) are rare.

For those rare restarts, the current drain approach is solid. Tier 1 improvements (offset persistence + Type=notify) would reduce the gap to ~5s with minimal effort. Tier 2 improvements polish the UX but don't fundamentally change the architecture.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions