Skip to content

Add agent health monitoring and graceful shutdown support#15

Open
97nitt wants to merge 15 commits intomainfrom
cfritz/refactor-agent-health-tracking
Open

Add agent health monitoring and graceful shutdown support#15
97nitt wants to merge 15 commits intomainfrom
cfritz/refactor-agent-health-tracking

Conversation

@97nitt
Copy link
Copy Markdown
Contributor

@97nitt 97nitt commented Apr 9, 2026

Summary

Phase 2 agent-side changes for the egress agent pull model improvements — goodbye event handling, graceful shutdown, heartbeat during backpressure, and instance ID tracking.

Goodbye Event Handling:

  • EventsClient handles new goodbye SSE event type from orchestrator
  • BaseEgressAgentService._handle_goodbye triggers graceful shutdown on goodbye

Graceful Shutdown:

  • _trigger_graceful_shutdown: notifies orchestrator, stops threads, signals process to exit
  • register_signal_handlers: registers SIGTERM/SIGINT handlers (called from gunicorn post_worker_init hook or local dev main)
  • _shutting_down flag prevents double shutdown when both goodbye and signal paths fire
  • Under gunicorn, signals master process so all workers shut down
  • SSE receiver and ops_runner/results_publisher threads run as daemon threads so they don't block process exit

Backpressure Heartbeat:

  • OperationsPoller sends heartbeat to orchestrator during backpressure to maintain activity tracking
  • BackendClient.send_heartbeat: POST /api/v1/agent/heartbeat

Shutdown Notification:

  • BackendClient.notify_shutdown: POST /api/v1/agent/shutdown (15s timeout)

Instance ID (Phase 3 prep):

  • BackendClient generates a UUID instance ID on construction
  • All requests include x-mcd-agent-instance-id header (orchestrator ignores it for now)
  • SSE connection includes the header via extra_headers on SSEClientReceiver
  • Instance ID logged at startup

Other:

  • Fixed misleading "Woken by work_available notification" log during shutdown (was actually woken by stop())

Test plan

  • Unit tests pass
  • Integration tested on dev (hermes-agent, sna-artemis-agent)
  • Verify goodbye flow: orchestrator sends goodbye → agent logs shutdown → process exits
  • Verify SIGTERM flow: kill -TERM <pid> → agent notifies orchestrator → process exits
  • Verify heartbeat during backpressure: all worker threads busy → agent still sends heartbeats

Dependencies

@97nitt 97nitt requested a review from a team as a code owner April 9, 2026 11:53
Copy link
Copy Markdown
Contributor

@mrostan mrostan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, left a nit

@97nitt
Copy link
Copy Markdown
Contributor Author

97nitt commented Apr 9, 2026

Code Review Summary

0 blockers, 2 issues, 3 suggestions, 1 nit — reviewed by correctness, security, devdocs, testing, architecture agents.

Fixed (2bed82d)

  • F1 [ISSUE] _trigger_graceful_shutdown double-execution guard was not thread-safe — replaced _shutting_down boolean with threading.Lock
  • F2 [ISSUE] Missing tests for BackendClient.send_heartbeat and notify_shutdown — added test_backend_client.py with 7 tests
  • F3 [SUGGESTION] notify_shutdown docstring said "Best-effort" but raises on failure — fixed docstring
  • F5 [SUGGESTION] Missing test for heartbeat during backpressure — added test_backpressure_sends_heartbeat

Skipped

  • F4 [SUGGESTION] Backpressure heartbeat frequency — activity timeout (120s) is 2x poll interval (60s), sufficient margin
  • F6 [NIT] _headers(**extra: str) type annotation — correct Python, left as-is

@97nitt 97nitt force-pushed the cfritz/refactor-agent-health-tracking branch from 2bed82d to d22e5cb Compare April 9, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants