Skip to content

fix: prevent interactive workers from getting stuck in idle after outcome#551

Open
marcmantei wants to merge 31 commits intospacedriveapp:mainfrom
marcmantei:fix/idle-worker-outcome-bug
Open

fix: prevent interactive workers from getting stuck in idle after outcome#551
marcmantei wants to merge 31 commits intospacedriveapp:mainfrom
marcmantei:fix/idle-worker-outcome-bug

Conversation

@marcmantei
Copy link
Copy Markdown

Problem

Interactive workers that call set_status(kind="outcome") get permanently stuck in idle status, causing a crash-loop:

  1. Worker completes task and calls set_status(kind="outcome")
  2. Worker unconditionally enters the follow-up loop and sets itself to idle
  3. input_rx.recv().await blocks forever — circular dependency:
    • WorkerComplete only fires after worker.run() returns
    • worker.run() only returns when the follow-up loop exits
    • The loop only exits when the sender is dropped
    • The sender is only dropped on WorkerComplete
  4. Health watchdog detects no activity after 600s → kills process
  5. On restart, idle worker is resumed but produces no activity → repeat

This results in permanent crash-loop with restarts every ~10 minutes. Restart counter reached 1194+ on our instance.

Fix

src/agent/worker.rs: Before entering the follow-up loop, check outcome_signaled(). If the worker already signaled a terminal outcome, drop input_rx immediately so WorkerComplete can fire normally.

src/conversation/history.rs: Harden log_worker_idle() with a conditional UPDATE that skips terminal states (done/failed/completed/cancelled). This prevents a race condition where two fire-and-forget tokio::spawn DB updates execute out of order.

Test plan

  • cargo check passes
  • Deployed to production instance — no more crash-loop, workers complete normally
  • Verify workers that need follow-up (non-outcome) still enter idle correctly

🤖 Generated with Claude Code
via Happy

marcmantei and others added 30 commits March 12, 2026 17:10
Add a SQLite-backed registry that auto-discovers GitHub repositories
via `gh repo list`, syncs periodically in the background, and
optionally auto-clones new repos. Per-repo config overrides (model,
enabled) are preserved across syncs.

New modules:
- registry/store.rs — RegistryStore CRUD (upsert, list, overrides, link)
- registry/sync.rs — discovery via gh CLI, reconciliation, auto-clone
- api/registry.rs — 5 REST endpoints for repos, overrides, sync, status

Config: [defaults.registry] section with github_owners, clone_base_dir,
sync_interval_secs, auto_clone, exclude_patterns.

Wired into RuntimeConfig (hot-reloadable), AgentDeps, ApiState, and
main agent init loop with background sync task.

9 new tests, all 633 tests pass.

Refs: #1

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
Send a summary message via MessagingManager when registry sync discovers
new repos or archives removed ones. Configurable via `notification_target`
in [defaults.registry] (e.g. "telegram:1285309093").

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
The sync loop was spawned before messaging_manager was set on AgentDeps,
so notifications were never sent. Now spawns after init, reading the
sync status from ApiState.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
1. Fix cortex profile 404: default routing model was
   "anthropic/claude-sonnet-4" (doesn't exist), now uses
   "anthropic/claude-sonnet-4-20250514".

2. Fix worker outcome reporting: when a webhook-triggered worker
   completes and Lira relays the result via send_message_to_another_channel
   (e.g. to Telegram), the retrigger system didn't recognize it as a
   successful relay because only the reply tool set the replied_flag.
   Now SendMessageTool also sets the flag on successful delivery.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
The agent now uses the registry as its dynamic project list:

1. System prompt injection: enabled registry repos are appended to the
   project context so the agent knows about all discovered repos
   (name, path, language, model override, default branch).

2. Webhook event filtering: events from repos not in the registry or
   with enabled=false are dropped before reaching the agent. This
   replaces the need for a hardcoded project table in ROLE.md.

Combined with the GitHub App (all repos) and registry sync (auto-discover),
this completes the zero-config flow: new repo → auto-discovered →
webhooks delivered → agent reacts.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
…-app

feat: Dynamic Project Registry + GitHub App Migration
Browser tools fail to launch Chromium because libxkbcommon.so.0 is not
installed in the container.

Closes spacedriveapp#425
Every developer worker now receives the full CUPID framework
(cupid.dev) as non-negotiable coding principles: Composable, Unix
philosophy, Predictable, Idiomatic, Domain-based — with detailed
actionable guidance for each.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
Automates the source-based upgrade flow: git fetch → check for new
commits → git pull → cargo build --release → stop daemon → atomic
binary swap → systemctl restart. Supports --check-only for dry runs
and --no-restart for manual service management.

Also updates Deployment::Native detection in update.rs to report
can_apply=true when a source directory is found.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
When all worker slots were occupied (especially by idle resumed workers),
new spawn requests were rejected with WorkerLimitReached. The LLM then
hallucinated a queue mechanism that didn't exist, telling users tasks
would "auto-spawn when a slot opens" — but they were silently dropped.

This commit fixes three issues:

1. Worker queue: spawn requests that hit the limit are now queued in a
   per-channel VecDeque and auto-spawned when a WorkerComplete event
   frees a slot.

2. Idle worker eviction: when the limit is reached, the oldest idle
   worker (e.g. resumed after restart but doing nothing) is cancelled
   to make room for new active work.

3. Structured response: the LLM now receives `queued: true` with a
   clear message instead of a tool error, eliminating hallucinated
   queue behavior.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
Adds a background loop that periodically polls open PRs across all
registered repos using `gh pr list`. When a PR with mergeable status
CONFLICTING is detected, a message is injected into the webhook
channel so the agent can spawn a worker to resolve the conflicts.

Configurable via `pr_conflict_check_interval_secs` in [registry]
(default: 600s = 10 minutes, set to 0 to disable). Tracks notified
PRs to avoid duplicate messages; clears tracking when conflicts are
resolved.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
Startup catch-up: after boot, scans all enabled repos for open issues
created in the last 24h that lack the `prd-generated` label. Injects
them as synthetic webhook messages so the agent processes missed issues
from downtime or frozen states.

Health watchdog: background task that monitors message processing
activity. If no messages are processed for 10 minutes, exits with
code 1 so systemd auto-restarts the service. Prevents silent freezes
from going undetected.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
- New "Ember" dark theme with dark red/orange accent colors
- 3-tier Pricing page (Free OSS / Teams / Enterprise) inspired by Cline
- Kanban board: task dependency support via metadata (depends_on)
- Kanban board: blocked task indicators and dependency chips
- Kanban board: dependency field in Create Task dialog
- Sidebar: Pricing page navigation link

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
- New `spacebot local` subcommand: starts in foreground, auto-opens browser
- Optional `--port` flag for custom port
- Welcome page (/welcome) with onboarding checklist for first-time users
- Guides through: provider setup, agent creation, platform connection

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
Python-based MCP server that exposes Spacebot's HTTP API as MCP tools.
Zero dependencies — works with Python 3.8+.

Tools: list_agents, list/create/update/execute tasks, list_workers,
send_message, list_channels, system status.

Enables Claude Code, Codex, VS Code to orchestrate Spacebot agents
via standard MCP protocol over stdio.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
- Task cards show assigned_agent badge (e.g. "claude-code", "codex")
- Task cards show worktree indicator when metadata.worktree is set
- Worker ID shown with first 8 chars instead of generic "Worker" label
- Create dialog: new "Assign Agent" field alongside dependencies
- Detail dialog: shows agent assignment and worktree metadata

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
- New TaskDependencyGraph component using graphology + Sigma.js
- Board/Graph toggle in Kanban toolbar
- Force-directed layout for task dependency visualization
- Color-coded by status, sized by priority
- Blocked tasks shown in red
- Click node to open task detail
- Legend with status colors

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
Backend:
- New GET /agents/tasks/{number}/diff endpoint
- Reads worktree/branch from task metadata, runs git diff
- Returns unified diff with branch/worktree info

Frontend:
- New DiffViewer component with syntax-highlighted unified diffs
- Green/red line coloring for additions/removals
- Sticky file headers, hunk separators
- Auto-fetches diff when task has worktree/branch metadata
- Shown in task detail dialog before metadata section

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
Backend:
- POST /agents/tasks/{number}/worktree — creates branch task/{N} + worktree
- DELETE /agents/tasks/{number}/worktree — removes worktree, keeps branch ref
- Worktree stored in .spacebot-worktrees/ adjacent to repo
- Task metadata updated with worktree path and branch name

Frontend:
- "Create Worktree" button in task detail dialog (shown for non-done tasks)
- Worktree badge on task cards

MCP Bridge:
- New spacebot_create_worktree tool
- New spacebot_task_diff tool

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
- Run cargo fmt on channel.rs, channel_dispatch.rs, tasks.rs, main.rs,
  pr_conflicts.rs, spawn_worker.rs, upgrade.rs
- Fix missing closing > on JSX div element in AgentTasks.tsx (line 252)
- Fix TS2322: cast metadata?.worktree and metadata?.assigned_agent with
  !! to satisfy ReactNode constraint (Record<string, unknown> values)
task.metadata is serde_json::Value not Option<Value>.
Remove incorrect .as_ref()/.unwrap_or_else() calls and call
.get() directly on the Value in task_diff, create_worktree,
and delete_worktree handlers.
Welcome Page:
- Detects Claude Max OAuth (anthropic provider check)
- Shows provider type ("Anthropic (Claude Max / OAuth)")
- Agent status with direct navigation
- Kanban Board quick-link when agent exists

Kanban Board:
- Project filter dropdown (all repos from registry)
- GitHub Issues toggle with count badge
- GitHub Issues column showing issues from all registry repos
- Issues show repo name, labels, assignees, timestamp
- Click issue to open on GitHub

Backend:
- New GET /api/registry/issues endpoint
- Fetches issues from all registry repos via gh CLI
- Supports repo filter and state filter (open/closed)
- Sorted by updated_at descending

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
- Replace non-null assertions with safe variable access
- Use href links instead of TanStack navigate for dynamic agent routes
- Prevents blank page when agents query is still loading

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
Backend:
- Fix Anthropic OAuth detection in providers API (anthropic_oauth.json)
- /api/providers now returns anthropic: true for Claude Max users

Frontend:
- New useSpacebotConfig hook (merges agentConfig + messaging + providers)
- Kanban is now the default agent landing page (Board tab)
- Overview moved to /agents/$agentId/overview tab
- Config status bar: worker model, concurrency, worktrees, platforms, auth
- GitHub Issue import button (creates Kanban task from issue)
- Already-imported issues show "Imported" badge
- Project filter now filters both tasks AND issues

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
…come

Workers that call set_status(kind="outcome") now skip the idle follow-up
loop entirely. Previously, they would unconditionally enter the loop and
block forever on input_rx.recv(), creating a circular dependency:
- WorkerComplete only fires after worker.run() returns
- worker.run() only returns when the follow-up loop exits
- The follow-up loop only exits when the sender is dropped
- The sender is only dropped on WorkerComplete

This caused a permanent crash-loop: the health watchdog would kill the
process every ~10 minutes, and on restart the idle worker would be
resumed but produce no activity.

Also hardens log_worker_idle() to not overwrite terminal states
(done/failed/completed/cancelled), preventing a race condition between
fire-and-forget tokio::spawn database updates.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <[email protected]>
Co-Authored-By: Happy <[email protected]>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

Walkthrough

This pull request introduces a comprehensive GitHub registry integration system with automated repository discovery, syncing, and dependency management. It adds UI improvements including new routes (Welcome, Pricing), theme customization, task dependency visualization, and worker capacity management with queueing. Additional features include CLI upgrade support, health watchdog monitoring, and an MCP bridge for external integrations.

Changes

Cohort / File(s) Summary
Dockerfile Runtime Dependencies
Dockerfile
Added libxkbcommon0 to runtime image package installation list for Chrome/Chromium shared-library dependencies.
UI Components — New & Updated
interface/src/components/DiffViewer.tsx, interface/src/components/TaskDependencyGraph.tsx, interface/src/components/AgentTabs.tsx, interface/src/components/Sidebar.tsx
Added DiffViewer (diff syntax highlighting and rendering), TaskDependencyGraph (Graphology/Sigma-based task dependency visualization with node size/color by priority/status), updated AgentTabs route mappings, and added pricing navigation link to Sidebar with new icon.
UI Routes & Pages
interface/src/routes/Welcome.tsx, interface/src/routes/Pricing.tsx, interface/src/routes/AgentTasks.tsx, interface/src/router.tsx
Added Welcome (onboarding/setup status page) and Pricing pages, new /agents/$agentId/overview route, refactored AgentTasks to support dependency handling, GitHub issues integration, Kanban/graph view switching, Spacebot config display, task metadata (worktree, assigned_agent), diff viewing, and task creation/detail form extensions.
Theme System
interface/src/hooks/useTheme.ts, interface/src/routes/Settings.tsx, interface/src/ui/style/colors.scss
Added "ember" theme variant (ThemeId, THEMES entry, CSS custom properties), and strongly-typed ThemePreview colors map.
Interface Hooks
interface/src/hooks/useSpacebotConfig.ts
Added useSpacebotConfig hook to fetch and consolidate agent configuration, messaging status, and provider info with derived enabled platforms and availability flags.
Registry Backend System
src/registry.rs, src/registry/store.rs, src/registry/sync.rs, src/registry/pr_conflicts.rs, migrations/20260312000001_registry.sql
Introduced GitHub repository auto-discovery and registry layer with SQLite persistence (RegistryStore), sync operations (sync_registry), PR conflict detection loop, and registry_repos table with agent/repo metadata and indexes.
API Registry Endpoints
src/api/registry.rs
Added REST handlers for listing/fetching registry repos, updating per-repo overrides, triggering manual sync, querying sync status, and fetching GitHub issues with optional repository filtering.
API Integration Points
src/api/tasks.rs, src/api/server.rs, src/api/agents.rs, src/api/providers.rs, src/api/state.rs
Added task diff, worktree creation/deletion endpoints; wired registry module and routes; injected registry_store into AgentDeps; added registry stores/sync status tracking to ApiState; updated provider detection to include Anthropic OAuth configuration.
Config System Extensions
src/config/types.rs, src/config/load.rs, src/config/toml_schema.rs, src/config/runtime.rs
Added RegistryConfig with GitHub owners, clone settings, sync intervals, and auto-clone options; wired registry config through TOML parsing, defaults resolution, and hot-reloadable runtime config.
Worker Queueing & Dispatch
src/agent/channel.rs, src/agent/channel_dispatch.rs, src/agent/status.rs, src/agent/worker.rs, src/tools/spawn_worker.rs
Introduced worker capacity queue (worker_queue in ChannelState) with eviction fallback; added spawn_or_queue_worker and drain_worker_queue functions; extended status rendering with "Queued Workers" section; updated SpawnWorkerOutput with queued flag; fixed idle-loop logic for interactive workers with signaled outcomes.
Message Routing & Tools
src/tools.rs, src/tools/send_message_to_another_channel.rs
Configured SendMessageTool with replied flag tracking; added with_replied_flag method and mark-replied logic on successful send operations.
Upgrade & Watchdog Systems
src/upgrade.rs, src/watchdog.rs, src/update.rs, src/main.rs
Added upgrade CLI subcommand with Git operations (fetch, pull, compare), Cargo build, binary installation, and systemd service restart; added Local CLI variant for foreground execution with port override and browser auto-open; implemented startup catch-up and health watchdog loops; integrated registry sync and PR conflict detection into agent initialization.
Core Module Integration
src/lib.rs
Added module exports for registry, upgrade, watchdog; injected registry_store into public AgentDeps struct.
Utilities & Infrastructure
src/api.rs, src/llm/routing.rs, src/conversation/history.rs
Declared internal registry module; updated Anthropic default model to claude-sonnet-4-20250514; added status guard to prevent idle state overwriting terminal worker status.
MCP Bridge
mcp-bridge/spacebot_mcp_server.py, mcp-bridge/README.md
Added Python MCP server (stdin/stdout JSON-RPC) exposing Spacebot tools (agents, tasks, workers, channels, status) with HTTP forwarding to local Spacebot API; documented setup, environment variables, and tool descriptions.
Developer Prompt
prompts/en/worker.md.j2
Added "Coding Principles — CUPID" section outlining development guidelines (composable, Unix philosophy, predictable, idiomatic, domain-based).
Tests
tests/bulletin.rs, tests/context_dump.rs
Initialized registry_store dependency in test bootstrap and context-dump utilities; added worker_queue initialization to channel state tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and concisely summarizes the main fix: preventing interactive workers from getting stuck in idle after an outcome is signaled.
Description check ✅ Passed The description clearly explains the problem (interactive workers stuck in idle causing crash-loop), the root cause (circular dependency preventing WorkerComplete from firing), and the solution (check outcome_signaled and drop input_rx, plus harden DB updates).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.42.1)
src/api/server.rs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/api/agents.rs (1)

766-779: ⚠️ Potential issue | 🔴 Critical

Add registry store to ApiState when creating agents.

The code creates registry_store at lines 766–779 but never inserts it into state.registry_stores, unlike project_stores at lines 962–966. Since all /api/registry/* handlers (in src/api/registry.rs) look up stores via state.registry_stores.load(), newly created agents will return StatusCode::NOT_FOUND for every registry request until the server restarts.

After creating the registry store and its status tracker, insert them into ApiState following the same pattern as project_stores:

let mut registry_stores_map = (**state.registry_stores.load()).clone();
registry_stores_map.insert(agent_id.clone(), registry_store);
state.registry_stores.store(std::sync::Arc::new(registry_stores_map));

let mut registry_status_map = (**state.registry_sync_status.load()).clone();
registry_status_map.insert(
    agent_id.clone(),
    Arc::new(ArcSwap::from_pointee(SyncStatus::Idle)),
);
state.registry_sync_status.store(std::sync::Arc::new(registry_status_map));

Also remove registry stores from ApiState in the delete_agent() cleanup (same section where project stores are removed around line 1273).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/agents.rs` around lines 766 - 779, When creating agents, persist the
newly constructed registry_store and its status into ApiState so registry
handlers can find them: after creating registry_store (symbol: registry_store)
and its SyncStatus, clone and insert into state.registry_stores (use
state.registry_stores.load()/store pattern) and into state.registry_sync_status
(insert an Arc::new(ArcSwap::from_pointee(SyncStatus::Idle)) keyed by agent_id),
mirroring how project_stores is handled; likewise, in delete_agent() remove the
agent's entries from state.registry_stores and state.registry_sync_status during
cleanup to avoid stale entries.
src/agent/channel.rs (1)

2843-2892: ⚠️ Potential issue | 🟠 Major

Cancelled workers still bypass the new queue drain.

This drain only runs after the existing worker_handles.remove(...) guard passes. cancel_worker_with_reason() removes that handle before it emits its synthetic WorkerComplete, so cancelled workers free capacity without ever reaching this block and queued workers stay stuck until some unrelated worker finishes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel.rs` around lines 2843 - 2892, The WorkerComplete handler
currently returns early if removing worker_id from self.state.worker_handles
returns None, which causes cancelled workers (which remove the handle earlier in
cancel_worker_with_reason()) to skip running
crate::agent::channel_dispatch::drain_worker_queue and leaves queued workers
stuck; modify the ProcessEvent::WorkerComplete arm so that even when
worker_handles.write().await.remove(worker_id) is None you still call
drain_worker_queue(&self.state).await before returning (or refactor to run the
drain unconditionally after the guard), keeping the existing
logging/queued-spawn behavior; reference symbols: ProcessEvent::WorkerComplete,
cancel_worker_with_reason, self.state.worker_handles, and
crate::agent::channel_dispatch::drain_worker_queue.
src/main.rs (1)

2396-2403: ⚠️ Potential issue | 🟠 Major

Track these registry task handles so agent removal can stop them.

registry_sync_loop and pr_conflict_check_loop are detached and forgotten, while Line 2398 only removes the agent from the map and disconnects MCP. A dynamically removed agent can keep syncing the registry and checking PR conflicts in the background after it is supposedly gone.

Also applies to: 3436-3468

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.rs` around lines 2396 - 2403, When removing an agent you need to
stop its background tasks (registry_sync_loop and pr_conflict_check_loop)
instead of just removing the agent and disconnecting MCP; change the spawn sites
for registry_sync_loop and pr_conflict_check_loop to return and store their
tokio::task::JoinHandle in the agent struct (e.g., add fields like
registry_task_handle and pr_conflict_task_handle to the Agent or AgentDeps), and
in the agent removal branch where agent_remove_rx.recv() leads to
agents.remove(&key) call ensure you abort or join those handles (call
JoinHandle::abort() or await the join) before finishing removal, then disconnect
MCP and log; update the creation paths that currently detach/forget these loops
to assign the handles into the agent so removal can stop them.
🟠 Major comments (21)
src/update.rs-107-115 (1)

107-115: ⚠️ Potential issue | 🟠 Major

Add routing in the apply endpoint for native deployments or defer can_apply for native until the upgrade path is implemented.

Lines 107–115 and 240–254 enable can_apply = true for Deployment::Native when the source directory is resolvable. However, the /update/apply endpoint (line 461 in src/api/settings.rs) unconditionally calls apply_docker_update(), which immediately bails with "not running in Docker" for native deployments (line 334). This creates a broken UX: the UI enables the Apply button for native installs, but the endpoint fails.

The upgrade module (public in src/lib.rs) has all the functions needed for native upgrades (resolve_source_dir, check_for_updates, pull, build_release, install_binary, restart_service), but the API does not route native deployments to them.

Either route native deployments to the upgrade module in the apply endpoint, or set can_apply = false for native deployments until the native upgrade path is wired into the API.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/update.rs` around lines 107 - 115, The API currently enables
status.can_apply for Deployment::Native based on resolve_source_dir but the
/update/apply handler unconditionally calls apply_docker_update(), causing
failures for native installs; either (A) modify the apply endpoint in
src/api/settings.rs to route native deployments to the upgrade module flow (use
resolve_source_dir, check_for_updates, pull, build_release, install_binary,
restart_service in sequence) when deployment == Deployment::Native, or (B)
revert the status logic in update.rs (where Deployment::Native sets
status.can_apply and status.cannot_apply_reason) to set can_apply = false and an
explanatory cannot_apply_reason until the native upgrade path is implemented;
pick one approach and update the corresponding functions (Deployment::Native
handling in update.rs or apply handler in settings.rs) so UI state and API
behavior match.
interface/src/routes/Welcome.tsx-111-117 (1)

111-117: ⚠️ Potential issue | 🟠 Major

Use route-aware navigation for the agent links.

These absolute hrefs bypass the app's configured base path, so they break when the UI is hosted under a subpath. The View link also now resolves to /agents/$agentId, which was repointed to the task board in router.tsx, so the destination no longer matches the label here.

Also applies to: 136-141

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@interface/src/routes/Welcome.tsx` around lines 111 - 117, Replace hardcoded
anchor tags in Welcome.tsx that use href={`/agents/${firstAgentId}`} (and the
similar block at 136-141) with the router-aware Link component (or
useNavigate/to) so navigation respects the app base path; change the destination
to the correct agent-details route (use your named route constant or path helper
instead of `/agents/${id}` since that path now points to the task board) and
pass firstAgentId (or relevant agentId) to that route, preserving the existing
className and children.
src/agent/channel.rs-2119-2169 (1)

2119-2169: ⚠️ Potential issue | 🟠 Major

This makes every channel prompt scale with the full registry.

On agents that discover lots of repos, every turn now injects one ProjectContext per unlinked registry repo, which can materially increase token usage/latency and eventually hit context limits. The if let Ok(...) also hides the failure case, so registry context can disappear without any signal. Please cap or summarize this list and at least log the error path. As per coding guidelines, "Don't silently discard errors. No let _ = on Results. Handle them, log them, or propagate them."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel.rs` around lines 2119 - 2169, The current block calling
self.deps.registry_store.list_repos(&self.deps.agent_id, true) silently ignores
errors and pushes an unbounded ProjectContext per unlinked registry repo (via
ProjectContext and ProjectRepoContext), which can blow up prompt size; change it
to handle the Result error (log it with your agent logger, e.g.,
self.deps.logger/error or similar) instead of using if let Ok(...), and when
successful, cap or summarize the registry_repos you expand into contexts (e.g.,
include only the first N entries, N configurable, and add a single summary
ProjectContext stating "X additional registry repos omitted") so you don’t
inject one context per repo and thereby limit token growth; ensure
linked_project_ids filtering remains and preserve repo metadata for the included
items.
src/registry/pr_conflicts.rs-149-149 (1)

149-149: ⚠️ Potential issue | 🟠 Major

Pass the configured webhook bind address to pr_conflict_check_loop instead of hardcoding 127.0.0.1.

The webhook server's bind address is configurable via webhook_config.bind (defaulting to 127.0.0.1), but pr_conflict_check_loop only receives the webhook_port parameter. If the webhook is configured to bind to a different address (e.g., 0.0.0.0 or a custom IP), the hardcoded 127.0.0.1 in line 149 will fail to reach it. Extract and pass webhook_config.bind to the function alongside webhook_port.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/registry/pr_conflicts.rs` at line 149, The code hardcodes "127.0.0.1"
when building webhook_url in pr_conflicts.rs; update the call site that invokes
pr_conflict_check_loop so it also passes webhook_config.bind (the configured
bind address) alongside webhook_port, and then modify pr_conflict_check_loop to
accept a bind parameter and use it when formatting webhook_url (e.g.,
format!("http://{bind}:{webhook_port}/send")). Ensure you update all references
to pr_conflict_check_loop and the parameter list to avoid compile errors.
src/registry/sync.rs-179-213 (1)

179-213: ⚠️ Potential issue | 🟠 Major

Use an owner-qualified directory for auto-clones.

clone_base_dir.join(&gh_repo.name) collides as soon as two configured owners both have a repo with the same name. The second repo will reuse or skip the first checkout, and local_path ends up pointing at the wrong repository.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/registry/sync.rs` around lines 179 - 213, The auto-clone path currently
uses config.clone_base_dir.join(&gh_repo.name) which causes collisions for repos
with the same name across different owners; change the directory to be
owner-qualified (for example join the owner and repo name or join a sanitized
full_name) so each repo gets a unique path. Update the clone_dir computation
used by auto_clone_repo, the exists() check, and the subsequent
store.set_local_path calls to use the new owner-qualified path (use
gh_repo.owner.login or full_name sanitized) so full_name, auto_clone_repo,
clone_dir, and set_local_path remain consistent and avoid cross-owner
collisions.
src/watchdog.rs-203-218 (1)

203-218: ⚠️ Potential issue | 🟠 Major

Page through startup catch-up results instead of stopping at 20 issues.

A single outage window with more than 20 matching issues leaves the remainder unprocessed once they age out of the 24-hour search. Please paginate here, or keep fetching until GitHub returns fewer than a page.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/watchdog.rs` around lines 203 - 218, The current gh CLI call that sets
output = tokio::process::Command::new("gh")... with args including
"--limit","20" only fetches a single page; change this to paginate and keep
fetching until a page returns fewer than page_size (20). Implement a loop around
the Command invocation (use a page counter starting at 1 and add "--page",
&page.to_string() to the args) that runs the command for repo and since, parses
each page’s results and accumulates/processes them, and breaks when the returned
list length < 20 (or empty); ensure you still pass the same search (--search
&format!("-label:prd-generated created:>{since}") ) and collect results from
each invocation rather than stopping after the first output.
src/api/registry.rs-143-159 (1)

143-159: ⚠️ Potential issue | 🟠 Major

Record Failed status before returning a 500 from manual sync.

Syncing is stored before sync_registry(), but the ? on error exits before the shared status is updated. After a failed manual sync, /api/registry/status stays stuck on syncing until some later successful run overwrites it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/registry.rs` around lines 143 - 159, The code sets
SyncStatus::Syncing in state.registry_sync_status (via
status_map.get(&query.agent_id)) but uses ? on the await of sync_registry(store,
&query.agent_id, &registry_config) so failures never update the shared status;
change the error path to record a SyncStatus::Failed before returning the 500:
capture the error from sync_registry (remove the direct map_err(?)) so you can,
in the error branch, look up the same status in status_map (using
query.agent_id) and call status.store(Arc::new(SyncStatus::Failed { at:
chrono::Utc::now().to_rfc3339(), error: format!("{}", err) })), then return
StatusCode::INTERNAL_SERVER_ERROR; keep the existing success path that stores
SyncStatus::Completed with result.clone().
src/api/tasks.rs-619-645 (1)

619-645: ⚠️ Potential issue | 🟠 Major

Don't erase worktree metadata when removal failed.

let _ = discards the result of git worktree remove, so a failed removal still clears the DB pointer. That leaves an on-disk checkout orphaned from task metadata and makes later cleanup much harder. As per coding guidelines "Don't silently discard errors. No let _ = on Results. Handle them, log them, or propagate them."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/tasks.rs` around lines 619 - 645, The code currently discards the
result of the git worktree removal (let _ =) so it always removes the "worktree"
metadata even when removal failed; capture and inspect the result of
TokioCommand::new("git")...output().await (or use status) and only call
obj.remove("worktree") and persist via store.update when the command succeeded;
if the command fails, log the error with context (include worktree_path, command
stderr/stdout and the error) and return or map an appropriate StatusCode instead
of silently proceeding. Ensure you modify the branch that runs TokioCommand, the
variable worktree_path, the metadata handling that calls obj.remove("worktree"),
and the subsequent store.update(...) (UpdateTaskInput) so metadata is only
mutated on successful worktree removal.
src/api/tasks.rs-416-439 (1)

416-439: ⚠️ Potential issue | 🟠 Major

Propagate git diff failures instead of returning an empty diff.

These branches only handle spawn errors; a non-zero git exit currently becomes a 200 with empty stdout, which is indistinguishable from “no changes”. Check output.status.success() and return an error when the diff command fails. As per coding guidelines "Don't silently discard errors. No let _ = on Results. Handle them, log them, or propagate them."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/tasks.rs` around lines 416 - 439, The git diff branches (where
TokioCommand::new("git") is called for worktree_path and branch) only map spawn
errors but ignore non-zero exit statuses; update both places to check
output.status.success() after awaiting output, and if not successful log the
stderr (using tracing::warn! with %error? or
%String::from_utf8_lossy(&output.stderr)) and return an
Err(StatusCode::INTERNAL_SERVER_ERROR) (or propagate a suitable error) instead
of treating empty stdout as a valid diff; ensure the diff variable assignment
for worktree_path and branch handles failed exit statuses consistently before
converting stdout to String.
src/agent/channel_dispatch.rs-646-649 (1)

646-649: ⚠️ Potential issue | 🟠 Major

Handle WorkerLimitReached by falling through to queueing.

The capacity check here isn't atomic with do_spawn(). Another concurrent spawn can consume the slot before spawn_*_from_state runs, and this returns a limit error even though the contract of spawn_or_queue_worker() is to queue on saturation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel_dispatch.rs` around lines 646 - 649, The capacity check
before spawning isn't atomic and currently returns early on any error; change
the logic in spawn_or_queue_worker so that when check_worker_limit(state).await
returns Ok you call do_spawn(state, &request).await.map(Some), but when it
returns the specific WorkerLimitReached error you do not return an error and
instead fall through to the queuing path (i.e., treat limit-reached as "queue
this request"); for other unexpected errors propagate them as before. Update the
conditional around check_worker_limit, referencing check_worker_limit, do_spawn,
spawn_or_queue_worker (and spawn_*_from_state callers) to implement this
selective handling.
src/agent/channel_dispatch.rs-757-797 (1)

757-797: ⚠️ Potential issue | 🟠 Major

Don't drop queued work on drain-side spawn races.

This removes the entry from worker_queue/queued_workers before the spawn succeeds. If another request grabs the free slot first, spawn_*_from_state can return WorkerLimitReached here and the queued task is lost instead of retried.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel_dispatch.rs` around lines 757 - 797, The code currently
removes the task from worker_queue / status.queued_workers before attempting to
spawn (see worker_queue.write().await.pop_front(), queued_workers,
spawn_opencode_worker_from_state, spawn_worker_from_state), which loses the task
if spawn fails with WorkerLimitReached; change the flow to keep the request in
the queue and status until spawn succeeds — either delay removing the entry
until after result is Ok(worker_id), or if spawn returns an error (specifically
WorkerLimitReached) push the request back onto worker_queue and re-insert into
status.queued_workers so the task will be retried; only remove from queued lists
on successful spawn or on non-retriable errors.
src/api/tasks.rs-427-436 (1)

427-436: ⚠️ Potential issue | 🟠 Major

Resolve the task's repository before running Git.

The branch-diff path and worktree creation both shell out in the API process's current directory, not the repo attached to the task. Imported registry issues can therefore diff or create a worktree in the wrong checkout, or just fail when the daemon wasn't started inside that repo.

Also applies to: 505-540

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/tasks.rs` around lines 427 - 436, The git commands are being run in
the API process CWD instead of the task's repository; update the code that
creates TokioCommand::new("git") (the branch-diff block using args(["diff",
&format!("main...{br}")]) and the separate worktree creation code referenced at
505-540) to first resolve the task's repository path (e.g., via the task
struct's repo path accessor or a helper like
task.repo_dir()/resolve_repo_path(task)) and then call .current_dir(repo_path)
on the TokioCommand before .args()/.output() so the git subcommands execute
inside the correct repository; apply the same change to every git invocation in
this file (including the worktree creation path) and preserve existing error
mapping/logging.
src/agent/channel_dispatch.rs-670-712 (1)

670-712: ⚠️ Potential issue | 🟠 Major

Make duplicate detection and enqueue a single critical section.

Two concurrent callers at capacity can both pass the read-only duplicate checks and then push_back() the same task, because the queue is only locked for the final write. Reuse the reservation mechanism or perform dedupe while holding the queue write lock.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel_dispatch.rs` around lines 670 - 712, Replace the separate
read-only duplicate checks + later push with a single critical section: compute
normalized from request.task, then acquire locks in a consistent order (e.g. let
mut status = state.status_block.write().await; let mut queue =
state.worker_queue.write().await;) and inside that section call
status.find_duplicate_worker_task(&request.task) and check queue.iter().any(...)
using the same normalized logic; if duplicate return
AgentError::DuplicateWorkerTask, otherwise push_back(request) to queue and push
the same QueuedWorkerInfo into status. This ensures the duplicate check and the
push_back/update of queued_workers happen atomically.
src/api/registry.rs-261-290 (1)

261-290: ⚠️ Potential issue | 🟠 Major

Don't silently skip a repo when issue JSON parsing fails.

If gh issue list returns malformed or unexpected JSON, this repo is just omitted because the from_slice error is discarded. Log the parse error or return a failure so the UI doesn't silently show partial results. As per coding guidelines "Don't silently discard errors. No let _ = on Results. Handle them, log them, or propagate them."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/registry.rs` around lines 261 - 290, The repo's JSON parsing
currently ignores errors from serde_json::from_slice (used when building
GitHubIssue items) which causes silent omission of a repository's issues; update
the code around the from_slice::<Vec<serde_json::Value>>(&out.stdout) call to
handle the Result: if parsing fails, log the parsing error (including repo_name
and stderr/stdout) via the existing logger or return/propagate an Err from the
enclosing function instead of silently continuing, so failures are visible;
ensure logs reference repo_name and the serde_json error and preserve existing
behavior on success where you iterate and push GitHubIssue entries.
src/watchdog.rs-246-249 (1)

246-249: ⚠️ Potential issue | 🟠 Major

Use a monotonic baseline for watchdog timestamps.

The instant_to_epoch_secs() function relies on SystemTime, which is not monotonic and can jump due to NTP or manual clock adjustments. This causes the elapsed time calculation at line 267 to become incorrect, potentially triggering false restarts (if time jumps forward) or missing actual stalls (if time jumps backward). Store seconds since a fixed Instant instead—such as the watchdog's initialization time.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/watchdog.rs` around lines 246 - 249, The code uses
instant_to_epoch_secs/SystemTime for last_activity which is non-monotonic;
change to a monotonic baseline by capturing a fixed Instant at spawn_watchdog
start (e.g., let baseline = Instant::now()) and store seconds as
(Instant::now().duration_since(baseline).as_secs()) in the AtomicU64
(last_activity). Update all places that call instant_to_epoch_secs or compare
epoch seconds (including the elapsed calculation that compares last_activity to
now) to instead compute monotonic elapsed seconds relative to the same baseline
so the watchdog uses Instant-based, monotonic timestamps; keep the public
signature spawn_watchdog and WatchdogHandle unchanged, only replace
instant_to_epoch_secs usage and ensure the baseline is captured where the
background thread and handle can access it.
src/upgrade.rs-130-145 (1)

130-145: ⚠️ Potential issue | 🟠 Major

Avoid deadlock: consume piped stdout or inherit it instead.

cargo build is spawned with both stdout and stderr piped, but only stderr is drained before wait(). Cargo and its build scripts write to stdout, which will fill the pipe buffer and block the child process while the parent is stuck waiting—causing a deadlock. Either inherit stdout (.stdout(Stdio::inherit())) or read both streams concurrently. Additionally, .unwrap() on line 139 should handle the error rather than panicking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/upgrade.rs` around lines 130 - 145, The child process created by
Command::new("cargo") pipes both stdout and stderr but only drains stderr,
risking a deadlock; change this by either making stdout inherited (replace
.stdout(Stdio::piped()) with .stdout(Stdio::inherit())) or by reading stdout
concurrently (spawn a thread/task to BUfReader the child.stdout and drain lines
similar to how stderr is processed), and avoid panicking on
child.stderr.take().unwrap() by handling the Option/Result (e.g., return an
error if stderr is None or map it to a proper anyhow error) so Command::new /
.spawn() -> child, child.stderr, reader.lines(), and on_line are all handled
safely.
src/main.rs-372-379 (1)

372-379: ⚠️ Potential issue | 🟠 Major

Pass an explicit local-mode flag into cmd_start.

Line 467 infers "local mode" from port_override.is_some(), so spacebot local without --port never opens the browser even though the subcommand advertises that behavior. start --foreground and local are indistinguishable once port_override is None, so this needs a separate flag instead of overloading the port override.

🩹 Suggested fix
-        Command::Start { foreground } => cmd_start(cli.config, cli.debug, foreground, None),
+        Command::Start { foreground } => {
+            cmd_start(cli.config, cli.debug, foreground, None, false)
+        }
 ...
-        Command::Restart { foreground } => {
+        Command::Restart { foreground } => {
             cmd_stop_if_running();
-            cmd_start(cli.config, cli.debug, foreground, None)
+            cmd_start(cli.config, cli.debug, foreground, None, false)
         }
 ...
-        Command::Local { port } => cmd_start(cli.config, cli.debug, true, port),
+        Command::Local { port } => cmd_start(cli.config, cli.debug, true, port, true),

 fn cmd_start(
     config_path: Option<std::path::PathBuf>,
     debug: bool,
     foreground: bool,
     port_override: Option<u16>,
+    open_browser: bool,
 ) -> anyhow::Result<()> {
 ...
-    let open_browser = foreground && port_override.is_some();
     let browser_url = format!("http://{}:{}", config.api.bind, config.api.port);

Also applies to: 401-405, 466-468

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.rs` around lines 372 - 379, The subcommand handling is overloading
port_override to signal "local mode", which fails when port is None; update
calls to cmd_start so an explicit local-mode boolean is passed instead of
inferring from port_override. Modify the Command::Local arm to call
cmd_start(cli.config, cli.debug, true, port) with a new local_mode=true
parameter (or equivalent) and update other callers (e.g., Command::Start,
Command::Restart, Command::Status paths that invoke cmd_start) to pass
local_mode=false where appropriate; update the cmd_start signature to accept
this explicit local flag and use it instead of checking port_override.is_some()
to decide whether to open the browser. Ensure references to port_override remain
unaffected for port value only.
src/main.rs-2156-2163 (1)

2156-2163: ⚠️ Potential issue | 🟠 Major

Ping the watchdog from the main loop, not only from message traffic.

Line 2163 arms the timeout immediately, but this file only resets it after inbound/injection routing. A healthy quiet daemon — including setup mode before any providers are configured — will therefore restart every 10 minutes even when nothing is stuck.

🩹 Suggested fix
     let watchdog = spacebot::watchdog::spawn_watchdog(
         std::time::Duration::from_secs(600),  // 10 minute timeout
         std::time::Duration::from_secs(60),    // check every minute
     );
-    // Ping immediately so the watchdog doesn't trigger during initial quiet period
     watchdog.ping();
+    let mut watchdog_heartbeat = tokio::time::interval(std::time::Duration::from_secs(60));
+    watchdog_heartbeat.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);

     // Main event loop: route inbound messages to agent channels
     loop {
 ...
         tokio::select! {
+            _ = watchdog_heartbeat.tick() => {
+                watchdog.ping();
+            }
             Some(mut message) = inbound_next, if agents_initialized => {

Also applies to: 2174-2174, 2387-2389, 2417-2428

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.rs` around lines 2156 - 2163, The watchdog is currently only pinged
during inbound/injection routing (calls to watchdog.ping()), causing
healthy-but-quiet phases to trigger the 10-minute restart; update the main loop
to periodically ping the watchdog (e.g., on each main loop tick or by scheduling
a short interval timer) so that spawn_watchdog(...)’s watchdog.ping() is invoked
independent of message traffic or provider setup; apply the same change to other
occurrences (the other watchdog.ping() sites referenced) so setup mode and idle
daemon states keep the watchdog reset.
src/main.rs-593-618 (1)

593-618: ⚠️ Potential issue | 🟠 Major

Pass --remote and --branch to upgrade::pull() to ensure consistency.

check_for_updates() fetches from and compares against the specified remote/branch (line 77 fetches from remote, line 100 uses HEAD..{remote}/{branch}), but pull() only accepts source_dir and runs a bare git pull (line 125). This means the command can report updates available from one target (e.g., origin/main) but merge from a different tracking branch, creating a mismatch between what the check reports and what actually gets applied.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.rs` around lines 593 - 618, The pull step uses
upgrade::pull(&source_dir) which performs a bare git pull and can pull a
different ref than the earlier upgrade::check_for_updates(&source_dir, &remote,
&branch) used; update the API and call so pull accepts the same remote and
branch (e.g., change upgrade::pull signature to upgrade::pull(source_dir: &Path,
remote: &str, branch: &str) and invoke upgrade::pull(&source_dir, &remote,
&branch) where currently called) so the actual merge target matches the check;
ensure any error handling/return types are adjusted to match the modified
signature.
src/registry/store.rs-320-329 (1)

320-329: ⚠️ Potential issue | 🟠 Major

Propagate decode failures for nullable fields.

unwrap_or(None) turns missing-column and type-mismatch errors into None, so bad migrations or query regressions look like absent data instead of failing loudly. Keep the fields nullable, but let try_get::<Option<_>, _> errors surface.

🩹 Suggested fix
-        language: row.try_get("language").unwrap_or(None),
-        local_path: row.try_get("local_path").unwrap_or(None),
+        language: row
+            .try_get::<Option<String>, _>("language")
+            .context("missing language")?,
+        local_path: row
+            .try_get::<Option<String>, _>("local_path")
+            .context("missing local_path")?,
 ...
-        worker_model: row.try_get("worker_model").unwrap_or(None),
+        worker_model: row
+            .try_get::<Option<String>, _>("worker_model")
+            .context("missing worker_model")?,
 ...
-        project_id: row.try_get("project_id").unwrap_or(None),
-        last_synced_at: row.try_get("last_synced_at").unwrap_or(None),
+        project_id: row
+            .try_get::<Option<String>, _>("project_id")
+            .context("missing project_id")?,
+        last_synced_at: row
+            .try_get::<Option<String>, _>("last_synced_at")
+            .context("missing last_synced_at")?,

As per coding guidelines "Don't silently discard errors. No let _ = on Results. Handle them, log them, or propagate them."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/registry/store.rs` around lines 320 - 329, The current mapping silently
turns decode errors into None by using unwrap_or(None) for fields like language,
local_path, worker_model, project_id, last_synced_at; change these to call
row.try_get::<Option<_>, _>("field_name")? (or the appropriate concrete
Option<Type>) so that type-mismatch/missing-column errors propagate instead of
being swallowed; keep the fields as Option<T> but return/propagate the Result
error (using ? and existing .context(...) where needed) so failures surface
during row decoding for functions that build the record from DB rows.
src/registry/store.rs-245-247 (1)

245-247: ⚠️ Potential issue | 🟠 Major

Archive all rows when the latest sync is empty.

Returning 0 here means a successful discovery that finds no repos leaves every stale row non-archived forever. If callers want to suppress archive-on-error, that decision belongs in the sync layer, not inside this store method.

🩹 Suggested fix
     if seen_full_names.is_empty() {
-        return Ok(0);
+        let result = sqlx::query(
+            "UPDATE registry_repos \
+             SET is_archived = 1, updated_at = CURRENT_TIMESTAMP \
+             WHERE agent_id = ? AND is_archived = 0",
+        )
+        .bind(agent_id)
+        .execute(&self.pool)
+        .await
+        .context("failed to archive repos for empty registry sync")?;
+
+        return Ok(result.rows_affected());
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/registry/store.rs` around lines 245 - 247, The early return when
seen_full_names.is_empty() prevents archiving stale rows; remove the `return
Ok(0)` and let the rest of the method proceed to mark rows not present in
`seen_full_names` as archived (i.e., perform the same update/DELETE logic that
runs when there are discovered names). Ensure you still handle the empty set
safely (avoid SQL IN () errors) by using the existing code path or a guard that
executes the archive-all branch when `seen_full_names` is empty; do not move
suppression logic here — callers that want to skip archiving should decide in
the sync layer.
🟡 Minor comments (7)
interface/src/ui/style/colors.scss-204-204 (1)

204-204: ⚠️ Potential issue | 🟡 Minor

Fix stylelint errors: add blank lines before inline section comments.

Line 204, Line 208, Line 212, Line 223, and Line 243 violate scss/double-slash-comment-empty-line-before.

Suggested lint-only patch
 .ember-theme {
 	--dark-hue: 8;
 	--color-black: 0, 0%, 0%;
 	--color-white: 0, 0%, 100%;
+
 	// dark red primary, warm orange secondary
 	--color-accent: 4, 72%, 40%;
 	--color-accent-faint: 14, 75%, 50%;
 	--color-accent-deep: 0, 78%, 30%;
+
 	// warm-tinted text
 	--color-ink: 15, 12%, 92%;
 	--color-ink-dull: 10, 8%, 62%;
 	--color-ink-faint: 8, 5%, 44%;
+
 	// sidebar — deep warm black
 	--color-sidebar: 8, 15%, 3%;
 	--color-sidebar-box: 8, 12%, 7%;
@@
 	--color-sidebar-button: 8, 12%, 11%;
 	--color-sidebar-selected: 4, 18%, 16%;
 	--color-sidebar-shade: 8, 15%, 10%;
+
 	// main — warm dark surfaces
 	--color-app: 8, 12%, 5%;
@@
 	--color-app-frame: 8, 10%, 16%;
 	--color-app-slider: 8, 10%, 11%;
 	--color-app-explorer-scrollbar: 4, 14%, 20%;
+
 	// menu
 	--color-menu: 8, 14%, 4%;

Also applies to: 208-208, 212-212, 223-223, 243-243

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@interface/src/ui/style/colors.scss` at line 204, Several inline section
comments (e.g., the comment text "// dark red primary, warm orange secondary"
and the other inline comments at the same positions) violate
scss/double-slash-comment-empty-line-before; fix by inserting a single blank
line immediately above each of those inline // comments so there is an empty
line separating them from the preceding code or block, ensuring the scss linter
rule passes for those comment occurrences.
interface/src/routes/Pricing.tsx-26-26 (1)

26-26: ⚠️ Potential issue | 🟡 Minor

Placeholder CTA link for Teams tier.

The Teams tier CTA has href: "#" which doesn't navigate anywhere meaningful. Users clicking "Get Started" will see no action, which is confusing.

Consider either:

  • Linking to a sign-up flow or waitlist
  • Disabling the button with a "Coming Soon" label
  • Removing the CTA until the Teams tier is available
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@interface/src/routes/Pricing.tsx` at line 26, The Teams tier CTA currently
uses a placeholder href ("#") in the cta object inside Pricing.tsx (cta: {
label: "Get Started", href: "#" }); update this to a meaningful action: either
point href to the sign-up or waitlist route (e.g., "/signup" or "/waitlist"), or
replace the CTA with a disabled/secondary button and change the label to "Coming
Soon" (and remove the href), or remove the cta field altogether until the tier
is available; apply the chosen change to the cta object for the Teams tier so
the button no longer navigates to a dead anchor.
interface/src/routes/Pricing.tsx-42-42 (1)

42-42: ⚠️ Potential issue | 🟡 Minor

Personal email address in Enterprise CTA.

The Enterprise tier uses mailto:[email protected] which appears to be a personal email. For a customer-facing pricing page, consider using a business domain email (e.g., [email protected] or [email protected]).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@interface/src/routes/Pricing.tsx` at line 42, The Enterprise CTA currently
uses a personal mailto address in the cta object (cta: { label: "Contact Sales",
href: "mailto:[email protected]" }); update that href to a company-facing
address (for example "mailto:[email protected]" or
"mailto:[email protected]") so the Pricing component uses a business
email for customer contact.
src/config/types.rs-1026-1039 (1)

1026-1039: ⚠️ Potential issue | 🟡 Minor

Default clone directory uses ephemeral /tmp path.

The default clone_base_dir of /tmp/spacebot-repos may be problematic:

  • /tmp is cleared on reboot on most Linux systems
  • Auto-cloned repos would be lost, triggering re-clones on restart
  • Could cause unexpected disk usage in /tmp which often has size limits

Consider defaulting to a path under the instance directory or agent data directory for persistence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config/types.rs` around lines 1026 - 1039, The Default impl for
RegistryConfig sets clone_base_dir to an ephemeral "/tmp/spacebot-repos"; change
it to a persistent data directory (e.g. under the instance or agent data
directory) by replacing the hardcoded PathBuf::from("/tmp/spacebot-repos") in
the RegistryConfig::default() implementation with a PathBuf resolved from the
instance/agent data location (use an existing helper like an
instance_dir/agent_data_dir function or std::env/dirs crate to derive a
persistent path), ensuring the field name clone_base_dir in the Default impl is
updated accordingly.
mcp-bridge/spacebot_mcp_server.py-275-275 (1)

275-275: ⚠️ Potential issue | 🟡 Minor

Remove unnecessary f-string prefix.

Static analysis correctly flags this f-string has no placeholders.

🐛 Fix
-        params = f"?limit=100"
+        params = "?limit=100"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcp-bridge/spacebot_mcp_server.py` at line 275, The assignment to the
variable params uses an unnecessary f-string with no placeholders; change the
assignment of params from using f"?limit=100" to a plain string "?limit=100"
(update the params variable initialization) to remove the redundant f-prefix and
satisfy static analysis.
src/registry/pr_conflicts.rs-128-132 (1)

128-132: ⚠️ Potential issue | 🟡 Minor

HTTP response is not checked for success status.

The send().await succeeds if the request was sent, but doesn't verify the response status code. A 4xx/5xx response would silently pass.

🐛 Proposed fix to check response status
     client
         .post(webhook_url)
         .json(&payload)
         .timeout(std::time::Duration::from_secs(10))
         .send()
         .await
-        .map_err(|e| format!("failed to forward conflict message: {e}"))?;
+        .map_err(|e| format!("failed to forward conflict message: {e}"))?
+        .error_for_status()
+        .map_err(|e| format!("webhook returned error: {e}"))?;

     Ok(())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/registry/pr_conflicts.rs` around lines 128 - 132, The code currently only
awaits .send().await (which ensures the request was sent) but doesn't verify
HTTP success; capture the Response returned by .send().await (e.g., let resp =
... .send().await.map_err(...)?) and then call
resp.error_for_status().map_err(|e| format!("failed to forward conflict message:
{e}"))? to surface 4xx/5xx errors before returning Ok(()); update the existing
map_err and return flow around .send().await and add the error_for_status check
to ensure non-success statuses are treated as errors.
interface/src/routes/AgentTasks.tsx-277-284 (1)

277-284: ⚠️ Potential issue | 🟡 Minor

Apply the repo filter consistently outside the board.

With a projectFilter selected, the blocked badge still counts from tasks and graph mode still renders tasks instead of filteredTasks, so the selected repo and the visible totals/nodes disagree.

Also applies to: 391-397

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@interface/src/routes/AgentTasks.tsx` around lines 277 - 284, The blocked
count and graph rendering are using the full tasks array instead of the
repo-filtered list, causing mismatch with the selected projectFilter; change the
code to compute and use a single source variable (e.g., const visibleTasks =
projectFilter ? filteredTasks : tasks) and replace usages of tasks with
visibleTasks for the blocked badge (where isBlocked is called) and for the graph
rendering/graph mode code so both the badge and graph operate on filteredTasks
when a projectFilter is active.
🧹 Nitpick comments (10)
src/agent/status.rs (1)

137-150: Keep queued-worker output bounded.

These queued tasks are injected into the channel status block on every turn, but unlike completed_items they are neither capped nor truncated. Once the worker queue backs up, a few long tasks can dominate prompt space and crowd out actual conversation history. Consider limiting the rendered entries and summarizing each task before writing it into the status block.

Also applies to: 409-421

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/status.rs` around lines 137 - 150, The queued_workers list
(Vec<QueuedWorkerInfo>) is unbounded when serialized into the status block and
can dominate prompts; update the code that renders/serializes queued_workers
(the same pattern used for completed_items) to (1) cap the number of entries
included (e.g., keep the most recent N entries), (2) truncate or summarize each
QueuedWorkerInfo.task to a safe max length (or replace with a short
summary/ellipsis), and (3) if entries are omitted, add a small summary line like
"X more queued workers omitted" so callers know items were dropped; implement
this in the status rendering helper used for queued_workers and reuse the same
bounding/truncation logic for the other affected block referenced (lines
~409-421).
mcp-bridge/spacebot_mcp_server.py (2)

349-367: Add bounds check on Content-Length to prevent memory exhaustion.

The read_message function reads content_length bytes without validating the size. A malformed or malicious message with an extremely large Content-Length could cause memory exhaustion.

🛡️ Proposed bounds check
+MAX_MESSAGE_SIZE = 10 * 1024 * 1024  # 10 MB

 def read_message() -> dict | None:
     """Read a JSON-RPC message from stdin."""
     # Read headers
     content_length = 0
     while True:
         line = sys.stdin.readline()
         if not line:
             return None
         line = line.strip()
         if not line:
             break
         if line.startswith("Content-Length:"):
             content_length = int(line.split(":")[1].strip())

     if content_length == 0:
         return None
+    if content_length > MAX_MESSAGE_SIZE:
+        return None  # Or raise an error

     content = sys.stdin.read(content_length)
     return json.loads(content)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcp-bridge/spacebot_mcp_server.py` around lines 349 - 367, The read_message
function currently trusts the Content-Length header and may allocate huge
memory; add a bounds check around the parsed content_length in read_message to
validate it's a positive integer within a safe maximum (e.g. MAX_CONTENT_LENGTH
constant), handle non-integer or negative values gracefully, and return None or
log an error when the value is out of bounds before calling
sys.stdin.read(content_length); ensure you reference and validate the
content_length variable and introduce a MAX_CONTENT_LENGTH near read_message to
make the limit configurable.

43-51: Consider validating URL scheme to prevent file:// access.

The SPACEBOT_URL environment variable is used directly in urlopen() without scheme validation. While environment variables are generally trusted, a misconfiguration could expose local file access via file:// URLs.

🛡️ Proposed validation
 def api_request(method: str, path: str, body: dict | None = None) -> dict:
     """Make an HTTP request to the Spacebot API."""
     url = f"{SPACEBOT_URL}/api{path}"
+    if not url.startswith(("http://", "https://")):
+        return {"error": f"Invalid URL scheme. SPACEBOT_URL must use http:// or https://"}
     data = json.dumps(body).encode() if body else None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcp-bridge/spacebot_mcp_server.py` around lines 43 - 51, Validate the URL
scheme before building/using urllib.request.Request: parse the variable/url with
urllib.parse.urlparse and ensure the scheme is either "http" or "https" (reject
"file", empty, or other schemes); if invalid, return a clear error (same
structure as existing error returns) instead of calling urllib.request.urlopen.
Reference symbols: SPACEBOT_URL, the local variable url, and the code paths
around urllib.request.Request / urllib.request.urlopen to add the validation
early.
src/config/types.rs (1)

1349-1349: Registry config is not overridable per-agent.

Unlike other config sections (e.g., projects, browser), the registry config is always cloned from defaults with no per-agent override. If this is intentional (registry is instance-level), this is fine. If per-agent registry settings might be needed later, consider adding Option<RegistryConfig> to AgentConfig.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config/types.rs` at line 1349, The registry field is always cloned from
defaults (registry: defaults.registry.clone()) so agents cannot override it;
update the AgentConfig structure to accept an Option<RegistryConfig> (e.g., add
registry: Option<RegistryConfig>) and change the place using
defaults.registry.clone() to use agent.registry.clone().unwrap_or_else(||
defaults.registry.clone()) (or equivalent) so a per-agent RegistryConfig is
respected when present but falls back to defaults.registry when None.
interface/src/components/TaskDependencyGraph.tsx (2)

94-128: Force layout may be slow for large graphs.

The inline force-directed layout runs 100 iterations with O(n²) repulsion calculations synchronously during render. For task boards with many items (50+), this could cause noticeable UI jank.

Consider:

  1. Reducing iterations for larger graphs
  2. Running layout in a Web Worker
  3. Using graphology-layout's built-in algorithms which are optimized
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@interface/src/components/TaskDependencyGraph.tsx` around lines 94 - 128, The
synchronous O(n²) force-loop in the TaskDependencyGraph component (the nested
repulsion loops over nodes, positions[], and the graph.forEachEdge attraction
loop) should be made non-blocking for large graphs: detect large graph sizes
(e.g., nodes.length > 50) and reduce the number of iterations (from 100 to a
smaller dynamic value), or move the layout computation into a Web Worker that
performs the repulsion/attraction loops and returns final positions, or replace
the inline loops with a call to an optimized layout routine (e.g.,
graphology-layout / forceatlas2) and use that result to set positions; update
the code paths referencing positions, nodes, and graph.forEachEdge to use the
worker/optimized layout result asynchronously so rendering is not done on the
main thread.

40-48: O(n²) lookup for blocked status calculation.

The nested .find() inside the loop creates O(n²) complexity when determining blocked status. For large task boards, consider pre-building a Map for O(1) lookups:

♻️ Proposed optimization
 const buildGraph = useCallback(() => {
 	const graph = new Graph({ type: "directed" });
+	const taskMap = new Map(tasks.map(t => [t.task_number, t]));

 	// Add nodes
 	for (const task of tasks) {
 		const key = String(task.task_number);
 		const deps = getDeps(task);
 		const isBlocked =
 			deps.length > 0 &&
 			deps.some((d) => {
-				const dt = tasks.find((t) => t.task_number === d);
+				const dt = taskMap.get(d);
 				return dt && dt.status !== "done";
 			});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@interface/src/components/TaskDependencyGraph.tsx` around lines 40 - 48, The
blocked-status calculation in the loop over tasks (the block using getDeps,
isBlocked and task.task_number) is O(n²) due to the inner tasks.find; fix it by
pre-building a Map from task_number to task (e.g., create taskMap from tasks
before the loop) and replace the tasks.find lookup with taskMap.get(d) to
achieve O(1) lookups; update any references in the isBlocked predicate to use
taskMap.get and keep the rest of the logic unchanged.
migrations/20260312000001_registry.sql (1)

26-27: No trigger for updated_at auto-update.

The updated_at column defaults to CURRENT_TIMESTAMP on insert but won't auto-update on modifications. SQLite doesn't have built-in ON UPDATE for timestamps. If auto-updating is needed, add a trigger:

♻️ Proposed trigger addition
CREATE TRIGGER IF NOT EXISTS registry_repos_updated_at
    AFTER UPDATE ON registry_repos
    FOR EACH ROW
BEGIN
    UPDATE registry_repos SET updated_at = CURRENT_TIMESTAMP WHERE id = NEW.id;
END;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@migrations/20260312000001_registry.sql` around lines 26 - 27, The updated_at
column in registry_repos is only set to CURRENT_TIMESTAMP on insert and won’t
auto-update on modifications in SQLite; add a trigger (e.g.,
registry_repos_updated_at) on the registry_repos table that runs AFTER UPDATE
FOR EACH ROW and sets updated_at = CURRENT_TIMESTAMP for the updated row (match
by id using NEW.id) so updated_at is refreshed automatically on every update.
src/registry/pr_conflicts.rs (1)

123-133: Reuse reqwest::Client instead of creating per-request.

Creating a new reqwest::Client for each webhook call is inefficient. The client should be created once and reused to benefit from connection pooling.

♻️ Proposed fix
 pub async fn pr_conflict_check_loop(
     store: RegistryStore,
     agent_id: String,
     runtime_config: Arc<crate::config::RuntimeConfig>,
     webhook_port: u16,
 ) {
     // Initial delay to let the agent fully start up and registry sync to run first.
     tokio::time::sleep(Duration::from_secs(90)).await;

     let webhook_url = format!("http://127.0.0.1:{webhook_port}/send");
+    let client = reqwest::Client::new();

     // Track which PRs we've already sent a conflict message for...
     let mut notified: std::collections::HashSet<(String, u64)> = std::collections::HashSet::new();

Then pass &client to forward_to_webhook:

-async fn forward_to_webhook(webhook_url: &str, repo: &str, message: &str) -> Result<(), String> {
+async fn forward_to_webhook(client: &reqwest::Client, webhook_url: &str, repo: &str, message: &str) -> Result<(), String> {
     let conversation_id = format!("github:{repo}");
     let payload = serde_json::json!({
         "conversation_id": conversation_id,
         "sender_id": "pr-conflict-checker",
         "content": message,
     });

-    let client = reqwest::Client::new();
     client
         .post(webhook_url)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/registry/pr_conflicts.rs` around lines 123 - 133, The function forwarding
webhook requests (forward_to_webhook) currently creates a new reqwest::Client
per call; change it to accept and use a reused reqwest::Client instance instead
(e.g., pass &reqwest::Client into forward_to_webhook or store one on the
surrounding struct) so connection pooling is preserved; update call sites to
construct a single Client once (e.g., at registry/handler initialization) and
pass that client into forward_to_webhook and any other places that post to
webhook_url.
interface/src/hooks/useSpacebotConfig.ts (1)

48-55: Consider deduplicating enabled platforms.

If multiple messaging instances share the same platform (e.g., two Discord bots), enabledPlatforms will contain duplicates. This may cause UI issues if the array is used for rendering unique badges.

♻️ Proposed fix using Set for deduplication
 const enabledPlatforms: string[] = [];
 if (messagingData) {
+	const seen = new Set<string>();
 	for (const inst of messagingData.instances ?? []) {
-		if (inst.enabled && inst.platform) {
+		if (inst.enabled && inst.platform && !seen.has(inst.platform)) {
+			seen.add(inst.platform);
 			enabledPlatforms.push(inst.platform);
 		}
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@interface/src/hooks/useSpacebotConfig.ts` around lines 48 - 55, The
enabledPlatforms array in useSpacebotConfig may contain duplicates because you
push inst.platform for each enabled instance; update the logic to deduplicate
platforms (e.g., use a Set or filter/Array.from) so each platform appears once.
Locate the block referencing messagingData.instances and enabledPlatforms in
useSpacebotConfig, collect only inst.platform values for enabled instances, then
convert to a deduplicated array before returning/using it to ensure unique
badges in the UI. Ensure null/undefined platform values are skipped and the
resulting type remains string[].
src/registry/store.rs (1)

258-260: Rename q to a full word.

This query builder is already dynamic, so the abbreviated local makes the binding flow harder to follow than it needs to be.

As per coding guidelines "Don't abbreviate variable names. Use queue not q, message not msg, channel not ch. Common abbreviations like config are fine."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/registry/store.rs` around lines 258 - 260, Rename the abbreviated local
variable q to a full, descriptive name (e.g., query_builder) in the dynamic SQL
builder flow: replace let mut q = sqlx::query(&query).bind(agent_id); and all
subsequent q = q.bind(name); references with let mut query_builder =
sqlx::query(&query).bind(agent_id); and query_builder =
query_builder.bind(name); (ensure you update any further uses of q in this
scope, including eventual .fetch_/execute calls, to the new name) so the binding
flow using seen_full_names remains clear.

Comment thread src/agent/worker.rs
Comment on lines +525 to +537
// For interactive workers, enter a follow-up loop -- but only if the
// worker has not already signaled a terminal outcome. Workers that called
// set_status(kind="outcome") are done; entering the idle loop would block
// forever and prevent WorkerComplete from firing (circular dependency).
if !resuming && self.hook.outcome_signaled() {
tracing::info!(
worker_id = %self.id,
"worker signaled outcome, skipping idle follow-up loop"
);
// Drop the input receiver so the channel cleans up and
// WorkerComplete fires normally.
self.input_rx.take();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't re-enter idle after a follow-up already signaled outcome.

This guard only covers the first transition into the follow-up loop. A resumed worker, or any worker that emits set_status(kind="outcome") during a later follow-up, still falls through to Lines 675-678 and goes back to waiting for input, recreating the same circular wait. Please short-circuit the post-follow-up path on outcome_signaled() as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/worker.rs` around lines 525 - 537, The post-follow-up path must
short-circuit if an outcome was signaled: after the follow-up handling (the
branch that may recreate the idle/wait-for-input state), check
self.hook.outcome_signaled() again and, if true, log and drop the input receiver
(self.input_rx.take()) and skip re-entering the idle/wait loop (return/continue
so WorkerComplete can fire). Update the code paths that run when resuming or
after a later follow-up to perform this guard before recreating the
waiting-for-input state to avoid the circular wait.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants