fix: improve Codex parity across provider, runtime, and workspace flows by VasuBansal7576 · Pull Request #6831 · aden-hive/hive

VasuBansal7576 · 2026-03-27T16:32:08Z

Description

Consolidates the Codex parity work into one PR. This includes the adapter-first Codex provider integration plus the shared backend/runtime/workspace fixes that Codex was exposing more aggressively than other models.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)

Related Issues

Related to #6741
Related to #6817

Changes Made

add the adapter-first Codex provider/backend path and wire it through the existing provider/runner contracts
harden validation, rerun/input propagation, worker handoff, stop behavior, result visibility, and no-progress detection in the backend/runtime path
improve workspace parity for Codex-built agents by gating Run correctly and preserving structured run inputs for reruns
add Codex parity and Codex-vs-control regression coverage for the targeted flows

Testing

Ran the following tests locally:

cd core && uv run pytest tests/test_litellm_provider.py tests/test_config.py -q
cd core && uv run pytest tests/test_codex_control_parity.py tests/test_codex_parity_gate.py tests/test_codex_planning_phase.py tests/test_event_loop_node.py tests/test_queen_lifecycle_validation.py tests/test_queen_orchestrator.py tests/test_session_manager_worker_handoff.py tests/test_session_manager_worker_validation.py tests/test_validate_agent_path.py tests/test_worker_monitoring_tools.py framework/server/tests/test_api.py -q
cd core/frontend && npm run test -- --run src/lib/run-inputs.test.ts
cd core/frontend && npm run build
Manual testing performed in the UI on the full stacked branch

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Screenshots (if applicable)

N/A

Summary by CodeRabbit

New Features
- Codex backend support with robust empty-stream fallback and improved tool-call handling.
- Structured run flow: prompted questions, run-button gating, last-run input reuse, and rerun support.
- Worker/package validation that can block staging/running with clear validation errors.
Improvements
- Enhanced worker health monitoring with health_status and issue_signals.
- Cleaner terminal/result notifications that highlight a primary result.
- Better session state persistence for entry inputs and safer SPA/API routing; chat terminal-stop handling and queen parking.
Bug Fixes
- More reliable stream recovery and streamed tool-call merging.
Tests
- Extensive new test coverage across Codex parity, planning, execution flows, validation, and tooling.

coderabbitai · 2026-03-28T02:10:02Z

📝 Walkthrough

Walkthrough

Adds a Codex backend abstraction and adapter with request hardening and empty-stream recovery; refactors LitellM integration and prompt composition; introduces subprocess-based worker/package validation and queen parking; adds worker-validation gating across server routes, cursor merge behavior, structured run-input handling, health snapshot tooling, and extensive tests.

Changes

Cohort / File(s)	Summary
Codex backend & adapter `core/framework/llm/codex_backend.py`, `core/framework/llm/codex_adapter.py`	New modules: Codex constants, api_base detection/normalization, header/kwargs builders, allowed param merging, system-prompt chunking, tool-choice rules, request hardening, empty-stream retry (non-stream→stream conversion), and streamed tool-call chunk merging. Review request/headers normalization and allowed param logic.
LitellM provider & integration `core/framework/llm/litellm.py`, `core/framework/runner/runner.py`, `core/framework/config.py`	LitellM refactor to centralize request/message building, Codex adapter usage, pythonish tool-arg parsing, stream recovery delegation, and token/account handling. Runner uses `build_codex_litellm_kwargs` and `CODEX_API_BASE`. Check token expiry, JWT decoding, and provider selection ordering.
Conversation, cursor & prompt composition `core/framework/graph/conversation.py`, `core/framework/graph/prompt_composer.py`	`NodeConversation.update_system_prompt` gains optional `output_keys`; added cursor merge helper to avoid clobbering cursor state; removed execution/node-type preambles from composed prompts; adjusted spillover file-load wording. Watch cursor merge behavior and prompt-signature change.
Executor / continuous flow `core/framework/graph/executor.py`	Continuous-mode updates now pass `output_keys` to system-prompt updates; phase stamping occurs earlier; transition marker tool list computed from cumulative + next tools and augmented for outputs/client-facing flags. Inspect transition-marker tool computation and phase ordering.
Event bus & execution stream `core/framework/runtime/event_bus.py`, `core/framework/runtime/execution_stream.py`	Event handler logging changed; `emit_client_input_requested` extended with `auto_blocked`, `assistant_text_present`, `assistant_text_requires_input` flags; session-state persistence backfills entry-node input keys from result outputs. Check new event fields and persisted state merging.
Server app & routes `core/framework/server/app.py`, `core/framework/server/routes_execution.py`, `core/framework/server/routes_sessions.py`	`validate_agent_path` trims and resolves relative to repo root; SPA handler guards `/api/*`; added worker-validation gating (returns 409) for `/trigger`, `/resume`, `/restore`, `/load`; `/chat` removes image handling and adds terminal-stop parking behavior. Review route gating and chat terminal handling.
Session manager & queen orchestrator `core/framework/server/session_manager.py`, `core/framework/server/queen_orchestrator.py`	Subprocess-based `validate_agent_package` run and structured report parsing; session fields to store validation report/failures; `suspend_queen` to park queen executor; digest injection gating; primary-result selection and worker-terminal notification formatting. Pay attention to subprocess parsing, error propagation, and suspend semantics.
Queen lifecycle & worker monitoring `core/framework/tools/queen_lifecycle_tools.py`, `core/framework/tools/worker_monitoring_tools.py`	Validation parsing/gating integrated into lifecycle tools, preflight flow centralized, input shaping/rerun support, health snapshot reader/classifier, and added `health_status`/`issue_signals` outputs. Review health derivation heuristics and validation gating effects.
Coder tools server & validation CLI `tools/coder_tools_server.py`	Validation rewritten to accept export names or agent paths, richer validation steps (including behavior checks), module-local tool discovery, improved reporting, and scaffold defaults changed to avoid placeholder TODOs. Verify CLI discovery and behavior_validation rules.
Frontend structured run inputs `core/frontend/src/lib/run-inputs.ts`, `core/frontend/src/lib/run-inputs.test.ts`, `core/frontend/src/pages/workspace.tsx`	New utilities/tests for structured run inputs; Run-button gating; pending-question flow for missing inputs; `lastRunInputData` persistence; UI changes to trigger with shaped `inputData`. Check UI gating and multi-question flow integration.
Execution, node & conversation tests `core/tests/test_event_loop_node.py`, `core/tests/test_continuous_conversation.py`	Expanded tests for `set_output` semantics, doom-loop detection, client-facing completion, transition marker/tool listing, and cursor/transition behavior. Review test expectations for completion and tooling.
LitellM & Codex tests `core/tests/test_litellm_provider.py`, `core/tests/test_codex_*.py`	Large set of tests covering Codex parity, empty-stream recovery, tool-call chunk merging, planning-phase gating, and control vs Codex streaming equivalence. Inspect new parity and recovery test cases.
Server & session tests `core/framework/server/tests/test_api.py`, `core/tests/test_session_manager_worker_*`, `core/tests/test_validate_agent_path.py`	Tests added/updated for validation-gated trigger/resume/restore/load, queen parking on terminal choices, session manager validation subprocess behavior, and repo-root-relative path resolution.
Queen lifecycle & monitoring tests `core/tests/test_queen_lifecycle_validation.py`, `core/tests/test_queen_orchestrator.py`, `core/tests/test_worker_monitoring_tools.py`	Integration/unit tests for validation gating, primary-result selection, health snapshot signals, input backfilling, and rerun semantics.
Tools & coder-tests `tools/tests/test_coder_tools_server.py`, `tools/coder_tools_server.py`	Expanded tool discovery/testing and validator behavior tests; improved FakeToolRegistry in tests. Review discovery fallbacks and module-local tool loading.
Misc config & small fixes `core/framework/config.py`, `core/tests/test_config.py`	Replaced hardcoded Codex URL with `CODEX_API_BASE`, exposed Codex helpers in tests, and updated config tests for Codex behavior. Confirm config helpers return expected Codex kwargs/headers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through prompts and Codex streams,
Chunked the rules, and mended seams.
Validation gates keep workers tight,
Queen parks gentle in the night.
The rabbit nods — the pipeline gleams. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 49.61% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: improving Codex parity across multiple system layers (provider, runtime, and workspace).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 16

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (5)

core/tests/test_event_loop_node.py-872-878 (1)

872-878: ⚠️ Potential issue | 🟡 Minor

Assert the ordering that this regression is about.

These assertions only prove both events were emitted. If CLIENT_INPUT_REQUESTED fires before the streamed explanation, the test still passes.

🧪 Tighten the assertion

         assert output_events
         assert input_events
+        assert received.index(output_events[0]) < received.index(input_events[0])
         assert "Root cause: checkout requests are failing" in output_events[0].data["snapshot"]
         assert input_events[0].data["prompt"] == "What would you like to do next?"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/tests/test_event_loop_node.py` around lines 872 - 878, The test only
asserts both CLIENT_OUTPUT_DELTA and CLIENT_INPUT_REQUESTED exist but not their
ordering; update the test to locate the indices of the first CLIENT_OUTPUT_DELTA
and first CLIENT_INPUT_REQUESTED in received (e.g. using enumerate and next or a
helper) and assert that the CLIENT_OUTPUT_DELTA index is less than the
CLIENT_INPUT_REQUESTED index before keeping the existing checks on
output_events[0].data["snapshot"] and input_events[0].data["prompt"]; reference
the variables received, output_events, input_events and EventType to find and
compare their positions.

core/framework/server/app.py-58-63 (1)

58-63: ⚠️ Potential issue | 🟡 Minor

Trimmed input is checked, but untrimmed input is still resolved.

At Line 51-56 you sanitize agent_path into raw_path, but Line 58 uses Path(agent_path) instead of Path(raw_path). Inputs like " examples/some_agent " will still fail unexpectedly.
💡 Suggested fix
-    candidate = Path(agent_path).expanduser()
+    candidate = Path(raw_path).expanduser()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/server/app.py` around lines 58 - 63, The code constructs
candidate using Path(agent_path) but earlier sanitizes the user input into
raw_path, so untrimmed inputs like "  examples/some_agent  " bypass trimming;
change the candidate initialization to use Path(raw_path) (i.e., replace
Path(agent_path) with Path(raw_path)) so all subsequent resolution and
_REPO_ROOT joining operate on the trimmed/sanitized value (ensure this change is
applied where candidate is assigned and continues to be used).

core/tests/test_session_manager_worker_handoff.py-174-177 (1)

174-177: ⚠️ Potential issue | 🟡 Minor

Replace fixed sleep with deterministic await to avoid flaky CI.

Using await asyncio.sleep(0.05) at Line 174 makes this test timing-sensitive. Prefer waiting until the mocked coroutine is actually awaited (with timeout).

💡 Suggested fix

-    await asyncio.sleep(0.05)
+    async def _wait_for_consolidate() -> None:
+        while consolidate.await_count == 0:
+            await asyncio.sleep(0)
+
+    await asyncio.wait_for(_wait_for_consolidate(), timeout=1.0)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/tests/test_session_manager_worker_handoff.py` around lines 174 - 177,
Replace the flaky fixed sleep with a deterministic wait that times out: instead
of await asyncio.sleep(0.05), use asyncio.wait_for with a short polling loop
that checks the AsyncMock state until consolidate has been awaited (e.g., poll
consolidate.await_count or consolidate.awaited flag) and raise on timeout; then
run consolidate.assert_awaited_once() and assert
queen_node.inject_event.await_count == 0. Target the existing mock names
(consolidate and queen_node.inject_event) so the test reliably waits for the
mocked coroutine to be awaited rather than sleeping.

core/framework/runtime/event_bus.py-538-539 (1)

538-539: ⚠️ Potential issue | 🟡 Minor

Keep the traceback when a handler fails.

Line 539 downgrades this to a plain message, but these exceptions are otherwise swallowed by gather(..., return_exceptions=True). Without exc_info, this becomes the only place we could have captured the failing handler's stack trace.

Suggested fix

-                except Exception as e:
-                    logger.error(f"Handler error for {event.type}: {e}")
+                except Exception:
+                    logger.exception("Handler error for %s", event.type)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/framework/runtime/event_bus.py` around lines 538 - 539, The except block
that currently does logger.error(f"Handler error for {event.type}: {e}") should
preserve the stack trace; update the handler exception logging (in the event
dispatch/handler loop in event_bus.py where the except Exception as e block is)
to include the exception info by using logger.error(..., exc_info=True) or
logger.exception(...) so the traceback is retained when a handler fails.

core/tests/test_codex_planning_phase.py-105-111 (1)

105-111: ⚠️ Potential issue | 🟡 Minor

Avoid sleeping to decide when the node is safe to stop.

Line 106 makes this test timing-sensitive: on a slow runner the block event may not have fired yet, and on a fast runner an extra loop can already start. Waiting on the actual signal is more stable.

Suggested fix

     async def shutdown_after_first_block() -> None:
-        await asyncio.sleep(0.05)
+        await asyncio.wait_for(blocked.wait(), timeout=1.0)
         node.signal_shutdown()

Add a shared event next to received, and set it from capture():

blocked = asyncio.Event()

async def capture(event: AgentEvent) -> None:
    received.append(event)
    if _client_input_counts_as_planning_ask(event):
        phase_state.planning_ask_rounds += 1
    blocked.set()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/tests/test_codex_planning_phase.py` around lines 105 - 111, The test is
timing-sensitive because shutdown_after_first_block uses asyncio.sleep; instead
create a shared asyncio.Event (e.g., blocked) alongside received, set
blocked.set() inside the capture(event: AgentEvent) callback after updating
received and phase_state.planning_ask_rounds (and after calling
_client_input_counts_as_planning_ask), and replace the sleep/wait logic in
shutdown_after_first_block with awaiting blocked.wait() before calling
node.signal_shutdown(); keep references to shutdown_after_first_block,
node.execute(ctx), capture, received, phase_state.planning_ask_rounds and
_client_input_counts_as_planning_ask to locate and update the code accordingly.

🧹 Nitpick comments (3)

core/tests/test_session_manager_worker_validation.py (1)

78-81: Avoid hard-coding subprocess arg positions in this assertion.

Lines 78-81 couple the test to the current argv layout instead of the behavior being verified. A harmless refactor of _run_validation_report_sync() can reshuffle the command and break this test even if it still invokes the right validator.

Suggested fix

-    script = captured["cmd"][4]
+    script = next(part for part in captured["cmd"] if "_validate_agent_package_impl" in str(part))
     assert "_validate_agent_package_impl" in script
     assert "validate_agent_package(agent_name)" not in script
-    assert captured["cmd"][6] == "/tmp/demo_agent"
+    assert any(part == "/tmp/demo_agent" for part in captured["cmd"])

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/tests/test_session_manager_worker_validation.py` around lines 78 - 81,
The test currently hard-codes subprocess argument positions in assertions
against captured["cmd"]; instead, search the cmd list for the validator script
name and assert presence and correct surrounding args without relying on fixed
indices. Concretely, in the test that exercises _run_validation_report_sync(),
replace assertions like captured["cmd"][4] and captured["cmd"][6] with: assert
any(arg for arg in captured["cmd"] if "_validate_agent_package_impl" in arg),
then locate that element's index via captured["cmd"].index(...) and assert the
next/expected argument equals "/tmp/demo_agent" (and assert that the literal
"validate_agent_package(agent_name)" is not present anywhere using not
any(...)). This makes the test robust to reordering while still verifying the
validator is invoked with the correct agent path.

core/framework/server/session_manager.py (1)

211-250: Persist queen-local subscription IDs before parking.

suspend_queen() only removes the subscriptions stored on Session, but create_queen() also registers queen-local listeners on the shared EventBus. Without persisting those IDs, every park/revive cycle leaks another stale listener because this method has no way to unsubscribe them here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/server/session_manager.py` around lines 211 - 250,
suspend_queen currently only clears Session.worker_handoff_sub and
Session.memory_consolidation_sub but does not remove queen-local listeners that
create_queen registers on the shared EventBus, causing listener leaks; fix by
having create_queen store the subscription IDs it registers into a new Session
attribute (e.g., session.queen_local_subs or similar) and then modify
suspend_queen to iterate over session.queen_local_subs, call
session.event_bus.unsubscribe(...) for each saved id (with the same try/except
pattern used for other subs), and then clear session.queen_local_subs (set to
None or empty) along with the other session fields so all queen-local listeners
are unsubscribed when parking.

core/framework/tools/queen_lifecycle_tools.py (1)

1059-1078: Resolve the same entry point you trigger later.

This helper assumes entry_points[0] is the default entry, but the execution paths below all call runtime.trigger(entry_point_id="default", ...). Looking up that same entry spec here would make input shaping and reruns safe for multi-entry agents.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/tools/queen_lifecycle_tools.py` around lines 1059 - 1078, The
helper _get_default_entry_input_keys currently picks entry_points[0] but
runtime.trigger uses entry_point_id="default", so change the lookup to find the
entry spec whose identifier/name equals "default" (e.g., check attributes like
id or name on each spec) and fall back to the first spec if not found; then
derive entry_node_id from that matched entry_spec (or graph.entry_node if
missing) and use graph.get_node to return its input_keys (preserving existing
None/empty guards). This ensures _get_default_entry_input_keys resolves the same
entry point used by runtime.trigger.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/framework/graph/conversation.py`:
- Around line 354-363: The update_system_prompt behavior should clear the
previous node contract when the next node has no outputs; change
update_system_prompt so it sets self._output_keys to the provided output_keys
and if output_keys is None explicitly clear the contract by setting
self._output_keys = [] (rather than leaving it untouched). Update the logic in
update_system_prompt to reference the parameter output_keys and the attribute
_output_keys accordingly so a transition that passes None will remove prior
output keys.

In `@core/framework/llm/codex_adapter.py`:
- Around line 145-156: The harden_request_kwargs method currently uses
cleaned.setdefault("store", False) which allows an explicit store=True in
incoming kwargs to persist; change it to forcefully disable storage by assigning
cleaned["store"] = False (or removing any existing "store" and then setting it
to False) inside harden_request_kwargs so that Codex traffic cannot be
persisted; update references in harden_request_kwargs to ensure "store" is
always False regardless of incoming kwargs.

In `@core/framework/llm/codex_backend.py`:
- Around line 17-29: The current is_codex_api_base uses substring matching
(CODEX_API_BASE in api_base) which is unsafe; update is_codex_api_base(api_base:
str | None) to parse the URL (use urllib.parse.urlparse), validate scheme is
http/https, compare parsed.scheme + "://" + parsed.netloc equals the known
CODEX_API_ORIGIN (or derive origin from CODEX_API_BASE by parsing) and ensure
the parsed.path startswith the expected base path (e.g., "/v1" or the path
portion of CODEX_API_BASE) rather than using substring membership; ensure it
returns False for invalid URLs or different origins. Then update
normalize_codex_api_base to call the new is_codex_api_base and trim a trailing
"/responses" only when the parsed path ends with "/responses" and the
origin/path match, returning the reconstructed normalized origin+path (or None)
and preserving behavior for None inputs and malformed URLs by returning the
original input or None.

In `@core/framework/llm/litellm.py`:
- Around line 1249-1282: The normalization currently replaces true/false/null
inside quoted strings and the fenced-block stripping leaves language tags,
breaking parsing; update _normalize_pythonish_tool_arguments to only replace
unquoted literals (e.g., detect matches not inside single or double quotes —
either via a regex with lookarounds or by scanning the string and skipping
quoted spans) so "'null hypothesis'" remains unchanged, and change the
fenced-code removal in _parse_tool_call_arguments to strip an optional language
tag and optional newline (e.g., remove a leading pattern like ^```[^\n]*\n? and
a trailing ```), ensuring fenced JSON like ```json\n{...}\n``` parses correctly.
Reference functions: _normalize_pythonish_tool_arguments,
_parse_pythonish_tool_arguments, and _parse_tool_call_arguments.

In `@core/framework/server/routes_execution.py`:
- Around line 514-516: The stop endpoint should not be blocked by package
validation: remove or bypass the call to _worker_validation_error in the stop
handler so that the /stop route always proceeds to cancel/terminate the running
job even when validation fails; specifically, locate the stop route handler
where validation_err = await _worker_validation_error(session) is invoked and
either delete that check or add a conditional so that _worker_validation_error
is skipped for the "/stop" handler, ensuring the handler still performs its
existing cancellation logic and returns the appropriate stop response.

In `@core/framework/server/session_manager.py`:
- Around line 119-134: The subprocess.run call in validate_agent_package (the
block invoking ["uv", "run", "python", "-c", script, ...]) can raise
FileNotFoundError, subprocess.TimeoutExpired, or OSError and should be wrapped
in a try/except that catches those exceptions and returns the same structured
validation failure dict used for non-zero exit codes: {"valid": False,
"summary": f"validate_agent_package failed for '{agent_ref_str}'", "steps":
{"validator_subprocess": {"passed": False, "error":
<message_truncated_to_2000_chars>}}}; use str(exception) (or include timeout /
errno details) truncated to 2000 chars for the error field and keep the existing
call to _parse_validation_report(proc.stdout) when the subprocess completes
successfully.

In `@core/framework/tools/queen_lifecycle_tools.py`:
- Around line 4154-4179: rerun_worker_with_last_input currently builds
input_data from allowed_keys and thus drops legacy task-only payloads when an
entry node has no structured input_keys; change the assembly logic so that when
allowed_keys is empty or does not include legacy keys, you merge any existing
legacy fields from _load_recent_worker_input_defaults(runtime, allowed_keys)
(notably "task" and "user_request") into input_data before calling
runtime.trigger; update the code around allowed_keys, input_data, and the call
sites of _load_recent_worker_input_defaults, _normalize_worker_input_value, and
runtime.trigger so legacy "task"/"user_request" values are preserved for older
workers.
- Around line 989-1016: The helper currently scans all sessions via
sessions_dir.glob("session_*/state.json") and can leak inputs across unrelated
runs; restrict the search to the same run/session instead of global scan by
adding and using a run/session identifier (e.g., current_run_id or
current_session_id) and only process state files whose raw_state contains a
matching "run_id"/"session_id" (or whose parent directory matches the current
session identifier) before merging inputs; update the code that iterates
state_path and the caller of this function to accept the current run/session id,
and keep the existing logic around merged_input, work_keys, best_score,
best_updated_at and best_payload unchanged so only same-run state.json files are
considered.
- Around line 3863-3875: The code currently validates the new package after the
current worker has already been unloaded; move the validation and import
precheck to run before removing the active worker so a failed validation doesn't
leave the system without a live worker. Specifically, call
_run_package_validation(str(resolved_path)) and check
_validation_blocks_stage_or_run(validation_report) and _validation_failures(...)
(and the subsequent import precheck) before invoking whatever routine currently
unloads the active worker, and only proceed to unload and replace the worker
after these checks succeed.

In `@core/framework/tools/worker_monitoring_tools.py`:
- Around line 103-110: When resolving session_id (the logic around
session_id/default_session_id and sessions_dir) ensure you validate that the
selected session directory actually contains persisted worker state and not an
empty/stale folder; if the resolved (explicit or default) session dir is missing
required state files (the same checks used later for
session_status/health_status), return an error dict (e.g. {"error": "No
persisted worker state for session <id>"} ) instead of allowing session_status
to fall through to "unknown"/"healthy". Update the branch that picks session_id
(and the equivalent logic in the later block around lines 128-202) to perform
this existence/content check on (sessions_dir / session_id) and bail out with an
error when the directory is empty or lacks expected state artifacts.

In `@core/frontend/src/lib/run-inputs.ts`:
- Around line 41-46: The canShowRunButton predicate (canShowRunButton) currently
returns true when sessionId/ready/queenPhase are set, which exposes the Run
button before the workspace's graph/topology (structured input requirements)
finishes loading; update the predicate to also require a graph/topology-ready
flag (e.g., add a new parameter like graphReady or structuredInputsLoaded) and
include it in the Boolean check so the Run button only appears after
RUNNABLE_PHASES.has(queenPhase) && graphReady; alternatively, if you prefer not
to change the UI predicate, modify handleRun to call
getStructuredRunInputKeys(...) and bail (or prompt for inputs) when keys are
missing until graph metadata is loaded—reference canShowRunButton, handleRun,
and getStructuredRunInputKeys to locate where to add the additional readiness
check.

In `@core/frontend/src/pages/workspace.tsx`:
- Around line 2888-2915: The code sets lastRunInputData before
executionApi.trigger resolves, so on non-424 failures the questionnaire is lost;
change behavior to only update lastRunInputData inside the successful .then(...)
handler (i.e., remove lastRunInputData from the initial updateAgentState call)
or, if you prefer to keep the initial optimistic update, restore the
questionnaire in the .catch(...) non-424 branch by calling
updateAgentState(activeWorker, { pendingQuestions: /* original questions */ ,
pendingQuestion: /* original */, pendingOptions: /* original */,
lastRunInputData: null }) so failed validation doesn't clear the widget; update
references to updateAgentState, executionApi.trigger, lastRunInputData,
pendingQuestions and ensure appendSystemMessage and the 424 handling
(setCredentialAgentPath, setCredentialsOpen) remain unchanged.
- Around line 741-759: The cached run input (state.lastRunInputData) is
forwarded whole into inputData even when some structured keys were removed or
renamed; update the assignment for inputData to trim/filter
state.lastRunInputData to only include keys in requiredRunKeys (when
requiredRunKeys.length > 0) so stale fields are not sent to the worker—use
hasAllStructuredRunInputs to gate prompting as before, and when building/passing
inputs (inputData) create a new object that picks only requiredRunKeys from
state.lastRunInputData (falling back to {} if no requiredRunKeys or no cached
data) so updateAgentState and subsequent worker calls receive only the current
schema fields.

In `@tools/coder_tools_server.py`:
- Around line 2124-2143: The tests block incorrectly marks failing or errored
runs as passed; update the logic in the tests result handling (where
steps["tests"] is populated) so that when "error" in test_result OR when not
all_passed (computed from test_result.get("failed"/"errors")), set
steps["tests"]["passed"] = False (and still populate warning/warnings/failures
as currently done) instead of True; ensure validate_agent_package() will then
see the correct failed status by using the updated passed flag.
- Around line 510-524: The current classifier treats the "identity_prompt is
blank..." finding as non-blocking, letting agents with only an identity prompt
placeholder pass; update the blocking marker set in tools/coder_tools_server.py
by adding the exact emitted phrase or a robust substring such as
"identity_prompt is blank" (or "identity_prompt" if broader) to blocking_markers
so that when _behavior_validation_errors() yields that finding it is appended to
blocking instead of warnings.
- Line 2306: The current assignment for intro_message inserts the raw result of
_default_intro_message(human_name, _draft_desc) into a quoted Python literal,
which breaks when the draft contains quotes or newlines; replace that raw
interpolation by emitting a properly escaped Python string literal (e.g., use
repr(...) or json.dumps(...) on the _default_intro_message(...) result) so
intro_message receives a safe, escaped string; update the code that builds the
config text where intro_message: str = "{_default_intro_message(human_name,
_draft_desc)}" is produced to instead call repr/_default_escape on
_default_intro_message(human_name, _draft_desc) before embedding.

---

Minor comments:
In `@core/framework/runtime/event_bus.py`:
- Around line 538-539: The except block that currently does
logger.error(f"Handler error for {event.type}: {e}") should preserve the stack
trace; update the handler exception logging (in the event dispatch/handler loop
in event_bus.py where the except Exception as e block is) to include the
exception info by using logger.error(..., exc_info=True) or
logger.exception(...) so the traceback is retained when a handler fails.

In `@core/framework/server/app.py`:
- Around line 58-63: The code constructs candidate using Path(agent_path) but
earlier sanitizes the user input into raw_path, so untrimmed inputs like " 
examples/some_agent  " bypass trimming; change the candidate initialization to
use Path(raw_path) (i.e., replace Path(agent_path) with Path(raw_path)) so all
subsequent resolution and _REPO_ROOT joining operate on the trimmed/sanitized
value (ensure this change is applied where candidate is assigned and continues
to be used).

In `@core/tests/test_codex_planning_phase.py`:
- Around line 105-111: The test is timing-sensitive because
shutdown_after_first_block uses asyncio.sleep; instead create a shared
asyncio.Event (e.g., blocked) alongside received, set blocked.set() inside the
capture(event: AgentEvent) callback after updating received and
phase_state.planning_ask_rounds (and after calling
_client_input_counts_as_planning_ask), and replace the sleep/wait logic in
shutdown_after_first_block with awaiting blocked.wait() before calling
node.signal_shutdown(); keep references to shutdown_after_first_block,
node.execute(ctx), capture, received, phase_state.planning_ask_rounds and
_client_input_counts_as_planning_ask to locate and update the code accordingly.

In `@core/tests/test_event_loop_node.py`:
- Around line 872-878: The test only asserts both CLIENT_OUTPUT_DELTA and
CLIENT_INPUT_REQUESTED exist but not their ordering; update the test to locate
the indices of the first CLIENT_OUTPUT_DELTA and first CLIENT_INPUT_REQUESTED in
received (e.g. using enumerate and next or a helper) and assert that the
CLIENT_OUTPUT_DELTA index is less than the CLIENT_INPUT_REQUESTED index before
keeping the existing checks on output_events[0].data["snapshot"] and
input_events[0].data["prompt"]; reference the variables received, output_events,
input_events and EventType to find and compare their positions.

In `@core/tests/test_session_manager_worker_handoff.py`:
- Around line 174-177: Replace the flaky fixed sleep with a deterministic wait
that times out: instead of await asyncio.sleep(0.05), use asyncio.wait_for with
a short polling loop that checks the AsyncMock state until consolidate has been
awaited (e.g., poll consolidate.await_count or consolidate.awaited flag) and
raise on timeout; then run consolidate.assert_awaited_once() and assert
queen_node.inject_event.await_count == 0. Target the existing mock names
(consolidate and queen_node.inject_event) so the test reliably waits for the
mocked coroutine to be awaited rather than sleeping.

---

Nitpick comments:
In `@core/framework/server/session_manager.py`:
- Around line 211-250: suspend_queen currently only clears
Session.worker_handoff_sub and Session.memory_consolidation_sub but does not
remove queen-local listeners that create_queen registers on the shared EventBus,
causing listener leaks; fix by having create_queen store the subscription IDs it
registers into a new Session attribute (e.g., session.queen_local_subs or
similar) and then modify suspend_queen to iterate over session.queen_local_subs,
call session.event_bus.unsubscribe(...) for each saved id (with the same
try/except pattern used for other subs), and then clear session.queen_local_subs
(set to None or empty) along with the other session fields so all queen-local
listeners are unsubscribed when parking.

In `@core/framework/tools/queen_lifecycle_tools.py`:
- Around line 1059-1078: The helper _get_default_entry_input_keys currently
picks entry_points[0] but runtime.trigger uses entry_point_id="default", so
change the lookup to find the entry spec whose identifier/name equals "default"
(e.g., check attributes like id or name on each spec) and fall back to the first
spec if not found; then derive entry_node_id from that matched entry_spec (or
graph.entry_node if missing) and use graph.get_node to return its input_keys
(preserving existing None/empty guards). This ensures
_get_default_entry_input_keys resolves the same entry point used by
runtime.trigger.

In `@core/tests/test_session_manager_worker_validation.py`:
- Around line 78-81: The test currently hard-codes subprocess argument positions
in assertions against captured["cmd"]; instead, search the cmd list for the
validator script name and assert presence and correct surrounding args without
relying on fixed indices. Concretely, in the test that exercises
_run_validation_report_sync(), replace assertions like captured["cmd"][4] and
captured["cmd"][6] with: assert any(arg for arg in captured["cmd"] if
"_validate_agent_package_impl" in arg), then locate that element's index via
captured["cmd"].index(...) and assert the next/expected argument equals
"/tmp/demo_agent" (and assert that the literal
"validate_agent_package(agent_name)" is not present anywhere using not
any(...)). This makes the test robust to reordering while still verifying the
validator is invoked with the correct agent path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e29641bb-ac20-4d06-a0bd-fd343184ec19

📥 Commits

Reviewing files that changed from the base of the PR and between 86ef6fd and e28d989.

📒 Files selected for processing (37)

core/framework/config.py
core/framework/graph/conversation.py
core/framework/graph/event_loop_node.py
core/framework/graph/executor.py
core/framework/graph/prompt_composer.py
core/framework/llm/codex_adapter.py
core/framework/llm/codex_backend.py
core/framework/llm/litellm.py
core/framework/runner/runner.py
core/framework/runtime/event_bus.py
core/framework/runtime/execution_stream.py
core/framework/server/app.py
core/framework/server/queen_orchestrator.py
core/framework/server/routes_execution.py
core/framework/server/routes_sessions.py
core/framework/server/session_manager.py
core/framework/server/tests/test_api.py
core/framework/tools/queen_lifecycle_tools.py
core/framework/tools/worker_monitoring_tools.py
core/frontend/src/lib/run-inputs.test.ts
core/frontend/src/lib/run-inputs.ts
core/frontend/src/pages/workspace.tsx
core/tests/test_codex_control_parity.py
core/tests/test_codex_parity_gate.py
core/tests/test_codex_planning_phase.py
core/tests/test_config.py
core/tests/test_continuous_conversation.py
core/tests/test_event_loop_node.py
core/tests/test_litellm_provider.py
core/tests/test_queen_lifecycle_validation.py
core/tests/test_queen_orchestrator.py
core/tests/test_session_manager_worker_handoff.py
core/tests/test_session_manager_worker_validation.py
core/tests/test_validate_agent_path.py
core/tests/test_worker_monitoring_tools.py
tools/coder_tools_server.py
tools/tests/test_coder_tools_server.py

core/framework/graph/conversation.py

core/framework/llm/codex_adapter.py

core/framework/llm/codex_backend.py

core/framework/llm/litellm.py

core/framework/server/routes_execution.py

core/frontend/src/pages/workspace.tsx

tools/coder_tools_server.py

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

core/framework/tools/queen_lifecycle_tools.py (1)

987-999: ⚠️ Potential issue | 🟠 Major

Stop falling back to unrelated sessions when session_id is known.

If a specific session_id is supplied but its state.json does not exist yet, this still falls back to session_*/state.json and can borrow another run's defaults. That reintroduces cross-session input bleed on fresh runs/reruns.

Suggested fix

     if session_id:
         state_path = sessions_dir / session_id / "state.json"
         candidate_state_paths = [state_path] if state_path.exists() else []
     else:
         candidate_state_paths = []

-    if not candidate_state_paths:
+    if session_id and not candidate_state_paths:
+        return {}
+
+    if not candidate_state_paths:
         candidate_state_paths = sorted(
             sessions_dir.glob("session_*/state.json"),
             key=lambda path: path.stat().st_mtime if path.exists() else 0,
             reverse=True,
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/framework/tools/queen_lifecycle_tools.py` around lines 987 - 999, When a
specific session_id is given the code currently falls back to other sessions'
state.json if that session's state file is missing; stop that by only using the
glob-based fallback when no session_id was provided. Change the fallback
condition so the sorted(sessions_dir.glob("session_*/state.json")) assignment
runs only when session_id is falsy (e.g. replace "if not candidate_state_paths:"
with "if not candidate_state_paths and not session_id:" or equivalent). This
touches the candidate_state_paths variable and the session_id/state.json logic
in queen_lifecycle_tools.py (look for the session_id block and the later
fallback that calls sessions_dir.glob).

🧹 Nitpick comments (9)

core/frontend/src/lib/run-inputs.ts (1)

1-8: Consider extracting QueenPhase to a shared location.

The QueenPhase type is duplicated here and in chat-helpers.ts. If phases change, both need updates. A shared types.ts or barrel export would reduce drift.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/frontend/src/lib/run-inputs.ts` around lines 1 - 8, QueenPhase is
duplicated across modules; extract the type into a single shared module and
import it where needed. Create a shared types file (e.g., export type QueenPhase
= "planning" | "building" | "staging" | "running";), export the QueenPhase type,
then replace the inline declaration in run-inputs.ts (the QueenPhase type) and
the duplicate in chat-helpers.ts to import { QueenPhase } from that shared
module; ensure any existing references (e.g., RUNNABLE_PHASES or functions
expecting QueenPhase) still type-check after the import and update exports if
you need to re-export the type from a barrel.

core/framework/llm/litellm.py (2)

1696-1697: Remove duplicate import.

FinishEvent, TextDeltaEvent, and TextEndEvent are already imported at line 1684-1689 within this function.

♻️ Remove duplicate import

             if not output_text:
                 return []
-            from framework.llm.stream_events import FinishEvent, TextDeltaEvent, TextEndEvent
-
             usage = getattr(response, "usage", None)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/framework/llm/litellm.py` around lines 1696 - 1697, Remove the duplicate
import of FinishEvent, TextDeltaEvent, and TextEndEvent: locate the redundant
"from framework.llm.stream_events import FinishEvent, TextDeltaEvent,
TextEndEvent" inside the same function (the earlier import at the top of that
function already brings these in) and delete this duplicate import so the three
symbols are only imported once.

663-668: Minor: redundant conditional branches for max_tokens.

Both branches of the if not stream / else set max_tokens identically. This could be simplified to a single unconditional assignment.

♻️ Suggested simplification

-        if not stream:
-            kwargs["max_tokens"] = max_tokens
-        else:
-            kwargs["max_tokens"] = max_tokens
-            if not self._is_anthropic_model():
-                kwargs["stream_options"] = {"include_usage": True}
+        kwargs["max_tokens"] = max_tokens
+        if stream and not self._is_anthropic_model():
+            kwargs["stream_options"] = {"include_usage": True}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/framework/llm/litellm.py` around lines 663 - 668, The two-branch
conditional redundantly assigns kwargs["max_tokens"] in litellm.py; simplify by
assigning kwargs["max_tokens"] = max_tokens unconditionally, then keep the
stream-specific logic: if stream is truthy and not self._is_anthropic_model()
add kwargs["stream_options"] = {"include_usage": True}. Update the block around
the existing use of variables stream, max_tokens, kwargs and the call to
self._is_anthropic_model() in the same function to remove the unused if/else.

tools/coder_tools_server.py (2)

2073-2086: In-process import for behavior validation may leak module state across calls.

The behavior validation imports the agent module in-process (lines 2073-2086) after clearing stale entries from sys.modules. While the cache clearing helps, this approach differs from other validation steps that use subprocesses for isolation. If _validate_agent_package_impl is called multiple times in the same process with different agents, the sys.path modification (line 2076) accumulates paths.

This is likely acceptable for the current use case (subprocess-per-validation from session_manager), but consider documenting this limitation or using a subprocess for full isolation if the function is called in-process repeatedly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/coder_tools_server.py` around lines 2073 - 2086, The code in
_validate_agent_package_impl modifies sys.path by inserting package_parent and
clears sys.modules before importlib.import_module(package_name), which can
accumulate package_parent entries if called multiple times; update the function
to either (a) restore sys.path after the import (remove the inserted
package_parent if you added it) to avoid path accumulation and ensure you only
clear the specific sys.modules entries you previously removed, or (b) switch
this validation path to run in a subprocess for full isolation (matching other
validators); also add a short docstring/note on _validate_agent_package_impl
documenting the in-process limitation if you keep the current approach.
477-486: Entry-node intake detection may have false positives for legitimate parsing nodes.

The heuristic at lines 477-486 flags entry nodes that look like "intake/config parsers" based on keyword matching. However, agents that legitimately need to validate/normalize input before doing real work may get blocked. The condition combines intake_like and not direct_work_like with other signals, but terms like "validate", "normalize", "config", "path" (lines 463-475) are common in legitimate work nodes too.

Consider documenting the escape hatch (e.g., adding a direct-work hint to the node description) or relaxing the blocking behavior to a warning for edge cases where real work nodes happen to mention validation.

core/framework/server/tests/test_api.py (4)

663-691: Make terminal-choice test fully deterministic and side-effect explicit.

Using free-form "No, stop here" may couple this unit test to fuzzy matching. Prefer the canonical option string and assert queen.inject_event is not called before queen is parked.

Suggested test hardening

 async def test_chat_done_for_now_parks_queen_without_new_followup(self):
@@
-        async with TestClient(TestServer(app)) as client:
+        queen_node = session.queen_executor.node_registry["queen"]
+        async with TestClient(TestServer(app)) as client:
             resp = await client.post(
                 "/api/sessions/test_agent/chat",
-                json={"message": "No, stop here"},
+                json={"message": "Done for now"},
             )
@@
-        queen_node = session.queen_executor
-        assert queen_node is None
+        queen_node.inject_event.assert_not_awaited()
+        assert session.queen_executor is None
         session.event_bus.emit_client_output_delta.assert_awaited_once()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/framework/server/tests/test_api.py` around lines 663 - 691, The test
test_chat_done_for_now_parks_queen_without_new_followup should use the exact
terminal option string from the event history instead of free-form input and
explicitly assert queen.inject_event is not called before the queen is parked;
update the POST payload to use "Done for now" (matching the options list from
session.event_bus.get_history) and add an assertion that
session.queen_executor.inject_event was not awaited/called prior to verifying
queen_node is None, ensuring the test checks both deterministic input and the
absence of the side-effect on queen.inject_event.

931-943: Assert stop result payload, not just HTTP status.

The test name says stop behavior should continue despite validation failures; asserting stopped in response would validate the functional outcome too.

Suggested assertion addition

         async with TestClient(TestServer(app)) as client:
             resp = await client.post(
                 "/api/sessions/test_agent/stop",
                 json={"execution_id": "exec_abc"},
             )
             assert resp.status == 200
+            data = await resp.json()
+            assert data["stopped"] is True

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/framework/server/tests/test_api.py` around lines 931 - 943, The test
test_stop_ignores_worker_validation_failure currently only asserts HTTP 200;
after posting to "/api/sessions/test_agent/stop" with execution_id "exec_abc"
using TestClient(TestServer(app)), parse the JSON response body and assert the
expected functional outcome (e.g., that the payload contains "stopped": true or
the appropriate field indicating the execution was stopped) so the test verifies
the stop result in addition to resp.status; update the assertion after the POST
accordingly and keep existing setup of session.worker_validation_failures and
session.worker_runtime._mock_streams["default"]._execution_tasks["exec_abc"]
intact.

867-891: Add a no-execution assertion for resume validation blocking.

Like /trigger, /resume should prove that runtime execution is not started when worker validation fails.

Suggested test hardening

 async def test_resume_blocks_invalid_loaded_worker(self, sample_session, tmp_agent_dir):
@@
         session = _make_session(tmp_dir=tmp_path / ".hive" / "agents" / agent_name)
+        session.worker_runtime.trigger = AsyncMock()
         session.worker_validation_report = {
             "valid": False,
             "steps": {"tool_validation": {"passed": False}},
         }
@@
             assert data["validation_failures"] == [
                 "tool_validation: missing tool execute_command_tool"
             ]
+        session.worker_runtime.trigger.assert_not_awaited()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/framework/server/tests/test_api.py` around lines 867 - 891, The test
should also assert that runtime execution is not started when worker validation
fails: mock or spy the function that the resume endpoint would call to begin
execution (e.g., the Session.start_runtime or the app's runtime-start function
used by the "/api/sessions/{agent}/resume" handler) before making the POST, then
after the POST assert that this start function was never called (or that no
runtime task/process was created), in addition to the existing 409 and
validation failure assertions; update test_resume_blocks_invalid_loaded_worker
to include that no-start assertion.

588-606: Also assert trigger is not invoked on validation failure.

This test checks the 409 payload, but it should additionally guard against accidental runtime execution when validation fails.

Suggested test hardening

 async def test_trigger_blocks_invalid_loaded_worker(self):
     session = _make_session()
+    session.worker_runtime.trigger = AsyncMock()
     session.worker_validation_report = {
         "valid": False,
         "steps": {"behavior_validation": {"passed": False}},
     }
@@
             data = await resp.json()
             assert "failed validation" in data["error"]
             assert data["validation_failures"] == ["behavior_validation: placeholder prompt"]
+        session.worker_runtime.trigger.assert_not_awaited()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/framework/server/tests/test_api.py` around lines 588 - 606, The test
test_trigger_blocks_invalid_loaded_worker should also assert that the runtime
trigger path is not executed when validation fails; add a spy/mock (e.g.,
unittest.mock.AsyncMock) on the method that the endpoint would call to run the
agent (attach to the session object used in the test — e.g., session.trigger or
the worker run method used by your codebase) before creating the app, perform
the POST as shown, then assert that the mocked method was not awaited/called;
keep the existing 409 and payload assertions and only add this non-invocation
check to prevent accidental runtime execution.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/framework/graph/executor.py`:
- Around line 1477-1480: The call to
continuous_conversation.update_system_prompt currently uses
list(next_spec.output_keys or []) which converts None to an empty list and thus
overwrites prior output keys; change it to preserve None semantics by passing
None when next_spec.output_keys is None and otherwise passing
list(next_spec.output_keys) so that update_system_prompt receives None to mean
"no change" and a list only when keys are explicitly provided.

In `@core/framework/tools/queen_lifecycle_tools.py`:
- Around line 4192-4200: The persisted defaults loaded via
_load_recent_worker_input_defaults are being replayed wholesale into input_data,
which can reintroduce stale keys; change the construction of input_data to first
filter the dict returned by _load_recent_worker_input_defaults(runtime,
allowed_keys, session_id=session_id) to only include keys present in
allowed_keys (as computed by _get_default_entry_input_keys(runtime)), then pass
those filtered items through _normalize_worker_input_value before building
input_data; preserve the existing legacy/no-input-keys fallback behavior (do not
change the fallback path) and ensure references remain to allowed_keys,
_load_recent_worker_input_defaults, _normalize_worker_input_value, runtime, and
session_id so only current allowed keys are replayed.
- Around line 1228-1238: In _run_package_validation: do not return None when the
validator is missing, when executor raises, or when _parse_validation_report
returns None; instead return a failure report dict so callers treat it as a
validation failure. Locate registry._tools.get("validate_agent_package"), call
validator.executor({"agent_name": agent_ref}) inside a try/except, await
coroutine results as now, and if validator is None or an exception is raised or
_parse_validation_report(result) yields None, return a dict like {"valid":
False, "errors": [...], "raw_result": result_or_error} (or similar schema used
by callers) so the gate fails closed; keep using
_parse_validation_report(result) for the normal success path.

---

Duplicate comments:
In `@core/framework/tools/queen_lifecycle_tools.py`:
- Around line 987-999: When a specific session_id is given the code currently
falls back to other sessions' state.json if that session's state file is
missing; stop that by only using the glob-based fallback when no session_id was
provided. Change the fallback condition so the
sorted(sessions_dir.glob("session_*/state.json")) assignment runs only when
session_id is falsy (e.g. replace "if not candidate_state_paths:" with "if not
candidate_state_paths and not session_id:" or equivalent). This touches the
candidate_state_paths variable and the session_id/state.json logic in
queen_lifecycle_tools.py (look for the session_id block and the later fallback
that calls sessions_dir.glob).

---

Nitpick comments:
In `@core/framework/llm/litellm.py`:
- Around line 1696-1697: Remove the duplicate import of FinishEvent,
TextDeltaEvent, and TextEndEvent: locate the redundant "from
framework.llm.stream_events import FinishEvent, TextDeltaEvent, TextEndEvent"
inside the same function (the earlier import at the top of that function already
brings these in) and delete this duplicate import so the three symbols are only
imported once.
- Around line 663-668: The two-branch conditional redundantly assigns
kwargs["max_tokens"] in litellm.py; simplify by assigning kwargs["max_tokens"] =
max_tokens unconditionally, then keep the stream-specific logic: if stream is
truthy and not self._is_anthropic_model() add kwargs["stream_options"] =
{"include_usage": True}. Update the block around the existing use of variables
stream, max_tokens, kwargs and the call to self._is_anthropic_model() in the
same function to remove the unused if/else.

In `@core/framework/server/tests/test_api.py`:
- Around line 663-691: The test
test_chat_done_for_now_parks_queen_without_new_followup should use the exact
terminal option string from the event history instead of free-form input and
explicitly assert queen.inject_event is not called before the queen is parked;
update the POST payload to use "Done for now" (matching the options list from
session.event_bus.get_history) and add an assertion that
session.queen_executor.inject_event was not awaited/called prior to verifying
queen_node is None, ensuring the test checks both deterministic input and the
absence of the side-effect on queen.inject_event.
- Around line 931-943: The test test_stop_ignores_worker_validation_failure
currently only asserts HTTP 200; after posting to
"/api/sessions/test_agent/stop" with execution_id "exec_abc" using
TestClient(TestServer(app)), parse the JSON response body and assert the
expected functional outcome (e.g., that the payload contains "stopped": true or
the appropriate field indicating the execution was stopped) so the test verifies
the stop result in addition to resp.status; update the assertion after the POST
accordingly and keep existing setup of session.worker_validation_failures and
session.worker_runtime._mock_streams["default"]._execution_tasks["exec_abc"]
intact.
- Around line 867-891: The test should also assert that runtime execution is not
started when worker validation fails: mock or spy the function that the resume
endpoint would call to begin execution (e.g., the Session.start_runtime or the
app's runtime-start function used by the "/api/sessions/{agent}/resume" handler)
before making the POST, then after the POST assert that this start function was
never called (or that no runtime task/process was created), in addition to the
existing 409 and validation failure assertions; update
test_resume_blocks_invalid_loaded_worker to include that no-start assertion.
- Around line 588-606: The test test_trigger_blocks_invalid_loaded_worker should
also assert that the runtime trigger path is not executed when validation fails;
add a spy/mock (e.g., unittest.mock.AsyncMock) on the method that the endpoint
would call to run the agent (attach to the session object used in the test —
e.g., session.trigger or the worker run method used by your codebase) before
creating the app, perform the POST as shown, then assert that the mocked method
was not awaited/called; keep the existing 409 and payload assertions and only
add this non-invocation check to prevent accidental runtime execution.

In `@core/frontend/src/lib/run-inputs.ts`:
- Around line 1-8: QueenPhase is duplicated across modules; extract the type
into a single shared module and import it where needed. Create a shared types
file (e.g., export type QueenPhase = "planning" | "building" | "staging" |
"running";), export the QueenPhase type, then replace the inline declaration in
run-inputs.ts (the QueenPhase type) and the duplicate in chat-helpers.ts to
import { QueenPhase } from that shared module; ensure any existing references
(e.g., RUNNABLE_PHASES or functions expecting QueenPhase) still type-check after
the import and update exports if you need to re-export the type from a barrel.

In `@tools/coder_tools_server.py`:
- Around line 2073-2086: The code in _validate_agent_package_impl modifies
sys.path by inserting package_parent and clears sys.modules before
importlib.import_module(package_name), which can accumulate package_parent
entries if called multiple times; update the function to either (a) restore
sys.path after the import (remove the inserted package_parent if you added it)
to avoid path accumulation and ensure you only clear the specific sys.modules
entries you previously removed, or (b) switch this validation path to run in a
subprocess for full isolation (matching other validators); also add a short
docstring/note on _validate_agent_package_impl documenting the in-process
limitation if you keep the current approach.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 274a781f-cf56-4307-8c2b-b13d4022b6c5

📥 Commits

Reviewing files that changed from the base of the PR and between e28d989 and 148f61a.

📒 Files selected for processing (19)

core/framework/graph/executor.py
core/framework/llm/codex_adapter.py
core/framework/llm/codex_backend.py
core/framework/llm/litellm.py
core/framework/server/routes_execution.py
core/framework/server/session_manager.py
core/framework/server/tests/test_api.py
core/framework/tools/queen_lifecycle_tools.py
core/framework/tools/worker_monitoring_tools.py
core/frontend/src/lib/run-inputs.test.ts
core/frontend/src/lib/run-inputs.ts
core/frontend/src/pages/workspace.tsx
core/tests/test_config.py
core/tests/test_litellm_provider.py
core/tests/test_queen_lifecycle_validation.py
core/tests/test_session_manager_worker_validation.py
core/tests/test_worker_monitoring_tools.py
tools/coder_tools_server.py
tools/tests/test_coder_tools_server.py

✅ Files skipped from review due to trivial changes (1)

tools/tests/test_coder_tools_server.py

🚧 Files skipped from review as they are similar to previous changes (5)

core/tests/test_litellm_provider.py
core/tests/test_config.py
core/tests/test_session_manager_worker_validation.py
core/frontend/src/pages/workspace.tsx
core/framework/server/session_manager.py

core/framework/graph/executor.py

core/framework/tools/queen_lifecycle_tools.py

coderabbitai

🧹 Nitpick comments (3)

core/tests/test_queen_lifecycle_validation.py (1)
596-601: Unused mock attribute can be removed for clarity.

_get_primary_session_state is mocked but never called since run_agent_with_input now passes session_state=None directly. The mock is harmless but misleading.
🔧 Remove unused mock
     runtime = SimpleNamespace(
         resume_timers=MagicMock(),
         trigger=AsyncMock(),
-        _get_primary_session_state=MagicMock(return_value={}),
         graph=SimpleNamespace(nodes=[]),
     )
And similarly at line 637:
     runtime = SimpleNamespace(
         resume_timers=MagicMock(),
         trigger=AsyncMock(return_value="exec-1"),
-        _get_primary_session_state=MagicMock(return_value={}),
         get_entry_points=lambda: [SimpleNamespace(id="default", entry_node="process")],
Also applies to: 637-637
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/tests/test_queen_lifecycle_validation.py` around lines 596 - 601, The
runtime fixture creation includes an unused mocked attribute
_get_primary_session_state which is no longer called because
run_agent_with_input passes session_state=None; remove the
_get_primary_session_state=MagicMock(...) entry from the SimpleNamespace
initialization in both places (the runtime declarations around the test at lines
~596 and ~637) so the runtime only contains resume_timers, trigger, and graph to
avoid misleading unused mocks.
core/framework/tools/queen_lifecycle_tools.py (2)
2995-3021: Judge pressure detection has a potential edge case with mixed actions.

The streak detection breaks on ACCEPT or any action not in _NON_ACCEPT_JUDGE_ACTIONS, which includes actions not in the expected set. If a new verdict action is added (e.g., "PAUSE"), it would break the streak even though it's not an ACCEPT.

This is likely acceptable since unknown actions are rare, but worth noting for future maintainability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/tools/queen_lifecycle_tools.py` around lines 2995 - 3021, The
current _get_recent_judge_pressure streak loop breaks on any action not in
_NON_ACCEPT_JUDGE_ACTIONS which causes new/unknown non-ACCEPT actions (e.g.,
"PAUSE") to prematurely end the streak; change the loop in
_get_recent_judge_pressure to only break when action == "ACCEPT" and treat any
other non-empty action as part of the non-ACCEPT streak (append unknown actions
to streak so they count), keeping the existing compression step and final
threshold check; reference the function name _get_recent_judge_pressure and the
constant _NON_ACCEPT_JUDGE_ACTIONS when making this update.
1071-1108: Preflight timeout is appropriate but consider surfacing the warning.

The 15-second timeout with a warning log is reasonable. However, if preflight times out, the worker proceeds without credential validation or MCP resync. The queen won't know about this unless it checks logs.

Consider returning a status flag or including a warning in the trigger response so the queen can inform the user that preflight was incomplete.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/framework/tools/queen_lifecycle_tools.py` around lines 1071 - 1108,
_change the _preflight_worker_run signature and behavior to surface when
preflight did not complete: have _preflight_worker_run return a bool (True =
preflight completed, False = preflight incomplete/timed out) and set a
short-lived flag on the session (e.g., session.preflight_incomplete_warning =
True) when the timeout path is hit so callers (the queen/trigger response) can
surface the warning; specifically, update async def
_preflight_worker_run(session, runtime, timeout_seconds) to return bool, set
cred_error handling and MCP resync as-is but in the except TimeoutError block
set session.preflight_incomplete_warning = True and return False, otherwise
return True when _preflight finishes successfully so callers can include the
warning in the trigger response.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@core/framework/tools/queen_lifecycle_tools.py`:
- Around line 2995-3021: The current _get_recent_judge_pressure streak loop
breaks on any action not in _NON_ACCEPT_JUDGE_ACTIONS which causes new/unknown
non-ACCEPT actions (e.g., "PAUSE") to prematurely end the streak; change the
loop in _get_recent_judge_pressure to only break when action == "ACCEPT" and
treat any other non-empty action as part of the non-ACCEPT streak (append
unknown actions to streak so they count), keeping the existing compression step
and final threshold check; reference the function name
_get_recent_judge_pressure and the constant _NON_ACCEPT_JUDGE_ACTIONS when
making this update.
- Around line 1071-1108: _change the _preflight_worker_run signature and
behavior to surface when preflight did not complete: have _preflight_worker_run
return a bool (True = preflight completed, False = preflight incomplete/timed
out) and set a short-lived flag on the session (e.g.,
session.preflight_incomplete_warning = True) when the timeout path is hit so
callers (the queen/trigger response) can surface the warning; specifically,
update async def _preflight_worker_run(session, runtime, timeout_seconds) to
return bool, set cred_error handling and MCP resync as-is but in the except
TimeoutError block set session.preflight_incomplete_warning = True and return
False, otherwise return True when _preflight finishes successfully so callers
can include the warning in the trigger response.

In `@core/tests/test_queen_lifecycle_validation.py`:
- Around line 596-601: The runtime fixture creation includes an unused mocked
attribute _get_primary_session_state which is no longer called because
run_agent_with_input passes session_state=None; remove the
_get_primary_session_state=MagicMock(...) entry from the SimpleNamespace
initialization in both places (the runtime declarations around the test at lines
~596 and ~637) so the runtime only contains resume_timers, trigger, and graph to
avoid misleading unused mocks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d908ef30-3cd3-44ef-a9da-3ece023ede89

📥 Commits

Reviewing files that changed from the base of the PR and between 148f61a and f56600f.

📒 Files selected for processing (2)

core/framework/tools/queen_lifecycle_tools.py
core/tests/test_queen_lifecycle_validation.py

VasuBansal7576 added 3 commits March 27, 2026 20:45

fix: add adapter-first codex provider integration

e87a40a

fix: harden codex backend parity across runtime and validation

9526570

fix: improve codex workspace run and rerun parity

c959cab

This was referenced Mar 27, 2026

fix: add adapter-first Codex provider integration #6830

Closed

fix: harden Codex backend parity across runtime and validation VasuBansal7576/hive#5

Closed

fix: improve Codex workspace run and rerun parity VasuBansal7576/hive#6

Closed

VasuBansal7576 added 3 commits March 27, 2026 22:43

fix: resolve codex parity CI regressions

8351c80

fix: align codex parity PR with CI formatting and windows tests

4681e52

fix: restore image handoff support in codex parity pr

e28d989

coderabbitai bot reviewed Mar 28, 2026

View reviewed changes

fix: address codex parity review findings

148f61a

coderabbitai bot reviewed Mar 28, 2026

View reviewed changes

core/framework/graph/executor.py Show resolved Hide resolved

core/framework/tools/queen_lifecycle_tools.py Outdated Show resolved Hide resolved

core/framework/tools/queen_lifecycle_tools.py Show resolved Hide resolved

fix: tighten validation and rerun default guards

f56600f

coderabbitai bot reviewed Mar 28, 2026

View reviewed changes

Conversation

VasuBansal7576 commented Mar 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Changes Made

Testing

Checklist

Screenshots (if applicable)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

VasuBansal7576 commented Mar 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 28, 2026 •

edited

Loading