Skip to content

fix: retry compaction-busy turns and close shell fs jail gaps#228

Merged
andersonleal merged 2 commits into
mainfrom
fix/compaction-busy-shell-jail
Jun 4, 2026
Merged

fix: retry compaction-busy turns and close shell fs jail gaps#228
andersonleal merged 2 commits into
mainfrom
fix/compaction-busy-shell-jail

Conversation

@andersonleal
Copy link
Copy Markdown
Collaborator

@andersonleal andersonleal commented Jun 4, 2026

Summary

Fixes two failure modes found by investigating session x7tshdba (chat died with response failed: from assistant_streaming: compaction already in progress), plus a critical jail-escape regression caught in pre-landing review of the fix itself.

harness — compaction busy is transient, not terminal

  • CompactionBusyError now extends TransientError. The async post-turn compaction of a large session can hold the compaction lease (300s TTL) past busyTimeoutMs (30s); compact_now then returns busy, and the resulting throw used to hit runTransition's catch-all and route the session to terminal failed. It now re-throws so the durable turn-step queue retries the step once the lease releases — the same proven path session-lease contention already uses (run-transition.ts:94).
  • runPreflight no longer conflates compact_now statuses: only 'ok' maps to 'compacted' (caller reloads messages); 'empty'/unknown statuses log a warning and proceed without claiming the context shrank.
  • Documented the orphaned turn-step queue config at TURN_STEP_QUEUE (store.ts): steps go to the engine default queue while engine.config.yaml configures a turn-step FIFO (session grouping, max_retries 5) that nothing uses. Switching is a deliberate follow-up — it changes scheduling semantics for every step.

shell — fs jail errors are recoverable, and the jail actually holds

  • validate_path resolves relative paths against the canonical host_root when jailed (agents commonly probe with ./bare names; previously S210). The canonical starts_with check still bounds the joined path, so .. cannot escape. Empty paths get an explicit S210.
  • S215 escape errors name the canonical jail root so a caller can self-correct in one step instead of guessing.
  • Jail-escape fix (review finding, multi-model confirmed): rm/chmod/mv/sed need lexical (non-canonicalized) operands for unlink/perm/rename semantics and rebuilt them from the raw request string. With relative inputs accepted, the validated path (host_root/<rel>) and the operated-on path (<cwd>/<rel>) diverged — a destructive escape. New lexical_operand helper anchors relative inputs to the same root validate_path validated.

Test plan

  • harness: full vitest suite green — 1107/1107 (regression test proves the busy throw never writes state: failed; instanceof TransientError contract pinned; both new preflight status branches covered)
  • shell: full cargo test --lib green — 151/151 (+10 new: relative resolution under jail, dotdot escape, empty path jailed/unjailed, S215 names the root, cwd≠host_root regressions for rm/mv/chmod/sed)
  • cargo fmt --all --check clean; Biome clean on changed TS
  • Adversarial review: 3 specialists + Claude adversarial + Codex; the one critical finding (jail escape) fixed in-branch
  • Live smoke: rebuild/restart the shell worker and re-run a jailed fs::ls/rm probe with relative paths
  • Live smoke: long-session compaction overlap no longer kills the chat (re-run the x7tshdba recipe)

Follow-ups (flagged, intentionally out of scope)

  • Route turn steps onto the configured turn-step FIFO queue (per-session ordering + bounded retries) — semantics change for every step, needs its own validation
  • Preflight trigger timeout (60s) vs sync-compaction summarizer budget (~120s) incoherence

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed relative path resolution in filesystem operations to respect configured boundaries
    • Improved error handling and retry behavior for context compaction busy scenarios
  • Improvements

    • Enhanced error messages to include canonical path information for better diagnostics

CompactionBusyError from preflight (async post-turn compaction holding the
300s-TTL lease past busyTimeoutMs on large sessions) propagated through
runTransition's catch-all and killed the chat with a terminal
"response failed: from assistant_streaming: compaction already in progress".

- CompactionBusyError now extends TransientError: runTransition re-throws
  and the durable turn-step queue retries once the lease releases, the same
  proven path the session-lease contention already uses.
- preflight no longer conflates compact_now statuses: only 'ok' maps to
  'compacted' (reload messages); 'empty' or an unknown/drifted status logs
  a warning and proceeds without claiming the context shrank.
- Document the orphaned `turn-step` queue config at TURN_STEP_QUEUE: steps
  go to the engine `default` queue; switching is a deliberate follow-up.
- Tests: regression for the transient re-throw (no terminal state::set),
  direct instanceof contract pin, and the two new status branches.
…e errors

S215/S210 jail errors never named the configured host_root, so jailed
agents could not self-correct: the x7tshdba session burned turns guessing
paths (a relative "." probe died with S210) before falling back to
shell::exec ls.

- validate_path resolves relative paths against the canonical host_root
  when jailed; the canonical starts_with check still bounds the joined
  path, so `..` cannot escape. Empty paths get an explicit S210.
- S215 messages (validate_path and the write parents:true branch) name the
  canonical jail root so a caller can recover in one step.
- lexical_operand: rm/chmod/mv/sed need lexical (non-canonicalized)
  operands for unlink/perm/rename semantics, but rebuilt them from the raw
  request string — with relative inputs now accepted, the validated path
  (host_root/<rel>) and the operated-on path (<cwd>/<rel>) diverged into a
  jail escape. The helper anchors relative inputs to the same root
  validate_path validated, closing the divergence (found in review by two
  independent models).
- Tests: relative resolution under the jail, dotdot escape, empty path
  (jailed and unjailed), S215 names the root (validate_path and
  parents:true write), and cwd!=host_root regressions for rm/mv/chmod/sed.
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
workers Ready Ready Preview, Comment Jun 4, 2026 7:16pm

Request Review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

skill-check — worker

0 verified, 14 skipped (no docs/).

Layer Result
structure
vale
ai
render

Four for four. Nicely done.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9a2b4c0a-f1fa-450f-82c0-d37663ab69c4

📥 Commits

Reviewing files that changed from the base of the PR and between e506635 and faf896d.

📒 Files selected for processing (6)
  • harness/src/turn-orchestrator/errors.ts
  • harness/src/turn-orchestrator/preflight.ts
  • harness/src/turn-orchestrator/state-runtime/store.ts
  • harness/tests/turn-orchestrator/preflight.test.ts
  • harness/tests/turn-orchestrator/run-transition.test.ts
  • shell/src/fs/host.rs

📝 Walkthrough

Walkthrough

This PR makes two independent changes: first, it reclassifies CompactionBusyError as a transient retryable error and updates turn-step preflight/transition logic and tests accordingly; second, it refactors filesystem path handling to properly support relative request paths under a configured host_root, ensuring operations remain jailed.

Changes

Turn-Step Transient Error Handling

Layer / File(s) Summary
Error class hierarchy and transient classification
harness/src/turn-orchestrator/errors.ts
TransientError is reordered before dependent error classes, and CompactionBusyError changes from directly extending Error to extending TransientError, making it automatically classified as retriable by transition logic.
Preflight compaction result handling
harness/src/turn-orchestrator/preflight.ts, harness/tests/turn-orchestrator/preflight.test.ts
runPreflight documentation and logic are updated to clarify CompactionBusyError transience, handle non-compaction outcomes ('empty', unrecognized statuses) by logging a warning and returning 'ok' instead of claiming compaction, and test coverage validates this behavior.
Transient error retry behavior in transitions
harness/tests/turn-orchestrator/run-transition.test.ts
runTransition regression tests verify that CompactionBusyError, as a TransientError instance, is re-thrown during transition (not caught and converted to a failed state), enabling the turn-step queue to retry the operation after compaction completes.
Turn-step queue orchestration documentation
harness/src/turn-orchestrator/state-runtime/store.ts
Documentation clarifies that turn-step wake scheduling is currently wired to the engine's default queue (which is orphaned), and notes that switching to a dedicated turn-step queue would intentionally change per-step scheduling semantics as a follow-up.

Filesystem Relative Path Support

Layer / File(s) Summary
Path validation and jail-anchored operand computation
shell/src/fs/host.rs
validate_path is updated to reject empty paths, accept relative paths by joining them to the cached canonical host_root, and enforce jail confinement with informative S215 error messages. A new lexical_operand helper produces jail-anchored, lexically-normalized operand paths without forcing canonicalization, ensuring operations target the correct file or link.
Filesystem operations rewired for jail-anchored paths
shell/src/fs/host.rs
Core operations (rm, chmod, mv, sed) are updated to use lexical_operand post-validation instead of direct path normalization, ensuring relative rm/chmod/mv/sed operands remain anchored to the jail root when executed on disk.
Write operation parent-escape error handling
shell/src/fs/host.rs
The write operation's parents:true mode now includes the canonical host_root in the S215 error message when a parent directory would escape the jail, providing clearer feedback about the jail boundary.
Relative path validation and operation correctness tests
shell/src/fs/host.rs
Comprehensive regression test suite covers relative-path validation under host_root, dotdot escape prevention, empty-path rejection, S215 message content naming the jail root, and operation correctness to verify that relative rm/mv/chmod/sed/write with parents:true modify files under the jail rather than the worker's current working directory.

Sequence Diagram(s)

sequenceDiagram
  participant runPreflight
  participant compact_now
  participant caller
  runPreflight->>compact_now: request status
  alt status is 'ok'
    compact_now-->>runPreflight: status: 'ok'
    runPreflight-->>caller: return 'ok'
  else status is 'compacted'
    compact_now-->>runPreflight: status: 'compacted'
    runPreflight-->>caller: return 'compacted'
  else status is 'overflow'
    compact_now-->>runPreflight: status: 'overflow'
    runPreflight-->>runPreflight: throw ContextOverflowError
  else status is 'busy'
    compact_now-->>runPreflight: status: 'busy'
    runPreflight-->>runPreflight: throw CompactionBusyError
  else status is unknown
    compact_now-->>runPreflight: status: 'empty' or other
    runPreflight-->>runPreflight: log warning
    runPreflight-->>caller: return 'ok'
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • iii-hq/workers#208: Modifies TURN_STEP_QUEUE constant value and wake-enqueue tests, related to the queue semantics documented in this PR.
  • iii-hq/workers#198: Changes runTransition to use per-session leases and throw TransientError on contention, foundational to the CompactionBusyError transient classification in this PR.

Suggested reviewers

  • ytallo
  • sergiofilhowz

Poem

🐰 Busy errors now know when to yield and retry,
Paths can roam relative under a jail-root sky,
With anchored operands and clearer errors bright,
Orchestration springs forward—everything working just right!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures both major fixes: compaction-busy retry logic and shell filesystem jail security improvements.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/compaction-busy-shell-jail

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@andersonleal andersonleal merged commit 1a77553 into main Jun 4, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants