Goal
Validate a small Rust-based process supervisor spike for PawWork's child-process lifecycle, with the primary motivation being reliability (not performance).
This issue is intentionally a bounded spike, not an implementation commitment. The current packages/opencode/src/util/process.ts uses cross-spawn + Node child_process and handles abort via a self-written terminateTree(). The known gaps on the reliability axis include Windows grandchild leaks under taskkill /f /t, Unix process group / signal race during terminate, and timeout vs. output-drain ordering.
Scope
In scope:
- Build a minimal Rust supervisor sidecar that covers exactly four lifecycle pieces:
spawn (with stdio piping)
cancel (graceful → forceful termination ladder)
- tree termination (Windows Job Objects + Unix process groups via
nix)
output drain (ensure stdout/stderr buffers are flushed before exit-status finalization)
- Validate on macOS, Linux, and Windows with a fixture matrix:
- grandchild process tree kill
- timeout + slow output drain race
- abort during in-flight write
- signal forwarding (SIGINT / SIGTERM under Unix; CTRL_C_EVENT under Windows)
- Compare reliability against current
terminateTree() baseline, not raw spawn latency.
Out of scope:
- No replacement of
packages/opencode/src/util/process.ts in production code.
- No global rewrite of
cross-spawn callers.
- No Node
child_process removal.
- No performance optimization as a stated goal. If spawn p50 changes, it is a side effect, not a success criterion.
- No new dependency added to the published Electron app yet. The supervisor lives only in the spike workspace.
Approach
- Rust supervisor crate using
tokio::process for async spawn + stdio + cancel semantics, with kill_on_drop enabled to reduce orphan leaks.
- Unix backend:
nix for setsid, killpg, waitpid. Cargo itself uses this pattern for its own subprocess tree cleanup.
- Windows backend: Job Object via the
windows crate (or a dedicated helper). Bind every spawned child to the job so that JobObject close cascades to all descendants. This matches Cargo's Windows process-tree cleanup pattern.
- Communication boundary: spike supervisor speaks a minimal stdin/stdout protocol (line-based JSON or length-prefixed framing) so the JS layer can swap it in behind a feature flag later without rewriting call sites.
Verification
For the spike to succeed:
cargo test covers the four lifecycle pieces with the cross-platform fixture matrix above.
- Grandchild tree kill is verifiable on all three OSes:
- spawn a tree that forks at least one extra layer
- issue
cancel
- assert no leaked PIDs after a bounded grace period (
ps/tasklist snapshot)
- Output drain race: spawn a fast producer + slow consumer, issue
cancel mid-stream, assert all buffered bytes arrive before exit-status finalization.
- Cross-OS CI matrix: macOS, Linux, Windows GitHub-hosted runners run the fixtures, with deliberate flakes called out (e.g., AV sandbox interference).
- Comparative report against the current
terminateTree() baseline: which fixtures the current implementation passes, which it fails, which the supervisor newly covers.
If verification passes, follow-up tracks adoption behind a feature flag in PawWork's runtime. If verification fails or the maintenance cost outweighs the reliability win, document the negative result and close.
Priority
P2.
This is reliability hardening with a meaningful Windows footprint. Real user-visible regressions in process lifecycle should continue to be filed as narrower P1 bugs and fixed in the existing Node path.
Relevant references
Execution mode
Agent should propose a small implementation plan first and confirm the supervisor crate scope before writing Rust. Keep the spike behind a separate workspace path and a feature flag; do not wire it into production code in this issue.
Goal
Validate a small Rust-based process supervisor spike for PawWork's child-process lifecycle, with the primary motivation being reliability (not performance).
This issue is intentionally a bounded spike, not an implementation commitment. The current
packages/opencode/src/util/process.tsusescross-spawn+ Nodechild_processand handlesabortvia a self-writtenterminateTree(). The known gaps on the reliability axis include Windows grandchild leaks undertaskkill /f /t, Unix process group / signal race duringterminate, and timeout vs. output-drain ordering.Scope
In scope:
spawn(with stdio piping)cancel(graceful → forceful termination ladder)nix)output drain(ensure stdout/stderr buffers are flushed before exit-status finalization)terminateTree()baseline, not raw spawn latency.Out of scope:
packages/opencode/src/util/process.tsin production code.cross-spawncallers.child_processremoval.Approach
tokio::processfor async spawn + stdio + cancel semantics, withkill_on_dropenabled to reduce orphan leaks.nixforsetsid,killpg,waitpid. Cargo itself uses this pattern for its own subprocess tree cleanup.windowscrate (or a dedicated helper). Bind every spawned child to the job so thatJobObjectclose cascades to all descendants. This matches Cargo's Windows process-tree cleanup pattern.Verification
For the spike to succeed:
cargo testcovers the four lifecycle pieces with the cross-platform fixture matrix above.cancelps/tasklistsnapshot)cancelmid-stream, assert all buffered bytes arrive before exit-status finalization.terminateTree()baseline: which fixtures the current implementation passes, which it fails, which the supervisor newly covers.If verification passes, follow-up tracks adoption behind a feature flag in PawWork's runtime. If verification fails or the maintenance cost outweighs the reliability win, document the negative result and close.
Priority
P2.
This is reliability hardening with a meaningful Windows footprint. Real user-visible regressions in process lifecycle should continue to be filed as narrower P1 bugs and fixed in the existing Node path.
Relevant references
packages/opencode/src/util/process.tsand currentterminateTree()implementation.Execution mode
Agent should propose a small implementation plan first and confirm the supervisor crate scope before writing Rust. Keep the spike behind a separate workspace path and a feature flag; do not wire it into production code in this issue.