Skip to content

[Task] Spike native Rust process supervisor for reliability (tree kill, cancel race, output drain) #581

@Astro-Han

Description

@Astro-Han

Goal

Validate a small Rust-based process supervisor spike for PawWork's child-process lifecycle, with the primary motivation being reliability (not performance).

This issue is intentionally a bounded spike, not an implementation commitment. The current packages/opencode/src/util/process.ts uses cross-spawn + Node child_process and handles abort via a self-written terminateTree(). The known gaps on the reliability axis include Windows grandchild leaks under taskkill /f /t, Unix process group / signal race during terminate, and timeout vs. output-drain ordering.

Scope

In scope:

  • Build a minimal Rust supervisor sidecar that covers exactly four lifecycle pieces:
    1. spawn (with stdio piping)
    2. cancel (graceful → forceful termination ladder)
    3. tree termination (Windows Job Objects + Unix process groups via nix)
    4. output drain (ensure stdout/stderr buffers are flushed before exit-status finalization)
  • Validate on macOS, Linux, and Windows with a fixture matrix:
    • grandchild process tree kill
    • timeout + slow output drain race
    • abort during in-flight write
    • signal forwarding (SIGINT / SIGTERM under Unix; CTRL_C_EVENT under Windows)
  • Compare reliability against current terminateTree() baseline, not raw spawn latency.

Out of scope:

  • No replacement of packages/opencode/src/util/process.ts in production code.
  • No global rewrite of cross-spawn callers.
  • No Node child_process removal.
  • No performance optimization as a stated goal. If spawn p50 changes, it is a side effect, not a success criterion.
  • No new dependency added to the published Electron app yet. The supervisor lives only in the spike workspace.

Approach

  • Rust supervisor crate using tokio::process for async spawn + stdio + cancel semantics, with kill_on_drop enabled to reduce orphan leaks.
  • Unix backend: nix for setsid, killpg, waitpid. Cargo itself uses this pattern for its own subprocess tree cleanup.
  • Windows backend: Job Object via the windows crate (or a dedicated helper). Bind every spawned child to the job so that JobObject close cascades to all descendants. This matches Cargo's Windows process-tree cleanup pattern.
  • Communication boundary: spike supervisor speaks a minimal stdin/stdout protocol (line-based JSON or length-prefixed framing) so the JS layer can swap it in behind a feature flag later without rewriting call sites.

Verification

For the spike to succeed:

  1. cargo test covers the four lifecycle pieces with the cross-platform fixture matrix above.
  2. Grandchild tree kill is verifiable on all three OSes:
    • spawn a tree that forks at least one extra layer
    • issue cancel
    • assert no leaked PIDs after a bounded grace period (ps/tasklist snapshot)
  3. Output drain race: spawn a fast producer + slow consumer, issue cancel mid-stream, assert all buffered bytes arrive before exit-status finalization.
  4. Cross-OS CI matrix: macOS, Linux, Windows GitHub-hosted runners run the fixtures, with deliberate flakes called out (e.g., AV sandbox interference).
  5. Comparative report against the current terminateTree() baseline: which fixtures the current implementation passes, which it fails, which the supervisor newly covers.

If verification passes, follow-up tracks adoption behind a feature flag in PawWork's runtime. If verification fails or the maintenance cost outweighs the reliability win, document the negative result and close.

Priority

P2.

This is reliability hardening with a meaningful Windows footprint. Real user-visible regressions in process lifecycle should continue to be filed as narrower P1 bugs and fixed in the existing Node path.

Relevant references

Execution mode

Agent should propose a small implementation plan first and confirm the supervisor crate scope before writing Rust. Keep the spike behind a separate workspace path and a feature flag; do not wire it into production code in this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priorityharnessModel harness, prompts, tool descriptions, and session mechanicsplatformElectron shell, OS integration, packaging, updater, signing, paths, and permissionstaskMaintainer or agent execution task

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions