Add SubprocessEvaluator for process-isolated evaluation#77
Conversation
When evaluating candidate programs that can corrupt process state (e.g. CUDA kernels with illegal memory access, C extensions that segfault), the in-process Evaluator allows one bad candidate to poison all subsequent evaluations within the same CUDA context. SubprocessEvaluator provides a middle ground between the in-process Evaluator (no isolation) and ContainerizedEvaluator (requires Docker): - Each evaluate() call spawns a fresh Python subprocess - Child process gets its own CUDA context / address space - If a candidate crashes, only the subprocess dies - ~100-200ms overhead per evaluation for process startup - Same evaluate(program_path) -> dict interface as Evaluator Usage: set `evaluator.subprocess_isolation: true` in config YAML. The auto-detection in create_evaluator() checks this flag after Harbor/Container detection but before falling back to in-process. Includes 6 tests covering: successful evaluation, noisy stdout parsing, crash isolation, exception handling, crash-then-success recovery, and timeout behavior.
There was a problem hiding this comment.
Code Review
This pull request introduces SubprocessEvaluator to run candidate evaluations in isolated Python subprocesses, preventing crashes from affecting the parent process. Feedback on the implementation highlights several key areas for improvement: handling non-JSON-serializable return types (such as EvaluationResult or numpy arrays) in the wrapper script, fixing a potential resource leak and NameError during temporary file creation, replacing the blocking run_in_executor pattern with a non-blocking asyncio.create_subprocess_exec to avoid orphaned processes on timeout, and removing unnecessary sys.path modifications in the parent process.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
- Wrapper template: handle EvaluationResult objects via to_dict(), use default=str for non-serializable types (numpy arrays etc.) - Fix temp file leak: assign temp_path before write, guard cleanup with `if temp_path and os.path.exists(temp_path)` - Replace run_in_executor + subprocess.run with asyncio.create_subprocess_exec for proper timeout handling (proc.kill() + await proc.wait() on timeout) - Remove unnecessary sys.path modification in parent process (child gets eval_dir via PYTHONPATH env var) - Remove unused subprocess import
Reuse the existing SafeJSONEncoder from checkpoint_manager (with an inline fallback if the import fails in the subprocess) instead of the generic default=str approach.
- Fix black violations in subprocess_evaluator.py (argument-per-line, slice spacing) - Replace sys.modules stubbing in test_subprocess_evaluator.py with normal imports — the stub for skydiscover.config poisoned the module cache and caused ImportError in tests collected afterwards Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Adds
SubprocessEvaluator— a new evaluator that runs each candidate in a separatePython subprocess, providing process-level isolation without requiring Docker.
Problem
When evaluating candidate programs that can corrupt process state (e.g. CUDA kernels with illegal
memory access, C extensions that segfault, memory corruption), the in-process
Evaluatorallows one bad candidate to poison all subsequent evaluations. For GPUworkloads,
cudaErrorIllegalAddressis sticky — once triggered, the CUDA context ispermanently corrupted and all further operations fail.
Solution
SubprocessEvaluatorprovides a middle ground betweenEvaluator(fast, no isolation) andContainerizedEvaluator(full Docker):Usage
set
evaluator.subprocess_isolation: truein config YAML.The auto-detection in create_evaluator() checks this flag after Harbor/Container detection but before falling back to in-process.
##Test
Includes 6 tests covering: successful evaluation, noisy stdout parsing, crash isolation, exception handling, crash-then-success recovery, and timeout behavior.
Usage