fix(run): bound RAG subprocesses + GEAK_DISABLE_RAG + ccache (follow-up to #319 — clears GEAK 7800s hang)#321
Merged
Merged
Conversation
…ache/MAX_JOBS Follow-up to #316/#319. release/v3.2.2 already degrades RAG on EXCEPTION, but the pip-install / index-build subprocess calls had NO timeout, so a HANG (no exception, just blocking) still consumed the whole per-kernel budget — the observed 'TimeoutExpired after 7800s' that dominated last-24h GEAK failures (~96%). - Add timeouts to both RAG subprocesses (GEAK_RAG_INSTALL_TIMEOUT=180s, GEAK_RAG_INDEX_TIMEOUT=600s) so a hang raises TimeoutExpired -> the existing except degrades RAG and continues. - GEAK_DISABLE_RAG=1 to skip the optional RAG setup entirely. - MAX_JOBS cap + ccache compiler launchers so CK/.cu harness rebuilds are incremental. All default-safe (setdefault; RAG stays best-effort). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nnected" 401 The core42 SaFE proxy occasionally returns a 401 whose body is "Client is not connected to the query engine, you must call connect()". Despite the 401 status this is a transient proxy-side fault (its upstream connection dropped mid-burst), not a credential failure — the next identical call succeeds. LiteLLM surfaces it as AuthenticationError, which together with its parent APIError sat in the no-retry exclusion list, so a single blip aborted the whole optimization round. Replace the static retry_if_not_exception_type exclusion with a predicate (should_retry_llm_error) that retries this specific transient signature with the existing exponential backoff, while keeping a genuine bad-key 401 (any other AuthenticationError text) fatal so we don't spin for minutes on an actually invalid credential. Timeouts and rate limits keep retrying as before. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
yueliu14
approved these changes
Jun 30, 2026
…mbie thread) _invoke_profiler_mcp used 'with ThreadPoolExecutor(...) as pool', whose __exit__ calls shutdown(wait=True) and JOINS the worker. When the profiler hangs (ROCprofiler-SDK / LD_PRELOAD contention on a busy host) future.result(timeout) raises on time, but the implicit join then blocks forever on the still-running worker, defeating the timeout and starving the whole preprocess budget. Manage the pool explicitly and shutdown(wait=False, cancel_futures=True) in finally so the timeout is actually honored and we fall through to a profile-less run. Complements the GEAK_SKIP_PROFILE default-on behaviour: even when the profiler is opted back in, a single hang can no longer hang the run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…round FULL_BENCHMARK) The subagent --benchmark already runs the SAME full config set as --full-benchmark (harness contract: _cap(ALL_CONFIGS), uncapped under GEAK_MAX_BENCHMARK_SHAPES=0), so the post-round FULL_BENCHMARK re-measures identical shapes - pure redundancy that also times out on heavy CK/.cu rebuilds. Adopt the in-loop full-set speedup as verified and skip the FB subprocess, while keeping the clean-worktree CORRECTNESS (the only worktree-bypass guard). Opt back in with GEAK_VERIFY_IN_LOOP=0. Guard: only adopt-and-skip when a usable in-loop number exists; otherwise fall through to FB so the round can still earn a value. The downstream HL paired same-config A/B stays the real E2E arbiter (it re-measures throughput itself and never reads a GEAK FB artifact), so micro is used only for ranking/promotion. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
rodosingh
approved these changes
Jun 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #319 (merged) — that merge captured only the first commit (
GEAK_VERIFY_IN_LOOP); this lands the second commit that was pushed afterward.Why it matters: in the last 24h of CI, GEAK failed ~96% of attempts, dominated by
TimeoutExpired after 7800s— the agent stalling on a blocking, raise-on-failurepip install rag-mcp+ FAISS index build at startup, plus uncapped CK/.cu builds. This commit:GEAK_DISABLE_RAGto skip; otherwise timeout-bounded install/index build that disables RAG and continues instead of hanging/raising);MAX_JOBS+ routes the compiler throughccache(when present) so CK rebuilds are incremental.All default-safe (
setdefault; RAG degrades, never blocks). Without this,release/v3.2.2still hangs GEAK at startup.