Skip to content

fix(run): bound RAG subprocesses + GEAK_DISABLE_RAG + ccache (follow-up to #319 — clears GEAK 7800s hang)#321

Merged
iraj465 merged 4 commits into
release/v3.2.2from
fix/geak-rag-ccache
Jun 30, 2026
Merged

fix(run): bound RAG subprocesses + GEAK_DISABLE_RAG + ccache (follow-up to #319 — clears GEAK 7800s hang)#321
iraj465 merged 4 commits into
release/v3.2.2from
fix/geak-rag-ccache

Conversation

@iraj465

@iraj465 iraj465 commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Follow-up to #319 (merged) — that merge captured only the first commit (GEAK_VERIFY_IN_LOOP); this lands the second commit that was pushed afterward.

Why it matters: in the last 24h of CI, GEAK failed ~96% of attempts, dominated by TimeoutExpired after 7800s — the agent stalling on a blocking, raise-on-failure pip install rag-mcp + FAISS index build at startup, plus uncapped CK/.cu builds. This commit:

  • makes RAG setup bounded + graceful (GEAK_DISABLE_RAG to skip; otherwise timeout-bounded install/index build that disables RAG and continues instead of hanging/raising);
  • sets MAX_JOBS + routes the compiler through ccache (when present) so CK rebuilds are incremental.

All default-safe (setdefault; RAG degrades, never blocks). Without this, release/v3.2.2 still hangs GEAK at startup.

…ache/MAX_JOBS

Follow-up to #316/#319. release/v3.2.2 already degrades RAG on EXCEPTION, but the
pip-install / index-build subprocess calls had NO timeout, so a HANG (no exception,
just blocking) still consumed the whole per-kernel budget — the observed
'TimeoutExpired after 7800s' that dominated last-24h GEAK failures (~96%).

- Add timeouts to both RAG subprocesses (GEAK_RAG_INSTALL_TIMEOUT=180s,
  GEAK_RAG_INDEX_TIMEOUT=600s) so a hang raises TimeoutExpired -> the existing
  except degrades RAG and continues.
- GEAK_DISABLE_RAG=1 to skip the optional RAG setup entirely.
- MAX_JOBS cap + ccache compiler launchers so CK/.cu harness rebuilds are incremental.

All default-safe (setdefault; RAG stays best-effort).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nnected" 401

The core42 SaFE proxy occasionally returns a 401 whose body is "Client is not
connected to the query engine, you must call connect()". Despite the 401 status
this is a transient proxy-side fault (its upstream connection dropped mid-burst),
not a credential failure — the next identical call succeeds. LiteLLM surfaces it
as AuthenticationError, which together with its parent APIError sat in the
no-retry exclusion list, so a single blip aborted the whole optimization round.

Replace the static retry_if_not_exception_type exclusion with a predicate
(should_retry_llm_error) that retries this specific transient signature with the
existing exponential backoff, while keeping a genuine bad-key 401 (any other
AuthenticationError text) fatal so we don't spin for minutes on an actually
invalid credential. Timeouts and rate limits keep retrying as before.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
iraj465 and others added 2 commits June 30, 2026 07:16
…mbie thread)

_invoke_profiler_mcp used 'with ThreadPoolExecutor(...) as pool', whose __exit__
calls shutdown(wait=True) and JOINS the worker. When the profiler hangs
(ROCprofiler-SDK / LD_PRELOAD contention on a busy host) future.result(timeout)
raises on time, but the implicit join then blocks forever on the still-running
worker, defeating the timeout and starving the whole preprocess budget. Manage
the pool explicitly and shutdown(wait=False, cancel_futures=True) in finally so
the timeout is actually honored and we fall through to a profile-less run.

Complements the GEAK_SKIP_PROFILE default-on behaviour: even when the profiler
is opted back in, a single hang can no longer hang the run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…round FULL_BENCHMARK)

The subagent --benchmark already runs the SAME full config set as --full-benchmark
(harness contract: _cap(ALL_CONFIGS), uncapped under GEAK_MAX_BENCHMARK_SHAPES=0),
so the post-round FULL_BENCHMARK re-measures identical shapes - pure redundancy
that also times out on heavy CK/.cu rebuilds. Adopt the in-loop full-set speedup
as verified and skip the FB subprocess, while keeping the clean-worktree
CORRECTNESS (the only worktree-bypass guard). Opt back in with GEAK_VERIFY_IN_LOOP=0.

Guard: only adopt-and-skip when a usable in-loop number exists; otherwise fall
through to FB so the round can still earn a value. The downstream HL paired
same-config A/B stays the real E2E arbiter (it re-measures throughput itself and
never reads a GEAK FB artifact), so micro is used only for ranking/promotion.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@iraj465 iraj465 requested a review from rodosingh June 30, 2026 09:41
@iraj465 iraj465 merged commit f748830 into release/v3.2.2 Jun 30, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants