GEAK kernel-build failures block E2E: CK/.cu preprocess, sglang harness-gen, rag-mcp env crash, timeouts

## GEAK-internal kernel-build failures blocking E2E wins (CK/.cu + sglang harness + env + budget)

Observed across the last-30h GEAKv3+HL fleet. When HL dispatches a hot kernel to GEAK,
a large fraction of attempts **never produce a usable patch** — failing inside GEAK
before any micro/correctness result. These are GEAK-side (harness-gen / preprocess /
runtime env / budget), distinct from HL's promotion logic (fixed in Hyperloom #759).

### Evidence (fleet sessions)
- **zai-org-GLM-4.7-FP8 (sglang)**: 8/8 GEAK kernels failed → all REVERT, E2E 0%.
  - `int8_kernel.py`: `[harness_gen] no @perftest/@benchmark decorated functions found`
    → falls back to `timeout 600 python bench_int8_quant.py`; then
    `rag-mcp package not found, installing automatically...` → **agent exits (log ends)**.
- **openai-gpt-oss-120b (vllm)**: 7 GEAK kernels, all on `.cu/.cuh`:
  - `custom_all_reduce.cuh`, `quant_kernels.cu` → `error_class=preprocess_failed`
    ("preprocess reported 1 error(s)"), `produced_artifact=false`.
  - large kernels → `error_class=agent_error` / **`exit code 124` (timeout)** at ~130 min.

### Failure modes → proposed fixes (v4 parity)
1. **`.cu`/CK `preprocess_failed`** — GEAK preprocess can't build a harness for CK `.cu`
   kernels. → Emit a buildable `.cu` harness (v4's `.cu` harness path).
2. **harness-gen "no `@perftest/@benchmark` decorators"** (sglang) — generator can't find
   a microbench entrypoint and degrades to raw-script execution. → Synthesize a harness
   from the kernel signature / live shape capture instead of requiring decorators.
3. **`rag-mcp` auto-install crash** — agent dies mid-run installing `rag-mcp`. → Pre-provision
   `rag-mcp` in the image, or make the agent resilient to install failure (skip RAG, continue).
4. **`exit 124` timeouts** — budget exhausted on large kernels (~130 min observed). → Per-kernel
   adaptive budget / earlier checkpointing so partial-but-correct results are emitted.

### Impact if fixed
On these workloads GEAK currently contributes **0%** of E2E despite high-GPU% targets
(GLM all-fail; gpt-oss `.cu` all-fail). Fixing harness/preprocess unblocks the kernels that
HL #759 would then promote to E2E.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GEAK kernel-build failures block E2E: CK/.cu preprocess, sglang harness-gen, rag-mcp env crash, timeouts #316

GEAK-internal kernel-build failures blocking E2E wins (CK/.cu + sglang harness + env + budget)

Evidence (fleet sessions)

Failure modes → proposed fixes (v4 parity)

Impact if fixed

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

GEAK kernel-build failures block E2E: CK/.cu preprocess, sglang harness-gen, rag-mcp env crash, timeouts #316

Description

GEAK-internal kernel-build failures blocking E2E wins (CK/.cu + sglang harness + env + budget)

Evidence (fleet sessions)

Failure modes → proposed fixes (v4 parity)

Impact if fixed

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions