GEAK-internal kernel-build failures blocking E2E wins (CK/.cu + sglang harness + env + budget)
Observed across the last-30h GEAKv3+HL fleet. When HL dispatches a hot kernel to GEAK,
a large fraction of attempts never produce a usable patch — failing inside GEAK
before any micro/correctness result. These are GEAK-side (harness-gen / preprocess /
runtime env / budget), distinct from HL's promotion logic (fixed in Hyperloom #759).
Evidence (fleet sessions)
- zai-org-GLM-4.7-FP8 (sglang): 8/8 GEAK kernels failed → all REVERT, E2E 0%.
int8_kernel.py: [harness_gen] no @perftest/@benchmark decorated functions found
→ falls back to timeout 600 python bench_int8_quant.py; then
rag-mcp package not found, installing automatically... → agent exits (log ends).
- openai-gpt-oss-120b (vllm): 7 GEAK kernels, all on
.cu/.cuh:
custom_all_reduce.cuh, quant_kernels.cu → error_class=preprocess_failed
("preprocess reported 1 error(s)"), produced_artifact=false.
- large kernels →
error_class=agent_error / exit code 124 (timeout) at ~130 min.
Failure modes → proposed fixes (v4 parity)
.cu/CK preprocess_failed — GEAK preprocess can't build a harness for CK .cu
kernels. → Emit a buildable .cu harness (v4's .cu harness path).
- harness-gen "no
@perftest/@benchmark decorators" (sglang) — generator can't find
a microbench entrypoint and degrades to raw-script execution. → Synthesize a harness
from the kernel signature / live shape capture instead of requiring decorators.
rag-mcp auto-install crash — agent dies mid-run installing rag-mcp. → Pre-provision
rag-mcp in the image, or make the agent resilient to install failure (skip RAG, continue).
exit 124 timeouts — budget exhausted on large kernels (~130 min observed). → Per-kernel
adaptive budget / earlier checkpointing so partial-but-correct results are emitted.
Impact if fixed
On these workloads GEAK currently contributes 0% of E2E despite high-GPU% targets
(GLM all-fail; gpt-oss .cu all-fail). Fixing harness/preprocess unblocks the kernels that
HL #759 would then promote to E2E.
GEAK-internal kernel-build failures blocking E2E wins (CK/.cu + sglang harness + env + budget)
Observed across the last-30h GEAKv3+HL fleet. When HL dispatches a hot kernel to GEAK,
a large fraction of attempts never produce a usable patch — failing inside GEAK
before any micro/correctness result. These are GEAK-side (harness-gen / preprocess /
runtime env / budget), distinct from HL's promotion logic (fixed in Hyperloom #759).
Evidence (fleet sessions)
int8_kernel.py:[harness_gen] no @perftest/@benchmark decorated functions found→ falls back to
timeout 600 python bench_int8_quant.py; thenrag-mcp package not found, installing automatically...→ agent exits (log ends)..cu/.cuh:custom_all_reduce.cuh,quant_kernels.cu→error_class=preprocess_failed("preprocess reported 1 error(s)"),
produced_artifact=false.error_class=agent_error/exit code 124(timeout) at ~130 min.Failure modes → proposed fixes (v4 parity)
.cu/CKpreprocess_failed— GEAK preprocess can't build a harness for CK.cukernels. → Emit a buildable
.cuharness (v4's.cuharness path).@perftest/@benchmarkdecorators" (sglang) — generator can't finda microbench entrypoint and degrades to raw-script execution. → Synthesize a harness
from the kernel signature / live shape capture instead of requiring decorators.
rag-mcpauto-install crash — agent dies mid-run installingrag-mcp. → Pre-provisionrag-mcpin the image, or make the agent resilient to install failure (skip RAG, continue).exit 124timeouts — budget exhausted on large kernels (~130 min observed). → Per-kerneladaptive budget / earlier checkpointing so partial-but-correct results are emitted.
Impact if fixed
On these workloads GEAK currently contributes 0% of E2E despite high-GPU% targets
(GLM all-fail; gpt-oss
.cuall-fail). Fixing harness/preprocess unblocks the kernels thatHL #759 would then promote to E2E.