Skip to content

GEAK kernel-build failures block E2E: CK/.cu preprocess, sglang harness-gen, rag-mcp env crash, timeouts #316

Description

@iraj465

GEAK-internal kernel-build failures blocking E2E wins (CK/.cu + sglang harness + env + budget)

Observed across the last-30h GEAKv3+HL fleet. When HL dispatches a hot kernel to GEAK,
a large fraction of attempts never produce a usable patch — failing inside GEAK
before any micro/correctness result. These are GEAK-side (harness-gen / preprocess /
runtime env / budget), distinct from HL's promotion logic (fixed in Hyperloom #759).

Evidence (fleet sessions)

  • zai-org-GLM-4.7-FP8 (sglang): 8/8 GEAK kernels failed → all REVERT, E2E 0%.
    • int8_kernel.py: [harness_gen] no @perftest/@benchmark decorated functions found
      → falls back to timeout 600 python bench_int8_quant.py; then
      rag-mcp package not found, installing automatically...agent exits (log ends).
  • openai-gpt-oss-120b (vllm): 7 GEAK kernels, all on .cu/.cuh:
    • custom_all_reduce.cuh, quant_kernels.cuerror_class=preprocess_failed
      ("preprocess reported 1 error(s)"), produced_artifact=false.
    • large kernels → error_class=agent_error / exit code 124 (timeout) at ~130 min.

Failure modes → proposed fixes (v4 parity)

  1. .cu/CK preprocess_failed — GEAK preprocess can't build a harness for CK .cu
    kernels. → Emit a buildable .cu harness (v4's .cu harness path).
  2. harness-gen "no @perftest/@benchmark decorators" (sglang) — generator can't find
    a microbench entrypoint and degrades to raw-script execution. → Synthesize a harness
    from the kernel signature / live shape capture instead of requiring decorators.
  3. rag-mcp auto-install crash — agent dies mid-run installing rag-mcp. → Pre-provision
    rag-mcp in the image, or make the agent resilient to install failure (skip RAG, continue).
  4. exit 124 timeouts — budget exhausted on large kernels (~130 min observed). → Per-kernel
    adaptive budget / earlier checkpointing so partial-but-correct results are emitted.

Impact if fixed

On these workloads GEAK currently contributes 0% of E2E despite high-GPU% targets
(GLM all-fail; gpt-oss .cu all-fail). Fixing harness/preprocess unblocks the kernels that
HL #759 would then promote to E2E.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions