Problem
For aiter/Composable-Kernel .cu/.cuh kernels, GEAK's 900s preprocess soft-cap is exhausted during harness-init (the baseline compile) before a single optimization round runs. The kernel then borrows from the optimization budget just to finish compiling the baseline, leaving little/no time for actual candidate optimization — and sometimes never produces a benchmark_baseline.txt at all.
Root cause is the interaction of two facts:
- Hyperloom forces
AITER_REBUILD=1 for aiter kernels (kernel-agent backends/forge_submit.py: "each time the op is recompiled, so force AITER_REBUILD=1 for aiter kernels -- edits must recompile"). So the baseline AND every candidate must do a full hipcc recompile — caching the build away is not an option by design.
- Composable Kernel is template-heavy — a single aiter attention instance takes 10–35+ min to compile via hipcc.
900s simply isn't enough for (1)+(2).
Data-backed evidence (3 kernels, same pinned pipeline: GEAK v3.2.2, MI300X)
| Kernel |
Backend/type |
Preprocess result |
dynamic_per_tensor_quant (Llama-3.1-8B) |
elementwise, fast compile |
baseline committed at +472s, 12 strategies, Round 1 completed ✅ |
attention_paged_attention_ragged (Llama-3.1-8B) |
aiter/CK .cu |
900s soft-cap hit at harness-init, NO benchmark_baseline.txt after 34+ min, borrowing up to 4500s from opt budget |
mha_batch_prefill (Mixtral-8x7B) |
aiter/CK .cu |
same: 900s soft-cap hit, no baseline, then correctness gate FAILED — 0 optimization rounds run |
Log excerpts (attention, Llama):
[budget] mode=full, total=10800s, preprocess_soft_cap=900
[preprocess] Original budget (900s) reached at stage 'harness-init'.
[preprocess] Borrowing up to 4500s from optimization budget. (hard cap = 5400s)
[preprocess] soft cap reached during 'harness-init' with no benchmark_baseline.txt yet
Contrast (quant, Llama): preprocess committed at +472s; optimization budget ... → rounds ran normally.
So the elementwise kernel clears preprocess in <8 min and optimizes; the two aiter/CK attention kernels never clear it under the same cap.
Impact
aiter/CK attention kernels — which are frequently the dominant editable bottleneck (Llama attention ~23.6% GPU, Mixtral mha) — get effectively zero optimization because the entire budget is consumed compiling the baseline. The bottleneck kernel that most needs optimizing is the one GEAK can't reach.
Possible solves (in rough priority)
- Per-kernel-type preprocess budget. Scale
preprocess_soft_cap by kernel class: elementwise/triton ~900s is fine; aiter/CK .cu need ~2400–3600s. Detect via source path (aiter/csrc, composable_kernel) or KernelLanguage=hip + CK dependency.
ccache/sccache for hipcc. Candidates share the bulk of CK template instantiations across rounds; a compiler cache would massively cut per-candidate recompile (the baseline compile is paid once, candidate diffs reuse cached objects). Note: this helps the candidate loop even though AITER_REBUILD=1 forces a rebuild — ccache short-circuits unchanged translation units.
- Parallelize the CK build — harness-init compiles appear serial;
MAX_JOBS=$(nproc) on the hipcc/ninja build (108 cores idle here while one kernel compiles serially).
- Reuse the already-built baseline
.so. The baseline aiter kernel is already compiled in the serving image; if harness-init benchmarked that prebuilt .so for the baseline (only compiling candidate edits), the entire baseline-compile cost at harness-init disappears.
Happy to provide the full run dirs / logs.
Problem
For aiter/Composable-Kernel
.cu/.cuhkernels, GEAK's 900s preprocess soft-cap is exhausted duringharness-init(the baseline compile) before a single optimization round runs. The kernel then borrows from the optimization budget just to finish compiling the baseline, leaving little/no time for actual candidate optimization — and sometimes never produces abenchmark_baseline.txtat all.Root cause is the interaction of two facts:
AITER_REBUILD=1for aiter kernels (kernel-agentbackends/forge_submit.py: "each time the op is recompiled, so force AITER_REBUILD=1 for aiter kernels -- edits must recompile"). So the baseline AND every candidate must do a full hipcc recompile — caching the build away is not an option by design.900s simply isn't enough for (1)+(2).
Data-backed evidence (3 kernels, same pinned pipeline: GEAK v3.2.2, MI300X)
dynamic_per_tensor_quant(Llama-3.1-8B)attention_paged_attention_ragged(Llama-3.1-8B).cuharness-init, NObenchmark_baseline.txtafter 34+ min, borrowing up to 4500s from opt budgetmha_batch_prefill(Mixtral-8x7B).cuLog excerpts (attention, Llama):
Contrast (quant, Llama):
preprocess committed at +472s; optimization budget ...→ rounds ran normally.So the elementwise kernel clears preprocess in <8 min and optimizes; the two aiter/CK attention kernels never clear it under the same cap.
Impact
aiter/CK attention kernels — which are frequently the dominant editable bottleneck (Llama attention ~23.6% GPU, Mixtral mha) — get effectively zero optimization because the entire budget is consumed compiling the baseline. The bottleneck kernel that most needs optimizing is the one GEAK can't reach.
Possible solves (in rough priority)
preprocess_soft_capby kernel class: elementwise/triton ~900s is fine; aiter/CK.cuneed ~2400–3600s. Detect via source path (aiter/csrc,composable_kernel) orKernelLanguage=hip+ CK dependency.ccache/sccachefor hipcc. Candidates share the bulk of CK template instantiations across rounds; a compiler cache would massively cut per-candidate recompile (the baseline compile is paid once, candidate diffs reuse cached objects). Note: this helps the candidate loop even thoughAITER_REBUILD=1forces a rebuild — ccache short-circuits unchanged translation units.MAX_JOBS=$(nproc)on the hipcc/ninja build (108 cores idle here while one kernel compiles serially)..so. The baseline aiter kernel is already compiled in the serving image; if harness-init benchmarked that prebuilt.sofor the baseline (only compiling candidate edits), the entire baseline-compile cost at harness-init disappears.Happy to provide the full run dirs / logs.