Skip to content

Preprocess soft-cap (900s) too tight for aiter/CK kernels under forced AITER_REBUILD=1 — harness-init starves optimization #298

Description

@iraj465

Problem

For aiter/Composable-Kernel .cu/.cuh kernels, GEAK's 900s preprocess soft-cap is exhausted during harness-init (the baseline compile) before a single optimization round runs. The kernel then borrows from the optimization budget just to finish compiling the baseline, leaving little/no time for actual candidate optimization — and sometimes never produces a benchmark_baseline.txt at all.

Root cause is the interaction of two facts:

  1. Hyperloom forces AITER_REBUILD=1 for aiter kernels (kernel-agent backends/forge_submit.py: "each time the op is recompiled, so force AITER_REBUILD=1 for aiter kernels -- edits must recompile"). So the baseline AND every candidate must do a full hipcc recompile — caching the build away is not an option by design.
  2. Composable Kernel is template-heavy — a single aiter attention instance takes 10–35+ min to compile via hipcc.

900s simply isn't enough for (1)+(2).

Data-backed evidence (3 kernels, same pinned pipeline: GEAK v3.2.2, MI300X)

Kernel Backend/type Preprocess result
dynamic_per_tensor_quant (Llama-3.1-8B) elementwise, fast compile baseline committed at +472s, 12 strategies, Round 1 completed ✅
attention_paged_attention_ragged (Llama-3.1-8B) aiter/CK .cu 900s soft-cap hit at harness-init, NO benchmark_baseline.txt after 34+ min, borrowing up to 4500s from opt budget
mha_batch_prefill (Mixtral-8x7B) aiter/CK .cu same: 900s soft-cap hit, no baseline, then correctness gate FAILED — 0 optimization rounds run

Log excerpts (attention, Llama):

[budget] mode=full, total=10800s, preprocess_soft_cap=900
[preprocess] Original budget (900s) reached at stage 'harness-init'.
[preprocess] Borrowing up to 4500s from optimization budget. (hard cap = 5400s)
[preprocess] soft cap reached during 'harness-init' with no benchmark_baseline.txt yet

Contrast (quant, Llama): preprocess committed at +472s; optimization budget ... → rounds ran normally.

So the elementwise kernel clears preprocess in <8 min and optimizes; the two aiter/CK attention kernels never clear it under the same cap.

Impact

aiter/CK attention kernels — which are frequently the dominant editable bottleneck (Llama attention ~23.6% GPU, Mixtral mha) — get effectively zero optimization because the entire budget is consumed compiling the baseline. The bottleneck kernel that most needs optimizing is the one GEAK can't reach.

Possible solves (in rough priority)

  1. Per-kernel-type preprocess budget. Scale preprocess_soft_cap by kernel class: elementwise/triton ~900s is fine; aiter/CK .cu need ~2400–3600s. Detect via source path (aiter/csrc, composable_kernel) or KernelLanguage=hip + CK dependency.
  2. ccache/sccache for hipcc. Candidates share the bulk of CK template instantiations across rounds; a compiler cache would massively cut per-candidate recompile (the baseline compile is paid once, candidate diffs reuse cached objects). Note: this helps the candidate loop even though AITER_REBUILD=1 forces a rebuild — ccache short-circuits unchanged translation units.
  3. Parallelize the CK build — harness-init compiles appear serial; MAX_JOBS=$(nproc) on the hipcc/ninja build (108 cores idle here while one kernel compiles serially).
  4. Reuse the already-built baseline .so. The baseline aiter kernel is already compiled in the serving image; if harness-init benchmarked that prebuilt .so for the baseline (only compiling candidate edits), the entire baseline-compile cost at harness-init disappears.

Happy to provide the full run dirs / logs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions