Preprocess soft-cap (900s) too tight for aiter/CK kernels under forced AITER_REBUILD=1 — harness-init starves optimization

## Problem
For aiter/Composable-Kernel `.cu`/`.cuh` kernels, GEAK's **900s preprocess soft-cap is exhausted during `harness-init`** (the baseline compile) before a single optimization round runs. The kernel then borrows from the optimization budget just to finish compiling the *baseline*, leaving little/no time for actual candidate optimization — and sometimes never produces a `benchmark_baseline.txt` at all.

Root cause is the interaction of two facts:
1. **Hyperloom forces `AITER_REBUILD=1` for aiter kernels** (kernel-agent `backends/forge_submit.py`: *"each time the op is recompiled, so force AITER_REBUILD=1 for aiter kernels -- edits must recompile"*). So the baseline AND every candidate must do a full hipcc recompile — caching the build away is not an option by design.
2. **Composable Kernel is template-heavy** — a single aiter attention instance takes 10–35+ min to compile via hipcc.

900s simply isn't enough for (1)+(2).

## Data-backed evidence (3 kernels, same pinned pipeline: GEAK v3.2.2, MI300X)

| Kernel | Backend/type | Preprocess result |
|---|---|---|
| `dynamic_per_tensor_quant` (Llama-3.1-8B) | elementwise, fast compile | **baseline committed at +472s**, 12 strategies, Round 1 completed ✅ |
| `attention_paged_attention_ragged` (Llama-3.1-8B) | aiter/CK `.cu` | **900s soft-cap hit at `harness-init`, NO `benchmark_baseline.txt` after 34+ min**, borrowing up to 4500s from opt budget |
| `mha_batch_prefill` (Mixtral-8x7B) | aiter/CK `.cu` | **same: 900s soft-cap hit, no baseline, then correctness gate FAILED** — 0 optimization rounds run |

Log excerpts (attention, Llama):
```
[budget] mode=full, total=10800s, preprocess_soft_cap=900
[preprocess] Original budget (900s) reached at stage 'harness-init'.
[preprocess] Borrowing up to 4500s from optimization budget. (hard cap = 5400s)
[preprocess] soft cap reached during 'harness-init' with no benchmark_baseline.txt yet
```
Contrast (quant, Llama): `preprocess committed at +472s; optimization budget ...` → rounds ran normally.

So the elementwise kernel clears preprocess in <8 min and optimizes; the two aiter/CK attention kernels never clear it under the same cap.

## Impact
aiter/CK attention kernels — which are frequently the **dominant editable bottleneck** (Llama attention ~23.6% GPU, Mixtral mha) — get effectively **zero optimization** because the entire budget is consumed compiling the baseline. The bottleneck kernel that most needs optimizing is the one GEAK can't reach.

## Possible solves (in rough priority)
1. **Per-kernel-type preprocess budget.** Scale `preprocess_soft_cap` by kernel class: elementwise/triton ~900s is fine; aiter/CK `.cu` need ~2400–3600s. Detect via source path (`aiter/csrc`, `composable_kernel`) or `KernelLanguage=hip` + CK dependency.
2. **`ccache`/`sccache` for hipcc.** Candidates share the bulk of CK template instantiations across rounds; a compiler cache would massively cut per-candidate recompile (the baseline compile is paid once, candidate diffs reuse cached objects). Note: this helps the *candidate* loop even though `AITER_REBUILD=1` forces a rebuild — ccache short-circuits unchanged translation units.
3. **Parallelize the CK build** — harness-init compiles appear serial; `MAX_JOBS=$(nproc)` on the hipcc/ninja build (108 cores idle here while one kernel compiles serially).
4. **Reuse the already-built baseline `.so`.** The baseline aiter kernel is already compiled in the serving image; if harness-init benchmarked that prebuilt `.so` for the baseline (only compiling candidate *edits*), the entire baseline-compile cost at harness-init disappears.

Happy to provide the full run dirs / logs.

Kernel	Backend/type	Preprocess result
`dynamic_per_tensor_quant` (Llama-3.1-8B)	elementwise, fast compile	baseline committed at +472s, 12 strategies, Round 1 completed ✅
`attention_paged_attention_ragged` (Llama-3.1-8B)	aiter/CK `.cu`	900s soft-cap hit at `harness-init`, NO `benchmark_baseline.txt` after 34+ min, borrowing up to 4500s from opt budget
`mha_batch_prefill` (Mixtral-8x7B)	aiter/CK `.cu`	same: 900s soft-cap hit, no baseline, then correctness gate FAILED — 0 optimization rounds run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preprocess soft-cap (900s) too tight for aiter/CK kernels under forced AITER_REBUILD=1 — harness-init starves optimization #298

Problem

Data-backed evidence (3 kernels, same pinned pipeline: GEAK v3.2.2, MI300X)

Impact

Possible solves (in rough priority)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Preprocess soft-cap (900s) too tight for aiter/CK kernels under forced AITER_REBUILD=1 — harness-init starves optimization #298

Description

Problem

Data-backed evidence (3 kernels, same pinned pipeline: GEAK v3.2.2, MI300X)

Impact

Possible solves (in rough priority)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions