This is an autonomous optimization program, in the style of autoresearch/program.md, for a single kernel. The goal is to continuously reduce the target kernel's GPU execution time without breaking correctness; any attempt that regresses performance must be rolled back, and only performance-improving attempts may be kept and committed.
For this run, use the following command as the single source of truth for both the unit test and benchmark:
python tests/kernels/test_pa.pyThroughout the entire optimization process, do not modify this benchmark command unless a human explicitly asks you to switch targets.
- Correctness gate: the script must exit successfully, and
.csvmust containerr_flydsl_ps == 0 - Script-side benchmark metric:
us_flydsl_psin the.csv, lower is better - Final keep/discard metric: the average GPU kernel time from
rocprofv3produced by/kernel-trace-analysis, lower is better - Noise rule: changes smaller than
1%are considered noise by default, unless the same-direction improvement appears twice in a row - If the benchmark output contains multiple configs, do not look only at a single overall average; you must break results down by config dimensions, and at minimum prioritize any actual
block_size,mtp,head_size, plus any other columns that directly affect dispatch
us_flydsl_ps is used for quick regression checks; when deciding whether to keep a modification, prioritize rocprofv3 GPU kernel time. Do not keep a modification just because Python-side timing looks faster if the actual GPU kernel got slower.
When the benchmark returns many configs, do not collapse all rows into a single number. You must first inspect the config patterns and identify which combinations naturally belong to the same class, such as block_size=1024, mtp=1, head_size=128; all later optimization, regression judgment, and kernel-splitting decisions must be analyzed by config class.
Prepare as follows before starting:
0. Use /FlyDSL as the root directory
- First choose a run tag, for example
apr16-pa-ps - Create a dedicated worktree or dedicated branch from the current
HEADto avoid polluting the existing workspace. Example:
git worktree add ../kernel-opt-<tag> -b kernel-opt/<tag> HEAD- Work in the dedicated worktree; do not experiment directly in the original dirty workspace
- Read first:
tests/kernels/test_pa.py- The kernel implementation file ultimately dispatched by that script
- Any helper files that directly affect launch config / tile / prefetch / LDS / wait strategy
- Confirm that both
rocprofv3androcm-smiare available inPATH; if either is unavailable, stop and inform the human - If
performance.mddoes not exist, create it - Record the current commit as:
BEST_COMMIT=$(git rev-parse HEAD)BEST_COMMIT means "the current latest known-correct and best-performing commit." All later failed or regressive attempts must return to this point before continuing.
Also maintain a regression counter:
CONSECUTIVE_REGRESSIONS=0Here, a "performance regression" specifically means: the candidate passes correctness and benchmark, but compared with the current BEST_COMMIT, the target kernel's average GPU time in rocprofv3 gets worse beyond the noise threshold (> 1%). Only this case should increment the regression counter by +1.
Also maintain a recent regression summary list:
RECENT_REGRESSIONS=[] # rolling window, max len = 5Whenever a clear performance regression occurs, append a summary to RECENT_REGRESSIONS that includes at least:
- Date/time
- Baseline commit: current
BEST_COMMIT - Candidate commit or working tree state
- Hypothesis for this round
- Modified files
us_flydsl_ps: best -> candidaterocprofv3target kernel average GPU time: best -> candidate- Percentage change relative to best
- Main regression points observed in this round's trace
"Consecutive" is defined as follows:
- Clear performance regression:
CONSECUTIVE_REGRESSIONS += 1 - Performance improvement is kept:
CONSECUTIVE_REGRESSIONS = 0, and clearRECENT_REGRESSIONS - Tie within noise:
CONSECUTIVE_REGRESSIONS = 0, and clearRECENT_REGRESSIONS - Benchmark failure / missing metrics: do not count as a "performance regression" and do not write to
RECENT_REGRESSIONS; because no comparable performance regression was formed, also resetCONSECUTIVE_REGRESSIONS = 0by default - When correctness fails and
.csvshowserr_flydsl_ps != 0, do not discard the round immediately; first enter a subflow of "up to 3 kernel repair attempts to restore accuracy." Only if all 3 repairs still fail should the optimization proposal be treated as failed, andCONSECUTIVE_REGRESSIONSreset to0
Any test, benchmark, correctness retest, or /kernel-trace-analysis run that will actually occupy a GPU must pass the following GPU gate before it starts:
- First use
rocm-smito inspect all GPUs and look for an idle GPU - Define "idle" based on the utilization / memory / process information visible in
rocm-smi; prefer a card with no active compute process, near-idle GPU utilization, and low memory usage or memory usage that is stable and not growing - If multiple idle GPUs are found, prefer the most idle one and bind subsequent commands to that GPU, for example by setting
HIP_VISIBLE_DEVICES=<gpu_id> - If there is currently no idle GPU, do not preempt; wait
10minutes, then re-check withrocm-smi - Wait at most
1hour total, meaning retry every10minutes after the initial check; if an idle GPU appears at any point during this period, continue the task - If there is still no available GPU after a full
1hour, this is not a normal benchmark failure; it is a stop condition for the entire optimization task: abort immediately and stop trying new optimization ideas
Do not run tests on a busy GPU, or benchmark noise may become distorted and you may interfere with other people's jobs.
The first round must be the baseline with unmodified code.
- First execute
GPU Availability Gate, obtain an idle GPU, then run the fixed benchmark on that GPU and write the full output to the log:
python tests/kernels/test_pa.py-
Confirm:
- Process exit code is
0 - A
.csvwas generated err_flydsl_ps == 0
- Process exit code is
-
Read baseline metrics from the
.csv:
python - <<'PY'
import pandas as pd
df = pd.read_csv("<filename>.csv")
if not (df["err_flydsl_ps"] == 0).all():
raise SystemExit("correctness failed")
print(f"baseline_us_asm_fp8={df['us_flydsl_ps'].mean():.3f}")
PY- Execute
GPU Availability Gateagain, then run the same benchmark command with:
/kernel-trace-analysis python tests/kernels/test_pa.py
-
Record from the trace report:
- Hottest kernel name
- Average GPU time of that kernel
- arch VGPR / accum VGPR / SGPR / LDS
- Top bottlenecks
-
Write the baseline into
performance.mdwith statusbaseline
The benchmark .csv may contain many configs. During autonomous optimization, you must first understand the config structure before deciding on an optimization strategy:
- First inspect which config columns actually exist in the
.csv; do not assume the fields are fixed - Prioritize grouping by actual
block_size,mtp, andhead_size; if other columns explain performance stratification better, include them in the config classification as well - Record baseline correctness and performance for each config class; do not record only an overall average
- Every optimization round must state clearly whether it is a "global optimization" or an optimization specific to one config class
- If a hypothesis only makes sense for a certain config class, it is allowed to optimize only that class, but the target config class must be explicitly recorded
- Non-target configs must not be heavily sacrificed for gains on the target configs; do not use "one subset improved a lot" to hide "another subset clearly regressed"
A "config class" means a group of inputs that goes through the same launch / tile / kernel decision path and shows similar performance behavior. Classification must be based on real columns present in benchmark output, not on arbitrary guesswork.
You may modify:
- The kernel implementation itself
- Runtime dispatch logic so that different config classes use different kernels
- Launch config
- Tile / wave / split / partition choices
- Prefetch / double-buffer / software pipelining
- LDS layout / swizzle / padding
- Instruction scheduling before/after wait/barrier
- Vectorization width
- Local code that affects occupancy or memory-latency hiding
- A specialized kernel added specifically for one config class
You may not modify:
- The fixed benchmark command
- Benchmark shape
- Secretly reducing the amount of work or bypassing correctness just to "run faster"
- Introducing new dependencies
- Keeping a modification that makes GPU kernel time worse
If you must modify the benchmark script itself, it may only be for:
- Fixing the benchmark harness
- Making the metrics more stable and easier to parse
- Fixing an obviously wrong correctness gate
Do not modify the benchmark just to make the numbers look better.
If a candidate round produces a .csv and err_flydsl_ps != 0, that means the optimization broke accuracy. Do not immediately abandon the current idea; you must first try to repair the kernel:
- Keep the current optimization hypothesis unchanged, and allow only the minimal repairs required to make accuracy pass again
- Before each repair attempt, execute
GPU Availability Gate, then rerun the fixed benchmark and recheckerr_flydsl_psin the.csv - A single optimization proposal is allowed at most
3consecutive repair attempts - As soon as one repair restores
err_flydsl_ps == 0, return to the normal performance evaluation flow and continue comparing the candidate againstBEST_COMMIT - If
err_flydsl_psis still not0after3repairs, abandon this optimization proposal: revert toBEST_COMMIT, do not commit, do not write toperformance.md, and continue with a new optimization idea
This 3-repair limit is a local limit for a single optimization proposal and does not count toward CONSECUTIVE_REGRESSIONS.
If benchmark results show clear config divergence, for example one config class improves significantly while another class clearly and reproducibly regresses, do not force all configs to share the same optimized version. In this case, prefer splitting:
- Keep the current known-best general-purpose kernel as the default path
- Add a specialized kernel for the config class that clearly benefits
- In runtime dispatch, choose kernels based on config conditions, for example actual
block_size,mtp,head_size, or other real configuration columns - Route the benefiting configs to the new kernel, while keeping the other configs on the old best kernel
- After splitting, rerun the fixed benchmark, verify correctness for all configs, and compare performance again across config classes
Only if "after dispatch splitting, the benefiting configs are clearly faster and the other configs are not slowed down" may this kind of new-kernel approach be kept.
Starting from the baseline, enter an infinite loop. Do not stop unless manually interrupted by a human.
LOOP FOREVER:
- Start from
BEST_COMMIT - First execute
GPU Availability Gate, then run or rerun/kernel-trace-analysisto make sure the bottleneck profile is up to date - Read the latest benchmark
.csv, and by config dimension re-check which config classes are hottest, slowest, or most prone to regressions; at minimum inspectblock_size,mtp, andhead_size - Choose exactly one bottleneck-driven hypothesis for this round, and state clearly whether it is a "global optimization" or an optimization specific to one config class
- Write down the hypothesis and target config class in one sentence before editing, for example:
- Launch next-iteration loads earlier to reduce exposed
s_waitcnt vmcnt(0) - Reduce LDS write/read round-trips to lower
lgkmcntand barrier stall - Reduce arch VGPR pressure to improve occupancy
- Insert independent work between MFMA/VALU to fill idle bubbles
- Launch next-iteration loads earlier to reduce exposed
- Only modify the minimal set of files needed to support this hypothesis
- Set
FIX_ATTEMPTS=0, first executeGPU Availability Gate, then rerun the fixed benchmark and save the output ascandidate.log - If the benchmark fails or metrics are missing:
- Discard this attempt
- Revert to
BEST_COMMIT - Reset
CONSECUTIVE_REGRESSIONSto0 - Clear
RECENT_REGRESSIONS - Continue to the next idea
- If the benchmark succeeds, first read the
.csvby config class; do not look only at the total average - If any config class has
err_flydsl_ps != 0:
- Enter the correctness repair subflow
- Repair the kernel while keeping the current optimization direction unchanged, and increment
FIX_ATTEMPTS += 1 - Before each repair attempt, execute
GPU Availability Gate, rerun the fixed benchmark, and recheckerr_flydsl_ps - Proceed to the next step only after all config classes are restored to
err_flydsl_ps == 0 - If failure still remains after
FIX_ATTEMPTS >= 3:- Abandon this optimization proposal
- Revert to
BEST_COMMIT - Do not commit
- Do not write to
performance.md - Reset
CONSECUTIVE_REGRESSIONSto0 - Clear
RECENT_REGRESSIONS - Continue to the next idea
- First execute
GPU Availability Gate, then run/kernel-trace-analysisagain on the candidate - When comparing the candidate against the current best, you must look at both the overall result and the per-config result:
- All config classes must still have
err_flydsl_ps == 0 - The target config class must genuinely improve
- Non-target config classes must not show clear and reproducible severe regressions
us_flydsl_psmust not improve by sacrificing other config classes- The target kernel's average GPU time in
rocprofv3must improve; if kernels are split, confirm that the new dispatch makes the benefiting configs faster without slowing down the others
- If the candidate is better:
- Append the result to
performance.md - Run
/format-code - After formatting, first execute
GPU Availability Gate, then rerun the fixed benchmark - If the formatted result is still better, commit the code and
performance.md - Update
BEST_COMMITto the new commit - Reset
CONSECUTIVE_REGRESSIONSto0 - Clear
RECENT_REGRESSIONS
- If the candidate shows a divergence where "some config classes improve a lot but other config classes regress severely":
- Do not keep this single-kernel version
- If this divergence is stable and explainable, follow
Kernel Split Ruleto switch to a specialized kernel + dispatch, then reevaluate - If you have not split yet, first revert to
BEST_COMMIT
- If the candidate is tied or worse:
- Revert all modifications back to
BEST_COMMIT - Do not commit
- Do not write to
performance.md - If it is a tie within noise, reset
CONSECUTIVE_REGRESSIONSto0and clearRECENT_REGRESSIONS - If it is a measurable performance regression rather than a tie within noise, increment
CONSECUTIVE_REGRESSIONS += 1and append this round's summary toRECENT_REGRESSIONS - If
CONSECUTIVE_REGRESSIONS >= 5, stop optimization and write the stop reason plus the latest 5 regression summaries intoperformance.md - Otherwise continue to the next round
Because the entire experiment runs in a dedicated worktree / dedicated branch, failed or regressive attempts may reset the experiment branch directly back to BEST_COMMIT. Do not do this in a shared workspace that contains unrelated changes.
Every analysis round runs the same fixed command:
/kernel-trace-analysis python tests/kernels/test_pa.py
When reading the report, prioritize:
- High-cycle
s_waitcnt/s_barrier - High-cycle
buffer_load_*or other VMEM stalls ds_read_*/ds_write_*/lgkmcntpressure- Idle bubbles around MFMA/VALU
- Whether arch VGPR or accum VGPR is the occupancy limiter
If there are multiple kernels:
- Optimize the one with the highest GPU time share first
- If both the main kernel and the reduce kernel are hot, prioritize the larger one
- Only handle secondary kernels after the main bottleneck improves
Do not do "guess-and-tune" optimization. Every modification in each round must point back to some bottleneck seen in the trace report.
All of the following conditions must be true before a modification may be kept:
- Correctness passes
- The fixed benchmark passes
- The target config class or target kernel shows improved
rocprofv3GPU time - Non-target config classes do not show clear and reproducible severe regressions
- If config divergence appears, it has already been isolated through specialized kernel + dispatch, rather than forcing all configs to share one regressive version
/format-codehas been executed- After formatting, the result is still better
Otherwise, discard the modification immediately and revert it.
If err_flydsl_ps != 0, do not discard it immediately; you must first follow Correctness Repair Rule and attempt up to 3 repairs. Only after 3 repairs still fail to restore it to 0 should the optimization proposal be abandoned.
If script-side us_flydsl_ps improves but rocprofv3 GPU kernel time does not improve, or gets worse, do not keep it.
If only some configs improve while other configs clearly regress, do not directly keep a single-kernel compromise version. Prefer splitting into specialized kernels and letting dispatch route different configs to their more suitable implementations.
If there are 5 consecutive measured performance regressions, end this autonomous optimization round and stop trying new modifications.
The results table in performance.md should record only two kinds of results:
- Baseline
- A real kept improvement
Do not write failed attempts, crashed attempts, or regressive attempts into the results table.
If optimization stops because of "5 consecutive measured performance regressions," you may append a separate Stop Summary block to performance.md. This block is not part of the results table; it is only used to explain why optimization stopped.
Each kept record must include at least:
- Date
- Commit
- Status:
baselineorkeep - Benchmark command
- Target config class or dispatch condition
- Kernel name
us_flydsl_ps- Average
rocprofv3GPU time - Percentage change relative to the previous best
- Short explanation: which bottleneck this change targeted and what it did
If it is a split-kernel solution, the record should also explain:
- Which configs continue to use the old kernel
- Which configs switch to the new kernel
- Which config conditions are served by the old and new kernels respectively
Example:
| 2026-04-16 | abc1234 | keep | python tests/kernels/test_pa.py | pa_persistent_fwd | 142.8 | 118.4 | -6.2% | hoist next-iteration loads to hide vmcnt stall |
If a stop condition is triggered, append the following to performance.md:
- Stop time
- Stop reason:
5 consecutive measured regressionsorno idle GPU available for 60 minutes - Current
BEST_COMMIT - Current best metrics
- Recent 5 regression summaries
For the recent 5 regression summaries, a markdown table is recommended with at least these columns:
- Attempt
- Hypothesis
- Changed Files
us_flydsl_psbest -> candidate- GPU Time best -> candidate
- Delta
- Notes
Committing is allowed only after confirming that "this round is faster than the current best."
Before committing, you must:
- Run
/format-code - First execute
GPU Availability Gate, then rerun the fixed benchmark - Confirm that correctness still passes
- Execute
GPU Availability Gateagain and confirm thatrocprofv3GPU kernel time is still better
The commit message should explain why this change worked, rather than mechanically listing which files were changed.
Once you enter the Optimization Loop, do not stop and ask the human whether to continue. You are an autonomous optimizer:
- Do the baseline first
- Let trace analysis drive modifications
- Roll back regressions
- Record, format, and commit improvements
- Then continue iterating forward from the new best
If there are 50 consecutive performance regressions, or if no idle GPU becomes available for a continuous 1 hour, then stop; otherwise keep going unless a human explicitly interrupts you.