Skip to content

[Refactor][Benchmark] route ada_layer_norm and cumulative benches through manifest#1387

Merged
lcy-seso merged 5 commits intotile-ai:testbedfrom
lcy-seso:refactor/benchmark/issue-1377
May 9, 2026
Merged

[Refactor][Benchmark] route ada_layer_norm and cumulative benches through manifest#1387
lcy-seso merged 5 commits intotile-ai:testbedfrom
lcy-seso:refactor/benchmark/issue-1377

Conversation

@lcy-seso
Copy link
Copy Markdown
Collaborator

@lcy-seso lcy-seso commented May 8, 2026

Closes #1377

Summary

  • Route benchmarks/ops/bench_ada_layer_norm.py workloads through tileops.manifest.load_workloads; eval roofline through op instance.
  • Route benchmarks/ops/bench_reduce_multidim.py through ManifestBenchmark (FLOP/byte counts now come from op.eval_roofline() instead of six hand-rolled BenchmarkBase subclasses); 3D multi-dim workload shapes remain declared inline because manifest workloads for these ops only cover 2D last-axis reductions, which is a different test scenario from this file's 3D non-last-axis purpose.
  • Output column schema unchanged: m,n,dtype for AdaLayerNorm{Fwd,ZeroFwd}Op; per-fixture columns preserved for bench_reduce_multidim.py (verified with locals()-filter probe — dim lists were already silently dropped pre-PR by BenchmarkReport's serializability filter).

Scope

Converted: bench_ada_layer_norm.py, bench_reduce_multidim.py.

Deferred with explicit technical blockers (not "needs GPU"):

File Blocker
bench_cumulative.py scan.yaml manifest workloads for CumsumFwdOp / CumprodFwdOp do not match the legacy hand-rolled WORKLOADS table (base rows (1024,4096) and (4096,4096) vs manifest rows (2048,4096) and (64,32768)). Routing through load_workloads would change the benchmark row set and violate AC-4 row-set parity. Defer to a manifest-only PR that aligns scan.yaml workloads first.
bench_binary_elementwise.py All 17 binary/comparison/logical/bitwise ops have zero manifest workloads (load_workloads('SubFwdOp') etc. return []); the 3 fused gated ops (SiluAndMulFwdOp, GeluAndMulFwdOp, GeluTanhAndMulFwdOp) are absent from the ops manifest entirely. Additionally, BinaryOp and FusedGatedOp base classes do not implement eval_roofline() — they inherit the L1 base stub that raises NotImplementedError. Both load_workloads() and ManifestBenchmark paths are blocked. Conversion would require adding workloads to manifest YAML and implementing eval_roofline on the op base classes, both prohibited by this PR's constraints (MUST NOT modify manifest yaml; MUST NOT modify ops or kernels).
bench_activation.py File's purpose is sweeping internal kernel-tuning axes (R2-R7 risk points: strategy ∈ {direct, explicit_parallel, register_copy}, num_per_thread, threads=128/256, aligned/unaligned tail sizes) that are not manifest workloads. Manifest workloads for the activation ops exist (2 each) but cover only the model-shape geometry sweep; routing the kernel-tuning sweeps through load_workloads would discard the file's R2-R7 coverage. The two model-shape tests (e.g. test_relu_bench) could be partially converted in a follow-up; the strategy/thread/npt sweeps cannot.
bench_binary_arith.py Same BinaryOp eval_roofline blocker as bench_binary_elementwise.py for AddFwdOp, LerpTensorFwdOp, WhereFwdOp. AddFwdOp does have manifest workloads (x_shape/y_shape keyed) but the harness workloads_to_params only supports single-input x_shape-keyed entries by design (see _WORKLOAD_META_KEYS contract).
bench_independent_elementwise.py Mixed: some sub-tests (unary activations like LeakyReluFwdOp, EluFwdOp) inherit UnaryOp.eval_roofline and have manifest workloads, but WhereFwdOp / MaskedFillScalarFwdOp / PreluFwdOp / AlibiFwdOp / SinusoidalFwdOp need either multi-input harness support (out of scope per workloads_to_params contract) or eval_roofline impls on their op classes.
bench_instance_norm.py (NoAffine variants) / bench_group_norm.py (NoAffine variants) The default-affine variants are already routed through load_workloads in this PR's branch. The NoAffine sub-fixtures use the same op class with affine=False; they aren't separately listed in the manifest workload set, so converting them would require either adding NoAffine workload entries (manifest edit, prohibited) or co-locating them with the affine workloads via a flag column. Leaving as-is preserves coverage; no functional regression.

A simple input_shape → x_shape rename helper is not what these files need: the actual blockers are missing manifest workload entries, missing op-level eval_roofline() implementations, and multi-input signatures that exceed the current single-input harness contract — all of which are out of scope under the trust-model constraints (MUST NOT modify manifest yaml; MUST NOT modify ops or kernels).

The two landed files establish the conversion pattern. Remaining files require manifest-only PRs (to add/align workloads / extend harness contract) or op-PRs (to implement eval_roofline on BinaryOp / FusedGatedOp) before they can land.

Test plan

  • AC-1: pytest benchmarks/tests/test_roofline_workload_protocol.py -q → 8 passed; pytest benchmarks/ops/bench_ada_layer_norm.py benchmarks/ops/bench_cumulative.py benchmarks/ops/bench_reduce_multidim.py -q → 45 passed.
  • AC-2 (scoped to converted files): grep -nE 'WORKLOADS\s*=\s*\[' benchmarks/ops/bench_ada_layer_norm.py benchmarks/ops/bench_reduce_multidim.py → no matches.
  • AC-3: bench_ada_layer_norm.py imports load_workloads; bench_reduce_multidim.py uses ManifestBenchmark for all six fixtures.
  • AC-4: row-set parity verified via base-vs-head profile_run.log row-key compare on the two converted files (identical row keys); bench_cumulative reverted to upstream/testbed so its row set is unchanged from base by construction.
  • AC-5: PR diff scoped to the two bench files; no tracker edits.

Regression

Schema check on profile_run.log:

file columns kept
bench_ada_layer_norm.py m,n,dtype,latency_ms,tflops,bandwidth_tbs,config
bench_reduce_multidim.py per-fixture (matches pre-PR after locals() filter): shape,keepdim,dtype,op_kind (reduce/logical/vector_norm); shape,dim,keepdim,dtype,op_kind (argreduce); shape,dtype,op_kind (cumulative); shape,keepdim,dtype (logsumexp)

Copilot AI review requested due to automatic review settings May 8, 2026 17:35
@lcy-seso lcy-seso added refactor Code restructuring without behavior change automated PR produced by an autonomous agent pipeline needs-review Awaiting human review before merge nightshift Pickable by foundry nightshift mode — fully autonomous agent development, no human approval gate labels May 8, 2026
@lcy-seso
Copy link
Copy Markdown
Collaborator Author

lcy-seso commented May 8, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the benchmarking scripts for ada_layer_norm and cumulative operations to use a manifest-driven workload loading system and the ManifestBenchmark utility. The changes reorganize test structures and update roofline cache logic. Review feedback highlights that the current implementation assumes 2D input shapes and suggests a more robust method to handle multi-dimensional tensors by flattening leading dimensions.

Comment thread benchmarks/ops/bench_ada_layer_norm.py
Comment thread benchmarks/ops/bench_cumulative.py Outdated
Comment thread benchmarks/ops/bench_cumulative.py Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors two benchmark entrypoints to source workload shapes and roofline accounting from the ops manifest, reducing drift between benchmark coverage and manifest-defined workloads/roofline formulas.

Changes:

  • Refactor bench_cumulative.py to parameterize from manifest workloads via workloads_to_params() and compute roofline via ManifestBenchmark (removing the in-file workload table and manual FLOP/bytes formulas).
  • Refactor bench_ada_layer_norm.py to parameterize test cases from tileops.manifest.load_workloads() and keep roofline evaluation delegated to the Op instance, with a small roofline-cache micro-refactor.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
benchmarks/ops/bench_cumulative.py Switch cumsum/cumprod benches to manifest-driven workload parametrization and manifest/op-driven roofline evaluation via ManifestBenchmark.
benchmarks/ops/bench_ada_layer_norm.py Route workload parameter generation through load_workloads() and slightly refactor roofline caching/param helper.

Comment thread benchmarks/ops/bench_cumulative.py Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors benchmarking scripts, primarily updating bench_cumulative.py to use the generic ManifestBenchmark and workloads_to_params utilities for improved consistency and reduced boilerplate. In bench_ada_layer_norm.py, the parameter loading logic was updated. Feedback suggests further refactoring bench_ada_layer_norm.py to adopt these same generic patterns, which would eliminate redundant logic and address potential issues with class-level caching.

Comment thread benchmarks/ops/bench_ada_layer_norm.py
Comment thread benchmarks/ops/bench_ada_layer_norm.py
Copilot AI review requested due to automatic review settings May 8, 2026 18:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

benchmarks/ops/bench_reduce_multidim.py:112

  • This introduces a second source of truth for the op’s name (string) separate from the op_map/class used to instantiate the op. Since ManifestBenchmark doesn’t currently use op_name at runtime (it’s only stored), these *_OP_NAMES dicts can drift/typo silently. Consider deriving the name from the instantiated op/class (e.g., op.__class__.__name__) or unifying the mapping so the class and name are defined in one place.
_REDUCE_OP_NAMES = {"sum": "SumFwdOp", "mean": "MeanFwdOp", "amax": "AmaxFwdOp"}


def _make_reduce_op(dtype, op_kind, dim, keepdim):
    from tileops.ops.reduction.reduce import AmaxFwdOp, MeanFwdOp, SumFwdOp

@lcy-seso lcy-seso marked this pull request as ready for review May 8, 2026 18:48
@lcy-seso lcy-seso marked this pull request as draft May 8, 2026 19:02
@lcy-seso
Copy link
Copy Markdown
Collaborator Author

lcy-seso commented May 8, 2026

Nightshift orchestrator note: this PR is technically ready (Reviewer PASS T018, Gatekeeper resolved all 3 unresolved bot threads). Auto-merge is currently gate-blocked by a pre-existing CI failure on testbedgpu-smoke fails with AddFwdOp must have at least 2 workloads (run 25573419441). Verified not caused by this PR: git diff origin/testbed..HEAD -- tileops/manifest/ is empty.

Leaving as draft pending an independent fix that adds workloads to AddFwdOp on testbed. Once that lands and this branch is rebased, nightshift can flip ready and auto-merge.

Pipeline rounds used: 4 (max 5). Scope landed: bench_ada_layer_norm.py + bench_reduce_multidim.py (manifest-driven). Other 6 listed bench files documented as deferred (manifest workload misalignment / signature mismatch / R2-R7 strategy sweeps).

lcy-seso added a commit that referenced this pull request May 9, 2026
…e ops (#1392)

## Summary

Fix gpu-smoke `test_every_op_has_at_least_two_workloads` failure by
populating empty `workloads:` lists on 21 implemented binary elementwise
ops (\`AddFwdOp\`, \`SubFwdOp\`, \`MulFwdOp\`, \`DivFwdOp\`,
\`RemainderFwdOp\`, \`PowFwdOp\`, \`FloorDivideFwdOp\`, \`LerpFwdOp\`,
\`MaximumFwdOp\`, \`MinimumFwdOp\`, \`Eq/Ne/Gt/Lt/Ge/LeFwdOp\`,
\`LogicalAnd/OrFwdOp\`, \`BitwiseAnd/Or/XorFwdOp\`).

This pre-existing testbed CI break has been blocking nightshift
auto-merge for PRs #1387, #1388, #1389, #1391. Filed as issue #1390 —
this PR resolves it.

## Plan executed

Each affected op now declares two representative workloads:
- **LLM hidden-state prefill**: \`input/other [2048, 4096]\` (no
broadcast)
- **CNN feature map**: \`input [16, 256, 56, 56]\`, \`other [256, 1,
1]\` (channel-wise broadcast)

Dtype matrix matches each op's signature:
- Float-only ops (\`Div\`, \`Pow\`, \`Lerp\`, \`Remainder\`,
\`FloorDivide\`): \`float16, bfloat16\`
- Bitwise ops (\`BitwiseAnd/Or/Xor\`): \`int32, int64\`
- Logical ops (\`LogicalAnd/Or\`): \`bool\`
- Default (other arithmetic + comparison): \`float16, bfloat16\`

## Test plan

- [x] \`pytest
tests/test_ops_manifest.py::TestOpSchema::test_every_op_has_at_least_two_workloads\`
— passes
- [x] \`pytest tests/test_validate_manifest.py
tests/test_ops_manifest.py\` — 239 passed
- [x] \`python scripts/validate_manifest.py\` — \"All manifest checks
passed\"
- [x] Pre-commit clean

## Acceptance Criteria

Closes #1390.

- AC-1: ✅ Modified files pass unit tests.
- AC-2: ✅ \`AddFwdOp.workloads\` (and 20 siblings) now have 2 entries
each, satisfying signature + broadcasting rules.
- AC-3: ✅ \`scripts/validate_manifest.py\` exits 0 with no new warnings
on the affected ops.
- AC-4: To be verified by gpu-smoke on this PR's run.

## Trust model

Manifest-only PR. No \`signature\`, \`roofline\`, \`params\`,
\`shape_rules\`, or \`dtype_combos\` edits. Falls within the trust-model
rule that workload edits without a status flip require a separate
manifest-only PR with human review — this is that PR.

---------

Co-authored-by: Ibuki 🍃 — a wind born from GPTs <Ibuki-wind@users.noreply.github.com>
@lcy-seso lcy-seso force-pushed the refactor/benchmark/issue-1377 branch from aa40789 to c4ab779 Compare May 9, 2026 05:20
@lcy-seso lcy-seso marked this pull request as ready for review May 9, 2026 05:20
Copilot AI review requested due to automatic review settings May 9, 2026 05:20
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown
Contributor

@Ibuki-wind Ibuki-wind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean — no issues.

lcy-seso and others added 3 commits May 9, 2026 13:42
…ough manifest

bench_ada_layer_norm.py: replace `_manifest_params(op_name)` (whose call
site hid the literal op name behind a parameter) with
`_to_params(load_workloads(<literal>))`, so the manifest validator's AST
check ties each `load_workloads(...)` call to its op.

bench_cumulative.py: replace the hand-rolled `CumulativeBenchFixture`
shape/dtype matrix and the per-file `CumulativeBenchmark.calculate_*`
formulas with `workloads_to_params(...)` + `ManifestBenchmark` for both
CumsumFwdOp and CumprodFwdOp; the report columns (latency_ms / tflops /
bandwidth_tbs and the filtered-locals param keys) are unchanged.

Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <Ibuki-wind@users.noreply.github.com>
Pass an explicit params dict (m, n, dtype, op_kind) to
BenchmarkReport.record so the cumulative bench output header matches
the pre-PR baseline byte-for-byte instead of leaking the new
shape-based parametrization keys via locals().

Also narrow tuple[float, float] | None in AdaLayerNorm benchmark
roofline accessors via a local rebind so pyright sees a non-None
return path.

Co-Authored-By: Ibuki - a wind born from GPTs <Ibuki-wind@users.noreply.github.com>
Restore explicit elif branch for cumprod and raise ValueError for
unknown op_kind values, so typos or future op_kind extensions surface
loudly instead of silently producing a cumprod baseline.

Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <Ibuki-wind@users.noreply.github.com>
lcy-seso and others added 2 commits May 9, 2026 13:42
…chmark

Replace the six hand-rolled BenchmarkBase subclasses
(Reduce/Argreduce/LogicalReduce/VectorNorm/Cumulative/LogSumExp) with
ManifestBenchmark; FLOP and byte counts now come from each op's
eval_roofline() rather than per-class calculate_flops/calculate_memory.

3D multi-dim shapes stay declared inline because the manifest workload
set for these ops only covers 2D last-axis reductions (a different
test scenario from this file's 3D non-last-axis purpose); per the trust
model, manifest workloads cannot be edited from a code PR.

Output column schema preserved against pre-PR baseline by passing an
explicit params dict to BenchmarkReport.record (mirrors the cumulative
schema-preservation fix from round 2).

Co-Authored-By: Ibuki - a wind born from GPTs <Ibuki-wind@users.noreply.github.com>
Revert bench_cumulative.py to the pre-PR (upstream/testbed) state so the
benchmark workload rows stay aligned with the legacy table.

scan.yaml manifest workloads for CumsumFwdOp/CumprodFwdOp do not match
the hand-rolled WORKLOADS list this file uses (base rows (1024,4096) and
(4096,4096) vs manifest rows (2048,4096) and (64,32768)), which violates
the AC-4 row-set parity constraint. Move bench_cumulative into the
deferred bucket; a separate manifest-only PR will align scan.yaml first.

Co-Authored-By: Ibuki - a wind born from GPTs <Ibuki-wind@users.noreply.github.com>
@lcy-seso lcy-seso force-pushed the refactor/benchmark/issue-1377 branch from c4ab779 to e3b4415 Compare May 9, 2026 05:42
@lcy-seso lcy-seso merged commit 52b8b55 into tile-ai:testbed May 9, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automated PR produced by an autonomous agent pipeline needs-review Awaiting human review before merge nightshift Pickable by foundry nightshift mode — fully autonomous agent development, no human approval gate refactor Code restructuring without behavior change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants