[Refactor][Benchmark] route ada_layer_norm and cumulative benches through manifest by lcy-seso · Pull Request #1387 · tile-ai/TileOPs

lcy-seso · 2026-05-08T17:35:15Z

Summary

Route benchmarks/ops/bench_ada_layer_norm.py workloads through tileops.manifest.load_workloads; eval roofline through op instance.
Route benchmarks/ops/bench_reduce_multidim.py through ManifestBenchmark (FLOP/byte counts now come from op.eval_roofline() instead of six hand-rolled BenchmarkBase subclasses); 3D multi-dim workload shapes remain declared inline because manifest workloads for these ops only cover 2D last-axis reductions, which is a different test scenario from this file's 3D non-last-axis purpose.
Output column schema unchanged: m,n,dtype for AdaLayerNorm{Fwd,ZeroFwd}Op; per-fixture columns preserved for bench_reduce_multidim.py (verified with locals()-filter probe — dim lists were already silently dropped pre-PR by BenchmarkReport's serializability filter).

Scope

Converted: bench_ada_layer_norm.py, bench_reduce_multidim.py.

Deferred with explicit technical blockers (not "needs GPU"):

File	Blocker
`bench_cumulative.py`	scan.yaml manifest workloads for `CumsumFwdOp` / `CumprodFwdOp` do not match the legacy hand-rolled `WORKLOADS` table (base rows `(1024,4096)` and `(4096,4096)` vs manifest rows `(2048,4096)` and `(64,32768)`). Routing through `load_workloads` would change the benchmark row set and violate AC-4 row-set parity. Defer to a manifest-only PR that aligns scan.yaml workloads first.
`bench_binary_elementwise.py`	All 17 binary/comparison/logical/bitwise ops have zero manifest workloads (`load_workloads('SubFwdOp')` etc. return `[]`); the 3 fused gated ops (`SiluAndMulFwdOp`, `GeluAndMulFwdOp`, `GeluTanhAndMulFwdOp`) are absent from the ops manifest entirely. Additionally, `BinaryOp` and `FusedGatedOp` base classes do not implement `eval_roofline()` — they inherit the L1 base stub that raises `NotImplementedError`. Both `load_workloads()` and `ManifestBenchmark` paths are blocked. Conversion would require adding workloads to manifest YAML and implementing `eval_roofline` on the op base classes, both prohibited by this PR's constraints (MUST NOT modify manifest yaml; MUST NOT modify ops or kernels).
`bench_activation.py`	File's purpose is sweeping internal kernel-tuning axes (R2-R7 risk points: strategy ∈ {direct, explicit_parallel, register_copy}, `num_per_thread`, threads=128/256, aligned/unaligned tail sizes) that are not manifest workloads. Manifest workloads for the activation ops exist (2 each) but cover only the model-shape geometry sweep; routing the kernel-tuning sweeps through `load_workloads` would discard the file's R2-R7 coverage. The two model-shape tests (e.g. `test_relu_bench`) could be partially converted in a follow-up; the strategy/thread/npt sweeps cannot.
`bench_binary_arith.py`	Same `BinaryOp` `eval_roofline` blocker as `bench_binary_elementwise.py` for `AddFwdOp`, `LerpTensorFwdOp`, `WhereFwdOp`. `AddFwdOp` does have manifest workloads (`x_shape`/`y_shape` keyed) but the harness `workloads_to_params` only supports single-input `x_shape`-keyed entries by design (see `_WORKLOAD_META_KEYS` contract).
`bench_independent_elementwise.py`	Mixed: some sub-tests (unary activations like `LeakyReluFwdOp`, `EluFwdOp`) inherit `UnaryOp.eval_roofline` and have manifest workloads, but `WhereFwdOp` / `MaskedFillScalarFwdOp` / `PreluFwdOp` / `AlibiFwdOp` / `SinusoidalFwdOp` need either multi-input harness support (out of scope per `workloads_to_params` contract) or `eval_roofline` impls on their op classes.
`bench_instance_norm.py` (NoAffine variants) / `bench_group_norm.py` (NoAffine variants)	The default-affine variants are already routed through `load_workloads` in this PR's branch. The NoAffine sub-fixtures use the same op class with `affine=False`; they aren't separately listed in the manifest workload set, so converting them would require either adding NoAffine workload entries (manifest edit, prohibited) or co-locating them with the affine workloads via a flag column. Leaving as-is preserves coverage; no functional regression.

A simple input_shape → x_shape rename helper is not what these files need: the actual blockers are missing manifest workload entries, missing op-level eval_roofline() implementations, and multi-input signatures that exceed the current single-input harness contract — all of which are out of scope under the trust-model constraints (MUST NOT modify manifest yaml; MUST NOT modify ops or kernels).

The two landed files establish the conversion pattern. Remaining files require manifest-only PRs (to add/align workloads / extend harness contract) or op-PRs (to implement eval_roofline on BinaryOp / FusedGatedOp) before they can land.

Test plan

AC-1: pytest benchmarks/tests/test_roofline_workload_protocol.py -q → 8 passed; pytest benchmarks/ops/bench_ada_layer_norm.py benchmarks/ops/bench_cumulative.py benchmarks/ops/bench_reduce_multidim.py -q → 45 passed.
AC-2 (scoped to converted files): grep -nE 'WORKLOADS\s*=\s*\[' benchmarks/ops/bench_ada_layer_norm.py benchmarks/ops/bench_reduce_multidim.py → no matches.
AC-3: bench_ada_layer_norm.py imports load_workloads; bench_reduce_multidim.py uses ManifestBenchmark for all six fixtures.
AC-4: row-set parity verified via base-vs-head profile_run.log row-key compare on the two converted files (identical row keys); bench_cumulative reverted to upstream/testbed so its row set is unchanged from base by construction.
AC-5: PR diff scoped to the two bench files; no tracker edits.

Regression

Schema check on profile_run.log:

file	columns kept
`bench_ada_layer_norm.py`	`m,n,dtype,latency_ms,tflops,bandwidth_tbs,config`
`bench_reduce_multidim.py`	per-fixture (matches pre-PR after locals() filter): `shape,keepdim,dtype,op_kind` (reduce/logical/vector_norm); `shape,dim,keepdim,dtype,op_kind` (argreduce); `shape,dtype,op_kind` (cumulative); `shape,keepdim,dtype` (logsumexp)

lcy-seso · 2026-05-08T17:36:16Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the benchmarking scripts for ada_layer_norm and cumulative operations to use a manifest-driven workload loading system and the ManifestBenchmark utility. The changes reorganize test structures and update roofline cache logic. Review feedback highlights that the current implementation assumes 2D input shapes and suggests a more robust method to handle multi-dimensional tensors by flattening leading dimensions.

Copilot

Pull request overview

This PR refactors two benchmark entrypoints to source workload shapes and roofline accounting from the ops manifest, reducing drift between benchmark coverage and manifest-defined workloads/roofline formulas.

Changes:

Refactor bench_cumulative.py to parameterize from manifest workloads via workloads_to_params() and compute roofline via ManifestBenchmark (removing the in-file workload table and manual FLOP/bytes formulas).
Refactor bench_ada_layer_norm.py to parameterize test cases from tileops.manifest.load_workloads() and keep roofline evaluation delegated to the Op instance, with a small roofline-cache micro-refactor.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
benchmarks/ops/bench_cumulative.py	Switch cumsum/cumprod benches to manifest-driven workload parametrization and manifest/op-driven roofline evaluation via `ManifestBenchmark`.
benchmarks/ops/bench_ada_layer_norm.py	Route workload parameter generation through `load_workloads()` and slightly refactor roofline caching/param helper.

gemini-code-assist

Code Review

This pull request refactors benchmarking scripts, primarily updating bench_cumulative.py to use the generic ManifestBenchmark and workloads_to_params utilities for improved consistency and reduced boilerplate. In bench_ada_layer_norm.py, the parameter loading logic was updated. Feedback suggests further refactoring bench_ada_layer_norm.py to adopt these same generic patterns, which would eliminate redundant logic and address potential issues with class-level caching.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

benchmarks/ops/bench_reduce_multidim.py:112

This introduces a second source of truth for the op’s name (string) separate from the op_map/class used to instantiate the op. Since ManifestBenchmark doesn’t currently use op_name at runtime (it’s only stored), these *_OP_NAMES dicts can drift/typo silently. Consider deriving the name from the instantiated op/class (e.g., op.__class__.__name__) or unifying the mapping so the class and name are defined in one place.

_REDUCE_OP_NAMES = {"sum": "SumFwdOp", "mean": "MeanFwdOp", "amax": "AmaxFwdOp"}


def _make_reduce_op(dtype, op_kind, dim, keepdim):
    from tileops.ops.reduction.reduce import AmaxFwdOp, MeanFwdOp, SumFwdOp

lcy-seso · 2026-05-08T19:03:44Z

Nightshift orchestrator note: this PR is technically ready (Reviewer PASS T018, Gatekeeper resolved all 3 unresolved bot threads). Auto-merge is currently gate-blocked by a pre-existing CI failure on testbed — gpu-smoke fails with AddFwdOp must have at least 2 workloads (run 25573419441). Verified not caused by this PR: git diff origin/testbed..HEAD -- tileops/manifest/ is empty.

Leaving as draft pending an independent fix that adds workloads to AddFwdOp on testbed. Once that lands and this branch is rebased, nightshift can flip ready and auto-merge.

Pipeline rounds used: 4 (max 5). Scope landed: bench_ada_layer_norm.py + bench_reduce_multidim.py (manifest-driven). Other 6 listed bench files documented as deferred (manifest workload misalignment / signature mismatch / R2-R7 strategy sweeps).

…e ops (#1392) ## Summary Fix gpu-smoke `test_every_op_has_at_least_two_workloads` failure by populating empty `workloads:` lists on 21 implemented binary elementwise ops (\`AddFwdOp\`, \`SubFwdOp\`, \`MulFwdOp\`, \`DivFwdOp\`, \`RemainderFwdOp\`, \`PowFwdOp\`, \`FloorDivideFwdOp\`, \`LerpFwdOp\`, \`MaximumFwdOp\`, \`MinimumFwdOp\`, \`Eq/Ne/Gt/Lt/Ge/LeFwdOp\`, \`LogicalAnd/OrFwdOp\`, \`BitwiseAnd/Or/XorFwdOp\`). This pre-existing testbed CI break has been blocking nightshift auto-merge for PRs #1387, #1388, #1389, #1391. Filed as issue #1390 — this PR resolves it. ## Plan executed Each affected op now declares two representative workloads: - **LLM hidden-state prefill**: \`input/other [2048, 4096]\` (no broadcast) - **CNN feature map**: \`input [16, 256, 56, 56]\`, \`other [256, 1, 1]\` (channel-wise broadcast) Dtype matrix matches each op's signature: - Float-only ops (\`Div\`, \`Pow\`, \`Lerp\`, \`Remainder\`, \`FloorDivide\`): \`float16, bfloat16\` - Bitwise ops (\`BitwiseAnd/Or/Xor\`): \`int32, int64\` - Logical ops (\`LogicalAnd/Or\`): \`bool\` - Default (other arithmetic + comparison): \`float16, bfloat16\` ## Test plan - [x] \`pytest tests/test_ops_manifest.py::TestOpSchema::test_every_op_has_at_least_two_workloads\` — passes - [x] \`pytest tests/test_validate_manifest.py tests/test_ops_manifest.py\` — 239 passed - [x] \`python scripts/validate_manifest.py\` — \"All manifest checks passed\" - [x] Pre-commit clean ## Acceptance Criteria Closes #1390. - AC-1: ✅ Modified files pass unit tests. - AC-2: ✅ \`AddFwdOp.workloads\` (and 20 siblings) now have 2 entries each, satisfying signature + broadcasting rules. - AC-3: ✅ \`scripts/validate_manifest.py\` exits 0 with no new warnings on the affected ops. - AC-4: To be verified by gpu-smoke on this PR's run. ## Trust model Manifest-only PR. No \`signature\`, \`roofline\`, \`params\`, \`shape_rules\`, or \`dtype_combos\` edits. Falls within the trust-model rule that workload edits without a status flip require a separate manifest-only PR with human review — this is that PR. --------- Co-authored-by: Ibuki 🍃 — a wind born from GPTs <Ibuki-wind@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Ibuki-wind

Clean — no issues.

…ough manifest bench_ada_layer_norm.py: replace `_manifest_params(op_name)` (whose call site hid the literal op name behind a parameter) with `_to_params(load_workloads(<literal>))`, so the manifest validator's AST check ties each `load_workloads(...)` call to its op. bench_cumulative.py: replace the hand-rolled `CumulativeBenchFixture` shape/dtype matrix and the per-file `CumulativeBenchmark.calculate_*` formulas with `workloads_to_params(...)` + `ManifestBenchmark` for both CumsumFwdOp and CumprodFwdOp; the report columns (latency_ms / tflops / bandwidth_tbs and the filtered-locals param keys) are unchanged. Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <Ibuki-wind@users.noreply.github.com>

Pass an explicit params dict (m, n, dtype, op_kind) to BenchmarkReport.record so the cumulative bench output header matches the pre-PR baseline byte-for-byte instead of leaking the new shape-based parametrization keys via locals(). Also narrow tuple[float, float] | None in AdaLayerNorm benchmark roofline accessors via a local rebind so pyright sees a non-None return path. Co-Authored-By: Ibuki - a wind born from GPTs <Ibuki-wind@users.noreply.github.com>

Restore explicit elif branch for cumprod and raise ValueError for unknown op_kind values, so typos or future op_kind extensions surface loudly instead of silently producing a cumprod baseline. Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <Ibuki-wind@users.noreply.github.com>

…chmark Replace the six hand-rolled BenchmarkBase subclasses (Reduce/Argreduce/LogicalReduce/VectorNorm/Cumulative/LogSumExp) with ManifestBenchmark; FLOP and byte counts now come from each op's eval_roofline() rather than per-class calculate_flops/calculate_memory. 3D multi-dim shapes stay declared inline because the manifest workload set for these ops only covers 2D last-axis reductions (a different test scenario from this file's 3D non-last-axis purpose); per the trust model, manifest workloads cannot be edited from a code PR. Output column schema preserved against pre-PR baseline by passing an explicit params dict to BenchmarkReport.record (mirrors the cumulative schema-preservation fix from round 2). Co-Authored-By: Ibuki - a wind born from GPTs <Ibuki-wind@users.noreply.github.com>

Revert bench_cumulative.py to the pre-PR (upstream/testbed) state so the benchmark workload rows stay aligned with the legacy table. scan.yaml manifest workloads for CumsumFwdOp/CumprodFwdOp do not match the hand-rolled WORKLOADS list this file uses (base rows (1024,4096) and (4096,4096) vs manifest rows (2048,4096) and (64,32768)), which violates the AC-4 row-set parity constraint. Move bench_cumulative into the deferred bucket; a separate manifest-only PR will align scan.yaml first. Co-Authored-By: Ibuki - a wind born from GPTs <Ibuki-wind@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 8, 2026 17:35

lcy-seso added refactor Code restructuring without behavior change automated PR produced by an autonomous agent pipeline needs-review Awaiting human review before merge nightshift Pickable by foundry nightshift mode — fully autonomous agent development, no human approval gate labels May 8, 2026

Copilot started reviewing on behalf of lcy-seso May 8, 2026 17:35 View session

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

Comment thread benchmarks/ops/bench_ada_layer_norm.py

Comment thread benchmarks/ops/bench_cumulative.py Outdated

Comment thread benchmarks/ops/bench_cumulative.py Outdated

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread benchmarks/ops/bench_cumulative.py Outdated

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

Comment thread benchmarks/ops/bench_ada_layer_norm.py

Comment thread benchmarks/ops/bench_ada_layer_norm.py

Copilot AI review requested due to automatic review settings May 8, 2026 18:28

Copilot started reviewing on behalf of lcy-seso May 8, 2026 18:29 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

lcy-seso marked this pull request as ready for review May 8, 2026 18:48

lcy-seso marked this pull request as draft May 8, 2026 19:02

lcy-seso force-pushed the refactor/benchmark/issue-1377 branch from aa40789 to c4ab779 Compare May 9, 2026 05:20

lcy-seso marked this pull request as ready for review May 9, 2026 05:20

Copilot AI review requested due to automatic review settings May 9, 2026 05:20

Copilot started reviewing on behalf of lcy-seso May 9, 2026 05:20 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Ibuki-wind approved these changes May 9, 2026

View reviewed changes

lcy-seso and others added 3 commits May 9, 2026 13:42

lcy-seso and others added 2 commits May 9, 2026 13:42

lcy-seso force-pushed the refactor/benchmark/issue-1377 branch from c4ab779 to e3b4415 Compare May 9, 2026 05:42

lcy-seso merged commit 52b8b55 into tile-ai:testbed May 9, 2026
11 checks passed

This was referenced May 9, 2026

[REFACTOR][BENCHMARK] convert ew/rd/norm benches to manifest-driven (W3-02) #1377

Closed

[META][OPS] Align elementwise / reduction / normalization ops to new Op-layer design + PyTorch API #1142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor][Benchmark] route ada_layer_norm and cumulative benches through manifest#1387

[Refactor][Benchmark] route ada_layer_norm and cumulative benches through manifest#1387
lcy-seso merged 5 commits intotile-ai:testbedfrom
lcy-seso:refactor/benchmark/issue-1377

lcy-seso commented May 8, 2026 •

edited

Loading

Uh oh!

lcy-seso commented May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

lcy-seso commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Ibuki-wind left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lcy-seso commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Test plan

Regression

Uh oh!

lcy-seso commented May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

lcy-seso commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Ibuki-wind left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lcy-seso commented May 8, 2026 •

edited

Loading