[Refactor][Benchmark] route ada_layer_norm and cumulative benches through manifest#1387
Conversation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the benchmarking scripts for ada_layer_norm and cumulative operations to use a manifest-driven workload loading system and the ManifestBenchmark utility. The changes reorganize test structures and update roofline cache logic. Review feedback highlights that the current implementation assumes 2D input shapes and suggests a more robust method to handle multi-dimensional tensors by flattening leading dimensions.
There was a problem hiding this comment.
Pull request overview
This PR refactors two benchmark entrypoints to source workload shapes and roofline accounting from the ops manifest, reducing drift between benchmark coverage and manifest-defined workloads/roofline formulas.
Changes:
- Refactor
bench_cumulative.pyto parameterize from manifest workloads viaworkloads_to_params()and compute roofline viaManifestBenchmark(removing the in-file workload table and manual FLOP/bytes formulas). - Refactor
bench_ada_layer_norm.pyto parameterize test cases fromtileops.manifest.load_workloads()and keep roofline evaluation delegated to the Op instance, with a small roofline-cache micro-refactor.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| benchmarks/ops/bench_cumulative.py | Switch cumsum/cumprod benches to manifest-driven workload parametrization and manifest/op-driven roofline evaluation via ManifestBenchmark. |
| benchmarks/ops/bench_ada_layer_norm.py | Route workload parameter generation through load_workloads() and slightly refactor roofline caching/param helper. |
There was a problem hiding this comment.
Code Review
This pull request refactors benchmarking scripts, primarily updating bench_cumulative.py to use the generic ManifestBenchmark and workloads_to_params utilities for improved consistency and reduced boilerplate. In bench_ada_layer_norm.py, the parameter loading logic was updated. Feedback suggests further refactoring bench_ada_layer_norm.py to adopt these same generic patterns, which would eliminate redundant logic and address potential issues with class-level caching.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
benchmarks/ops/bench_reduce_multidim.py:112
- This introduces a second source of truth for the op’s name (string) separate from the
op_map/class used to instantiate the op. SinceManifestBenchmarkdoesn’t currently useop_nameat runtime (it’s only stored), these*_OP_NAMESdicts can drift/typo silently. Consider deriving the name from the instantiated op/class (e.g.,op.__class__.__name__) or unifying the mapping so the class and name are defined in one place.
_REDUCE_OP_NAMES = {"sum": "SumFwdOp", "mean": "MeanFwdOp", "amax": "AmaxFwdOp"}
def _make_reduce_op(dtype, op_kind, dim, keepdim):
from tileops.ops.reduction.reduce import AmaxFwdOp, MeanFwdOp, SumFwdOp
|
Nightshift orchestrator note: this PR is technically ready (Reviewer PASS T018, Gatekeeper resolved all 3 unresolved bot threads). Auto-merge is currently gate-blocked by a pre-existing CI failure on testbed — Leaving as draft pending an independent fix that adds workloads to Pipeline rounds used: 4 (max 5). Scope landed: |
…e ops (#1392) ## Summary Fix gpu-smoke `test_every_op_has_at_least_two_workloads` failure by populating empty `workloads:` lists on 21 implemented binary elementwise ops (\`AddFwdOp\`, \`SubFwdOp\`, \`MulFwdOp\`, \`DivFwdOp\`, \`RemainderFwdOp\`, \`PowFwdOp\`, \`FloorDivideFwdOp\`, \`LerpFwdOp\`, \`MaximumFwdOp\`, \`MinimumFwdOp\`, \`Eq/Ne/Gt/Lt/Ge/LeFwdOp\`, \`LogicalAnd/OrFwdOp\`, \`BitwiseAnd/Or/XorFwdOp\`). This pre-existing testbed CI break has been blocking nightshift auto-merge for PRs #1387, #1388, #1389, #1391. Filed as issue #1390 — this PR resolves it. ## Plan executed Each affected op now declares two representative workloads: - **LLM hidden-state prefill**: \`input/other [2048, 4096]\` (no broadcast) - **CNN feature map**: \`input [16, 256, 56, 56]\`, \`other [256, 1, 1]\` (channel-wise broadcast) Dtype matrix matches each op's signature: - Float-only ops (\`Div\`, \`Pow\`, \`Lerp\`, \`Remainder\`, \`FloorDivide\`): \`float16, bfloat16\` - Bitwise ops (\`BitwiseAnd/Or/Xor\`): \`int32, int64\` - Logical ops (\`LogicalAnd/Or\`): \`bool\` - Default (other arithmetic + comparison): \`float16, bfloat16\` ## Test plan - [x] \`pytest tests/test_ops_manifest.py::TestOpSchema::test_every_op_has_at_least_two_workloads\` — passes - [x] \`pytest tests/test_validate_manifest.py tests/test_ops_manifest.py\` — 239 passed - [x] \`python scripts/validate_manifest.py\` — \"All manifest checks passed\" - [x] Pre-commit clean ## Acceptance Criteria Closes #1390. - AC-1: ✅ Modified files pass unit tests. - AC-2: ✅ \`AddFwdOp.workloads\` (and 20 siblings) now have 2 entries each, satisfying signature + broadcasting rules. - AC-3: ✅ \`scripts/validate_manifest.py\` exits 0 with no new warnings on the affected ops. - AC-4: To be verified by gpu-smoke on this PR's run. ## Trust model Manifest-only PR. No \`signature\`, \`roofline\`, \`params\`, \`shape_rules\`, or \`dtype_combos\` edits. Falls within the trust-model rule that workload edits without a status flip require a separate manifest-only PR with human review — this is that PR. --------- Co-authored-by: Ibuki 🍃 — a wind born from GPTs <Ibuki-wind@users.noreply.github.com>
aa40789 to
c4ab779
Compare
…ough manifest bench_ada_layer_norm.py: replace `_manifest_params(op_name)` (whose call site hid the literal op name behind a parameter) with `_to_params(load_workloads(<literal>))`, so the manifest validator's AST check ties each `load_workloads(...)` call to its op. bench_cumulative.py: replace the hand-rolled `CumulativeBenchFixture` shape/dtype matrix and the per-file `CumulativeBenchmark.calculate_*` formulas with `workloads_to_params(...)` + `ManifestBenchmark` for both CumsumFwdOp and CumprodFwdOp; the report columns (latency_ms / tflops / bandwidth_tbs and the filtered-locals param keys) are unchanged. Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <Ibuki-wind@users.noreply.github.com>
Pass an explicit params dict (m, n, dtype, op_kind) to BenchmarkReport.record so the cumulative bench output header matches the pre-PR baseline byte-for-byte instead of leaking the new shape-based parametrization keys via locals(). Also narrow tuple[float, float] | None in AdaLayerNorm benchmark roofline accessors via a local rebind so pyright sees a non-None return path. Co-Authored-By: Ibuki - a wind born from GPTs <Ibuki-wind@users.noreply.github.com>
Restore explicit elif branch for cumprod and raise ValueError for unknown op_kind values, so typos or future op_kind extensions surface loudly instead of silently producing a cumprod baseline. Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <Ibuki-wind@users.noreply.github.com>
…chmark Replace the six hand-rolled BenchmarkBase subclasses (Reduce/Argreduce/LogicalReduce/VectorNorm/Cumulative/LogSumExp) with ManifestBenchmark; FLOP and byte counts now come from each op's eval_roofline() rather than per-class calculate_flops/calculate_memory. 3D multi-dim shapes stay declared inline because the manifest workload set for these ops only covers 2D last-axis reductions (a different test scenario from this file's 3D non-last-axis purpose); per the trust model, manifest workloads cannot be edited from a code PR. Output column schema preserved against pre-PR baseline by passing an explicit params dict to BenchmarkReport.record (mirrors the cumulative schema-preservation fix from round 2). Co-Authored-By: Ibuki - a wind born from GPTs <Ibuki-wind@users.noreply.github.com>
Revert bench_cumulative.py to the pre-PR (upstream/testbed) state so the benchmark workload rows stay aligned with the legacy table. scan.yaml manifest workloads for CumsumFwdOp/CumprodFwdOp do not match the hand-rolled WORKLOADS list this file uses (base rows (1024,4096) and (4096,4096) vs manifest rows (2048,4096) and (64,32768)), which violates the AC-4 row-set parity constraint. Move bench_cumulative into the deferred bucket; a separate manifest-only PR will align scan.yaml first. Co-Authored-By: Ibuki - a wind born from GPTs <Ibuki-wind@users.noreply.github.com>
c4ab779 to
e3b4415
Compare
Closes #1377
Summary
benchmarks/ops/bench_ada_layer_norm.pyworkloads throughtileops.manifest.load_workloads; eval roofline through op instance.benchmarks/ops/bench_reduce_multidim.pythroughManifestBenchmark(FLOP/byte counts now come fromop.eval_roofline()instead of six hand-rolledBenchmarkBasesubclasses); 3D multi-dim workload shapes remain declared inline because manifest workloads for these ops only cover 2D last-axis reductions, which is a different test scenario from this file's 3D non-last-axis purpose.m,n,dtypeforAdaLayerNorm{Fwd,ZeroFwd}Op; per-fixture columns preserved forbench_reduce_multidim.py(verified with locals()-filter probe —dimlists were already silently dropped pre-PR byBenchmarkReport's serializability filter).Scope
Converted:
bench_ada_layer_norm.py,bench_reduce_multidim.py.Deferred with explicit technical blockers (not "needs GPU"):
bench_cumulative.pyCumsumFwdOp/CumprodFwdOpdo not match the legacy hand-rolledWORKLOADStable (base rows(1024,4096)and(4096,4096)vs manifest rows(2048,4096)and(64,32768)). Routing throughload_workloadswould change the benchmark row set and violate AC-4 row-set parity. Defer to a manifest-only PR that aligns scan.yaml workloads first.bench_binary_elementwise.pyload_workloads('SubFwdOp')etc. return[]); the 3 fused gated ops (SiluAndMulFwdOp,GeluAndMulFwdOp,GeluTanhAndMulFwdOp) are absent from the ops manifest entirely. Additionally,BinaryOpandFusedGatedOpbase classes do not implementeval_roofline()— they inherit the L1 base stub that raisesNotImplementedError. Bothload_workloads()andManifestBenchmarkpaths are blocked. Conversion would require adding workloads to manifest YAML and implementingeval_rooflineon the op base classes, both prohibited by this PR's constraints (MUST NOT modify manifest yaml; MUST NOT modify ops or kernels).bench_activation.pynum_per_thread, threads=128/256, aligned/unaligned tail sizes) that are not manifest workloads. Manifest workloads for the activation ops exist (2 each) but cover only the model-shape geometry sweep; routing the kernel-tuning sweeps throughload_workloadswould discard the file's R2-R7 coverage. The two model-shape tests (e.g.test_relu_bench) could be partially converted in a follow-up; the strategy/thread/npt sweeps cannot.bench_binary_arith.pyBinaryOpeval_rooflineblocker asbench_binary_elementwise.pyforAddFwdOp,LerpTensorFwdOp,WhereFwdOp.AddFwdOpdoes have manifest workloads (x_shape/y_shapekeyed) but the harnessworkloads_to_paramsonly supports single-inputx_shape-keyed entries by design (see_WORKLOAD_META_KEYScontract).bench_independent_elementwise.pyLeakyReluFwdOp,EluFwdOp) inheritUnaryOp.eval_rooflineand have manifest workloads, butWhereFwdOp/MaskedFillScalarFwdOp/PreluFwdOp/AlibiFwdOp/SinusoidalFwdOpneed either multi-input harness support (out of scope perworkloads_to_paramscontract) oreval_rooflineimpls on their op classes.bench_instance_norm.py(NoAffine variants) /bench_group_norm.py(NoAffine variants)load_workloadsin this PR's branch. The NoAffine sub-fixtures use the same op class withaffine=False; they aren't separately listed in the manifest workload set, so converting them would require either adding NoAffine workload entries (manifest edit, prohibited) or co-locating them with the affine workloads via a flag column. Leaving as-is preserves coverage; no functional regression.A simple
input_shape → x_shaperename helper is not what these files need: the actual blockers are missing manifest workload entries, missing op-leveleval_roofline()implementations, and multi-input signatures that exceed the current single-input harness contract — all of which are out of scope under the trust-model constraints (MUST NOT modify manifest yaml; MUST NOT modify ops or kernels).The two landed files establish the conversion pattern. Remaining files require manifest-only PRs (to add/align workloads / extend harness contract) or op-PRs (to implement
eval_rooflineonBinaryOp/FusedGatedOp) before they can land.Test plan
pytest benchmarks/tests/test_roofline_workload_protocol.py -q→ 8 passed;pytest benchmarks/ops/bench_ada_layer_norm.py benchmarks/ops/bench_cumulative.py benchmarks/ops/bench_reduce_multidim.py -q→ 45 passed.grep -nE 'WORKLOADS\s*=\s*\[' benchmarks/ops/bench_ada_layer_norm.py benchmarks/ops/bench_reduce_multidim.py→ no matches.bench_ada_layer_norm.pyimportsload_workloads;bench_reduce_multidim.pyusesManifestBenchmarkfor all six fixtures.Regression
Schema check on
profile_run.log:bench_ada_layer_norm.pym,n,dtype,latency_ms,tflops,bandwidth_tbs,configbench_reduce_multidim.pyshape,keepdim,dtype,op_kind(reduce/logical/vector_norm);shape,dim,keepdim,dtype,op_kind(argreduce);shape,dtype,op_kind(cumulative);shape,keepdim,dtype(logsumexp)