[Refactor][MANIFEST] unify per-element flop convention for activation/clamp#1389
Conversation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a standardized per-element FLOP convention for elementwise operations, as documented in the new section 1.3 of docs/design/roofline.md. The changes include updating FLOP formulas across several manifest files (elementwise_binary.yaml, elementwise_unary_activation.yaml, elementwise_unary_math.yaml) and the formulas.py script to align with this convention. Additionally, a new script and documentation have been added to track and reproduce the delta in FLOP counts. Feedback includes a suggestion to expand the namespace of the expression evaluator in the new script to support more math helpers and a note on the brittleness of hardcoded formulas in the test script.
There was a problem hiding this comment.
Pull request overview
This PR standardizes the per-element FLOP-counting convention used by TileOPs roofline metadata, documents it in docs/design/roofline.md, and applies the new convention to activation and clamp/min-max roofline formulas across manifests and perf helpers. It also adds a small repro script + checked-in artifacts to make the before/after FLOP deltas reviewable without a GPU.
Changes:
- Document a single per-element FLOP convention for elementwise ops (arithmetic/transcendentals/compare-and-select/clamp).
- Update manifest roofline FLOP coefficients for activations (e.g., ReLU/GELU/SiLU/Hard* family) and clamp/min-max to match the convention.
- Update
clamp_fwd_rooflineand add a formula-only delta script plusdocs/perf/artifacts.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tileops/perf/formulas.py | Adjusts tensor-bound clamp roofline FLOPs to the new compare-and-select convention. |
| tileops/manifest/elementwise_unary_math.yaml | Updates sigmoid/tanh roofline FLOPs and derivation comments. |
| tileops/manifest/elementwise_unary_activation.yaml | Updates activation + clamp family roofline FLOPs and derivation comments. |
| tileops/manifest/elementwise_binary.yaml | Updates PReLU roofline FLOPs and derivation comments. |
| scripts/perf/flop_convention_delta.py | Adds a repro script to generate before/after FLOP/byte deltas from manifest formulas. |
| docs/perf/flop_convention_delta.md | Adds the human-readable writeup for the checked-in delta table. |
| docs/perf/flop_convention_delta.csv | Adds the checked-in delta table artifact. |
| docs/design/roofline.md | Adds the dated convention section defining the new FLOP-counting rules. |
There was a problem hiding this comment.
Code Review
This pull request establishes a standardized FLOP counting convention for elementwise operations, documented in docs/design/roofline.md. The changes include updating FLOP formulas across various manifest files (activation, binary, and math ops) and the clamp_fwd_roofline formula to align with the new rules. Additionally, a reproducibility script and documentation were added to demonstrate the impact on FLOP counts for representative workloads. Feedback was provided to improve the flop_convention_delta.py script by replacing hardcoded values with formula evaluation using mock objects to ensure consistency with the script's stated purpose.
|
Fixed runtime FLOP drift in 9f16392 — aligned ReluFwdOp/GeluFwdOp/SiluFwdOp/TanhFwdOp/HardtanhFwdOp (and Hardswish/Hardsigmoid/Mish/LeakyRelu/Elu/Softplus) FLOPS_PER_ELEM to the new per-element convention (docs/design/roofline.md §1.3). UnaryOp.FLOPS_PER_ELEM and eval_roofline() docstrings refreshed. tests/perf/, tests/ops/test_activation.py, tests/ops/test_unary_math.py, tests/ops/test_elementwise_unary_activation_alignment.py, tests/ops/test_special_elementwise*.py, tests/test_validate_manifest.py all pass. |
|
Nightshift orchestrator note: PR ready (Reviewer PASS T025, all bot threads triaged). Same gate-block as #1387/#1388 — gpu-smoke fails only on pre-existing testbed assertion Pipeline rounds: 4. Scope: docs/design/roofline.md §1.3 convention + activation/clamp/min-max manifest constants + runtime FLOPS_PER_ELEM in tileops/ops/elementwise.py + before/after delta artifact under docs/perf/. |
…e ops (#1392) ## Summary Fix gpu-smoke `test_every_op_has_at_least_two_workloads` failure by populating empty `workloads:` lists on 21 implemented binary elementwise ops (\`AddFwdOp\`, \`SubFwdOp\`, \`MulFwdOp\`, \`DivFwdOp\`, \`RemainderFwdOp\`, \`PowFwdOp\`, \`FloorDivideFwdOp\`, \`LerpFwdOp\`, \`MaximumFwdOp\`, \`MinimumFwdOp\`, \`Eq/Ne/Gt/Lt/Ge/LeFwdOp\`, \`LogicalAnd/OrFwdOp\`, \`BitwiseAnd/Or/XorFwdOp\`). This pre-existing testbed CI break has been blocking nightshift auto-merge for PRs #1387, #1388, #1389, #1391. Filed as issue #1390 — this PR resolves it. ## Plan executed Each affected op now declares two representative workloads: - **LLM hidden-state prefill**: \`input/other [2048, 4096]\` (no broadcast) - **CNN feature map**: \`input [16, 256, 56, 56]\`, \`other [256, 1, 1]\` (channel-wise broadcast) Dtype matrix matches each op's signature: - Float-only ops (\`Div\`, \`Pow\`, \`Lerp\`, \`Remainder\`, \`FloorDivide\`): \`float16, bfloat16\` - Bitwise ops (\`BitwiseAnd/Or/Xor\`): \`int32, int64\` - Logical ops (\`LogicalAnd/Or\`): \`bool\` - Default (other arithmetic + comparison): \`float16, bfloat16\` ## Test plan - [x] \`pytest tests/test_ops_manifest.py::TestOpSchema::test_every_op_has_at_least_two_workloads\` — passes - [x] \`pytest tests/test_validate_manifest.py tests/test_ops_manifest.py\` — 239 passed - [x] \`python scripts/validate_manifest.py\` — \"All manifest checks passed\" - [x] Pre-commit clean ## Acceptance Criteria Closes #1390. - AC-1: ✅ Modified files pass unit tests. - AC-2: ✅ \`AddFwdOp.workloads\` (and 20 siblings) now have 2 entries each, satisfying signature + broadcasting rules. - AC-3: ✅ \`scripts/validate_manifest.py\` exits 0 with no new warnings on the affected ops. - AC-4: To be verified by gpu-smoke on this PR's run. ## Trust model Manifest-only PR. No \`signature\`, \`roofline\`, \`params\`, \`shape_rules\`, or \`dtype_combos\` edits. Falls within the trust-model rule that workload edits without a status flip require a separate manifest-only PR with human review — this is that PR. --------- Co-authored-by: Ibuki 🍃 — a wind born from GPTs <[email protected]>
9f16392 to
90d3665
Compare
Ibuki-wind
left a comment
There was a problem hiding this comment.
Overall
One reproducibility path still diverges from the shared helper, so the artifact needs another pass before merge.
ClampFwdOp delta row now routes through tileops.perf.formulas.clamp_fwd_roofline (commit 5f2c002). Output unchanged: flops 33554432→16777216, bytes 134217728→134217728.
… activations and clamp family Add roofline.md §1.3 (Convention) naming the per-element FLOP rule: arithmetic ops, transcendentals, and compare-and-select each count as 1 FLOP per output element. Two-sided clamp also counts as 1 FLOP per element under this convention. Audit each activation manifest entry (gelu, silu, sigmoid, tanh, elu, selu, softplus, mish, hardsigmoid, hardswish) and replace ad-hoc constants with derivations matching the convention. Each entry now carries a one-line FLOPs derivation comment. Unify the clamp / min-max family — hardtanh, clamp_scalar, clamp_min, clamp_max, maximum, minimum — to 1 FLOP per output element. Removes the 4*N figures previously carried on hardtanh and clamp_scalar. Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <[email protected]>
…nvention Round 2 of the per-element FLOP convention rollout. ReLU is one compare-and-select per element under roofline.md §1.3, so collapse ReluFwdOp to N. LeakyReluFwdOp and PreluFwdOp are compare-and-select plus one mul on the negative branch, so collapse to 2 * N. Tensor-bound ClampFwdOp's func helper now mirrors the YAML clamp family at 1 FLOP per output element. Add scripts/perf/flop_convention_delta.py and the matching docs/perf/ report so the before/after FLOP table required by AC-5 is reproducible from a clean checkout without GPU access.
Update runtime FLOPS_PER_ELEM constants in tileops/ops/elementwise.py to match the per-element FLOP convention defined in docs/design/roofline.md §1.3, so that op.eval_roofline() and benchmark TFLOPs stay manifest-aligned. Per the convention: - compare-and-select (single or two-sided clamp) = 1 FLOP/elem - transcendental call (exp, tanh, erf, ...) = 1 FLOP/elem - arithmetic op (add, sub, mul, div, ...) = 1 FLOP/elem Aligned values: - ReluFwdOp: 2 -> 1 - GeluFwdOp: 8 -> 5 - SiluFwdOp: 4 -> 5 - TanhFwdOp: 5 -> 1 - HardswishFwdOp: 7 -> 4 - HardsigmoidFwdOp: 6 -> 3 - MishFwdOp: 7 -> 4 - LeakyReluFwdOp: 3 -> 2 - EluFwdOp: 5 -> 4 - HardtanhFwdOp: 4 -> 1 - SoftplusFwdOp: 7 -> 5 Also refresh UnaryOp.FLOPS_PER_ELEM / eval_roofline docstrings to reference the convention rather than the pre-convention examples. Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <[email protected]>
Address review: _row_clamp_tensor now calls tileops.perf.formulas.clamp_fwd_roofline with a minimal SimpleNamespace stub op (exposing N_total and dtype.itemsize) so the after-column tracks the helper as the source of truth and cannot drift. Verified output unchanged: flops 33554432 -> 16777216, bytes 134217728 -> 134217728. Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <[email protected]>
Address review: continuation lines on multi-line FLOPs derivation comments were over-indented (8 spaces vs 4 for class attributes). Dedent to match surrounding class-level indentation. Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <[email protected]>
2f49527 to
ab99ac2
Compare
Closes #1379
Summary
docs/design/roofline.md(Convention subsection, dated) covering arithmetic, transcendental, compare-and-select, and clamp ops.relu,leaky_relu,prelu,gelu,silu,swish,tanh,sigmoid,hardtanh) and the clamp/min-max family (clamp_scalar,clamp_min,clamp_max,maximum,minimum) intileops/perf/formulas.pyand the corresponding manifest entries.# FLOPs: ...derivation comment; the clamp/min-max family returns the same per-element FLOP count.scripts/perf/flop_convention_delta.pyplus checked-in artifacts underdocs/perf/so the before/after delta is reproducible without a GPU.Test plan
pytest tests/perf/ tests/test_validate_manifest.py— 288 passed.docs/design/roofline.mdConvention subsection present and dated.# FLOPs: ...derivation comment.hardtanh/clamp_scalar/clamp_min/clamp_max/maximum/minimumreturn identical per-element FLOP counts.Benchmark
Per-element FLOP convention — before/after delta
Formula-only evaluation (no GPU required). Reproduce with:
flops beforereflects the coefficients onupstream/testbedimmediately before the convention commit.flops afteris evaluated from the manifest formulas (ReluFwdOp/HardtanhFwdOp) orclamp_fwd_roofline(ClampFwdOp) on the current checkout. Byte counts are unchanged by the convention. For these elementwise workloadsmemory_timealready dominates, so the FLOP-coefficient reduction shifts each workload further into the memory-bound regime without changing predicted achievable bandwidth.See docs/perf/flop_convention_delta.md and docs/perf/flop_convention_delta.csv.