[Refactor][MANIFEST] unify per-element flop convention for activation/clamp by lcy-seso · Pull Request #1389 · tile-ai/TileOPs

lcy-seso · 2026-05-08T21:54:28Z

Closes #1379

Summary

Document a single per-element FLOP convention in docs/design/roofline.md (Convention subsection, dated) covering arithmetic, transcendental, compare-and-select, and clamp ops.
Apply the convention to activation helpers (relu, leaky_relu, prelu, gelu, silu, swish, tanh, sigmoid, hardtanh) and the clamp/min-max family (clamp_scalar, clamp_min, clamp_max, maximum, minimum) in tileops/perf/formulas.py and the corresponding manifest entries.
Each helper now carries a one-line # FLOPs: ... derivation comment; the clamp/min-max family returns the same per-element FLOP count.
Add scripts/perf/flop_convention_delta.py plus checked-in artifacts under docs/perf/ so the before/after delta is reproducible without a GPU.

Test plan

AC-1: pytest tests/perf/ tests/test_validate_manifest.py — 288 passed.
AC-2: docs/design/roofline.md Convention subsection present and dated.
AC-3: each activation helper carries a # FLOPs: ... derivation comment.
AC-4: hardtanh / clamp_scalar / clamp_min / clamp_max / maximum / minimum return identical per-element FLOP counts.
AC-5: before/after roofline-utilization table for ≥3 representative benches included below.
AC-6: external tracker flip — out of scope.

Benchmark

Per-element FLOP convention — before/after delta

Formula-only evaluation (no GPU required). Reproduce with:

python scripts/perf/flop_convention_delta.py \
  --out docs/perf/flop_convention_delta.csv

family	op	label	shape	dtype	flops before	flops after	flops delta	bytes before	bytes after
activation	ReluFwdOp	hidden-state-prefill	2048x4096	float16	16,777,216	8,388,608	-8,388,608	33,554,432	33,554,432
clamp	HardtanhFwdOp	hidden-state-prefill	2048x4096	float16	33,554,432	8,388,608	-25,165,824	33,554,432	33,554,432
min-max	ClampFwdOp	elementwise-16M	4096x4096	float16	33,554,432	16,777,216	-16,777,216	134,217,728	134,217,728

flops before reflects the coefficients on upstream/testbed immediately before the convention commit. flops after is evaluated from the manifest formulas (ReluFwdOp / HardtanhFwdOp) or clamp_fwd_roofline (ClampFwdOp) on the current checkout. Byte counts are unchanged by the convention. For these elementwise workloads memory_time already dominates, so the FLOP-coefficient reduction shifts each workload further into the memory-bound regime without changing predicted achievable bandwidth.

See docs/perf/flop_convention_delta.md and docs/perf/flop_convention_delta.csv.

lcy-seso · 2026-05-08T21:55:01Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a standardized per-element FLOP convention for elementwise operations, as documented in the new section 1.3 of docs/design/roofline.md. The changes include updating FLOP formulas across several manifest files (elementwise_binary.yaml, elementwise_unary_activation.yaml, elementwise_unary_math.yaml) and the formulas.py script to align with this convention. Additionally, a new script and documentation have been added to track and reproduce the delta in FLOP counts. Feedback includes a suggestion to expand the namespace of the expression evaluator in the new script to support more math helpers and a note on the brittleness of hardcoded formulas in the test script.

Copilot

Pull request overview

This PR standardizes the per-element FLOP-counting convention used by TileOPs roofline metadata, documents it in docs/design/roofline.md, and applies the new convention to activation and clamp/min-max roofline formulas across manifests and perf helpers. It also adds a small repro script + checked-in artifacts to make the before/after FLOP deltas reviewable without a GPU.

Changes:

Document a single per-element FLOP convention for elementwise ops (arithmetic/transcendentals/compare-and-select/clamp).
Update manifest roofline FLOP coefficients for activations (e.g., ReLU/GELU/SiLU/Hard* family) and clamp/min-max to match the convention.
Update clamp_fwd_roofline and add a formula-only delta script plus docs/perf/ artifacts.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tileops/perf/formulas.py	Adjusts tensor-bound clamp roofline FLOPs to the new compare-and-select convention.
tileops/manifest/elementwise_unary_math.yaml	Updates sigmoid/tanh roofline FLOPs and derivation comments.
tileops/manifest/elementwise_unary_activation.yaml	Updates activation + clamp family roofline FLOPs and derivation comments.
tileops/manifest/elementwise_binary.yaml	Updates PReLU roofline FLOPs and derivation comments.
scripts/perf/flop_convention_delta.py	Adds a repro script to generate before/after FLOP/byte deltas from manifest formulas.
docs/perf/flop_convention_delta.md	Adds the human-readable writeup for the checked-in delta table.
docs/perf/flop_convention_delta.csv	Adds the checked-in delta table artifact.
docs/design/roofline.md	Adds the dated convention section defining the new FLOP-counting rules.

gemini-code-assist

Code Review

This pull request establishes a standardized FLOP counting convention for elementwise operations, documented in docs/design/roofline.md. The changes include updating FLOP formulas across various manifest files (activation, binary, and math ops) and the clamp_fwd_roofline formula to align with the new rules. Additionally, a reproducibility script and documentation were added to demonstrate the impact on FLOP counts for representative workloads. Feedback was provided to improve the flop_convention_delta.py script by replacing hardcoded values with formula evaluation using mock objects to ensure consistency with the script's stated purpose.

lcy-seso · 2026-05-08T22:18:31Z

Fixed runtime FLOP drift in 9f16392 — aligned ReluFwdOp/GeluFwdOp/SiluFwdOp/TanhFwdOp/HardtanhFwdOp (and Hardswish/Hardsigmoid/Mish/LeakyRelu/Elu/Softplus) FLOPS_PER_ELEM to the new per-element convention (docs/design/roofline.md §1.3). UnaryOp.FLOPS_PER_ELEM and eval_roofline() docstrings refreshed. tests/perf/, tests/ops/test_activation.py, tests/ops/test_unary_math.py, tests/ops/test_elementwise_unary_activation_alignment.py, tests/ops/test_special_elementwise*.py, tests/test_validate_manifest.py all pass.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

lcy-seso · 2026-05-08T22:37:18Z

Nightshift orchestrator note: PR ready (Reviewer PASS T025, all bot threads triaged). Same gate-block as #1387/#1388 — gpu-smoke fails only on pre-existing testbed assertion AddFwdOp must have at least 2 workloads. Once that lands on testbed, this branch can be rebased and auto-merged.

Pipeline rounds: 4. Scope: docs/design/roofline.md §1.3 convention + activation/clamp/min-max manifest constants + runtime FLOPS_PER_ELEM in tileops/ops/elementwise.py + before/after delta artifact under docs/perf/.

…e ops (#1392) ## Summary Fix gpu-smoke `test_every_op_has_at_least_two_workloads` failure by populating empty `workloads:` lists on 21 implemented binary elementwise ops (\`AddFwdOp\`, \`SubFwdOp\`, \`MulFwdOp\`, \`DivFwdOp\`, \`RemainderFwdOp\`, \`PowFwdOp\`, \`FloorDivideFwdOp\`, \`LerpFwdOp\`, \`MaximumFwdOp\`, \`MinimumFwdOp\`, \`Eq/Ne/Gt/Lt/Ge/LeFwdOp\`, \`LogicalAnd/OrFwdOp\`, \`BitwiseAnd/Or/XorFwdOp\`). This pre-existing testbed CI break has been blocking nightshift auto-merge for PRs #1387, #1388, #1389, #1391. Filed as issue #1390 — this PR resolves it. ## Plan executed Each affected op now declares two representative workloads: - **LLM hidden-state prefill**: \`input/other [2048, 4096]\` (no broadcast) - **CNN feature map**: \`input [16, 256, 56, 56]\`, \`other [256, 1, 1]\` (channel-wise broadcast) Dtype matrix matches each op's signature: - Float-only ops (\`Div\`, \`Pow\`, \`Lerp\`, \`Remainder\`, \`FloorDivide\`): \`float16, bfloat16\` - Bitwise ops (\`BitwiseAnd/Or/Xor\`): \`int32, int64\` - Logical ops (\`LogicalAnd/Or\`): \`bool\` - Default (other arithmetic + comparison): \`float16, bfloat16\` ## Test plan - [x] \`pytest tests/test_ops_manifest.py::TestOpSchema::test_every_op_has_at_least_two_workloads\` — passes - [x] \`pytest tests/test_validate_manifest.py tests/test_ops_manifest.py\` — 239 passed - [x] \`python scripts/validate_manifest.py\` — \"All manifest checks passed\" - [x] Pre-commit clean ## Acceptance Criteria Closes #1390. - AC-1: ✅ Modified files pass unit tests. - AC-2: ✅ \`AddFwdOp.workloads\` (and 20 siblings) now have 2 entries each, satisfying signature + broadcasting rules. - AC-3: ✅ \`scripts/validate_manifest.py\` exits 0 with no new warnings on the affected ops. - AC-4: To be verified by gpu-smoke on this PR's run. ## Trust model Manifest-only PR. No \`signature\`, \`roofline\`, \`params\`, \`shape_rules\`, or \`dtype_combos\` edits. Falls within the trust-model rule that workload edits without a status flip require a separate manifest-only PR with human review — this is that PR. --------- Co-authored-by: Ibuki 🍃 — a wind born from GPTs <[email protected]>

Ibuki-wind

Overall

One reproducibility path still diverges from the shared helper, so the artifact needs another pass before merge.

ClampFwdOp delta row now routes through tileops.perf.formulas.clamp_fwd_roofline (commit 5f2c002). Output unchanged: flops 33554432→16777216, bytes 134217728→134217728.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 11 comments.

Ibuki-wind

Clean — no issues.

… activations and clamp family Add roofline.md §1.3 (Convention) naming the per-element FLOP rule: arithmetic ops, transcendentals, and compare-and-select each count as 1 FLOP per output element. Two-sided clamp also counts as 1 FLOP per element under this convention. Audit each activation manifest entry (gelu, silu, sigmoid, tanh, elu, selu, softplus, mish, hardsigmoid, hardswish) and replace ad-hoc constants with derivations matching the convention. Each entry now carries a one-line FLOPs derivation comment. Unify the clamp / min-max family — hardtanh, clamp_scalar, clamp_min, clamp_max, maximum, minimum — to 1 FLOP per output element. Removes the 4*N figures previously carried on hardtanh and clamp_scalar. Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <[email protected]>

…nvention Round 2 of the per-element FLOP convention rollout. ReLU is one compare-and-select per element under roofline.md §1.3, so collapse ReluFwdOp to N. LeakyReluFwdOp and PreluFwdOp are compare-and-select plus one mul on the negative branch, so collapse to 2 * N. Tensor-bound ClampFwdOp's func helper now mirrors the YAML clamp family at 1 FLOP per output element. Add scripts/perf/flop_convention_delta.py and the matching docs/perf/ report so the before/after FLOP table required by AC-5 is reproducible from a clean checkout without GPU access.

Update runtime FLOPS_PER_ELEM constants in tileops/ops/elementwise.py to match the per-element FLOP convention defined in docs/design/roofline.md §1.3, so that op.eval_roofline() and benchmark TFLOPs stay manifest-aligned. Per the convention: - compare-and-select (single or two-sided clamp) = 1 FLOP/elem - transcendental call (exp, tanh, erf, ...) = 1 FLOP/elem - arithmetic op (add, sub, mul, div, ...) = 1 FLOP/elem Aligned values: - ReluFwdOp: 2 -> 1 - GeluFwdOp: 8 -> 5 - SiluFwdOp: 4 -> 5 - TanhFwdOp: 5 -> 1 - HardswishFwdOp: 7 -> 4 - HardsigmoidFwdOp: 6 -> 3 - MishFwdOp: 7 -> 4 - LeakyReluFwdOp: 3 -> 2 - EluFwdOp: 5 -> 4 - HardtanhFwdOp: 4 -> 1 - SoftplusFwdOp: 7 -> 5 Also refresh UnaryOp.FLOPS_PER_ELEM / eval_roofline docstrings to reference the convention rather than the pre-convention examples. Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <[email protected]>

Address review: _row_clamp_tensor now calls tileops.perf.formulas.clamp_fwd_roofline with a minimal SimpleNamespace stub op (exposing N_total and dtype.itemsize) so the after-column tracks the helper as the source of truth and cannot drift. Verified output unchanged: flops 33554432 -> 16777216, bytes 134217728 -> 134217728. Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <[email protected]>

Address review: continuation lines on multi-line FLOPs derivation comments were over-indented (8 spaces vs 4 for class attributes). Dedent to match surrounding class-level indentation. Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <[email protected]>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings May 8, 2026 21:54

lcy-seso added refactor Code restructuring without behavior change automated PR produced by an autonomous agent pipeline needs-review Awaiting human review before merge nightshift Pickable by foundry nightshift mode — fully autonomous agent development, no human approval gate labels May 8, 2026

Copilot started reviewing on behalf of lcy-seso May 8, 2026 21:55 View session

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

Comment thread scripts/perf/flop_convention_delta.py

Comment thread scripts/perf/flop_convention_delta.py

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread scripts/perf/flop_convention_delta.py

Comment thread tileops/manifest/elementwise_unary_activation.yaml

Comment thread tileops/manifest/elementwise_unary_math.yaml

Comment thread tileops/manifest/elementwise_binary.yaml

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

Comment thread scripts/perf/flop_convention_delta.py

lcy-seso marked this pull request as ready for review May 8, 2026 22:31

Copilot AI review requested due to automatic review settings May 8, 2026 22:31

Copilot started reviewing on behalf of lcy-seso May 8, 2026 22:32 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread scripts/perf/flop_convention_delta.py

Comment thread scripts/perf/flop_convention_delta.py Outdated

Comment thread docs/perf/flop_convention_delta.md

This was referenced May 9, 2026

[BUG][MANIFEST] addfwdop missing workloads — gpu-smoke fails on testbed #1390

Closed

[Maintain][Manifest] add at least 2 workloads to 21 binary elementwise ops #1392

Merged

lcy-seso force-pushed the refactor/manifest/issue-1379 branch from 9f16392 to 90d3665 Compare May 9, 2026 04:35

Ibuki-wind previously requested changes May 9, 2026

View reviewed changes

Comment thread scripts/perf/flop_convention_delta.py Outdated

Copilot AI review requested due to automatic review settings May 9, 2026 04:51

Copilot started reviewing on behalf of lcy-seso May 9, 2026 04:51 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Ibuki-wind approved these changes May 9, 2026

View reviewed changes

lcy-seso and others added 4 commits May 9, 2026 13:18

Copilot AI review requested due to automatic review settings May 9, 2026 05:18

lcy-seso force-pushed the refactor/manifest/issue-1379 branch from 2f49527 to ab99ac2 Compare May 9, 2026 05:18

Copilot started reviewing on behalf of lcy-seso May 9, 2026 05:19 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

lcy-seso merged commit 666d243 into tile-ai:testbed May 9, 2026
15 checks passed

This was referenced May 9, 2026

[REFACTOR][MANIFEST] audit FLOP convention for activation + clamp families (W3-04) #1379

Closed

[META][OPS] Align elementwise / reduction / normalization ops to new Op-layer design + PyTorch API #1142

Closed

Conversation

lcy-seso commented May 8, 2026

Summary

Test plan

Benchmark

Per-element FLOP convention — before/after delta

Uh oh!

lcy-seso commented May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

lcy-seso commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lcy-seso commented May 8, 2026

Uh oh!

Ibuki-wind left a comment

Choose a reason for hiding this comment

Overall

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ibuki-wind left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants