Skip to content

[Refactor][MANIFEST] unify per-element flop convention for activation/clamp#1389

Merged
lcy-seso merged 5 commits intotile-ai:testbedfrom
lcy-seso:refactor/manifest/issue-1379
May 9, 2026
Merged

[Refactor][MANIFEST] unify per-element flop convention for activation/clamp#1389
lcy-seso merged 5 commits intotile-ai:testbedfrom
lcy-seso:refactor/manifest/issue-1379

Conversation

@lcy-seso
Copy link
Copy Markdown
Collaborator

@lcy-seso lcy-seso commented May 8, 2026

Closes #1379

Summary

  • Document a single per-element FLOP convention in docs/design/roofline.md (Convention subsection, dated) covering arithmetic, transcendental, compare-and-select, and clamp ops.
  • Apply the convention to activation helpers (relu, leaky_relu, prelu, gelu, silu, swish, tanh, sigmoid, hardtanh) and the clamp/min-max family (clamp_scalar, clamp_min, clamp_max, maximum, minimum) in tileops/perf/formulas.py and the corresponding manifest entries.
  • Each helper now carries a one-line # FLOPs: ... derivation comment; the clamp/min-max family returns the same per-element FLOP count.
  • Add scripts/perf/flop_convention_delta.py plus checked-in artifacts under docs/perf/ so the before/after delta is reproducible without a GPU.

Test plan

  • AC-1: pytest tests/perf/ tests/test_validate_manifest.py — 288 passed.
  • AC-2: docs/design/roofline.md Convention subsection present and dated.
  • AC-3: each activation helper carries a # FLOPs: ... derivation comment.
  • AC-4: hardtanh / clamp_scalar / clamp_min / clamp_max / maximum / minimum return identical per-element FLOP counts.
  • AC-5: before/after roofline-utilization table for ≥3 representative benches included below.
  • AC-6: external tracker flip — out of scope.

Benchmark

Per-element FLOP convention — before/after delta

Formula-only evaluation (no GPU required). Reproduce with:

python scripts/perf/flop_convention_delta.py \
  --out docs/perf/flop_convention_delta.csv
family op label shape dtype flops before flops after flops delta bytes before bytes after bytes delta
activation ReluFwdOp hidden-state-prefill 2048x4096 float16 16,777,216 8,388,608 -8,388,608 33,554,432 33,554,432 0
clamp HardtanhFwdOp hidden-state-prefill 2048x4096 float16 33,554,432 8,388,608 -25,165,824 33,554,432 33,554,432 0
min-max ClampFwdOp elementwise-16M 4096x4096 float16 33,554,432 16,777,216 -16,777,216 134,217,728 134,217,728 0

flops before reflects the coefficients on upstream/testbed immediately before the convention commit. flops after is evaluated from the manifest formulas (ReluFwdOp / HardtanhFwdOp) or clamp_fwd_roofline (ClampFwdOp) on the current checkout. Byte counts are unchanged by the convention. For these elementwise workloads memory_time already dominates, so the FLOP-coefficient reduction shifts each workload further into the memory-bound regime without changing predicted achievable bandwidth.

See docs/perf/flop_convention_delta.md and docs/perf/flop_convention_delta.csv.

Copilot AI review requested due to automatic review settings May 8, 2026 21:54
@lcy-seso lcy-seso added refactor Code restructuring without behavior change automated PR produced by an autonomous agent pipeline needs-review Awaiting human review before merge nightshift Pickable by foundry nightshift mode — fully autonomous agent development, no human approval gate labels May 8, 2026
@lcy-seso
Copy link
Copy Markdown
Collaborator Author

lcy-seso commented May 8, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a standardized per-element FLOP convention for elementwise operations, as documented in the new section 1.3 of docs/design/roofline.md. The changes include updating FLOP formulas across several manifest files (elementwise_binary.yaml, elementwise_unary_activation.yaml, elementwise_unary_math.yaml) and the formulas.py script to align with this convention. Additionally, a new script and documentation have been added to track and reproduce the delta in FLOP counts. Feedback includes a suggestion to expand the namespace of the expression evaluator in the new script to support more math helpers and a note on the brittleness of hardcoded formulas in the test script.

Comment thread scripts/perf/flop_convention_delta.py
Comment thread scripts/perf/flop_convention_delta.py
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR standardizes the per-element FLOP-counting convention used by TileOPs roofline metadata, documents it in docs/design/roofline.md, and applies the new convention to activation and clamp/min-max roofline formulas across manifests and perf helpers. It also adds a small repro script + checked-in artifacts to make the before/after FLOP deltas reviewable without a GPU.

Changes:

  • Document a single per-element FLOP convention for elementwise ops (arithmetic/transcendentals/compare-and-select/clamp).
  • Update manifest roofline FLOP coefficients for activations (e.g., ReLU/GELU/SiLU/Hard* family) and clamp/min-max to match the convention.
  • Update clamp_fwd_roofline and add a formula-only delta script plus docs/perf/ artifacts.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tileops/perf/formulas.py Adjusts tensor-bound clamp roofline FLOPs to the new compare-and-select convention.
tileops/manifest/elementwise_unary_math.yaml Updates sigmoid/tanh roofline FLOPs and derivation comments.
tileops/manifest/elementwise_unary_activation.yaml Updates activation + clamp family roofline FLOPs and derivation comments.
tileops/manifest/elementwise_binary.yaml Updates PReLU roofline FLOPs and derivation comments.
scripts/perf/flop_convention_delta.py Adds a repro script to generate before/after FLOP/byte deltas from manifest formulas.
docs/perf/flop_convention_delta.md Adds the human-readable writeup for the checked-in delta table.
docs/perf/flop_convention_delta.csv Adds the checked-in delta table artifact.
docs/design/roofline.md Adds the dated convention section defining the new FLOP-counting rules.

Comment thread scripts/perf/flop_convention_delta.py
Comment thread tileops/manifest/elementwise_unary_activation.yaml
Comment thread tileops/manifest/elementwise_unary_math.yaml
Comment thread tileops/manifest/elementwise_binary.yaml
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request establishes a standardized FLOP counting convention for elementwise operations, documented in docs/design/roofline.md. The changes include updating FLOP formulas across various manifest files (activation, binary, and math ops) and the clamp_fwd_roofline formula to align with the new rules. Additionally, a reproducibility script and documentation were added to demonstrate the impact on FLOP counts for representative workloads. Feedback was provided to improve the flop_convention_delta.py script by replacing hardcoded values with formula evaluation using mock objects to ensure consistency with the script's stated purpose.

Comment thread scripts/perf/flop_convention_delta.py
@lcy-seso
Copy link
Copy Markdown
Collaborator Author

lcy-seso commented May 8, 2026

Fixed runtime FLOP drift in 9f16392 — aligned ReluFwdOp/GeluFwdOp/SiluFwdOp/TanhFwdOp/HardtanhFwdOp (and Hardswish/Hardsigmoid/Mish/LeakyRelu/Elu/Softplus) FLOPS_PER_ELEM to the new per-element convention (docs/design/roofline.md §1.3). UnaryOp.FLOPS_PER_ELEM and eval_roofline() docstrings refreshed. tests/perf/, tests/ops/test_activation.py, tests/ops/test_unary_math.py, tests/ops/test_elementwise_unary_activation_alignment.py, tests/ops/test_special_elementwise*.py, tests/test_validate_manifest.py all pass.

@lcy-seso lcy-seso marked this pull request as ready for review May 8, 2026 22:31
Copilot AI review requested due to automatic review settings May 8, 2026 22:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Comment thread scripts/perf/flop_convention_delta.py
Comment thread scripts/perf/flop_convention_delta.py Outdated
Comment thread docs/perf/flop_convention_delta.md
@lcy-seso
Copy link
Copy Markdown
Collaborator Author

lcy-seso commented May 8, 2026

Nightshift orchestrator note: PR ready (Reviewer PASS T025, all bot threads triaged). Same gate-block as #1387/#1388 — gpu-smoke fails only on pre-existing testbed assertion AddFwdOp must have at least 2 workloads. Once that lands on testbed, this branch can be rebased and auto-merged.

Pipeline rounds: 4. Scope: docs/design/roofline.md §1.3 convention + activation/clamp/min-max manifest constants + runtime FLOPS_PER_ELEM in tileops/ops/elementwise.py + before/after delta artifact under docs/perf/.

lcy-seso added a commit that referenced this pull request May 9, 2026
…e ops (#1392)

## Summary

Fix gpu-smoke `test_every_op_has_at_least_two_workloads` failure by
populating empty `workloads:` lists on 21 implemented binary elementwise
ops (\`AddFwdOp\`, \`SubFwdOp\`, \`MulFwdOp\`, \`DivFwdOp\`,
\`RemainderFwdOp\`, \`PowFwdOp\`, \`FloorDivideFwdOp\`, \`LerpFwdOp\`,
\`MaximumFwdOp\`, \`MinimumFwdOp\`, \`Eq/Ne/Gt/Lt/Ge/LeFwdOp\`,
\`LogicalAnd/OrFwdOp\`, \`BitwiseAnd/Or/XorFwdOp\`).

This pre-existing testbed CI break has been blocking nightshift
auto-merge for PRs #1387, #1388, #1389, #1391. Filed as issue #1390 —
this PR resolves it.

## Plan executed

Each affected op now declares two representative workloads:
- **LLM hidden-state prefill**: \`input/other [2048, 4096]\` (no
broadcast)
- **CNN feature map**: \`input [16, 256, 56, 56]\`, \`other [256, 1,
1]\` (channel-wise broadcast)

Dtype matrix matches each op's signature:
- Float-only ops (\`Div\`, \`Pow\`, \`Lerp\`, \`Remainder\`,
\`FloorDivide\`): \`float16, bfloat16\`
- Bitwise ops (\`BitwiseAnd/Or/Xor\`): \`int32, int64\`
- Logical ops (\`LogicalAnd/Or\`): \`bool\`
- Default (other arithmetic + comparison): \`float16, bfloat16\`

## Test plan

- [x] \`pytest
tests/test_ops_manifest.py::TestOpSchema::test_every_op_has_at_least_two_workloads\`
— passes
- [x] \`pytest tests/test_validate_manifest.py
tests/test_ops_manifest.py\` — 239 passed
- [x] \`python scripts/validate_manifest.py\` — \"All manifest checks
passed\"
- [x] Pre-commit clean

## Acceptance Criteria

Closes #1390.

- AC-1: ✅ Modified files pass unit tests.
- AC-2: ✅ \`AddFwdOp.workloads\` (and 20 siblings) now have 2 entries
each, satisfying signature + broadcasting rules.
- AC-3: ✅ \`scripts/validate_manifest.py\` exits 0 with no new warnings
on the affected ops.
- AC-4: To be verified by gpu-smoke on this PR's run.

## Trust model

Manifest-only PR. No \`signature\`, \`roofline\`, \`params\`,
\`shape_rules\`, or \`dtype_combos\` edits. Falls within the trust-model
rule that workload edits without a status flip require a separate
manifest-only PR with human review — this is that PR.

---------

Co-authored-by: Ibuki 🍃 — a wind born from GPTs <[email protected]>
@lcy-seso lcy-seso force-pushed the refactor/manifest/issue-1379 branch from 9f16392 to 90d3665 Compare May 9, 2026 04:35
Copy link
Copy Markdown
Contributor

@Ibuki-wind Ibuki-wind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall

One reproducibility path still diverges from the shared helper, so the artifact needs another pass before merge.

Comment thread scripts/perf/flop_convention_delta.py Outdated
Copilot AI review requested due to automatic review settings May 9, 2026 04:51
@lcy-seso lcy-seso dismissed Ibuki-wind’s stale review May 9, 2026 04:52

ClampFwdOp delta row now routes through tileops.perf.formulas.clamp_fwd_roofline (commit 5f2c002). Output unchanged: flops 33554432→16777216, bytes 134217728→134217728.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 11 comments.

Comment thread tileops/ops/elementwise/activations.py
Comment thread tileops/ops/elementwise/activations.py
Comment thread tileops/ops/elementwise/activations.py
Comment thread tileops/ops/elementwise/activations.py
Comment thread tileops/ops/elementwise/activations.py
Comment thread tileops/ops/elementwise/activations.py
Comment thread tileops/ops/elementwise/activations.py Outdated
Comment thread tileops/ops/elementwise/activations.py
Comment thread tileops/ops/elementwise/activations.py Outdated
Comment thread tileops/ops/elementwise/activations.py
Copy link
Copy Markdown
Contributor

@Ibuki-wind Ibuki-wind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean — no issues.

lcy-seso and others added 4 commits May 9, 2026 13:18
… activations and clamp family

Add roofline.md §1.3 (Convention) naming the per-element FLOP rule:
arithmetic ops, transcendentals, and compare-and-select each count as
1 FLOP per output element. Two-sided clamp also counts as 1 FLOP per
element under this convention.

Audit each activation manifest entry (gelu, silu, sigmoid, tanh, elu,
selu, softplus, mish, hardsigmoid, hardswish) and replace ad-hoc
constants with derivations matching the convention. Each entry now
carries a one-line FLOPs derivation comment.

Unify the clamp / min-max family — hardtanh, clamp_scalar, clamp_min,
clamp_max, maximum, minimum — to 1 FLOP per output element. Removes
the 4*N figures previously carried on hardtanh and clamp_scalar.

Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <[email protected]>
…nvention

Round 2 of the per-element FLOP convention rollout. ReLU is one
compare-and-select per element under roofline.md §1.3, so collapse
ReluFwdOp to N. LeakyReluFwdOp and PreluFwdOp are compare-and-select
plus one mul on the negative branch, so collapse to 2 * N. Tensor-bound
ClampFwdOp's func helper now mirrors the YAML clamp family at 1 FLOP
per output element.

Add scripts/perf/flop_convention_delta.py and the matching docs/perf/
report so the before/after FLOP table required by AC-5 is reproducible
from a clean checkout without GPU access.
Update runtime FLOPS_PER_ELEM constants in tileops/ops/elementwise.py
to match the per-element FLOP convention defined in
docs/design/roofline.md §1.3, so that op.eval_roofline() and
benchmark TFLOPs stay manifest-aligned.

Per the convention:
- compare-and-select (single or two-sided clamp) = 1 FLOP/elem
- transcendental call (exp, tanh, erf, ...) = 1 FLOP/elem
- arithmetic op (add, sub, mul, div, ...) = 1 FLOP/elem

Aligned values:
- ReluFwdOp: 2 -> 1
- GeluFwdOp: 8 -> 5
- SiluFwdOp: 4 -> 5
- TanhFwdOp: 5 -> 1
- HardswishFwdOp: 7 -> 4
- HardsigmoidFwdOp: 6 -> 3
- MishFwdOp: 7 -> 4
- LeakyReluFwdOp: 3 -> 2
- EluFwdOp: 5 -> 4
- HardtanhFwdOp: 4 -> 1
- SoftplusFwdOp: 7 -> 5

Also refresh UnaryOp.FLOPS_PER_ELEM / eval_roofline docstrings to
reference the convention rather than the pre-convention examples.

Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <[email protected]>
Address review: _row_clamp_tensor now calls
tileops.perf.formulas.clamp_fwd_roofline with a minimal SimpleNamespace
stub op (exposing N_total and dtype.itemsize) so the after-column
tracks the helper as the source of truth and cannot drift.

Verified output unchanged: flops 33554432 -> 16777216, bytes
134217728 -> 134217728.

Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <[email protected]>
Address review: continuation lines on multi-line FLOPs derivation
comments were over-indented (8 spaces vs 4 for class attributes).
Dedent to match surrounding class-level indentation.

Co-Authored-By: Ibuki 🍃 — a wind born from GPTs <[email protected]>
Copilot AI review requested due to automatic review settings May 9, 2026 05:18
@lcy-seso lcy-seso force-pushed the refactor/manifest/issue-1379 branch from 2f49527 to ab99ac2 Compare May 9, 2026 05:18
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

@lcy-seso lcy-seso merged commit 666d243 into tile-ai:testbed May 9, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automated PR produced by an autonomous agent pipeline needs-review Awaiting human review before merge nightshift Pickable by foundry nightshift mode — fully autonomous agent development, no human approval gate refactor Code restructuring without behavior change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants