feat(pto-gss): add A5 buf-id sync emission style to GraphSyncSolver#665
feat(pto-gss): add A5 buf-id sync emission style to GraphSyncSolver#665zhangstevenunity wants to merge 8 commits into
Conversation
Users can opt into the A5-only `pto.get_buf` / `pto.rls_buf` bracket form
instead of `pto.set_flag` / `pto.wait_flag` via
`--graph-sync-solver-sync-style=buf-id`. The buf-id model is a sequential
programming model: the hw `(get_cnt, rel_cnt)` scoreboard counters
serialize same-id brackets across pipes and iterations, so the solver
emits brackets at producer/consumer anchor RWOperations and the counter
monotonicity handles both intra-iter (forward) and loop-carried (backward)
deps without out-of-loop compensation, scf.if isFirstIter / isLastIter
guards, or pto.barrier on A5.
Solver-internals reuse:
* Conflict detection / coloring / barrier-all fallback all reused.
* buf-id flips checkIntersect to "share any pipe -> conflict" because
get_buf doesn't carry the partner pipe in the opcode; without that
rule, two pipe-pairs sharing a pipe would alias to the same physical
scoreboard and produce back-to-back get(P,#id) which violates the
spec.
* buf-id collapses the per-pipe-pair EventIdSolver map into a single
instance (same trick cross-core already uses), so all pairs draw
from one id pool.
* getBeforeAfterSyncMaps adds a mirror-image dedup: forward F and
backward B with the same unordered anchor set {(setOp,setPipe),
(waitOp,waitPipe)} share the same brackets, so emitting both would
waste an id. We keep the smaller-id pair and skip the mirror, which
reproduces doc §1.2's canonical single-id `for {load; vector}` form.
* All backward-sync hoist optimizations
(tryMovingOutBackwardSyncPairs, considerOuterBackwardSyncPairs,
mergeBackwardSyncPairs) short-circuit in buf-id mode — they exist
to repair set/wait pairing across loop boundaries, which buf-id
doesn't need.
CLI / pass option:
--graph-sync-solver-sync-style={set-wait (default) | buf-id}
PTOGraphSyncSolver pass option: sync-style="set-wait" | "buf-id"
buf-id requires --pto-arch=a5; non-A5 errors out.
Tests (all PASS, plus full 209-test PTO lit suite green):
* graph_sync_solver_buf_id_basic.pto: forward MTE2->V->MTE3 chain,
distinct ids on shared pipe (V) for two consumer/producer brackets.
* graph_sync_solver_buf_id_arch_guard.pto: A3 + buf-id -> error.
* graph_sync_solver_buf_id_backward.pto: loop with tload+tmuls+tstore;
3 ids cover 2 forward + 1 backward dep, no scf.if, no out-of-loop ops.
* graph_sync_solver_buf_id_canonical_loop.pto: doc §1.2 canonical
`for {load; vector}` form; single id #0 covers forward + backward.
Design doc: docs/designs/ptoas-graph-sync-solver-buf-id-design.md.
Co-Authored-By: Claude Opus 4.7 <[email protected]>
There was a problem hiding this comment.
Code Review
This pull request implements the 'Buf-ID' synchronization style for the PTOGraphSyncSolver, enabling get_buf and rls_buf bracketing on A5 architectures. The implementation includes new internal IR operations, updated conflict detection to enforce hardware constraints, and a mirror-image deduplication pass for optimized loop synchronization. The review feedback correctly identifies a typo in the design documentation and an outdated comment in the test suite regarding the implementation status of the deduplication logic.
| get_buf(MTE2, #id); load(); rel_buf(MTE2, #id) | ||
| get_buf(V, #id); Vector(); rel_buf(V, #id) |
There was a problem hiding this comment.
There's a small typo in this code block. rel_buf should be rls_buf to be consistent with the rest of the document and the operation name.
| get_buf(MTE2, #id); load(); rel_buf(MTE2, #id) | |
| get_buf(V, #id); Vector(); rel_buf(V, #id) | |
| get_buf(MTE2, #id); load(); rls_buf(MTE2, #id) | |
| get_buf(V, #id); Vector(); rls_buf(V, #id) |
| // Doc form uses ONE id (the same scoreboard counter naturally enforces | ||
| // MTE2 -> V -> MTE2 -> V -> ... ordering). Our solver allocates F and B | ||
| // as two ConflictPairs and assigns them distinct ids — that's correct but | ||
| // uses two ids instead of one. Both forms satisfy the buf-id spec; the | ||
| // single-id form is a future merge optimization, not required for | ||
| // correctness. |
There was a problem hiding this comment.
This comment appears to be outdated. The mirror-image deduplication logic added in SyncSolver.cpp in this PR is designed to handle this exact case, resulting in a single ID being used for the forward and backward dependencies. The comment should be updated to reflect that this optimization is now implemented.
// Doc form uses ONE id (the same scoreboard counter naturally enforces
// MTE2 -> V -> MTE2 -> V -> ... ordering). The solver's mirror-image
// deduplication logic identifies that the forward (tload->tmuls) and
// backward (tmuls->tload) dependencies form a mirror pair and can be
// covered by a single buf-id, matching the canonical form.
Codex Review该评论由 review 机器人自动更新。
SummaryReview failed at stage Findings未生成结构化 findings,因为 review 过程提前失败。 Log Tail |
The (get_cnt, rel_cnt) scoreboard counters monotonically advance per forward bracket. Iter i+1's get_buf is therefore automatically ordered after iter i's matching rls_buf — for both forward and backward edges of the same id. Set/wait needs an explicit pair across the loop edge (form 1 or form 2) because each set/wait flag is a one-shot handshake, but buf-id's counters make backward sync a free side-effect of the forward bracket. Emitting separate buf-id brackets for the backward ConflictPair just wastes an id and serializes more than the spec requires. Implementation: skip ConflictPair->isInnerBackward in getBeforeAfterSyncMaps when emitStyle == BUF_ID. The pair still gets an id from EventIdSolver (the allocator already ran), but the id is unused. Phase-2 can hoist the filter earlier (handleSetWaitConflict) to recover the wasted ids if pool pressure becomes a problem. Updates graph_sync_solver_buf_id_backward.pto CHECKs accordingly and expands the design doc §2.4 / §3.6 with the rationale. Co-Authored-By: Claude Opus 4.7 <[email protected]>
Previous "drop backward at emit" was over-aggressive — it broke the
single-buffer loop case (case 2 from the doc): tload, tmuls, tstore all
on the same tile would get distinct forward ids, and iter i+1's tload
would only chain through tmuls's id, leaving a race against iter i's
tstore on the shared tile.
The right model: backward analysis is still needed to learn buffer
sharing, but the result drives **id allocation** (which forward pairs
must reuse a single id to chain through the same scoreboard counter),
not bracket emission. The buf-id (get_cnt, rel_cnt) chain then
transitively serializes iter i+1 after iter i for free.
Implementation in getBeforeAfterSyncMaps (buf-id path):
* union-find over forward ConflictPairs; F1 and F2 are unioned iff
they share an anchor AND some backward B has F1's and F2's
non-shared endpoints as its anchors. The backward edge between
those two endpoints implies they're on the same buffer.
* Each union group gets a single shared bufId (smallest id among
members' allocated ids).
* Emit one bracket pair per unique (anchor_op, pipe, bufId) triple
(a shared anchor across multiple ConflictPairs in the same group
must dedup to satisfy spec constraint 1).
* Backward pairs never emit brackets — their semantic is already
captured by the shared-id chain.
Output diff:
* canonical_loop (2-pipe, single buf): still single id #0 (unchanged).
* backward_loop (3-pipe, single tile): now collapses to a single id
#0 across tload/tmuls/tstore (was 2 ids with race).
* basic.linear_chain (no loop, no backward): two ids #0/#1
(unchanged — no bridging backward to force merge).
* two_buffer_loop (new test, doc §1.3 case): two ids on the two
independent buffer chains, vadd has two brackets — exactly the
doc form.
All 210 lit tests PASS (one new test added for the two-buffer pingpong
case).
Co-Authored-By: Claude Opus 4.7 <[email protected]>
…d setOp The graph sync solver picks `setOp` / `waitOp` for each ConflictPair via LCA / placeholder hoisting (e.g. when the producer lives inside an scf.if-then branch and the consumer outside, setOp lands at the parent block, not at the conditional op). For set/wait this is correct — the flag signal needs to be visible along every CFG edge so the pairing holds. For buf-id, hoisting the get_buf/rls_buf bracket above an scf.if is semantically muddled: the counter advance is decoupled from whether the guarded pipe op actually executes. When the conditional path is taken, get_cnt/rel_cnt get bumped without a corresponding real pipe op. The chain still happens to work in simple cases (a consumer outside the branch waits on a phantom counter), but the model loses its "bracket the actual op" invariant and gets confusing on more complex CFG shapes. Switch buf-id emission to anchor at the raw RWOperation (`op1`, `op2`) instead of the solver's `setOp` / `waitOp`. Both mirror dedup and the final bracket emission now use op1/op2 directly. Result: brackets land tightly around the actual pipe op, inside whatever scope contains it (scf.if, scf.for, etc). New regression: graph_sync_solver_buf_id_if_branch.pto exercises a tload inside an scf.if with the tmuls outside. The fixed emit places get_buf(TLOAD,#0) and rls_buf(TLOAD,#0) inside the scf.if (around the tload) and get_buf(TVEC,#0)/rls_buf(TVEC,#0) at the tmuls. All 211 lit tests PASS. Co-Authored-By: Claude Opus 4.7 <[email protected]>
Symmetric companion to graph_sync_solver_buf_id_if_branch.pto: producer runs unconditionally, consumer (tmuls) lives inside an scf.if-then. The buf-id emit anchors brackets at the raw RWOperations, so: * MTE2 get/rls bracket sits tight around tload, OUTSIDE the scf.if. * V get/rls bracket sits tight around tmuls, INSIDE the scf.if. * Both brackets share buf id #0 (single buffer / single chain). The conditional consumer simply skips its bracket when cond=false — the counter doesn't get an unmatched bump, and the model stays clean. Co-Authored-By: Claude Opus 4.7 <[email protected]>
Two complementary cross-loop sync shapes:
Case A: producer inside the loop, consumer after.
for i: tload (src_i -> tile)
tmuls (tile) # depends on last iter's tload
Case B: two adjacent loops feeding through one shared tile.
for i: tload (src_i -> tile)
for j: tmuls (tile) # depends on first loop's final output
Both should produce brackets anchored at the raw RWOperations:
* tload bracket inside the loop body — once per iter, counter advances
N times in sequence.
* tmuls bracket at its actual location (after the loop in case A,
inside the consumer loop body in case B), NOT at a loop boundary.
* Same buf id #0 chains producer and consumer via counter monotonicity.
Verifies the anchor-at-raw-RWOp fix (de7fe04) generalizes from scf.if to
scf.for: even when the solver would hoist setOp/waitOp to a loop
placeholder for set/wait, buf-id ignores that and brackets the actual
op.
Co-Authored-By: Claude Opus 4.7 <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 780ad32a35
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| /*isRegBasedArch=*/isA5); | ||
| opts.eventIdNumMax = eventIdNumMax; | ||
| opts.emitStyle = emitStyle; | ||
| if (emitStyle == SyncEmitStyle::BUF_ID) { |
There was a problem hiding this comment.
Skip same-pipe barriers in buf-id mode
When buf-id is selected here, same-pipe hazards are still sent through the existing handleBarrierConflict path; that path only suppresses A5 barriers for PIPE_V/PIPE_M, so legal A5 cases such as two TLOAD/MTE2 operations on the same buffer can still produce pto.barrier even though this option and the pass docs promise that buf-id/A5 emits no barriers. This makes the new mode generate unsupported/unwanted sync for same-pipe MTE2/MTE3/FIX conflicts unless the option also disables barrier conflicts for all pipes in buf-id mode.
Useful? React with 👍 / 👎.
… share an id
Two ConflictPairs that anchor inside disjoint scf.if branches never execute
concurrently, so the spec-constraint-1 ("two consecutive get_buf(P,#id)
before a single op") cannot apply: the two brackets are on different
control-flow paths and the hw counter sees only one of them per
iteration. The previous checkIntersect rule for buf-id ("share any pipe
-> conflict") incorrectly forced distinct ids on such pairs, which:
* wasted id budget when the id pool was tight, falling back to
barrier-all sooner than necessary;
* for the gemm.pto case with --event-id-max=8 still bracketed both
branches correctly, but with --event-id-max smaller the else-branch
was being collapsed away under id pressure.
Add a `mutuallyExclusive` predicate in checkIntersect's buf-id branch:
walk both pairs' ancestor Conditions and detect if they require disjoint
true/false branches of the same Condition. If yes, drop the conflict
edge so the coloring is free to share an id.
Regression: graph_sync_solver_buf_id_if_both_branches.pto runs with
--graph-sync-solver-event-id-max=1 and verifies that both then- and
else-branches get fully-bracketed forward-dep brackets sharing id #0.
Co-Authored-By: Claude Opus 4.7 <[email protected]>
…itive pruning
For set/wait, checkGraphConflict's transitive pruning treats two ConflictPairs
on the same (pipeSrc, pipeDst) as transitively covering each other via
same-pipe FIFO ordering. That's safe because flag-based sync is fundamentally
"signal the LAST producer, wait at the consumer" — issue-queue order takes
care of the rest.
For buf-id this is unsound: each buf-id is an independent scoreboard counter,
so a covering pair on the same pipe-pair but a DIFFERENT buffer does not
transitively sync the candidate's counter. Concretely, when two textracts on
PIPE_MTE1 feed one tmatmul.acc on PIPE_M but on distinct left/right tile
buffers, the solver previously pruned one of the two textract->tmatmul pairs;
the tmatmul ended up with only one consumer bracket, missing the wait on the
other textract.
Plumb the underlying buffer through:
* checkMemoryConflicts now returns
SmallVector<tuple<CorePipeInfo, CorePipeInfo, mlir::Value>>
where the Value identifies the SSA buffer that triggered the conflict.
Multiple conflicting buffers at the same pipe-pair yield one tuple each.
* processConflict iterates the tuples and forwards the buffer through
handleConflict / handleBarrierConflict / handleSetWaitConflict.
* ConflictPair gains a `conflictBuffer` field; clone() preserves it.
* checkGraphConflict gains a `conflictBuffer` parameter. In buf-id mode
(when the candidate has a non-null buffer) it filters out existing
pairs whose `conflictBuffer` is non-null and different — those pairs
cannot transitively cover the candidate under independent counters.
Null buffers fall through, preserving legacy set/wait behavior.
For the gemm.pto kernel that triggered the bug, the first three matmul.acc
calls in the inner else-branch now have two M-pipe brackets each — one per
textract producer (left + right tile) — exactly matching the user's
expected form A,B->C on distinct buffers yielding distinct buf-ids.
214 lit tests still pass.
Co-Authored-By: Claude Opus 4.7 <[email protected]>
Users can opt into the A5-only
pto.get_buf/pto.rls_bufbracket form instead ofpto.set_flag/pto.wait_flagvia--graph-sync-solver-sync-style=buf-id. The buf-id model is a sequential programming model: the hw(get_cnt, rel_cnt)scoreboard counters serialize same-id brackets across pipes and iterations, so the solver emits brackets at producer/consumer anchor RWOperations and the counter monotonicity handles both intra-iter (forward) and loop-carried (backward) deps without out-of-loop compensation, scf.if isFirstIter / isLastIter guards, or pto.barrier on A5.Solver-internals reuse:
for {load; vector}form.CLI / pass option:
--graph-sync-solver-sync-style={set-wait (default) | buf-id}
PTOGraphSyncSolver pass option: sync-style="set-wait" | "buf-id"
buf-id requires --pto-arch=a5; non-A5 errors out.
Tests (all PASS, plus full 209-test PTO lit suite green):
for {load; vector}form; single id #0 covers forward + backward.Design doc: docs/designs/ptoas-graph-sync-solver-buf-id-design.md.