Add distributed FFN GridPipe TPUSH/TPOP example on A2/A3#123
Open
chenshengxin2026 wants to merge 1 commit into
Open
Add distributed FFN GridPipe TPUSH/TPOP example on A2/A3#123chenshengxin2026 wants to merge 1 commit into
chenshengxin2026 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces GridPipe, a 2D mesh communication abstraction with TPUSH and TPOP primitives, along with a mock backend for A2/A3 silicon that emulates hardware synchronization via global memory atomic flags. It includes runtime initialization helpers and a distributed FFN demo for end-to-end validation. Review feedback identifies opportunities to refactor duplicated logic in the mock SPR layer, improve the robustness of fault code generation by avoiding fragile enum arithmetic, and ensure proper resource cleanup in the host driver by using standard C++ termination functions.
ff2116d to
3039ba8
Compare
Adds `kernels/manual/a2a3/distributed_ffn_grid/`, a single-device multi-block FFN demo that drives a real GridPipe `TPUSH<Direction>` / `TPOP<Direction>` pipeline through the existing PTO ISA header stack on A2/A3. Header layer - `include/pto/common/grid_pipe.hpp`: `GridDirection`, `GridShape`, `GridCoord`, `NeighborRankForPush / NeighborRankForPop`, `is_grid_pipe_v`. - `include/pto/common/grid_pipe_mock_spr.hpp`: `MockMtsprCounter` / `MockTryWfeCounter` with the canonical pre-dcci -> store -> post-dcci -> `dsb(DSB_DDR)` publish and `dcci` + `pipe_barrier(PIPE_ALL)` spin. Ready and Free helpers are thin wrappers over the unified counter primitive. - `include/pto/common/grid_counter_intrinsic.hpp` (new): CCE-intrinsic-style neighbor counter API used by `GridTPush` / `GridTPop`. Exposes `mtspr_neighbor_counter(operand, kind, dir, value)` and `wfe_neighbor_counter(operand, kind, dir, threshold, maxSpins)` with a hardware contract that maps directly to `mtspr SPR_RDY/FREE_<DIR>` and `wfe SPR_RDY/FREE_<DIR>` semantics, plus a `NeighborCounterOperand` carrying the GM-backed counter address. - `include/pto/npu/a2a3/grid_pipe_runtime.hpp`: window layout and `InitGridPipeFromWindow`. - `include/pto/npu/a2a3/GridTPush.hpp` / `GridTPop.hpp`: A2/A3 backend. All ready/free synchronization goes through the new intrinsic API; the payload slot transfer still uses `HcclRemotePtr` + the mock window. - `include/pto/common/pto_instr.hpp` / `pto_instr_impl.hpp`: `GridDirection`-templated `TPUSH` / `TPOP` overloads. `TPUSH<SOURCE>` is rejected at compile time and unsupported targets emit a `static_assert` rather than silently falling back to a GM stub. Demo layer - One mixed-arch kernel `distributed_ffn_grid_compute_kernel.cpp` (`dav-c220`). Cube and Vec branches are gated by `__DAV_CUBE__` / `__DAV_VEC__` and exchange gate / up / hidden / down intermediates through regular A2/A3 `TPipe` ready/free handshakes inside a single launch. This replaces the earlier dual `.so` (cube `dav-c220-cube` + vec `dav-c220-vec`) and the host-side phase 0 / phase 1 doorbell sequencing. - TLOAD/TSTORE are confined to the FFN head and tail only: Cube uses TLOAD to bring `X`, `W_gate`, `W_up`, `W_down` into L1 once; the final-column Vec block uses TSTORE to write the fp32 `[T, H]` row output. Every intermediate (`gate_partial`, `up_partial`, `hidden`, `down_partial`, the EAST reduce payload) flows through `TPUSH`/`TPOP` -- C2V/V2C `TPipe` for intra-cell handoff, GridPipe `TPUSH<EAST>` / `TPOP<EAST>` for inter-cell reduce. - `main.cpp` / `kernel_launch.hpp` / `ffn_config.hpp` / `gridpipe_payload_inl.hpp`: ACL bootstrap, fake `HcclDeviceContext` with per-cell GridPipe windows on a single device, mixed-kernel single launch, fp32 golden compare on the last-column cell of each row. - `scripts/gen_data.py`: NumPy fp32 FFN reference with `alpha=0.1`, per-cell fp16 X / W_gate / W_up / W_down shards, fp32 `[T_total, H]` golden tensor. - `CMakeLists.txt` / `run.sh`: build the single mixed device `.so` plus the host binary, run `gen_data` + cmake, launch on a single device. - `README.md` / `README_zh.md`: design overview, head/tail TLOAD/TSTORE contract, neighbor counter intrinsic adaptation path, mock SPR/WFE expansion, ready/free flag layout, build/run flow. Adapting when real hardware supports neighbor counters - Today on A2/A3, `mtspr_neighbor_counter` / `wfe_neighbor_counter` lower to the GM-counter mock under `grid_pipe_mock_spr.hpp` and use the `NeighborCounterOperand::addr` GM pointer that the GridPipe window pre-allocates per direction. - When silicon adds a real neighbor SPR / WFE counter, define `PTO_GRID_COUNTER_NATIVE_INTRINSIC` and provide compiler builtins `__builtin_pto_mtspr_neighbor_counter(kind, dir, value)` and `__builtin_pto_wfe_neighbor_counter(kind, dir, threshold)` with the same release / acquire contract. Call sites in `GridTPush` / `GridTPop` do not change; the GM counter address in `NeighborCounterOperand` is ignored on the native path and the GridPipe window's ready/free counter region becomes unused. Payload slot transfer continues through the existing `HcclRemotePtr` adaptor and can be retargeted to the silicon's grid slot address mechanism without touching the demo kernel source. Validation - Build: `bash kernels/manual/a2a3/distributed_ffn_grid/run.sh --build-only -v Ascend910B1` produces the mixed kernel `.so` and the host binary. - NPU 2x2: `task-submit --device auto --run "bash kernels/manual/a2a3/distributed_ffn_grid/run.sh -r npu -v Ascend910B1 --grid-rows 2 --grid-cols 2"` -> 3 / 3 deterministic runs, `ResultCmp=1`, `max_diff=0`. - Regression: `cmake --build kernels/manual/a2a3/allgather_gemm/build -j8` RC=0. Limitations - A2/A3 mock backend is not silicon validation evidence for LPU WSE. The design-doc section 5.5 RFCs (cross-core SPR visibility, `TMOV` slot address space) remain open and gate the LPU WSE silicon lowering. - Negative / fault-injection kernels are planned as a follow-up.
3039ba8 to
b4e83e8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
kernels/manual/a2a3/distributed_ffn_grid/, an A2/A3 single-device multi-block FFN demo that drives a real GridPipeTPUSH<Direction>/TPOP<Direction>pipeline through the existing PTO ISA header stack.The demo intentionally restricts TLOAD/TSTORE to the head and tail of the FFN. Every intermediate tile flows through the TPipe / GridPipe pop/push surface, so the same kernel source can lower to silicon GridPipe without revisiting the dataflow.
TLOAD/TSTORE only at head and tail
X/W_gate/W_up/W_down→ L1TLOAD(Cube, once per cell)TPUSH<C2V>/TPOP<C2V>(TPipe)TPUSH<V2C>/TPOP<V2C>(TPipe)TPUSH<EAST>/TPOP<EAST>(GridPipe)yOutput[row]TSTORE(final-column block, fp32[T, H])gate_partial,up_partial,hidden,down_partial, and the EAST reduce payload never touch GM through TLOAD/TSTORE — they ride theTPipe/GridPipeFIFO surface. This keeps the demo aligned with the silicon GridPipe programming model where intermediate-tile spill / fill is a back-end choice, not a kernel-source decision.Mock CCE-intrinsic API for neighbor counters
New header
include/pto/common/grid_counter_intrinsic.hppintroduces two CCE-intrinsic-style functions that sit at the same layer asdcci:mtspr_neighbor_counter(operand, kind, dir, value)— publish a monotonically-increasingReadyorFreecounter to the neighbor in directiondir. Hardware contract maps tomtspr SPR_RDY_<DIR>, value/mtspr SPR_FREE_<DIR>, value. Release ordering: prior payload writes must be globally visible before the peer observes the counter.wfe_neighbor_counter(operand, kind, dir, threshold, maxSpins)— block until the local mirror of the neighbor-produced counter reachesthreshold. Hardware contract maps towfe SPR_RDY_<DIR>, threshold/wfe SPR_FREE_<DIR>, threshold. Acquire ordering: subsequent loads must not be reordered above the wait.NeighborCounterOperandcarries the backend operand (today: a GM counter pointer). On A2/A3 the function bodies dispatch togrid_pipe_mock_spr.hpp(MockMtsprCounter+MockTryWfeCounter), which already implement the canonical pre-dcci→ store → post-dcci→dsb(DSB_DDR)publish and thedcci+pipe_barrier(PIPE_ALL)spin.GridTPush.hppandGridTPop.hppnow go through these intrinsics for all ready/free synchronization — call sites no longer referencegrid_mock::*for counters. Slot transfer remains onHcclRemotePtrplus the existing window layout.Adapting when real hardware supports neighbor SPR / WFE counters
When silicon exposes native neighbor counters, no demo kernel source changes:
PTO_GRID_COUNTER_NATIVE_INTRINSICfor the device target.NeighborCounterOperand::addris ignored on the native path, so the GM ready/free counter region inside the GridPipe window becomes unused (slot region andHcclRemotePtrpayload addressing remain).a2a3_grid_payload::CopyGmSlotToTile/WriteTileToPeerGmSlot) can be retargeted to the silicon's grid slot mechanism in a follow-up by replacing only the payload adaptor — no changes toGridTPush.hpp/GridTPop.hppor to the demo kernel.This separation is the reason the new intrinsic API was introduced rather than calling the mock helpers directly: only one file in the build (
grid_counter_intrinsic.hpp) chooses between native lowering and the GM-counter mock.Header layer
include/pto/common/grid_pipe.hpp—GridDirection,GridShape,GridCoord,NeighborRankForPush / NeighborRankForPop,is_grid_pipe_v.include/pto/common/grid_pipe_mock_spr.hpp— unifiedMockMtsprCounter/MockTryWfeCounterwith the canonical publish/spin sequence;ReadyandFreehelpers are thin wrappers.include/pto/common/grid_counter_intrinsic.hpp— new CCE-intrinsic-style API (see above).include/pto/npu/a2a3/grid_pipe_runtime.hpp— window layout (kWindowBytes,kReadyFlagOffset,kFreeFlagOffset,kSlotOffset) andInitGridPipeFromWindow.include/pto/npu/a2a3/GridTPush.hpp/GridTPop.hpp— A2/A3 backend; counters go through the intrinsic API, payload still usesHcclRemotePtr.include/pto/common/pto_instr.hpp/pto_instr_impl.hpp—GridDirection-templatedTPUSH/TPOPoverloads.TPUSH<SOURCE>is rejected at compile time; unsupported targets emit astatic_assert.Demo layer (single mixed-arch kernel)
distributed_ffn_grid_compute_kernel.cpp(dav-c220) — single kernel with Cube and Vec branches gated by__DAV_CUBE__/__DAV_VEC__. They exchange gate / up / hidden / down intermediates through A2/A3TPipeready/free handshakes inside one launch, and run the row-local EASTfp32reduce viaTPOP<EAST>+TADD+TPUSH<EAST>with the final-column block doing the tail TSTORE. This replaces the earlier dual.so(cube + vec) and host-side phase 0 / phase 1 doorbell sequencing.main.cpp/kernel_launch.hpp/ffn_config.hpp/gridpipe_payload_inl.hpp— ACL bootstrap, fakeHcclDeviceContextwith per-cell GridPipe windows on a single device, single-launch mixed kernel,fp32golden compare on the last-column block of each row.scripts/gen_data.py— NumPyfp32FFN reference withalpha=0.1, per-cellfp16X / W_gate / W_up / W_down shards,fp32[T_total, H]golden tensor.CMakeLists.txt/run.sh— single mixed device.soplus host binary; runsgen_data+ cmake; launches on a single device.README.md/README_zh.md— design overview, head/tail TLOAD/TSTORE contract, neighbor counter intrinsic adaptation path, mock SPR/WFE expansion, ready/free flag layout, build/run flow.Validation
bash kernels/manual/a2a3/distributed_ffn_grid/run.sh --build-only -v Ascend910B1produces the mixed.soand the host binary.2x2:task-submit --device auto --run "bash kernels/manual/a2a3/distributed_ffn_grid/run.sh -r npu -v Ascend910B1 --grid-rows 2 --grid-cols 2"→ 3 / 3 deterministic runs,ResultCmp=1,max_diff=0.cmake --build kernels/manual/a2a3/allgather_gemm/build -j8RC=0.Limitations
TMOVslot address space) remain open and gate the LPU WSE silicon lowering.Test plan
bash kernels/manual/a2a3/distributed_ffn_grid/run.sh --build-only -v Ascend910B1produces the mixed kernel.soand the host binary.task-submit --device auto --run "bash kernels/manual/a2a3/distributed_ffn_grid/run.sh -r npu -v Ascend910B1 --grid-rows 2 --grid-cols 2"prints[SUCCESS]andResultCmp=1for all final-column blocks.cmake --build kernels/manual/a2a3/allgather_gemm/build -j8remains green.