Add distributed FFN GridPipe TPUSH/TPOP example on A2/A3 by chenshengxin2026 · Pull Request #123 · hw-native-sys/pto-isa

chenshengxin2026 · 2026-05-12T08:46:37Z

Summary

Adds kernels/manual/a2a3/distributed_ffn_grid/, an A2/A3 single-device multi-block FFN demo that drives a real GridPipe TPUSH<Direction> / TPOP<Direction> pipeline through the existing PTO ISA header stack.

The demo intentionally restricts TLOAD/TSTORE to the head and tail of the FFN. Every intermediate tile flows through the TPipe / GridPipe pop/push surface, so the same kernel source can lower to silicon GridPipe without revisiting the dataflow.

TLOAD/TSTORE only at head and tail

Stage	Movement	Instruction
Head	GM `X` / `W_gate` / `W_up` / `W_down` → L1	`TLOAD` (Cube, once per cell)
Cube → Vec (gate / up / down)	L1 / Acc → Vec UB	`TPUSH<C2V>` / `TPOP<C2V>` (`TPipe`)
Vec → Cube (hidden)	Vec UB → L1	`TPUSH<V2C>` / `TPOP<V2C>` (`TPipe`)
Inter-cell EAST reduce	Vec UB ↔ peer GridPipe slot	`TPUSH<EAST>` / `TPOP<EAST>` (`GridPipe`)
Tail	Vec UB → GM `yOutput[row]`	`TSTORE` (final-column block, fp32 `[T, H]`)

gate_partial, up_partial, hidden, down_partial, and the EAST reduce payload never touch GM through TLOAD/TSTORE — they ride the TPipe / GridPipe FIFO surface. This keeps the demo aligned with the silicon GridPipe programming model where intermediate-tile spill / fill is a back-end choice, not a kernel-source decision.

Mock CCE-intrinsic API for neighbor counters

New header include/pto/common/grid_counter_intrinsic.hpp introduces two CCE-intrinsic-style functions that sit at the same layer as dcci:

mtspr_neighbor_counter(operand, kind, dir, value) — publish a monotonically-increasing Ready or Free counter to the neighbor in direction dir. Hardware contract maps to mtspr SPR_RDY_<DIR>, value / mtspr SPR_FREE_<DIR>, value. Release ordering: prior payload writes must be globally visible before the peer observes the counter.
wfe_neighbor_counter(operand, kind, dir, threshold, maxSpins) — block until the local mirror of the neighbor-produced counter reaches threshold. Hardware contract maps to wfe SPR_RDY_<DIR>, threshold / wfe SPR_FREE_<DIR>, threshold. Acquire ordering: subsequent loads must not be reordered above the wait.

NeighborCounterOperand carries the backend operand (today: a GM counter pointer). On A2/A3 the function bodies dispatch to grid_pipe_mock_spr.hpp (MockMtsprCounter + MockTryWfeCounter), which already implement the canonical pre-dcci → store → post-dcci → dsb(DSB_DDR) publish and the dcci + pipe_barrier(PIPE_ALL) spin.

GridTPush.hpp and GridTPop.hpp now go through these intrinsics for all ready/free synchronization — call sites no longer reference grid_mock::* for counters. Slot transfer remains on HcclRemotePtr plus the existing window layout.

Adapting when real hardware supports neighbor SPR / WFE counters

When silicon exposes native neighbor counters, no demo kernel source changes:

Define PTO_GRID_COUNTER_NATIVE_INTRINSIC for the device target.

Provide compiler builtins with the same contract:

__builtin_pto_mtspr_neighbor_counter(uint32_t kind, uint32_t dir, uint32_t value);   // release
__builtin_pto_wfe_neighbor_counter (uint32_t kind, uint32_t dir, uint32_t threshold); // acquire

The header automatically swaps the mock body for the builtin path. NeighborCounterOperand::addr is ignored on the native path, so the GM ready/free counter region inside the GridPipe window becomes unused (slot region and HcclRemotePtr payload addressing remain).
Payload slot delivery (a2a3_grid_payload::CopyGmSlotToTile / WriteTileToPeerGmSlot) can be retargeted to the silicon's grid slot mechanism in a follow-up by replacing only the payload adaptor — no changes to GridTPush.hpp / GridTPop.hpp or to the demo kernel.

This separation is the reason the new intrinsic API was introduced rather than calling the mock helpers directly: only one file in the build (grid_counter_intrinsic.hpp) chooses between native lowering and the GM-counter mock.

Header layer

include/pto/common/grid_pipe.hpp — GridDirection, GridShape, GridCoord, NeighborRankForPush / NeighborRankForPop, is_grid_pipe_v.
include/pto/common/grid_pipe_mock_spr.hpp — unified MockMtsprCounter / MockTryWfeCounter with the canonical publish/spin sequence; Ready and Free helpers are thin wrappers.
include/pto/common/grid_counter_intrinsic.hpp — new CCE-intrinsic-style API (see above).
include/pto/npu/a2a3/grid_pipe_runtime.hpp — window layout (kWindowBytes, kReadyFlagOffset, kFreeFlagOffset, kSlotOffset) and InitGridPipeFromWindow.
include/pto/npu/a2a3/GridTPush.hpp / GridTPop.hpp — A2/A3 backend; counters go through the intrinsic API, payload still uses HcclRemotePtr.
include/pto/common/pto_instr.hpp / pto_instr_impl.hpp — GridDirection-templated TPUSH / TPOP overloads. TPUSH<SOURCE> is rejected at compile time; unsupported targets emit a static_assert.

Demo layer (single mixed-arch kernel)

distributed_ffn_grid_compute_kernel.cpp (dav-c220) — single kernel with Cube and Vec branches gated by __DAV_CUBE__ / __DAV_VEC__. They exchange gate / up / hidden / down intermediates through A2/A3 TPipe ready/free handshakes inside one launch, and run the row-local EAST fp32 reduce via TPOP<EAST> + TADD + TPUSH<EAST> with the final-column block doing the tail TSTORE. This replaces the earlier dual .so (cube + vec) and host-side phase 0 / phase 1 doorbell sequencing.
main.cpp / kernel_launch.hpp / ffn_config.hpp / gridpipe_payload_inl.hpp — ACL bootstrap, fake HcclDeviceContext with per-cell GridPipe windows on a single device, single-launch mixed kernel, fp32 golden compare on the last-column block of each row.
scripts/gen_data.py — NumPy fp32 FFN reference with alpha=0.1, per-cell fp16 X / W_gate / W_up / W_down shards, fp32 [T_total, H] golden tensor.
CMakeLists.txt / run.sh — single mixed device .so plus host binary; runs gen_data + cmake; launches on a single device.
README.md / README_zh.md — design overview, head/tail TLOAD/TSTORE contract, neighbor counter intrinsic adaptation path, mock SPR/WFE expansion, ready/free flag layout, build/run flow.

Validation

Build: bash kernels/manual/a2a3/distributed_ffn_grid/run.sh --build-only -v Ascend910B1 produces the mixed .so and the host binary.
NPU 2x2: task-submit --device auto --run "bash kernels/manual/a2a3/distributed_ffn_grid/run.sh -r npu -v Ascend910B1 --grid-rows 2 --grid-cols 2" → 3 / 3 deterministic runs, ResultCmp=1, max_diff=0.
Regression: cmake --build kernels/manual/a2a3/allgather_gemm/build -j8 RC=0.

Limitations

A2/A3 mock backend is not silicon validation evidence for LPU WSE. The design-doc section 5.5 RFCs (cross-core SPR visibility, TMOV slot address space) remain open and gate the LPU WSE silicon lowering.
Negative / fault-injection kernels are planned as a follow-up PR.

Test plan

bash kernels/manual/a2a3/distributed_ffn_grid/run.sh --build-only -v Ascend910B1 produces the mixed kernel .so and the host binary.
task-submit --device auto --run "bash kernels/manual/a2a3/distributed_ffn_grid/run.sh -r npu -v Ascend910B1 --grid-rows 2 --grid-cols 2" prints [SUCCESS] and ResultCmp=1 for all final-column blocks.
cmake --build kernels/manual/a2a3/allgather_gemm/build -j8 remains green.

gemini-code-assist

Code Review

This pull request introduces GridPipe, a 2D mesh communication abstraction with TPUSH and TPOP primitives, along with a mock backend for A2/A3 silicon that emulates hardware synchronization via global memory atomic flags. It includes runtime initialization helpers and a distributed FFN demo for end-to-end validation. Review feedback identifies opportunities to refactor duplicated logic in the mock SPR layer, improve the robustness of fault code generation by avoiding fragile enum arithmetic, and ensure proper resource cleanup in the host driver by using standard C++ termination functions.

Adds `kernels/manual/a2a3/distributed_ffn_grid/`, a single-device multi-block FFN demo that drives a real GridPipe `TPUSH<Direction>` / `TPOP<Direction>` pipeline through the existing PTO ISA header stack on A2/A3. Header layer - `include/pto/common/grid_pipe.hpp`: `GridDirection`, `GridShape`, `GridCoord`, `NeighborRankForPush / NeighborRankForPop`, `is_grid_pipe_v`. - `include/pto/common/grid_pipe_mock_spr.hpp`: `MockMtsprCounter` / `MockTryWfeCounter` with the canonical pre-dcci -> store -> post-dcci -> `dsb(DSB_DDR)` publish and `dcci` + `pipe_barrier(PIPE_ALL)` spin. Ready and Free helpers are thin wrappers over the unified counter primitive. - `include/pto/common/grid_counter_intrinsic.hpp` (new): CCE-intrinsic-style neighbor counter API used by `GridTPush` / `GridTPop`. Exposes `mtspr_neighbor_counter(operand, kind, dir, value)` and `wfe_neighbor_counter(operand, kind, dir, threshold, maxSpins)` with a hardware contract that maps directly to `mtspr SPR_RDY/FREE_<DIR>` and `wfe SPR_RDY/FREE_<DIR>` semantics, plus a `NeighborCounterOperand` carrying the GM-backed counter address. - `include/pto/npu/a2a3/grid_pipe_runtime.hpp`: window layout and `InitGridPipeFromWindow`. - `include/pto/npu/a2a3/GridTPush.hpp` / `GridTPop.hpp`: A2/A3 backend. All ready/free synchronization goes through the new intrinsic API; the payload slot transfer still uses `HcclRemotePtr` + the mock window. - `include/pto/common/pto_instr.hpp` / `pto_instr_impl.hpp`: `GridDirection`-templated `TPUSH` / `TPOP` overloads. `TPUSH<SOURCE>` is rejected at compile time and unsupported targets emit a `static_assert` rather than silently falling back to a GM stub. Demo layer - One mixed-arch kernel `distributed_ffn_grid_compute_kernel.cpp` (`dav-c220`). Cube and Vec branches are gated by `__DAV_CUBE__` / `__DAV_VEC__` and exchange gate / up / hidden / down intermediates through regular A2/A3 `TPipe` ready/free handshakes inside a single launch. This replaces the earlier dual `.so` (cube `dav-c220-cube` + vec `dav-c220-vec`) and the host-side phase 0 / phase 1 doorbell sequencing. - TLOAD/TSTORE are confined to the FFN head and tail only: Cube uses TLOAD to bring `X`, `W_gate`, `W_up`, `W_down` into L1 once; the final-column Vec block uses TSTORE to write the fp32 `[T, H]` row output. Every intermediate (`gate_partial`, `up_partial`, `hidden`, `down_partial`, the EAST reduce payload) flows through `TPUSH`/`TPOP` -- C2V/V2C `TPipe` for intra-cell handoff, GridPipe `TPUSH<EAST>` / `TPOP<EAST>` for inter-cell reduce. - `main.cpp` / `kernel_launch.hpp` / `ffn_config.hpp` / `gridpipe_payload_inl.hpp`: ACL bootstrap, fake `HcclDeviceContext` with per-cell GridPipe windows on a single device, mixed-kernel single launch, fp32 golden compare on the last-column cell of each row. - `scripts/gen_data.py`: NumPy fp32 FFN reference with `alpha=0.1`, per-cell fp16 X / W_gate / W_up / W_down shards, fp32 `[T_total, H]` golden tensor. - `CMakeLists.txt` / `run.sh`: build the single mixed device `.so` plus the host binary, run `gen_data` + cmake, launch on a single device. - `README.md` / `README_zh.md`: design overview, head/tail TLOAD/TSTORE contract, neighbor counter intrinsic adaptation path, mock SPR/WFE expansion, ready/free flag layout, build/run flow. Adapting when real hardware supports neighbor counters - Today on A2/A3, `mtspr_neighbor_counter` / `wfe_neighbor_counter` lower to the GM-counter mock under `grid_pipe_mock_spr.hpp` and use the `NeighborCounterOperand::addr` GM pointer that the GridPipe window pre-allocates per direction. - When silicon adds a real neighbor SPR / WFE counter, define `PTO_GRID_COUNTER_NATIVE_INTRINSIC` and provide compiler builtins `__builtin_pto_mtspr_neighbor_counter(kind, dir, value)` and `__builtin_pto_wfe_neighbor_counter(kind, dir, threshold)` with the same release / acquire contract. Call sites in `GridTPush` / `GridTPop` do not change; the GM counter address in `NeighborCounterOperand` is ignored on the native path and the GridPipe window's ready/free counter region becomes unused. Payload slot transfer continues through the existing `HcclRemotePtr` adaptor and can be retargeted to the silicon's grid slot address mechanism without touching the demo kernel source. Validation - Build: `bash kernels/manual/a2a3/distributed_ffn_grid/run.sh --build-only -v Ascend910B1` produces the mixed kernel `.so` and the host binary. - NPU 2x2: `task-submit --device auto --run "bash kernels/manual/a2a3/distributed_ffn_grid/run.sh -r npu -v Ascend910B1 --grid-rows 2 --grid-cols 2"` -> 3 / 3 deterministic runs, `ResultCmp=1`, `max_diff=0`. - Regression: `cmake --build kernels/manual/a2a3/allgather_gemm/build -j8` RC=0. Limitations - A2/A3 mock backend is not silicon validation evidence for LPU WSE. The design-doc section 5.5 RFCs (cross-core SPR visibility, `TMOV` slot address space) remain open and gate the LPU WSE silicon lowering. - Negative / fault-injection kernels are planned as a follow-up.

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

chenshengxin2026 force-pushed the feat/distributed-ffn-grid-tpush-tpop branch 2 times, most recently from ff2116d to 3039ba8 Compare May 14, 2026 11:19

chenshengxin2026 force-pushed the feat/distributed-ffn-grid-tpush-tpop branch from 3039ba8 to b4e83e8 Compare May 14, 2026 11:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed FFN GridPipe TPUSH/TPOP example on A2/A3#123

Add distributed FFN GridPipe TPUSH/TPOP example on A2/A3#123
chenshengxin2026 wants to merge 1 commit into
hw-native-sys:mainfrom
chenshengxin2026:feat/distributed-ffn-grid-tpush-tpop

chenshengxin2026 commented May 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chenshengxin2026 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

TLOAD/TSTORE only at head and tail

Mock CCE-intrinsic API for neighbor counters

Adapting when real hardware supports neighbor SPR / WFE counters

Header layer

Demo layer (single mixed-arch kernel)

Validation

Limitations

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chenshengxin2026 commented May 12, 2026 •

edited

Loading