Skip to content

Add distributed FFN GridPipe TPUSH/TPOP example on A2/A3#123

Open
chenshengxin2026 wants to merge 1 commit into
hw-native-sys:mainfrom
chenshengxin2026:feat/distributed-ffn-grid-tpush-tpop
Open

Add distributed FFN GridPipe TPUSH/TPOP example on A2/A3#123
chenshengxin2026 wants to merge 1 commit into
hw-native-sys:mainfrom
chenshengxin2026:feat/distributed-ffn-grid-tpush-tpop

Conversation

@chenshengxin2026
Copy link
Copy Markdown

@chenshengxin2026 chenshengxin2026 commented May 12, 2026

Summary

Adds kernels/manual/a2a3/distributed_ffn_grid/, an A2/A3 single-device multi-block FFN demo that drives a real GridPipe TPUSH<Direction> / TPOP<Direction> pipeline through the existing PTO ISA header stack.

The demo intentionally restricts TLOAD/TSTORE to the head and tail of the FFN. Every intermediate tile flows through the TPipe / GridPipe pop/push surface, so the same kernel source can lower to silicon GridPipe without revisiting the dataflow.

TLOAD/TSTORE only at head and tail

Stage Movement Instruction
Head GM X / W_gate / W_up / W_down → L1 TLOAD (Cube, once per cell)
Cube → Vec (gate / up / down) L1 / Acc → Vec UB TPUSH<C2V> / TPOP<C2V> (TPipe)
Vec → Cube (hidden) Vec UB → L1 TPUSH<V2C> / TPOP<V2C> (TPipe)
Inter-cell EAST reduce Vec UB ↔ peer GridPipe slot TPUSH<EAST> / TPOP<EAST> (GridPipe)
Tail Vec UB → GM yOutput[row] TSTORE (final-column block, fp32 [T, H])

gate_partial, up_partial, hidden, down_partial, and the EAST reduce payload never touch GM through TLOAD/TSTORE — they ride the TPipe / GridPipe FIFO surface. This keeps the demo aligned with the silicon GridPipe programming model where intermediate-tile spill / fill is a back-end choice, not a kernel-source decision.

Mock CCE-intrinsic API for neighbor counters

New header include/pto/common/grid_counter_intrinsic.hpp introduces two CCE-intrinsic-style functions that sit at the same layer as dcci:

  • mtspr_neighbor_counter(operand, kind, dir, value) — publish a monotonically-increasing Ready or Free counter to the neighbor in direction dir. Hardware contract maps to mtspr SPR_RDY_<DIR>, value / mtspr SPR_FREE_<DIR>, value. Release ordering: prior payload writes must be globally visible before the peer observes the counter.
  • wfe_neighbor_counter(operand, kind, dir, threshold, maxSpins) — block until the local mirror of the neighbor-produced counter reaches threshold. Hardware contract maps to wfe SPR_RDY_<DIR>, threshold / wfe SPR_FREE_<DIR>, threshold. Acquire ordering: subsequent loads must not be reordered above the wait.

NeighborCounterOperand carries the backend operand (today: a GM counter pointer). On A2/A3 the function bodies dispatch to grid_pipe_mock_spr.hpp (MockMtsprCounter + MockTryWfeCounter), which already implement the canonical pre-dcci → store → post-dccidsb(DSB_DDR) publish and the dcci + pipe_barrier(PIPE_ALL) spin.

GridTPush.hpp and GridTPop.hpp now go through these intrinsics for all ready/free synchronization — call sites no longer reference grid_mock::* for counters. Slot transfer remains on HcclRemotePtr plus the existing window layout.

Adapting when real hardware supports neighbor SPR / WFE counters

When silicon exposes native neighbor counters, no demo kernel source changes:

  1. Define PTO_GRID_COUNTER_NATIVE_INTRINSIC for the device target.
  2. Provide compiler builtins with the same contract:
    __builtin_pto_mtspr_neighbor_counter(uint32_t kind, uint32_t dir, uint32_t value);   // release
    __builtin_pto_wfe_neighbor_counter (uint32_t kind, uint32_t dir, uint32_t threshold); // acquire
    
  3. The header automatically swaps the mock body for the builtin path. NeighborCounterOperand::addr is ignored on the native path, so the GM ready/free counter region inside the GridPipe window becomes unused (slot region and HcclRemotePtr payload addressing remain).
  4. Payload slot delivery (a2a3_grid_payload::CopyGmSlotToTile / WriteTileToPeerGmSlot) can be retargeted to the silicon's grid slot mechanism in a follow-up by replacing only the payload adaptor — no changes to GridTPush.hpp / GridTPop.hpp or to the demo kernel.

This separation is the reason the new intrinsic API was introduced rather than calling the mock helpers directly: only one file in the build (grid_counter_intrinsic.hpp) chooses between native lowering and the GM-counter mock.

Header layer

  • include/pto/common/grid_pipe.hppGridDirection, GridShape, GridCoord, NeighborRankForPush / NeighborRankForPop, is_grid_pipe_v.
  • include/pto/common/grid_pipe_mock_spr.hpp — unified MockMtsprCounter / MockTryWfeCounter with the canonical publish/spin sequence; Ready and Free helpers are thin wrappers.
  • include/pto/common/grid_counter_intrinsic.hpp — new CCE-intrinsic-style API (see above).
  • include/pto/npu/a2a3/grid_pipe_runtime.hpp — window layout (kWindowBytes, kReadyFlagOffset, kFreeFlagOffset, kSlotOffset) and InitGridPipeFromWindow.
  • include/pto/npu/a2a3/GridTPush.hpp / GridTPop.hpp — A2/A3 backend; counters go through the intrinsic API, payload still uses HcclRemotePtr.
  • include/pto/common/pto_instr.hpp / pto_instr_impl.hppGridDirection-templated TPUSH / TPOP overloads. TPUSH<SOURCE> is rejected at compile time; unsupported targets emit a static_assert.

Demo layer (single mixed-arch kernel)

  • distributed_ffn_grid_compute_kernel.cpp (dav-c220) — single kernel with Cube and Vec branches gated by __DAV_CUBE__ / __DAV_VEC__. They exchange gate / up / hidden / down intermediates through A2/A3 TPipe ready/free handshakes inside one launch, and run the row-local EAST fp32 reduce via TPOP<EAST> + TADD + TPUSH<EAST> with the final-column block doing the tail TSTORE. This replaces the earlier dual .so (cube + vec) and host-side phase 0 / phase 1 doorbell sequencing.
  • main.cpp / kernel_launch.hpp / ffn_config.hpp / gridpipe_payload_inl.hpp — ACL bootstrap, fake HcclDeviceContext with per-cell GridPipe windows on a single device, single-launch mixed kernel, fp32 golden compare on the last-column block of each row.
  • scripts/gen_data.py — NumPy fp32 FFN reference with alpha=0.1, per-cell fp16 X / W_gate / W_up / W_down shards, fp32 [T_total, H] golden tensor.
  • CMakeLists.txt / run.sh — single mixed device .so plus host binary; runs gen_data + cmake; launches on a single device.
  • README.md / README_zh.md — design overview, head/tail TLOAD/TSTORE contract, neighbor counter intrinsic adaptation path, mock SPR/WFE expansion, ready/free flag layout, build/run flow.

Validation

  • Build: bash kernels/manual/a2a3/distributed_ffn_grid/run.sh --build-only -v Ascend910B1 produces the mixed .so and the host binary.
  • NPU 2x2: task-submit --device auto --run "bash kernels/manual/a2a3/distributed_ffn_grid/run.sh -r npu -v Ascend910B1 --grid-rows 2 --grid-cols 2" → 3 / 3 deterministic runs, ResultCmp=1, max_diff=0.
  • Regression: cmake --build kernels/manual/a2a3/allgather_gemm/build -j8 RC=0.

Limitations

  • A2/A3 mock backend is not silicon validation evidence for LPU WSE. The design-doc section 5.5 RFCs (cross-core SPR visibility, TMOV slot address space) remain open and gate the LPU WSE silicon lowering.
  • Negative / fault-injection kernels are planned as a follow-up PR.

Test plan

  • bash kernels/manual/a2a3/distributed_ffn_grid/run.sh --build-only -v Ascend910B1 produces the mixed kernel .so and the host binary.
  • task-submit --device auto --run "bash kernels/manual/a2a3/distributed_ffn_grid/run.sh -r npu -v Ascend910B1 --grid-rows 2 --grid-cols 2" prints [SUCCESS] and ResultCmp=1 for all final-column blocks.
  • cmake --build kernels/manual/a2a3/allgather_gemm/build -j8 remains green.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces GridPipe, a 2D mesh communication abstraction with TPUSH and TPOP primitives, along with a mock backend for A2/A3 silicon that emulates hardware synchronization via global memory atomic flags. It includes runtime initialization helpers and a distributed FFN demo for end-to-end validation. Review feedback identifies opportunities to refactor duplicated logic in the mock SPR layer, improve the robustness of fault code generation by avoiding fragile enum arithmetic, and ensure proper resource cleanup in the host driver by using standard C++ termination functions.

Comment thread include/pto/common/grid_pipe_mock_spr.hpp
Comment thread include/pto/common/grid_pipe_mock_spr.hpp
Comment thread include/pto/npu/a2a3/GridTPop.hpp Outdated
Comment thread include/pto/npu/a2a3/GridTPush.hpp Outdated
Comment thread kernels/manual/a2a3/distributed_ffn_grid/main.cpp Outdated
@chenshengxin2026 chenshengxin2026 force-pushed the feat/distributed-ffn-grid-tpush-tpop branch 2 times, most recently from ff2116d to 3039ba8 Compare May 14, 2026 11:19
Adds `kernels/manual/a2a3/distributed_ffn_grid/`, a single-device multi-block
FFN demo that drives a real GridPipe `TPUSH<Direction>` / `TPOP<Direction>`
pipeline through the existing PTO ISA header stack on A2/A3.

Header layer
- `include/pto/common/grid_pipe.hpp`: `GridDirection`, `GridShape`,
  `GridCoord`, `NeighborRankForPush / NeighborRankForPop`, `is_grid_pipe_v`.
- `include/pto/common/grid_pipe_mock_spr.hpp`: `MockMtsprCounter` /
  `MockTryWfeCounter` with the canonical pre-dcci -> store -> post-dcci ->
  `dsb(DSB_DDR)` publish and `dcci` + `pipe_barrier(PIPE_ALL)` spin.  Ready
  and Free helpers are thin wrappers over the unified counter primitive.
- `include/pto/common/grid_counter_intrinsic.hpp` (new): CCE-intrinsic-style
  neighbor counter API used by `GridTPush` / `GridTPop`.  Exposes
  `mtspr_neighbor_counter(operand, kind, dir, value)` and
  `wfe_neighbor_counter(operand, kind, dir, threshold, maxSpins)` with a
  hardware contract that maps directly to `mtspr SPR_RDY/FREE_<DIR>` and
  `wfe SPR_RDY/FREE_<DIR>` semantics, plus a `NeighborCounterOperand` carrying
  the GM-backed counter address.
- `include/pto/npu/a2a3/grid_pipe_runtime.hpp`: window layout and
  `InitGridPipeFromWindow`.
- `include/pto/npu/a2a3/GridTPush.hpp` / `GridTPop.hpp`: A2/A3 backend.
  All ready/free synchronization goes through the new intrinsic API; the
  payload slot transfer still uses `HcclRemotePtr` + the mock window.
- `include/pto/common/pto_instr.hpp` / `pto_instr_impl.hpp`:
  `GridDirection`-templated `TPUSH` / `TPOP` overloads.  `TPUSH<SOURCE>` is
  rejected at compile time and unsupported targets emit a `static_assert`
  rather than silently falling back to a GM stub.

Demo layer
- One mixed-arch kernel `distributed_ffn_grid_compute_kernel.cpp`
  (`dav-c220`).  Cube and Vec branches are gated by `__DAV_CUBE__` /
  `__DAV_VEC__` and exchange gate / up / hidden / down intermediates through
  regular A2/A3 `TPipe` ready/free handshakes inside a single launch.  This
  replaces the earlier dual `.so` (cube `dav-c220-cube` + vec
  `dav-c220-vec`) and the host-side phase 0 / phase 1 doorbell sequencing.
- TLOAD/TSTORE are confined to the FFN head and tail only: Cube uses TLOAD
  to bring `X`, `W_gate`, `W_up`, `W_down` into L1 once; the final-column
  Vec block uses TSTORE to write the fp32 `[T, H]` row output.  Every
  intermediate (`gate_partial`, `up_partial`, `hidden`, `down_partial`, the
  EAST reduce payload) flows through `TPUSH`/`TPOP` -- C2V/V2C `TPipe` for
  intra-cell handoff, GridPipe `TPUSH<EAST>` / `TPOP<EAST>` for inter-cell
  reduce.
- `main.cpp` / `kernel_launch.hpp` / `ffn_config.hpp` /
  `gridpipe_payload_inl.hpp`: ACL bootstrap, fake `HcclDeviceContext` with
  per-cell GridPipe windows on a single device, mixed-kernel single launch,
  fp32 golden compare on the last-column cell of each row.
- `scripts/gen_data.py`: NumPy fp32 FFN reference with `alpha=0.1`,
  per-cell fp16 X / W_gate / W_up / W_down shards, fp32 `[T_total, H]`
  golden tensor.
- `CMakeLists.txt` / `run.sh`: build the single mixed device `.so` plus the
  host binary, run `gen_data` + cmake, launch on a single device.
- `README.md` / `README_zh.md`: design overview, head/tail TLOAD/TSTORE
  contract, neighbor counter intrinsic adaptation path, mock SPR/WFE
  expansion, ready/free flag layout, build/run flow.

Adapting when real hardware supports neighbor counters
- Today on A2/A3, `mtspr_neighbor_counter` / `wfe_neighbor_counter` lower
  to the GM-counter mock under `grid_pipe_mock_spr.hpp` and use the
  `NeighborCounterOperand::addr` GM pointer that the GridPipe window
  pre-allocates per direction.
- When silicon adds a real neighbor SPR / WFE counter, define
  `PTO_GRID_COUNTER_NATIVE_INTRINSIC` and provide compiler builtins
  `__builtin_pto_mtspr_neighbor_counter(kind, dir, value)` and
  `__builtin_pto_wfe_neighbor_counter(kind, dir, threshold)` with the same
  release / acquire contract.  Call sites in `GridTPush` / `GridTPop` do
  not change; the GM counter address in `NeighborCounterOperand` is ignored
  on the native path and the GridPipe window's ready/free counter region
  becomes unused.  Payload slot transfer continues through the existing
  `HcclRemotePtr` adaptor and can be retargeted to the silicon's grid slot
  address mechanism without touching the demo kernel source.

Validation
- Build: `bash kernels/manual/a2a3/distributed_ffn_grid/run.sh --build-only
  -v Ascend910B1` produces the mixed kernel `.so` and the host binary.
- NPU 2x2:
  `task-submit --device auto --run "bash kernels/manual/a2a3/distributed_ffn_grid/run.sh -r npu -v Ascend910B1 --grid-rows 2 --grid-cols 2"`
  -> 3 / 3 deterministic runs, `ResultCmp=1`, `max_diff=0`.
- Regression: `cmake --build kernels/manual/a2a3/allgather_gemm/build -j8`
  RC=0.

Limitations
- A2/A3 mock backend is not silicon validation evidence for LPU WSE.  The
  design-doc section 5.5 RFCs (cross-core SPR visibility, `TMOV` slot
  address space) remain open and gate the LPU WSE silicon lowering.
- Negative / fault-injection kernels are planned as a follow-up.
@chenshengxin2026 chenshengxin2026 force-pushed the feat/distributed-ffn-grid-tpush-tpop branch from 3039ba8 to b4e83e8 Compare May 14, 2026 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant