Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| `models/deepseek-v2-lite/hf-accuracy-gate.md` | DeepSeek-V2-Lite EP2 HF accuracy gate after PR #149/#150: HF incremental greedy, host-staged EP2, and NCCL EP2 are token/text exact for `Hello`, output_len=16. |
| `models/deepseek-v2-lite/decode-attribution-gate.md` | DeepSeek-V2-Lite EP2 decode attribution gate for `Hello`/16-token batch sizes 1/4/8: structured JSON with accuracy hashes, CPU-side timing, selected CUDA event/NVTX attribution, host-staged/NCCL EP counts, and explicit no-throughput claim boundary. |
| `models/deepseek-v2-lite/source-layout.md` | DeepSeek-V2-Lite runtime layout refactor: `runtime.rs` split by responsibility, HF/host-staged/NCCL EP2 E2E exact on 2x RTX 5090; NCCL CUDA Graph smoke remains a diagnostic blocker on that host, independent of the passed correctness gate. |
| `models/deepseek-v2-lite/device-resident-nccl-combine.md` | Issue #275 record: NCCL decode combine now uses reusable device-resident f32 scratch, HF/host-staged/NCCL exact on 2x RTX 5090, and remaining graph blockers are dense exchange plus host-directed routing. |
| `models/deepseek-v2-lite/device-resident-nccl-combine.md` | Issue #275 record: NCCL decode combine uses reusable device-resident f32 scratch; current NCCL graph-readiness blockers live in `status.md`. |

## models / kimi-k2

Expand Down
18 changes: 13 additions & 5 deletions docs/models/deepseek-v2-lite/decode-attribution-gate.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ python tools/accuracy/compare_dsv2_lite_ep2_outputs.py \
--require-all-exact
```

Then collect attribution for the same two openinfer backends. Use `--batch-size 1` for the original single-row gate, and `--batch-size 4` / `--batch-size 8` for the true-batch benchmark attribution shape:
Then collect attribution for the same two openinfer backends. Use `--batch-size 1` for the original single-row gate, and `--batch-size 4` / `--batch-size 8` for the true-batch benchmark attribution shape. If the NCCL runtime should come from a Python CUDA wheel rather than the system install, set `OPENINFER_NCCL_PYTHON` or `OPENINFER_TRITON_PYTHON` to that environment's Python for the attribution command too.

```bash
cargo run --release -p openinfer-deepseek-v2-lite \
Expand Down Expand Up @@ -133,21 +133,29 @@ ncclMaxSharedMem 82240 exceeds device/fn maxSharedMem 79856
NCCL WARN Cuda failure 1 'invalid argument'
```

Use a newer NCCL runtime through the normal library path if the system runtime fails this way. The project code path should not change just to work around local NCCL installation age.
The NCCL loader now tries explicit overrides first (`OPENINFER_NCCL_LIB`, then `OPENINFER_NCCL_LIB_DIR` / `OPENINFER_NCCL_LIBRARY_PATH`), then Python wheel NCCL directories discoverable from `OPENINFER_NCCL_PYTHON`, `OPENINFER_TRITON_PYTHON`, `VIRTUAL_ENV`, or `CONDA_PREFIX`, and finally the system `libnccl.so.2` / `libnccl.so`. This keeps the code path unchanged while avoiding a stale system NCCL when the validation environment already has a newer CUDA wheel runtime.

The HF oracle needs a Python environment that can load DeepSeek-V2-Lite with `trust_remote_code=True`. The helper script tolerates the model file's optional `flash_attn` import check when FlashAttention is not installed, but the HF environment remains separate from the Rust runtime claim: it is only the truth-source generator for the comparison JSON.

## Latest Validation

The issue #275 refresh was rerun on 2026-06-08 with DeepSeek-V2-Lite snapshot `604d5664dddd88a0433dbae533b7fe9472482de0`, `prompt="Hello"`, `output_len=16`, and 2x RTX 5090. HF, host-staged, and NCCL were dumped from the same model directory and compared with `--require-all-exact`.
The issue #276 refresh was rerun on 2026-06-10 with DeepSeek-V2-Lite snapshot `604d5664dddd88a0433dbae533b7fe9472482de0`, `prompt="Hello"`, `output_len=16`, and 2x RTX 5090. HF, host-staged, and NCCL were dumped from the same model directory and compared with `--require-all-exact`. The Rust path loaded NCCL `2.30.7+cuda12.9` from the Python CUDA wheel path because the system NCCL `2.25.1+cuda12.8` failed the init smoke on this Blackwell host before model-level validation.

- HF / host-staged / NCCL comparison: `all_token_text_exact`.
- Token SHA256: `4fb4c8825fe4d2c4a1d966da25c259abdf675f4de4548daa5d41aea7dfe30225`.
- Text SHA256: `0eedf11429e9ac13bb799c31665c6e9f70a1ac4493a08a3f3da9ecf39c1ec347`.
- Generated text: `, I am a 19 year old girl from the UK. I am`.
- Candidate NCCL attribution: `gpu_timing.sample_count=8384`, `failure_count=0`.

The candidate readiness report still has `full_decode_capture_ready=false`. Compared with the same-host baseline attribution, it removed `nccl_contribution_accumulation_on_host`, `nccl_combine_h2d_contribution_copy`, `nccl_combine_allocates_per_call`, `nccl_combine_syncs_rank_streams`, and `nccl_combine_d2h_result_copy`. The remaining blockers are `nccl_dense_exchange_allocates_per_call`, `nccl_dense_exchange_syncs_rank_streams`, `nccl_route_iteration_on_host`, and `nccl_expert_accumulation_host_directed`.
The candidate readiness report still has `full_decode_capture_ready=false`. Compared with the issue #275 candidate, it removes the dense-exchange allocation/sync blockers. The remaining blockers are `nccl_route_iteration_on_host` and `nccl_expert_accumulation_host_directed`.

Current NCCL attribution for the issue #276 gate:

| Batch | GPU event samples | GPU failures | NCCL exchange/combine calls | Route counters | Readiness blockers |
| ---: | ---: | ---: | --- | --- | --- |
| 1 | 8384 | 0 | `416 / 416` | `local=1284`, `remote=1212`, `combine=2496` | `nccl_route_iteration_on_host`, `nccl_expert_accumulation_host_directed` |
| 4 | 23996 | 0 | `494 / 494` | `local=5136`, `remote=4848`, `combine=9984` | `nccl_route_iteration_on_host`, `nccl_expert_accumulation_host_directed` |
| 8 | 44812 | 0 | `598 / 598` | `local=10272`, `remote=9696`, `combine=19968` | `nccl_route_iteration_on_host`, `nccl_expert_accumulation_host_directed` |

The previous A800 strict same-host accuracy gate was rerun on 2026-06-04 with DeepSeek-V2-Lite snapshot `604d5664dddd88a0433dbae533b7fe9472482de0`, `prompt="Hello"`, `output_len=16`, and 2x A800-SXM4-80GB. The token/text oracle is confirmed by a real HF `AutoModelForCausalLM.generate(..., do_sample=false, use_cache=true)` run on the same model directory as the Rust E2E gate.

Expand All @@ -158,7 +166,7 @@ The Rust E2E accepts the known HF-confirmed RTX 5090 and A800 hash pairs for thi
- Text SHA256: `4aaafbe4b3a46bc5b9ab5ea8d09d5fad71225006c2e234e87a928e3265b387c6`.
- Generated text: `, I am a 20 year old female and I have been having a`.

The graph-readiness diagnostic was rerun on 2026-06-04 on the same model snapshot and 2x A800-SXM4-80GB:
The historical graph-readiness diagnostic before the #275/#276 device-scratch work was rerun on 2026-06-04 on the same model snapshot and 2x A800-SXM4-80GB:

- `full_decode_capture_ready=false`;
- `status=blocked_full_decode_path`;
Expand Down
10 changes: 5 additions & 5 deletions docs/models/deepseek-v2-lite/device-resident-nccl-combine.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# DeepSeek-V2-Lite Device-Resident NCCL Combine

> **TL;DR:** Issue #275 moves the NCCL decode combine path to reusable device-resident f32 scratch buffers. The retained `Hello` / 16-token gate stays HF / host-staged / NCCL exact, and the readiness report no longer lists the old combine H2D/D2H/allocation/sync blockers. Full decode capture is still blocked by dense exchange allocation/sync and host-directed routing.
> **TL;DR:** Issue #275 moves the NCCL decode combine path to reusable device-resident f32 scratch buffers. The retained `Hello` / 16-token gate stays HF / host-staged / NCCL exact, and the readiness report no longer lists the old combine H2D/D2H/allocation/sync blockers. Current NCCL graph-readiness blockers live in `status.md`.

Last touched: 2026-06

Expand All @@ -23,11 +23,11 @@ Last touched: 2026-06
1. Add a shared CUDA helper that accumulates a bf16 single-token expert output into a f32 device contribution buffer at a selected token row.
2. Re-export that helper through `openinfer-core::ops`.
3. Add reusable NCCL combine scratch buffers inside `NaiveNcclEp2Backend`, clear the f32 send scratch per MoE call, accumulate local/remote expert outputs on the owning device, all-reduce device buffers, and cast rank0 f32 result to bf16 on device.
4. Update graph-readiness blockers and attribution wording so removed combine H2D/D2H/allocation/sync blockers are no longer claimed, while remaining host routing and dense-exchange blockers stay explicit.
4. Update graph-readiness blockers and attribution wording so removed combine H2D/D2H/allocation/sync blockers are no longer claimed, while the remaining host routing and dense-exchange blockers stay explicit.
5. Run formatting and local compile gates, then use the provided remote GPU host for the DeepSeek-V2-Lite EP2 exactness and attribution gates.
- **Risks / open questions**:
- Device f32 accumulation must preserve the existing expert-id accumulation order before the final bf16 cast.
- Dense exchange and host route selection still block full decode CUDA Graph capture; `full_decode_capture_ready` should remain false unless validation proves otherwise.
- Dense exchange and host route selection still blocked full decode CUDA Graph capture at the time of issue #275; keep the current blocker list in `status.md`.
- The provided SSH credential should stay local to the validation session and must not be echoed into docs or final output.

## Execution Log
Expand Down Expand Up @@ -97,7 +97,7 @@ cargo clippy --release -p openinfer-deepseek-v2-lite \

Both commands passed on the same remote source copy after syncing the follow-up `host_ops.rs` and `logging.rs` fixes. The clippy command ran without `clippy::manual_midpoint`, `clippy::needless_range_loop`, or `clippy::option_option` allows.

Before/after readiness comparison for the same model snapshot and diagnostic shape:
Before/after readiness comparison for the same model snapshot and diagnostic shape at issue #275:

| Report | Readiness blockers |
| --- | --- |
Expand All @@ -118,4 +118,4 @@ The candidate report replaces the old `nccl_contribution_accumulate` and `nccl_c

The implementation keeps host-staged unchanged as the correctness oracle. The NCCL backend now owns reusable rank0/rank1 f32 send/recv scratch buffers behind `DeviceCombineScratch`; each MoE call clears the f32 send scratch on device, accumulates one-token expert outputs into the owning rank's send scratch with a CUDA helper, runs the f32 NCCL all-reduce, and casts rank0's f32 result back to bf16 on device.

The final bf16 `HiddenStates` returned to the model is still allocated per combine call. That allocation is outside the removed NCCL contribution/result round trip, so it is not claimed as full CUDA Graph readiness. The remaining readiness blockers are still real and should drive the next slice.
The final bf16 `HiddenStates` returned to the model is still allocated per combine call. That allocation is outside the removed NCCL contribution/result round trip, so issue #275 did not claim full CUDA Graph readiness. The current blocker list should stay in `status.md`.
9 changes: 5 additions & 4 deletions docs/models/deepseek-v2-lite/status.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# DeepSeek-V2-Lite Status And Benchmark Ledger

> **TL;DR:** DeepSeek-V2-Lite is a feature-gated EP2 correctness and attribution target. HF, host-staged, and NCCL match for the narrow greedy gate; NCCL decode combine now uses reusable device-resident f32 scratch, while dense exchange and host-directed routing still block full decode graph capture. Current batch and vLLM data remain diagnostic and do not claim production serving parity.
> **TL;DR:** DeepSeek-V2-Lite is a feature-gated EP2 correctness and attribution target. HF, host-staged, and NCCL match for the narrow greedy gate; NCCL decode combine and dense exchange now use reusable device scratch, while host-directed routing/expert accumulation still block full decode graph capture. Current batch and vLLM data remain diagnostic and do not claim production serving parity.

Last touched: 2026-06

Expand All @@ -14,7 +14,8 @@ Last touched: 2026-06
| Decode attribution | Available | PR #162 and PR #169 add CPU/GPU attribution, route counts, NCCL counters, CUDA event timing, and optional NVTX correlation. |
| Direct same-prompt diagnostic batch | Available | PR #184 and PR #196 cover batch sizes `1`, `4`, and `8` for the fixed same-prompt direct path. |
| Device-resident NCCL combine | Available | Issue #275 keeps NCCL combine contributions/results on reusable f32 device scratch and preserves the HF / host-staged / NCCL exact gate on 2x RTX 5090. |
| NCCL CUDA Graph readiness | Diagnostic only | The attribution binary now emits `cuda_graph_readiness`. Current NCCL full decode capture is blocked; the optional preallocated f32 NCCL graph smoke captures, replays, and verifies. |
| Device-resident NCCL dense exchange | Available | Issue #276 reuses backend-owned bf16 dense-exchange scratch, clears rank1 zero-send every exchange, removes dense-exchange stream sync from the backend call, and preserves HF / host-staged / NCCL exactness on 2x RTX 5090. |
| NCCL CUDA Graph readiness | Diagnostic only | The attribution binary emits `cuda_graph_readiness`. Current NCCL full decode capture remains blocked by host route iteration and host-directed expert accumulation; the removed dense-exchange allocation/sync blockers should stay absent. |
| Production continuous batching | Not available | The direct diagnostic batch path is not mixed-request HTTP serving. |
| vLLM production parity | Not claimed | The manual vLLM snapshot below is for understanding the gap requested in issue #170. |

Expand Down Expand Up @@ -100,15 +101,15 @@ Do not claim:

Issue #205 records the model roadmap. Maintainer feedback there calls out NCCL plus CUDA Graph as the likely best decode direction, with host staging possibly deprecated later. Treat that as a future direction, not as current evidence.

The current graph-readiness diagnostic is intentionally fail-closed: `full_decode_capture_ready=false` for NCCL. Issue #275 removed the old NCCL combine H2D/D2H/allocation/sync blockers from the retained 2x RTX 5090 attribution gate, but dense exchange allocation/sync and host-directed routing remain. A basic preallocated f32 NCCL all-reduce smoke captures, replays, and verifies on the retained A800 run, but that proves only the collective smoke shape. It is not full decode CUDA Graph coverage. HF, host-staged, and NCCL remain token/text exact for the narrow greedy gate.
The current graph-readiness diagnostic is intentionally fail-closed: `full_decode_capture_ready=false` for NCCL. Issue #275 removed the old NCCL combine H2D/D2H/allocation/sync blockers, and issue #276 removed the dense-exchange allocation/sync blockers from the retained 2x RTX 5090 attribution gate. Those removed dense-exchange blockers are absent from the current readiness report. The remaining NCCL blockers are host route iteration and host-directed expert accumulation. The optional f32 NCCL graph smoke is a separate collective-only diagnostic and is not #276 evidence. HF, host-staged, and NCCL remain token/text exact for the narrow greedy gate.

The next implementation should be chosen from measured evidence:

1. Move the remaining NCCL decode path toward CUDA Graph coverage.
- keep HF / host-staged / NCCL exact before and after;
- keep host-staged as the correctness baseline while it exists;
- preserve attribution before and after the change;
- attack dense exchange allocation/sync and host route iteration next;
- attack host route iteration and host-directed expert accumulation next;
- avoid broad generic EP or multi-node work;
- judge issue #170 by whether it reduces NCCL decode overhead and makes the path more graph-friendly.

Expand Down
6 changes: 3 additions & 3 deletions openinfer-core/src/ops.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ pub use attention::{
pub use openinfer_kernels::ops::{
LoraDecodeGroupedProjection, accumulate_bf16_token_scaled_to_f32_into, add_batch,
add_batch_into, bf16_hidden_to_f32_into, embedding_decode_into, extract_vec, extract_vec_into,
f32_to_bf16_hidden_into, fused_add_rms_norm_into, gather_hidden_tokens_into, gemm,
gemm_into_checked, gemm_per_token, gemv, linear, lora_decode_fused_delta_group3_into,
lora_decode_fused_delta_into, pack_lora_b_rows_into,
extract_vec_ref, extract_vec_ref_into, f32_to_bf16_hidden_into, fused_add_rms_norm_into,
gather_hidden_tokens_into, gemm, gemm_into_checked, gemm_per_token, gemv, linear,
lora_decode_fused_delta_group3_into, lora_decode_fused_delta_into, pack_lora_b_rows_into,
qk_norm_partial_rope_batched_decode_hd256_into, rms_norm, rms_norm_batch_offset_into,
rms_norm_gated_batch_into, rms_norm_into, rms_norm_offset_into, scale_f32_in_place,
scaled_add_batch_into, scaled_add_rows_indexed_into, scaled_add_rows_into,
Expand Down
Loading