fix(dsv2lite): reuse NCCL dense exchange scratch by CAICAIIs · Pull Request #339 · openinfer-project/openinfer

CAICAIIs · 2026-06-10T16:23:55Z

Summary

Fixes #276.

reuse backend-owned bf16 scratch for the DeepSeek-V2-Lite NCCL dense exchange path
let the remote expert path consume the rank1 receive buffer through HiddenStatesRef
update CUDA Graph readiness so dense-exchange allocation/sync blockers are removed while the remaining host-directed blockers stay explicit

Details

The NCCL EP2 diagnostic backend now keeps reusable dense-exchange buffers for rank0_recv, rank1_send_zero, and rank1_recv, reallocating only when the dense exchange shape changes. rank1_send_zero is cleared on the rank1 stream for every dense exchange before the grouped bf16 NCCL all-reduce.

The fixed EP2 route logic and host-staged oracle stay unchanged. The readiness report still fails closed for full decode CUDA Graph capture. The remaining NCCL blockers are:

nccl_route_iteration_on_host
nccl_expert_accumulation_host_directed

This PR does not claim full decode CUDA Graph capture, sparse dispatch, production EP readiness, or a throughput improvement.

Validation

cargo fmt --all --check
cargo clippy --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite --bins --tests -- -D warnings
cargo test --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite nccl_readiness_reports_only_remaining_graph_blockers -- --nocapture
cargo test --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite finds_nccl_python_wheel_lib_dir_from_python_executable -- --nocapture
HF / host-staged / NCCL EP2 greedy JSON comparison with --require-all-exact: all_token_text_exact
NCCL attribution for batch 1/4/8:
- batch 1: gpu_timing.failure_count=0, dense/combine 416/416
- batch 4: gpu_timing.failure_count=0, dense/combine 494/494
- batch 8: gpu_timing.failure_count=0, dense/combine 598/598

Retained Hello / 16-token oracle hashes:

token SHA256: 4fb4c8825fe4d2c4a1d966da25c259abdf675f4de4548daa5d41aea7dfe30225
text SHA256: 0eedf11429e9ac13bb799c31665c6e9f70a1ac4493a08a3f3da9ecf39c1ec347

Copilot

Pull request overview

This PR improves DeepSeek-V2-Lite’s NCCL EP2 decode path by reusing backend-owned dense-exchange scratch buffers (removing per-call allocation/sync blockers for CUDA Graph readiness), and updates the runtime APIs/docs/tests to reflect the new “borrowed hidden states” flow.

Changes:

Add HiddenStatesRef-based extract helpers and plumb borrowed hidden state views through the remote expert path.
Rework NCCL dense exchange to reuse backend-owned bf16 scratch buffers and update graph-readiness blockers/tests accordingly.
Extend NCCL dynamic library discovery (including Python wheel locations) and update model docs/ledger entries.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
openinfer-kernels/src/ops/elementwise.rs	Adds `extract_vec_ref` / `extract_vec_ref_into` to support borrowing `HiddenStates` data.
openinfer-kernels/src/ops.rs	Re-exports the new `extract_vec_ref*` helpers from kernels ops.
openinfer-core/src/ops.rs	Re-exports the new `extract_vec_ref*` helpers via `openinfer-core::ops`.
openinfer-deepseek-v2-lite/src/runtime/moe.rs	Uses `HiddenStatesRef` so remote expert forwarding can consume rank1 recv scratch without allocating a new `HiddenStates`.
openinfer-deepseek-v2-lite/src/runtime/readiness.rs	Removes dense-exchange allocation/sync blockers from NCCL readiness and wires in a unit test module.
openinfer-deepseek-v2-lite/src/runtime/readiness/tests.rs	Adds a regression test asserting only the remaining NCCL graph blockers are reported.
openinfer-deepseek-v2-lite/src/nccl_backend.rs	Adds reusable dense-exchange scratch and expands NCCL library candidate discovery (env overrides + Python wheel roots).
openinfer-deepseek-v2-lite/src/nccl_backend/tests.rs	Adds unit test for Python env root → NCCL wheel lib dir discovery.
docs/models/deepseek-v2-lite/status.md	Updates status ledger to reflect dense-exchange device scratch + remaining readiness blockers.
docs/models/deepseek-v2-lite/device-resident-nccl-combine.md	Updates combine doc to defer current blocker list to `status.md`.
docs/models/deepseek-v2-lite/decode-attribution-gate.md	Documents NCCL loader precedence and updates validation notes for issue #276.
docs/index.md	Updates doc index entry for the combine write-up.

Comments suppressed due to low confidence (1)

openinfer-kernels/src/ops/elementwise.rs:516

extract_vec_ref_into can panic on an out-of-bounds token_idx because CudaSlice::slice will be called with an invalid range. Since this is a public API that returns Result, it should fail gracefully with an ensure! instead of relying on a panic from slicing.

    let offset = token_idx * batch.hidden_dim;
    let len = batch.hidden_dim;
    anyhow::ensure!(out.len == len, "extract_vec_into len mismatch");
    let src_view = batch.data.slice(offset..offset + len);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

CAICAIIs added 2 commits June 10, 2026 20:37

fix(dsv2lite): reuse nccl dense exchange scratch

a68171e

chore(dsv2lite): tighten nccl dense exchange follow-up

8670692

CAICAIIs requested a review from Copilot June 10, 2026 17:16

Copilot started reviewing on behalf of CAICAIIs June 10, 2026 17:16 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Comment thread openinfer-deepseek-v2-lite/src/nccl_backend.rs

fix(dsv2lite): address nccl review follow-ups

c205c4e

xiaguan merged commit 8ddd840 into openinfer-project:main Jun 11, 2026
1 check passed

CAICAIIs mentioned this pull request Jun 11, 2026

[Model] DeepSeek-V2-Lite roadmap #205

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dsv2lite): reuse NCCL dense exchange scratch#339

fix(dsv2lite): reuse NCCL dense exchange scratch#339
xiaguan merged 3 commits into
openinfer-project:mainfrom
CAICAIIs:fix/dsv2lite-nccl-dense-exchange-scratch

CAICAIIs commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CAICAIIs commented Jun 10, 2026

Summary

Details

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants