Skip to content

fix(dsv2lite): reuse NCCL dense exchange scratch#339

Merged
xiaguan merged 3 commits into
openinfer-project:mainfrom
CAICAIIs:fix/dsv2lite-nccl-dense-exchange-scratch
Jun 11, 2026
Merged

fix(dsv2lite): reuse NCCL dense exchange scratch#339
xiaguan merged 3 commits into
openinfer-project:mainfrom
CAICAIIs:fix/dsv2lite-nccl-dense-exchange-scratch

Conversation

@CAICAIIs

Copy link
Copy Markdown
Collaborator

Summary

Fixes #276.

  • reuse backend-owned bf16 scratch for the DeepSeek-V2-Lite NCCL dense exchange path
  • let the remote expert path consume the rank1 receive buffer through HiddenStatesRef
  • update CUDA Graph readiness so dense-exchange allocation/sync blockers are removed while the remaining host-directed blockers stay explicit

Details

The NCCL EP2 diagnostic backend now keeps reusable dense-exchange buffers for rank0_recv, rank1_send_zero, and rank1_recv, reallocating only when the dense exchange shape changes. rank1_send_zero is cleared on the rank1 stream for every dense exchange before the grouped bf16 NCCL all-reduce.

The fixed EP2 route logic and host-staged oracle stay unchanged. The readiness report still fails closed for full decode CUDA Graph capture. The remaining NCCL blockers are:

  • nccl_route_iteration_on_host
  • nccl_expert_accumulation_host_directed

This PR does not claim full decode CUDA Graph capture, sparse dispatch, production EP readiness, or a throughput improvement.

Validation

  • cargo fmt --all --check
  • cargo clippy --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite --bins --tests -- -D warnings
  • cargo test --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite nccl_readiness_reports_only_remaining_graph_blockers -- --nocapture
  • cargo test --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite finds_nccl_python_wheel_lib_dir_from_python_executable -- --nocapture
  • HF / host-staged / NCCL EP2 greedy JSON comparison with --require-all-exact: all_token_text_exact
  • NCCL attribution for batch 1/4/8:
    • batch 1: gpu_timing.failure_count=0, dense/combine 416/416
    • batch 4: gpu_timing.failure_count=0, dense/combine 494/494
    • batch 8: gpu_timing.failure_count=0, dense/combine 598/598

Retained Hello / 16-token oracle hashes:

  • token SHA256: 4fb4c8825fe4d2c4a1d966da25c259abdf675f4de4548daa5d41aea7dfe30225
  • text SHA256: 0eedf11429e9ac13bb799c31665c6e9f70a1ac4493a08a3f3da9ecf39c1ec347

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves DeepSeek-V2-Lite’s NCCL EP2 decode path by reusing backend-owned dense-exchange scratch buffers (removing per-call allocation/sync blockers for CUDA Graph readiness), and updates the runtime APIs/docs/tests to reflect the new “borrowed hidden states” flow.

Changes:

  • Add HiddenStatesRef-based extract helpers and plumb borrowed hidden state views through the remote expert path.
  • Rework NCCL dense exchange to reuse backend-owned bf16 scratch buffers and update graph-readiness blockers/tests accordingly.
  • Extend NCCL dynamic library discovery (including Python wheel locations) and update model docs/ledger entries.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
openinfer-kernels/src/ops/elementwise.rs Adds extract_vec_ref / extract_vec_ref_into to support borrowing HiddenStates data.
openinfer-kernels/src/ops.rs Re-exports the new extract_vec_ref* helpers from kernels ops.
openinfer-core/src/ops.rs Re-exports the new extract_vec_ref* helpers via openinfer-core::ops.
openinfer-deepseek-v2-lite/src/runtime/moe.rs Uses HiddenStatesRef so remote expert forwarding can consume rank1 recv scratch without allocating a new HiddenStates.
openinfer-deepseek-v2-lite/src/runtime/readiness.rs Removes dense-exchange allocation/sync blockers from NCCL readiness and wires in a unit test module.
openinfer-deepseek-v2-lite/src/runtime/readiness/tests.rs Adds a regression test asserting only the remaining NCCL graph blockers are reported.
openinfer-deepseek-v2-lite/src/nccl_backend.rs Adds reusable dense-exchange scratch and expands NCCL library candidate discovery (env overrides + Python wheel roots).
openinfer-deepseek-v2-lite/src/nccl_backend/tests.rs Adds unit test for Python env root → NCCL wheel lib dir discovery.
docs/models/deepseek-v2-lite/status.md Updates status ledger to reflect dense-exchange device scratch + remaining readiness blockers.
docs/models/deepseek-v2-lite/device-resident-nccl-combine.md Updates combine doc to defer current blocker list to status.md.
docs/models/deepseek-v2-lite/decode-attribution-gate.md Documents NCCL loader precedence and updates validation notes for issue #276.
docs/index.md Updates doc index entry for the combine write-up.
Comments suppressed due to low confidence (1)

openinfer-kernels/src/ops/elementwise.rs:516

  • extract_vec_ref_into can panic on an out-of-bounds token_idx because CudaSlice::slice will be called with an invalid range. Since this is a public API that returns Result, it should fail gracefully with an ensure! instead of relying on a panic from slicing.
    let offset = token_idx * batch.hidden_dim;
    let len = batch.hidden_dim;
    anyhow::ensure!(out.len == len, "extract_vec_into len mismatch");
    let src_view = batch.data.slice(offset..offset + len);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread openinfer-deepseek-v2-lite/src/nccl_backend.rs
@xiaguan xiaguan merged commit 8ddd840 into openinfer-project:main Jun 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dsv2lite: preallocate NCCL dense-exchange buffers for decode

3 participants