fix(dsv2lite): reuse NCCL dense exchange scratch#339
Merged
xiaguan merged 3 commits intoJun 11, 2026
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves DeepSeek-V2-Lite’s NCCL EP2 decode path by reusing backend-owned dense-exchange scratch buffers (removing per-call allocation/sync blockers for CUDA Graph readiness), and updates the runtime APIs/docs/tests to reflect the new “borrowed hidden states” flow.
Changes:
- Add
HiddenStatesRef-based extract helpers and plumb borrowed hidden state views through the remote expert path. - Rework NCCL dense exchange to reuse backend-owned bf16 scratch buffers and update graph-readiness blockers/tests accordingly.
- Extend NCCL dynamic library discovery (including Python wheel locations) and update model docs/ledger entries.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| openinfer-kernels/src/ops/elementwise.rs | Adds extract_vec_ref / extract_vec_ref_into to support borrowing HiddenStates data. |
| openinfer-kernels/src/ops.rs | Re-exports the new extract_vec_ref* helpers from kernels ops. |
| openinfer-core/src/ops.rs | Re-exports the new extract_vec_ref* helpers via openinfer-core::ops. |
| openinfer-deepseek-v2-lite/src/runtime/moe.rs | Uses HiddenStatesRef so remote expert forwarding can consume rank1 recv scratch without allocating a new HiddenStates. |
| openinfer-deepseek-v2-lite/src/runtime/readiness.rs | Removes dense-exchange allocation/sync blockers from NCCL readiness and wires in a unit test module. |
| openinfer-deepseek-v2-lite/src/runtime/readiness/tests.rs | Adds a regression test asserting only the remaining NCCL graph blockers are reported. |
| openinfer-deepseek-v2-lite/src/nccl_backend.rs | Adds reusable dense-exchange scratch and expands NCCL library candidate discovery (env overrides + Python wheel roots). |
| openinfer-deepseek-v2-lite/src/nccl_backend/tests.rs | Adds unit test for Python env root → NCCL wheel lib dir discovery. |
| docs/models/deepseek-v2-lite/status.md | Updates status ledger to reflect dense-exchange device scratch + remaining readiness blockers. |
| docs/models/deepseek-v2-lite/device-resident-nccl-combine.md | Updates combine doc to defer current blocker list to status.md. |
| docs/models/deepseek-v2-lite/decode-attribution-gate.md | Documents NCCL loader precedence and updates validation notes for issue #276. |
| docs/index.md | Updates doc index entry for the combine write-up. |
Comments suppressed due to low confidence (1)
openinfer-kernels/src/ops/elementwise.rs:516
extract_vec_ref_intocan panic on an out-of-boundstoken_idxbecauseCudaSlice::slicewill be called with an invalid range. Since this is a public API that returnsResult, it should fail gracefully with anensure!instead of relying on a panic from slicing.
let offset = token_idx * batch.hidden_dim;
let len = batch.hidden_dim;
anyhow::ensure!(out.len == len, "extract_vec_into len mismatch");
let src_view = batch.data.slice(offset..offset + len);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #276.
HiddenStatesRefDetails
The NCCL EP2 diagnostic backend now keeps reusable dense-exchange buffers for
rank0_recv,rank1_send_zero, andrank1_recv, reallocating only when the dense exchange shape changes.rank1_send_zerois cleared on the rank1 stream for every dense exchange before the grouped bf16 NCCL all-reduce.The fixed EP2 route logic and host-staged oracle stay unchanged. The readiness report still fails closed for full decode CUDA Graph capture. The remaining NCCL blockers are:
nccl_route_iteration_on_hostnccl_expert_accumulation_host_directedThis PR does not claim full decode CUDA Graph capture, sparse dispatch, production EP readiness, or a throughput improvement.
Validation
cargo fmt --all --checkcargo clippy --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite --bins --tests -- -D warningscargo test --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite nccl_readiness_reports_only_remaining_graph_blockers -- --nocapturecargo test --release -p openinfer-deepseek-v2-lite --features deepseek-v2-lite finds_nccl_python_wheel_lib_dir_from_python_executable -- --nocapture--require-all-exact:all_token_text_exact1/4/8:gpu_timing.failure_count=0, dense/combine416/416gpu_timing.failure_count=0, dense/combine494/494gpu_timing.failure_count=0, dense/combine598/598Retained
Hello/ 16-token oracle hashes:4fb4c8825fe4d2c4a1d966da25c259abdf675f4de4548daa5d41aea7dfe302250eedf11429e9ac13bb799c31665c6e9f70a1ac4493a08a3f3da9ecf39c1ec347