-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[New Model] DeepSeek-V3.2 (Rebased to Main) #25896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]> fix smoke tests Signed-off-by: Lucas Wilkinson <[email protected]> moved to FlashMLA repo Signed-off-by: Lucas Wilkinson <[email protected]> removed pytorch shim Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
setup sparse attention backend
Signed-off-by: Lucas Wilkinson <[email protected]>
…ild-sparse-flash-mla Build and bind sparse-FlashMLA kernels
…integration [Feature] DeepGEMM integration
* and env and MQA path for both prefill and decode Signed-off-by: Lucas Wilkinson <[email protected]> * fix shapes Signed-off-by: Lucas Wilkinson <[email protected]> --------- Signed-off-by: Lucas Wilkinson <[email protected]>
* code from ds Signed-off-by: youkaichao <[email protected]> * doc from ds Signed-off-by: youkaichao <[email protected]> * Fixes for support_materials/2-tilelang/ Signed-off-by: mgoin <[email protected]> * Fix example 1 Signed-off-by: mgoin <[email protected]> * Fix Einsum in deepgemm * Fix `libc10.so` unimported error * fix reference code Signed-off-by: youkaichao <[email protected]> * adding missing indexer args * passing index args into the module * init Signed-off-by: Chen Zhang <[email protected]> * build indexer k cache medadata * prefill indexer, but weight_proj will output -inf * unqiantized paged indexer, still have -inf issue * remove support material * adding topk_indices mask * add weight scale * unittest infrastructure and fix weight_proj, numeric error due to quantization * varlen prefill passed * paged prefill * add indices mask --------- Signed-off-by: youkaichao <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: Chen Zhang <[email protected]>
Co-authored-by: Lucia Fang <[email protected]>
* prefill mla Signed-off-by: Chen Zhang <[email protected]> * can run now Signed-off-by: Chen Zhang <[email protected]> * tmp Signed-off-by: Chen Zhang <[email protected]> * can output the first token Signed-off-by: Chen Zhang <[email protected]> * fix bug Signed-off-by: Chen Zhang <[email protected]> * remove some debug Signed-off-by: Chen Zhang <[email protected]> * update Signed-off-by: Chen Zhang <[email protected]> * hack through cu_seqlen_ks exploding issue * update basic.py Signed-off-by: Chen Zhang <[email protected]> * remove some unnecessary changes Signed-off-by: Chen Zhang <[email protected]> * clean up Signed-off-by: Chen Zhang <[email protected]> --------- Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: Yongye Zhu <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Fix MLA for non dsv32 models
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: youkaichao <[email protected]>
locally verified this PR has correct results: local-completions (model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9613|± |0.0053|
| | |strict-match | 5|exact_match|↑ |0.9613|± |0.0053| |
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
@youkaichao Can you help to try DeepSeek-R1? I got the following errors:
|
descale_k is None | ||
), "descale_q and descale_k should be both None or both not None" | ||
|
||
if (descale_q is not None) and (descale_k is not None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (descale_q is not None) and (descale_k is not None): | |
if indices is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LucasWilkinson does this make sense? @heheda12345 's error seems to indicate that deepseek r1 goes into this branch and calls torch.ops._flashmla_extension_C.fwd_kvcache_mla_fp8
# Note(hc): need revisit when we support DCP with decode query_len > 1. | ||
return out.squeeze(1), softmax_lse.squeeze(-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LucasWilkinson do we need this as well for dcp?
the issue reported by @heheda12345 seems to be kvcache dtype issue. merging first to unblock further optimizations. @LucasWilkinson might help investigate further. |
Seems like DSV3 AWQ quantized checkpoints are broken after this PR; the error message is like the following, let me write an issue for it: RuntimeError: Expected q.dtype() == torch::kFloat8_e4m3fn to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.) |
A CI test was reportedly broken by this (now failing on main):
https://buildkite.com/vllm/ci/builds/32959#01999b1b-bec0-44a2-bca0-2523a6209558 Edit: I have opened a fix here: #25978 |
@heheda12345 @cjackal can you help check if #25956 solves the problem? |
… still failing tests (#292) vllm-project/vllm#25896 --------- Signed-off-by: Chendi Xue <[email protected]>
Confirmed that it works normal after your PR, thank you for the prompt bugfix! |
Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: Barry Kang <[email protected]> Signed-off-by: Lucia Fang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: yewentao256 <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Lucia Fang <[email protected]> Co-authored-by: Lucia Fang <[email protected]> Co-authored-by: NickLucche <[email protected]> Co-authored-by: Siyuan Fu <[email protected]> Co-authored-by: Matthew Bonanni <[email protected]> Co-authored-by: Xiaozhu Meng <[email protected]> Co-authored-by: Barry Kang <[email protected]> Signed-off-by: simon-mo <[email protected]>
… still failing tests (vllm-project#292) vllm-project/vllm#25896 --------- Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Iryna Boiko <[email protected]>
Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: Barry Kang <[email protected]> Signed-off-by: Lucia Fang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: yewentao256 <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Lucia Fang <[email protected]> Co-authored-by: Lucia Fang <[email protected]> Co-authored-by: NickLucche <[email protected]> Co-authored-by: Siyuan Fu <[email protected]> Co-authored-by: Matthew Bonanni <[email protected]> Co-authored-by: Xiaozhu Meng <[email protected]> Co-authored-by: Barry Kang <[email protected]>
Rebased dsv32, based on #25869
Run command
gsm8k
gsm8k, 20-shot