feat(deepseek-v4): support mixed prefill/decode batches#122
Merged
Conversation
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
497854b to
aae2bb8
Compare
jasl
added a commit
to jasl/tokenspeed
that referenced
this pull request
May 13, 2026
Lock in the configuration that lifts SM120 DSv4-Flash decode past the bs=1 ceiling. Headline: c=8 reaches 75.56 tok/s output throughput (vs 22.04 tok/s at c=1, ``+243% / 3.4×``) with all 16 queued prompts completing. Key knobs vs the prior baseline: * ``--gpu-memory-utilization 0.98`` (was 0.80) — model weights take 74 GiB of the 96 GiB GPU; bumping the cap leaves ~17 GiB for KV pool + CUDA graph pool + activations, matching vLLM's measured layout on the same hardware. * ``--max-total-tokens 16384`` (was 4096) — 8 seqs × 2k tokens needs a 16k KV pool budget; the prior 4k could only host 2 seqs at our prompt size. * ``--max-num-seqs 8`` (was 4); ``--max-cudagraph-capture-size 4`` (was 2); ``PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`` in the bench wrapper to reduce graph-pool fragmentation. Also documents: * Memory layout snapshots (idle 86.2 GiB, post-traffic peak 95.4 GiB). * Implications for the parked V2 Stage 2 (pre-projection multi-stream) — at c=4-8 the input GEMMs grow 4×-8×, so Stage 2 may revive once decode batch consistently sits above ~16. * TTFT bottleneck at 10-12 s with 8 concurrent prefills serial-queueing; upstream PR ``lightseekorg#122`` (mixed prefill/decode batching) is the next lever once the legacy HTTP server stack reconciliation is handled. * Known parking-lot item: cherry-picking PR lightseekorg#122 onto this fork hit an HTTP-server-stack mismatch (our rebase removed the legacy ``api_server.py`` / ``http_server.py`` path that the bench uses). Future attempts need either ``smg_grpc_servicer`` installed or a separate "restore HTTP server stack" patch landed first. Signed-off-by: jasl <jasl9187@hotmail.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
Signed-off-by: lightseek-bot <243258330+lightseek-bot@users.noreply.github.com>
lightseek-bot
approved these changes
May 14, 2026
jasl
added a commit
to jasl/tokenspeed
that referenced
this pull request
May 14, 2026
…ightseekorg#144) Rebased onto upstream/main 4d3b7dc which now includes: - PR lightseekorg#122 (mixed prefill/decode batches in DSv4 indexer) - PR lightseekorg#144 (switch SMG gateway packages to `tokenspeed-smg`) - PR lightseekorg#138/lightseekorg#143/lightseekorg#141/lightseekorg#139/lightseekorg#137/lightseekorg#136/lightseekorg#135 (xgrammar, AMD sampling, etc.) Conflicts resolved (7 files): - generation_output_processor.py: keep both lightseekorg#122's is_decode_slot and our output_top_logprobs_val/idx handling - attention/backends/deepseek_v4.py: keep both lightseekorg#122's triton indexer import and our current_platform import - models/deepseek_v4.py: keep all of lightseekorg#122's prefill/decode metadata dataclasses + chunk helpers and our SM12x decode all-candidate fast path helper + decode-only early-return fast path before lightseekorg#122's mixed-mode branch; took our defensive getattr for indexer_state_block_table - test_deepseek_v4_config.py: keep lightseekorg#122's DeepseekV4MLP import - deepseek_v4_attention.cu: keep both lightseekorg#122's gather_paged_indexer_mxfp4_cache function and our SM12x kernels; preserve lightseekorg#122's "materialize K at activation dtype before UE8M0 absmax" correctness tweak using our kv_slot/kv_cache_block_size variable names - deepseek_v4_attention_binding.cu: union of forward decls - deepseek_v4_attention.py (kernel shim): both indexer_mxfp4_paged_gather and our SM12x helpers Local checks: AST-parse all 5 touched Python files, brace-balance both .cu files. Workstation build + smoke test in follow-up.
jasl
added a commit
to jasl/tokenspeed
that referenced
this pull request
May 14, 2026
…ightseekorg#144) Rebased onto upstream/main 4d3b7dc which now includes: - PR lightseekorg#122 (mixed prefill/decode batches in DSv4 indexer) - PR lightseekorg#144 (switch SMG gateway packages to `tokenspeed-smg`) - PR lightseekorg#138/lightseekorg#143/lightseekorg#141/lightseekorg#139/lightseekorg#137/lightseekorg#136/lightseekorg#135 (xgrammar, AMD sampling, etc.) Conflicts resolved (7 files): - generation_output_processor.py: keep both lightseekorg#122's is_decode_slot and our output_top_logprobs_val/idx handling - attention/backends/deepseek_v4.py: keep both lightseekorg#122's triton indexer import and our current_platform import - models/deepseek_v4.py: keep all of lightseekorg#122's prefill/decode metadata dataclasses + chunk helpers and our SM12x decode all-candidate fast path helper + decode-only early-return fast path before lightseekorg#122's mixed-mode branch; took our defensive getattr for indexer_state_block_table - test_deepseek_v4_config.py: keep lightseekorg#122's DeepseekV4MLP import - deepseek_v4_attention.cu: keep both lightseekorg#122's gather_paged_indexer_mxfp4_cache function and our SM12x kernels; preserve lightseekorg#122's "materialize K at activation dtype before UE8M0 absmax" correctness tweak using our kv_slot/kv_cache_block_size variable names - deepseek_v4_attention_binding.cu: union of forward decls - deepseek_v4_attention.py (kernel shim): both indexer_mxfp4_paged_gather and our SM12x helpers Local checks: AST-parse all 5 touched Python files, brace-balance both .cu files. Workstation build + smoke test in follow-up.
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Validation
pre-commit run --all-filesexact_match = 0.94 ± 0.0339