Skip to content

feat(deepseek-v4): support mixed prefill/decode batches#122

Merged
lightseek-bot merged 7 commits into
mainfrom
pr-stack/v4-pr5-reapply
May 14, 2026
Merged

feat(deepseek-v4): support mixed prefill/decode batches#122
lightseek-bot merged 7 commits into
mainfrom
pr-stack/v4-pr5-reapply

Conversation

@dongjiyingdjy
Copy link
Copy Markdown
Contributor

Summary

  • Enable scheduler/runtime support for mixed prefill/decode batches for DeepSeek V4.
  • Thread MIXED forward mode through input buffering, model execution, output processing, CUDA graph gating, and DeepSeek V4 attention metadata.
  • Add DeepSeek V4 sparse prefill/indexer support plus kernel wrappers/custom ops needed by mixed batches.
  • Keep mixed batches disabled for speculative/MTP paths; MTP remains for a later PR.

Validation

  • pre-commit run --all-files
  • GSM8K 5-shot limit 50: exact_match = 0.94 ± 0.0339
  • E2E mixed-batch smoke test: scheduler emitted a mixed batch and all requests returned 200
  • Targeted unit tests for scheduler mixed batching, DeepSeek V4 config, CLI compatibility, generation output processing, attention ops, and kernel wrapper coverage

Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
@dongjiyingdjy dongjiyingdjy requested a review from a team as a code owner May 13, 2026 12:21
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
@dongjiyingdjy dongjiyingdjy force-pushed the pr-stack/v4-pr5-reapply branch from 497854b to aae2bb8 Compare May 13, 2026 13:50
jasl added a commit to jasl/tokenspeed that referenced this pull request May 13, 2026
Lock in the configuration that lifts SM120 DSv4-Flash decode past the
bs=1 ceiling. Headline: c=8 reaches 75.56 tok/s output throughput
(vs 22.04 tok/s at c=1, ``+243% / 3.4×``) with all 16 queued prompts
completing.

Key knobs vs the prior baseline:
* ``--gpu-memory-utilization 0.98`` (was 0.80) — model weights take
  74 GiB of the 96 GiB GPU; bumping the cap leaves ~17 GiB for KV pool
  + CUDA graph pool + activations, matching vLLM's measured layout on
  the same hardware.
* ``--max-total-tokens 16384`` (was 4096) — 8 seqs × 2k tokens needs a
  16k KV pool budget; the prior 4k could only host 2 seqs at our prompt
  size.
* ``--max-num-seqs 8`` (was 4); ``--max-cudagraph-capture-size 4``
  (was 2); ``PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`` in the
  bench wrapper to reduce graph-pool fragmentation.

Also documents:
* Memory layout snapshots (idle 86.2 GiB, post-traffic peak 95.4 GiB).
* Implications for the parked V2 Stage 2 (pre-projection multi-stream)
  — at c=4-8 the input GEMMs grow 4×-8×, so Stage 2 may revive once
  decode batch consistently sits above ~16.
* TTFT bottleneck at 10-12 s with 8 concurrent prefills serial-queueing;
  upstream PR ``lightseekorg#122`` (mixed prefill/decode
  batching) is the next lever once the legacy HTTP server stack
  reconciliation is handled.
* Known parking-lot item: cherry-picking PR lightseekorg#122 onto this fork hit
  an HTTP-server-stack mismatch (our rebase removed the legacy
  ``api_server.py`` / ``http_server.py`` path that the bench uses).
  Future attempts need either ``smg_grpc_servicer`` installed or a
  separate "restore HTTP server stack" patch landed first.

Signed-off-by: jasl <jasl9187@hotmail.com>
dongjiyingdjy and others added 4 commits May 14, 2026 01:22
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
Signed-off-by: lightseek-bot <243258330+lightseek-bot@users.noreply.github.com>
@lightseek-bot lightseek-bot merged commit 4d3b7dc into main May 14, 2026
8 of 32 checks passed
@lightseek-bot lightseek-bot deleted the pr-stack/v4-pr5-reapply branch May 14, 2026 08:47
jasl added a commit to jasl/tokenspeed that referenced this pull request May 14, 2026
…ightseekorg#144)

Rebased onto upstream/main 4d3b7dc which now includes:
- PR lightseekorg#122 (mixed prefill/decode batches in DSv4 indexer)
- PR lightseekorg#144 (switch SMG gateway packages to `tokenspeed-smg`)
- PR lightseekorg#138/lightseekorg#143/lightseekorg#141/lightseekorg#139/lightseekorg#137/lightseekorg#136/lightseekorg#135 (xgrammar, AMD sampling, etc.)

Conflicts resolved (7 files):
- generation_output_processor.py: keep both lightseekorg#122's is_decode_slot and our
  output_top_logprobs_val/idx handling
- attention/backends/deepseek_v4.py: keep both lightseekorg#122's triton indexer import
  and our current_platform import
- models/deepseek_v4.py: keep all of lightseekorg#122's prefill/decode metadata
  dataclasses + chunk helpers and our SM12x decode all-candidate fast path
  helper + decode-only early-return fast path before lightseekorg#122's mixed-mode
  branch; took our defensive getattr for indexer_state_block_table
- test_deepseek_v4_config.py: keep lightseekorg#122's DeepseekV4MLP import
- deepseek_v4_attention.cu: keep both lightseekorg#122's gather_paged_indexer_mxfp4_cache
  function and our SM12x kernels; preserve lightseekorg#122's "materialize K at
  activation dtype before UE8M0 absmax" correctness tweak using our
  kv_slot/kv_cache_block_size variable names
- deepseek_v4_attention_binding.cu: union of forward decls
- deepseek_v4_attention.py (kernel shim): both indexer_mxfp4_paged_gather
  and our SM12x helpers

Local checks: AST-parse all 5 touched Python files, brace-balance both
.cu files. Workstation build + smoke test in follow-up.
jasl added a commit to jasl/tokenspeed that referenced this pull request May 14, 2026
…ightseekorg#144)

Rebased onto upstream/main 4d3b7dc which now includes:
- PR lightseekorg#122 (mixed prefill/decode batches in DSv4 indexer)
- PR lightseekorg#144 (switch SMG gateway packages to `tokenspeed-smg`)
- PR lightseekorg#138/lightseekorg#143/lightseekorg#141/lightseekorg#139/lightseekorg#137/lightseekorg#136/lightseekorg#135 (xgrammar, AMD sampling, etc.)

Conflicts resolved (7 files):
- generation_output_processor.py: keep both lightseekorg#122's is_decode_slot and our
  output_top_logprobs_val/idx handling
- attention/backends/deepseek_v4.py: keep both lightseekorg#122's triton indexer import
  and our current_platform import
- models/deepseek_v4.py: keep all of lightseekorg#122's prefill/decode metadata
  dataclasses + chunk helpers and our SM12x decode all-candidate fast path
  helper + decode-only early-return fast path before lightseekorg#122's mixed-mode
  branch; took our defensive getattr for indexer_state_block_table
- test_deepseek_v4_config.py: keep lightseekorg#122's DeepseekV4MLP import
- deepseek_v4_attention.cu: keep both lightseekorg#122's gather_paged_indexer_mxfp4_cache
  function and our SM12x kernels; preserve lightseekorg#122's "materialize K at
  activation dtype before UE8M0 absmax" correctness tweak using our
  kv_slot/kv_cache_block_size variable names
- deepseek_v4_attention_binding.cu: union of forward decls
- deepseek_v4_attention.py (kernel shim): both indexer_mxfp4_paged_gather
  and our SM12x helpers

Local checks: AST-parse all 5 touched Python files, brace-balance both
.cu files. Workstation build + smoke test in follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants