feat(deepseek-v4): support mixed prefill/decode batches by dongjiyingdjy · Pull Request #122 · lightseekorg/tokenspeed

dongjiyingdjy · 2026-05-13T12:21:10Z

Summary

Enable scheduler/runtime support for mixed prefill/decode batches for DeepSeek V4.
Thread MIXED forward mode through input buffering, model execution, output processing, CUDA graph gating, and DeepSeek V4 attention metadata.
Add DeepSeek V4 sparse prefill/indexer support plus kernel wrappers/custom ops needed by mixed batches.
Keep mixed batches disabled for speculative/MTP paths; MTP remains for a later PR.

Validation

pre-commit run --all-files
GSM8K 5-shot limit 50: exact_match = 0.94 ± 0.0339
E2E mixed-batch smoke test: scheduler emitted a mixed batch and all requests returned 200
Targeted unit tests for scheduler mixed batching, DeepSeek V4 config, CLI compatibility, generation output processing, attention ops, and kernel wrapper coverage

Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

Lock in the configuration that lifts SM120 DSv4-Flash decode past the bs=1 ceiling. Headline: c=8 reaches 75.56 tok/s output throughput (vs 22.04 tok/s at c=1, ``+243% / 3.4×``) with all 16 queued prompts completing. Key knobs vs the prior baseline: * ``--gpu-memory-utilization 0.98`` (was 0.80) — model weights take 74 GiB of the 96 GiB GPU; bumping the cap leaves ~17 GiB for KV pool + CUDA graph pool + activations, matching vLLM's measured layout on the same hardware. * ``--max-total-tokens 16384`` (was 4096) — 8 seqs × 2k tokens needs a 16k KV pool budget; the prior 4k could only host 2 seqs at our prompt size. * ``--max-num-seqs 8`` (was 4); ``--max-cudagraph-capture-size 4`` (was 2); ``PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`` in the bench wrapper to reduce graph-pool fragmentation. Also documents: * Memory layout snapshots (idle 86.2 GiB, post-traffic peak 95.4 GiB). * Implications for the parked V2 Stage 2 (pre-projection multi-stream) — at c=4-8 the input GEMMs grow 4×-8×, so Stage 2 may revive once decode batch consistently sits above ~16. * TTFT bottleneck at 10-12 s with 8 concurrent prefills serial-queueing; upstream PR ``lightseekorg#122`` (mixed prefill/decode batching) is the next lever once the legacy HTTP server stack reconciliation is handled. * Known parking-lot item: cherry-picking PR lightseekorg#122 onto this fork hit an HTTP-server-stack mismatch (our rebase removed the legacy ``api_server.py`` / ``http_server.py`` path that the bench uses). Future attempts need either ``smg_grpc_servicer`` installed or a separate "restore HTTP server stack" patch landed first. Signed-off-by: jasl <jasl9187@hotmail.com>

Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

Signed-off-by: lightseek-bot <243258330+lightseek-bot@users.noreply.github.com>

…ightseekorg#144) Rebased onto upstream/main 4d3b7dc which now includes: - PR lightseekorg#122 (mixed prefill/decode batches in DSv4 indexer) - PR lightseekorg#144 (switch SMG gateway packages to `tokenspeed-smg`) - PR lightseekorg#138/lightseekorg#143/lightseekorg#141/lightseekorg#139/lightseekorg#137/lightseekorg#136/lightseekorg#135 (xgrammar, AMD sampling, etc.) Conflicts resolved (7 files): - generation_output_processor.py: keep both lightseekorg#122's is_decode_slot and our output_top_logprobs_val/idx handling - attention/backends/deepseek_v4.py: keep both lightseekorg#122's triton indexer import and our current_platform import - models/deepseek_v4.py: keep all of lightseekorg#122's prefill/decode metadata dataclasses + chunk helpers and our SM12x decode all-candidate fast path helper + decode-only early-return fast path before lightseekorg#122's mixed-mode branch; took our defensive getattr for indexer_state_block_table - test_deepseek_v4_config.py: keep lightseekorg#122's DeepseekV4MLP import - deepseek_v4_attention.cu: keep both lightseekorg#122's gather_paged_indexer_mxfp4_cache function and our SM12x kernels; preserve lightseekorg#122's "materialize K at activation dtype before UE8M0 absmax" correctness tweak using our kv_slot/kv_cache_block_size variable names - deepseek_v4_attention_binding.cu: union of forward decls - deepseek_v4_attention.py (kernel shim): both indexer_mxfp4_paged_gather and our SM12x helpers Local checks: AST-parse all 5 touched Python files, brace-balance both .cu files. Workstation build + smoke test in follow-up.

feat(deepseek-v4): support mixed prefill decode

21dd822

Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

dongjiyingdjy requested a review from a team as a code owner May 13, 2026 12:21

style(scheduler): format mixed forward prefill check

aae2bb8

Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

dongjiyingdjy force-pushed the pr-stack/v4-pr5-reapply branch from 497854b to aae2bb8 Compare May 13, 2026 13:50

dongjiyingdjy and others added 4 commits May 14, 2026 01:22

fix(runtime): restore cache sync debug env helper

1801fdf

Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

fix(runtime): disable mixed scheduling by default

7d4d91e

Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

fix(runtime): preserve speculative usage in raw token output

1f86c32

Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

Merge branch 'main' into pr-stack/v4-pr5-reapply

aa8840a

lightseek-bot added the high priority label May 14, 2026

lightseek-bot assigned SimonCqk May 14, 2026

Rename mixed chunk flag to mixed batch

f8701a8

Signed-off-by: lightseek-bot <243258330+lightseek-bot@users.noreply.github.com>

lightseek-bot approved these changes May 14, 2026

View reviewed changes

lightseek-bot merged commit 4d3b7dc into main May 14, 2026
8 of 32 checks passed

lightseek-bot deleted the pr-stack/v4-pr5-reapply branch May 14, 2026 08:47

rjzhb mentioned this pull request May 18, 2026

feat(trtllm-MHA): support mixed prefill/decode batches #176

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(deepseek-v4): support mixed prefill/decode batches#122

feat(deepseek-v4): support mixed prefill/decode batches#122
lightseek-bot merged 7 commits into
mainfrom
pr-stack/v4-pr5-reapply

dongjiyingdjy commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dongjiyingdjy commented May 13, 2026

Summary

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants