feat: mixed prefill-decode batching (part 3: compatibility with speculative decoding for MLA models) by syuoni · Pull Request #205 · lightseekorg/tokenspeed

syuoni · 2026-05-21T03:57:55Z

Summary

Extend --enable-mixed-batch from V4 sparse attention to Kimi K2.5 (and all MLA backends), with first-class spec-dec support.

Mixed prefill-decode batching:

Part 1: feat(deepseek-v4): support mixed prefill/decode batches #122
Part 2: refactor: forward mode consolidation #164

Changes

Scheduler & runtime

C++ scheduler: decode-first priority under enable_mixed_prefill_decode so MIXED actually emits under long-prefill workloads.
Lifted the spec-dec gate on --enable-mixed-batch; users opt in per-workload.

MLA backends (all gain MIXED support)

tokenspeed_mla, trtllm_mla, flashmla: init_forward_metadata fills both prefill + decode metadata under MIXED, with num_extends discriminator on decode metadata for kernel-call-time slicing.
DeepseekV3AttentionMLA.forward runs pre_attn_comm once, slices by num_prefill_tokens, dispatches prefill/decode through their native paths, single shared o_proj.
In-place out= plumbing eliminates the per-call BF16 copy in chunked prefill.

Spec-dec + MIXED

Wrapper drafter double-init collapsed to a single call: each backend now fills both prefill + decode metadata under is_draft + extend_or_mixed, with seq_lens aliased to the drafter's live buffer for in-place multi-step advance.
LogitsProcessor collapsed from a 4-branch prune into a single gather_ids gather; row indices computed by the caller (ModelExecutor + Eagle drafter). Cleaner contract, fewer special cases, MIXED-with-verify just works.
New spec_num_tokens field on AttentionBackend with sentinel-aware config defaults.
EagleDraftInput.num_extends threaded for correct drafter dispatch under EXTEND target.
Three GSM8K accuracy bugs fixed (per-row vc delta, per-row is_decode_slot gate, hybrid sampler logprob writeback).

Interface cleanup

Dropped redundant num_tokens arg from all init_forward_metadata* signatures.
Renamed forward-time local spec_num_tokens → q_len_per_req to distinguish per-call shape from configured constant.
Replaced set_decode_num_extends(int) setter with override_num_extends(int) context manager that restores prior value.

CI

4 K2.5 eval YAMLs + 1 perf YAML + 6 agentic-bench shell configs gain --enable-mixed-batch.
Eval YAMLs switched to MLA drafter (tokenspeed_mla + kimi-k2.5-eagle3-mla) for MIXED-safety.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 91482d0584

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f863184d8a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-23T02:16:25Z

                        sampling_info=sampling_info,
                        bs=bs,
-                        is_spec_decode=self.drafter is not None and num_extends <= 0,
+                        is_spec_decode=self.drafter is not None and num_extends < bs,


Restrict spec-grammar mode to true decode batches

Changing is_spec_decode to self.drafter is not None and num_extends < bs makes MIXED batches enter spec-grammar mode, but _forward_step still only uploads candidate tokens to the grammar runtime when ctx.forward_mode.is_decode() (schedule_fill(...input_ids_buf_slice=...)). In MIXED + speculative + grammar runs, this marks has_candidates=True while candidate buffers are not refreshed, so the capturable grammar path computes per-token masks from stale/incorrect candidate data, producing wrong constrained sampling behavior.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 51f14df981

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-23T03:52:50Z

                        sampling_info=sampling_info,
                        bs=bs,
-                        is_spec_decode=self.drafter is not None and num_extends <= 0,
+                        is_spec_decode=self.drafter is not None and num_extends < bs,


Keep spec-grammar path off for mixed batches in eager mode

Setting is_spec_decode true for any batch with num_extends < bs makes MIXED speculative batches enter the per-token grammar path even on the eager fallback. In that path, _fill_eager_bitmask assumes a pure decode layout and reads input_ids_buf[:bs*spec_num_tokens].view(bs, spec_num_tokens), but MIXED batches store ragged prefill tokens first and decode candidates after them, so grammar masks are generated from misaligned tokens. This produces incorrect constrained sampling whenever capturable grammar is disabled (or unavailable) and MIXED+spec decode is active.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3bf882de69

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-23T04:38:21Z

+        mla_metadata, num_splits = get_mla_metadata(
+            plan_seq_lens.to(torch.int32),
+            num_heads_plan,
+            1,
        )


Build FlashMLA decode planner from decode rows only

In MIXED batches, this path computes flashmla_metadata/num_splits from full seq_lens (prefill + decode rows), but decode execution later consumes only the decode slice (block_table[num_extends:] in forward_extend/forward_decode). That misaligns planner metadata with the actual decode rows, so when num_extends > 0 the kernel can run with row planning derived from the wrong requests, leading to incorrect attention behavior or shape/runtime failures. The decode planner needs to be built (or consistently sliced) for decode rows only.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 452d75aedf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-23T04:51:50Z

+            self.draft_attn_backend.init_forward_metadata(
+                bs=padded_bs,
+                num_extends=num_extends,
+                req_pool_indices=req_pool_indices,
+                seq_lens=draft_seq_lens,


Reintroduce decode-metadata init for Triton draft backends

When _init_forward_metadata switched from two draft-backend init calls to a single call, EXTEND/MIXED draft batches no longer force a DECODE-shaped metadata refresh before Eagle step 2+. TritonAttnBackend stores only one forward_metadata, so after the single call it can still hold prefill-style qo_indptr/kv_indptr; then _run_multi_step_decode invokes decode kernels against that stale layout, which can misindex KV ranges or fail on shape assumptions for mixed prefill+decode speculative batches.

Useful? React with 👍 / 👎.

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

syuoni changed the title ~~[WIP] feat(K2.5): mixed prefill&decode batching~~ [WIP] feat: mixed prefill-decode batching (part 3: compatibility with speculative decoding for MLA models) May 22, 2026

syuoni force-pushed the feat/k2.5-mixed-batch branch from f2815dc to 91482d0 Compare May 22, 2026 14:56

syuoni requested review from LorrinWWW, dongjiyingdjy, yweng0828 and zhyncs May 22, 2026 14:57

syuoni marked this pull request as ready for review May 22, 2026 14:57

syuoni requested a review from a team as a code owner May 22, 2026 14:57

syuoni changed the title ~~[WIP] feat: mixed prefill-decode batching (part 3: compatibility with speculative decoding for MLA models)~~ feat: mixed prefill-decode batching (part 3: compatibility with speculative decoding for MLA models) May 22, 2026

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/layers/attention/backends/trtllm_mla.py

Comment thread python/tokenspeed/runtime/sampling/sampling_batch_info.py

chatgpt-codex-connector Bot reviewed May 23, 2026

View reviewed changes

syuoni force-pushed the feat/k2.5-mixed-batch branch from f863184 to 51f14df Compare May 23, 2026 03:47

chatgpt-codex-connector Bot reviewed May 23, 2026

View reviewed changes

syuoni added 15 commits May 23, 2026 05:58

fix typing

1878a1a

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

scheduling order

92d9bf1

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

non-spec path

27882bb

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

merge o_proj

8c5564b

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

inplace prefill attn output

514a8c5

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix non-mixed spec

f02254c

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

spec path

0901eb6

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix trtllm_mla

8ede2cb

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

migrate flashmla

67b3810

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

migrate other attn backends

00469aa

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

clean num_tokens

7dfe411

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix dsv4 ut

f196575

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

override_num_extends

9ef8fc1

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

enable mixed batch on CI

307b1ef

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix doc

2f8a7dc

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

syuoni added 3 commits May 23, 2026 05:58

disable mixed batch on perf CI

e7db3f0

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix lint

96ea96e

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix ut

6a7b0a1

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

syuoni force-pushed the feat/k2.5-mixed-batch branch from 452d75a to 6a7b0a1 Compare May 23, 2026 05:59

unregister test_deepseek_v4_config

0d95c68

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

lightseek-bot approved these changes May 23, 2026

View reviewed changes

syuoni merged commit d8f3295 into lightseekorg:main May 23, 2026
82 of 87 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: mixed prefill-decode batching (part 3: compatibility with speculative decoding for MLA models)#205

feat: mixed prefill-decode batching (part 3: compatibility with speculative decoding for MLA models)#205
syuoni merged 19 commits into
lightseekorg:mainfrom
syuoni:feat/k2.5-mixed-batch

syuoni commented May 21, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

syuoni commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

syuoni commented May 21, 2026 •

edited

Loading