Skip to content

feat: mixed prefill-decode batching (part 3: compatibility with speculative decoding for MLA models)#205

Merged
syuoni merged 19 commits into
lightseekorg:mainfrom
syuoni:feat/k2.5-mixed-batch
May 23, 2026
Merged

feat: mixed prefill-decode batching (part 3: compatibility with speculative decoding for MLA models)#205
syuoni merged 19 commits into
lightseekorg:mainfrom
syuoni:feat/k2.5-mixed-batch

Conversation

@syuoni
Copy link
Copy Markdown
Member

@syuoni syuoni commented May 21, 2026

Summary

Extend --enable-mixed-batch from V4 sparse attention to Kimi K2.5 (and all MLA backends), with first-class spec-dec support.

Mixed prefill-decode batching:

Changes

Scheduler & runtime

  • C++ scheduler: decode-first priority under enable_mixed_prefill_decode so MIXED actually emits under long-prefill workloads.
  • Lifted the spec-dec gate on --enable-mixed-batch; users opt in per-workload.

MLA backends (all gain MIXED support)

  • tokenspeed_mla, trtllm_mla, flashmla: init_forward_metadata fills both prefill + decode metadata under MIXED, with num_extends discriminator on decode metadata for kernel-call-time slicing.
  • DeepseekV3AttentionMLA.forward runs pre_attn_comm once, slices by num_prefill_tokens, dispatches prefill/decode through their native paths, single shared o_proj.
  • In-place out= plumbing eliminates the per-call BF16 copy in chunked prefill.

Spec-dec + MIXED

  • Wrapper drafter double-init collapsed to a single call: each backend now fills both prefill + decode metadata under is_draft + extend_or_mixed, with seq_lens aliased to the drafter's live buffer for in-place multi-step advance.
  • LogitsProcessor collapsed from a 4-branch prune into a single gather_ids gather; row indices computed by the caller (ModelExecutor + Eagle drafter). Cleaner contract, fewer special cases, MIXED-with-verify just works.
  • New spec_num_tokens field on AttentionBackend with sentinel-aware config defaults.
  • EagleDraftInput.num_extends threaded for correct drafter dispatch under EXTEND target.
  • Three GSM8K accuracy bugs fixed (per-row vc delta, per-row is_decode_slot gate, hybrid sampler logprob writeback).

Interface cleanup

  • Dropped redundant num_tokens arg from all init_forward_metadata* signatures.
  • Renamed forward-time local spec_num_tokensq_len_per_req to distinguish per-call shape from configured constant.
  • Replaced set_decode_num_extends(int) setter with override_num_extends(int) context manager that restores prior value.

CI

  • 4 K2.5 eval YAMLs + 1 perf YAML + 6 agentic-bench shell configs gain --enable-mixed-batch.
  • Eval YAMLs switched to MLA drafter (tokenspeed_mla + kimi-k2.5-eagle3-mla) for MIXED-safety.

@syuoni syuoni changed the title [WIP] feat(K2.5): mixed prefill&decode batching [WIP] feat: mixed prefill-decode batching (part 3: compatibility with speculative decoding for MLA models) May 22, 2026
@syuoni syuoni force-pushed the feat/k2.5-mixed-batch branch from f2815dc to 91482d0 Compare May 22, 2026 14:56
@syuoni syuoni marked this pull request as ready for review May 22, 2026 14:57
@syuoni syuoni requested a review from a team as a code owner May 22, 2026 14:57
@syuoni syuoni changed the title [WIP] feat: mixed prefill-decode batching (part 3: compatibility with speculative decoding for MLA models) feat: mixed prefill-decode batching (part 3: compatibility with speculative decoding for MLA models) May 22, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 91482d0584

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/layers/attention/backends/trtllm_mla.py
Comment thread python/tokenspeed/runtime/sampling/sampling_batch_info.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f863184d8a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

sampling_info=sampling_info,
bs=bs,
is_spec_decode=self.drafter is not None and num_extends <= 0,
is_spec_decode=self.drafter is not None and num_extends < bs,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict spec-grammar mode to true decode batches

Changing is_spec_decode to self.drafter is not None and num_extends < bs makes MIXED batches enter spec-grammar mode, but _forward_step still only uploads candidate tokens to the grammar runtime when ctx.forward_mode.is_decode() (schedule_fill(...input_ids_buf_slice=...)). In MIXED + speculative + grammar runs, this marks has_candidates=True while candidate buffers are not refreshed, so the capturable grammar path computes per-token masks from stale/incorrect candidate data, producing wrong constrained sampling behavior.

Useful? React with 👍 / 👎.

@syuoni syuoni force-pushed the feat/k2.5-mixed-batch branch from f863184 to 51f14df Compare May 23, 2026 03:47
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 51f14df981

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

sampling_info=sampling_info,
bs=bs,
is_spec_decode=self.drafter is not None and num_extends <= 0,
is_spec_decode=self.drafter is not None and num_extends < bs,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep spec-grammar path off for mixed batches in eager mode

Setting is_spec_decode true for any batch with num_extends < bs makes MIXED speculative batches enter the per-token grammar path even on the eager fallback. In that path, _fill_eager_bitmask assumes a pure decode layout and reads input_ids_buf[:bs*spec_num_tokens].view(bs, spec_num_tokens), but MIXED batches store ragged prefill tokens first and decode candidates after them, so grammar masks are generated from misaligned tokens. This produces incorrect constrained sampling whenever capturable grammar is disabled (or unavailable) and MIXED+spec decode is active.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3bf882de69

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +247 to 251
mla_metadata, num_splits = get_mla_metadata(
plan_seq_lens.to(torch.int32),
num_heads_plan,
1,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Build FlashMLA decode planner from decode rows only

In MIXED batches, this path computes flashmla_metadata/num_splits from full seq_lens (prefill + decode rows), but decode execution later consumes only the decode slice (block_table[num_extends:] in forward_extend/forward_decode). That misaligns planner metadata with the actual decode rows, so when num_extends > 0 the kernel can run with row planning derived from the wrong requests, leading to incorrect attention behavior or shape/runtime failures. The decode planner needs to be built (or consistently sliced) for decode rows only.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 452d75aedf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +645 to +649
self.draft_attn_backend.init_forward_metadata(
bs=padded_bs,
num_extends=num_extends,
req_pool_indices=req_pool_indices,
seq_lens=draft_seq_lens,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reintroduce decode-metadata init for Triton draft backends

When _init_forward_metadata switched from two draft-backend init calls to a single call, EXTEND/MIXED draft batches no longer force a DECODE-shaped metadata refresh before Eagle step 2+. TritonAttnBackend stores only one forward_metadata, so after the single call it can still hold prefill-style qo_indptr/kv_indptr; then _run_multi_step_decode invokes decode kernels against that stale layout, which can misindex KV ranges or fail on shape assumptions for mixed prefill+decode speculative batches.

Useful? React with 👍 / 👎.

syuoni added 15 commits May 23, 2026 05:58
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
syuoni added 3 commits May 23, 2026 05:58
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
@syuoni syuoni force-pushed the feat/k2.5-mixed-batch branch from 452d75a to 6a7b0a1 Compare May 23, 2026 05:59
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
@syuoni syuoni merged commit d8f3295 into lightseekorg:main May 23, 2026
82 of 87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants