perf(K2.5): optimize small kernels in EAGEL3 drafter loop by syuoni · Pull Request #142 · lightseekorg/tokenspeed

syuoni · 2026-05-14T02:23:35Z

Summary

Per-call overhead reductions in the EAGLE3 drafter loop and surrounding metadata prep.

Changes

compute_out_cache_loc uniform variant — added compute_out_cache_loc_uniform
for the drafter's multi-step decode where every request has input_length=1. Skips
the per-call torch.cumsum + torch.full host-side work and the kernel's GMEM
reads of input_lengths_ptr/cumsum_lengths_ptr (Triton specializes on
None-pointer at JIT time).
req_pool_indices_buf is now int64 — eliminates ~12 implicit int32→int64
unrolled_elementwise cast kernels (~1.6 µs each) along the per-iteration
metadata prep path. Pairs with switching the valid_cache_lengths[idx] fancy
index to valid_cache_lengths.index_select(0, idx) so int32 indices are still
accepted natively where they remain (e.g. in index_add_).
Persistent drafter buffers — draft_seq_lens_buf, draft_out_cache_loc_buf,
draft_input_lengths_buf, and last_index_offsets_buf (= arange(max_bs) * spec_num_tokens - 1) are hoisted to Eagle.__init__ to avoid per-call alloc +
init. last_index_offsets is now plumbed via ForwardContext to
LogitsProcessor for the padded-static-len last-token-select path.
Eagle.draft() cleanup — fused cache_start + 1 into
torch.add(..., out=draft_seq_lens); replaced the post-draft() torch.cat
with direct writes into a shared next_tokens[bs, spec_num_steps+1] buffer;
skip the last-iter positions.add_(1) / draft_seq_lens.add_(1) since they're
not consumed; removed the dead logits.shape[0] != bs fallback branch.
Persistent drafter buffers — draft_seq_lens_buf, draft_out_cache_loc_buf,
draft_input_lengths_buf, and last_index_offsets_buf are hoisted to
Eagle.__init__ to avoid per-call alloc + init. The last one
(= torch.arange(max_bs) * spec_num_tokens - 1, int64) replaces two
per-call torch.arange(bs, ...) * spec_num_tokens patterns — one in the
drafter's last-verified-id selection, one in LogitsProcessor's padded-static-len
last-token-select. The precomputed buffer is sliced to [:bs] and plumbed via
ForwardContext.last_index_offsets so LogitsProcessor can skip the
arange + mul + sub triplet. Net: pre-step-0 last-token-select drops from
6 kernels (arange + mul + sub + 2 gathers) to 3 (1 add + 2 gathers).

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

LorrinWWW

lgtm

syuoni requested review from LorrinWWW, dongjiyingdjy, yweng0828 and zhyncs May 14, 2026 13:27

syuoni added 10 commits May 14, 2026 13:28

fix doc

a291826

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

compute_out_cache_loc

eb97f1c

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

int64 req_pool_indices

13f1d3c

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix doc

ffe0854

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

discard long() cast

fd86215

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

hoist

54d9d2a

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

hoist offset

34180cd

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

clean

638d653

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

clean

36c3211

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

polish

80fd19b

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

syuoni force-pushed the opt-eagle3-draft-part2 branch from fc15f9a to 80fd19b Compare May 14, 2026 13:31

syuoni changed the title ~~[WIP] perf(K2.5): optimize small kernels in EAGEL3 drafter loop~~ perf(K2.5): optimize small kernels in EAGEL3 drafter loop May 14, 2026

syuoni marked this pull request as ready for review May 14, 2026 13:33

syuoni requested a review from a team as a code owner May 14, 2026 13:33

fix rebase

8a3105f

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

LorrinWWW approved these changes May 14, 2026

View reviewed changes

LorrinWWW merged commit 82b4e49 into lightseekorg:main May 14, 2026
52 of 58 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(K2.5): optimize small kernels in EAGEL3 drafter loop#142

perf(K2.5): optimize small kernels in EAGEL3 drafter loop#142
LorrinWWW merged 11 commits into
lightseekorg:mainfrom
syuoni:opt-eagle3-draft-part2

syuoni commented May 14, 2026 •

edited

Loading

Uh oh!

LorrinWWW left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

syuoni commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

LorrinWWW left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

syuoni commented May 14, 2026 •

edited

Loading