feat(spec-decode): add n-gram (prompt-lookup) speculative drafter by elwhyjay · Pull Request #145 · lightseekorg/tokenspeed

elwhyjay · 2026-05-14T05:40:57Z

Summary

Adds SpeculativeAlgorithm.NGRAM, a draft-model-free prompt-lookup drafter for chain speculative decoding.

The drafter keeps a CPU-side token history per request-pool slot, finds the longest suffix-matching n-gram in that history, and proposes the continuation tokens as [last_verified, d1, ..., dK]. This reuses the existing target-side chain verify path without adding a draft model, draft KV cache, drafter attention backend, or new verify kernel.

This first cut also adds startup validation for NGRAM-specific constraints:

no draft model path
topk == 1
num_draft_tokens == num_steps + 1
no PD disaggregation
prefix caching and chunked prefill disabled for now
eager mode forced for now

The lookup core is split into a pure-numpy module so the algorithm and batched proposer can be unit-tested without native kernel builds.

Test Plan

test/runtime/test_spec_decode_ngram.py
- KMP suffix lookup cases
- batched proposer row layout / padding / shape validation
test/runtime/test_cli_config_compat.py
- NGRAM resolve path
- invalid config rejection cases

lightseek-bot · 2026-05-14T05:44:05Z

@elwhyjay We should also think about whether it makes sense to add N-gram support. With TorchSpec already making training for algorithms like Eagle3 and DFlash simple and production-friendly, the key question is: in what scenarios can N-gram actually outperform algorithms like Eagle3 or DFlash?

Will keep you posted.

Adds `SpeculativeAlgorithm.NGRAM`, a draft-model-free prompt-lookup drafter for chain speculative decoding. The drafter keeps a CPU-side token history per request-pool slot, finds the longest suffix-matching n-gram in that history, and proposes the continuation tokens as `[last_verified, d1, ..., dK]`. Reuses the existing target-side chain verify path without adding a draft model, draft KV cache, drafter attention backend, or new verify kernel. Adds startup validation for NGRAM-specific constraints: no draft model path, `topk == 1`, `num_draft_tokens == num_steps + 1`, no PD disaggregation, prefix caching and chunked prefill disabled, eager mode forced. The lookup core is split into a pure-numpy module so the algorithm and batched proposer can be unit-tested without native kernel builds. Tests: KMP suffix lookup, batched proposer row layout / padding / shape validation, NGRAM resolve and reject paths in CLI compat. Signed-off-by: yongjunlee <jqueen.astro@gmail.com>

elwhyjay · 2026-05-14T06:07:52Z

@lightseek-bot Thanks for quick raising this. I think NGRAM and trained drafters (EAGLE3, DFlash) sit in different categories rather than competing:

No training, no draft weights.NGRAM needs neither a trained draft head nor a draft KV pool, so it works on any model the moment it lands in tokenspeed (including new families that don't have an EAGLE3 / DFlash head trained yet). I see it as a bridge rather than a long-term replacement for trained drafters.
Different acceptance profile. Trained drafters generalize across prompts but rarely hit a 100% match on long verbatim spans. Prompt-lookup is the inverse: it does nothing on novel text but is very strong on patterns that repeat inside the same request — agentic tool-call replays, JSON / XML scaffolding, retrieval-augmented copy, repeated code chunks.

This PR is intentionally minimal, chain verify reused with no new kernel, single-rank only, and several features (prefix caching, chunked prefill, PD disaggregation, topk > 1) auto-disabled or rejected at startup. so it shouldn't get in the way of the trained-drafter track. Smaller follow-ups on top of the same lookup core are planned (CUDA-graph, prefix caching / KVStore, chunked prefill, etc.), and order is open to maintainer preference.

I might be missing context on TorchSpec's roadmap here, so let me know if there's an overlap I will be aware of to rescope or split things differently.

NGRAM previously forced `chunked_prefill_size = -1` to disable chunked prefill. That value is also used as a scheduler-facing token capacity, so propagating the negative sentinel can prevent SMG startup from completing. Leave the resolved chunked-prefill budget untouched for NGRAM. The drafter's per-pool token history is updated from the actual extend/verify stream, so it does not require chunked prefill to be disabled. Tests: - `python3 -m pytest test/runtime/test_cli_config_compat.py -q` - `python3 -m pytest test/runtime/test_spec_decode_ngram.py -q` Signed-off-by: yongjunlee <jqueen.astro@gmail.com>

elwhyjay · 2026-05-14T10:51:25Z

A quick e2e update on H100 (CUDA 13.0 / torch 2.11.0) with Qwen3-8B, eager mode, --speculative-num-steps 3 --speculative-ngram-min 2 --speculative-ngram-max 4, temperature=0:

Prompt (completion tokens)	Baseline tok/s	NGRAM tok/s	delta
JSON-record repeat × 5 (95)	48.4	46.4	-4.1%
Code line repeat × 15 (109)	36.3	49.6	+36.6%
Word repeat × 50 (400)	50.4	50.3	-0.2%
Numbered items × 30 (205)	49.7	50.1	+0.8%
Total	47.5	49.6	+4.4%

So the results are close to baseline on the less repetitive prompts, and clearly faster on the code line repeat case. This is similar to the use cases I mentioned above (tool calls, structured output, code chunks). The small change on shorter or less repetitive prompts is also similar to what vLLM's #24986 reports on mt-bench for 8B (mean acceptance length around 2.03).

lightseek-bot · 2026-05-14T18:22:02Z

Hi @elwhyjay
As discussed internally, ngram is neither included in nor planned for our roadmap. We have decided not to merge it and instead encourage a fork. Thank you for your understanding. Here are our thoughts on external contributions.
#120 (comment)

elwhyjay requested a review from a team as a code owner May 14, 2026 05:40

elwhyjay force-pushed the feat/spec-decode-ngram branch from 0bb40ed to 13f6ca2 Compare May 14, 2026 05:47

elwhyjay force-pushed the feat/spec-decode-ngram branch from 40d9851 to fb611a5 Compare May 14, 2026 10:45

lightseek-bot closed this May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spec-decode): add n-gram (prompt-lookup) speculative drafter#145

feat(spec-decode): add n-gram (prompt-lookup) speculative drafter#145
elwhyjay wants to merge 2 commits into
lightseekorg:mainfrom
elwhyjay:feat/spec-decode-ngram

elwhyjay commented May 14, 2026

Uh oh!

lightseek-bot commented May 14, 2026

Uh oh!

elwhyjay commented May 14, 2026

Uh oh!

elwhyjay commented May 14, 2026

Uh oh!

lightseek-bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

elwhyjay commented May 14, 2026

Summary

Test Plan

Uh oh!

lightseek-bot commented May 14, 2026

Uh oh!

elwhyjay commented May 14, 2026

Uh oh!

elwhyjay commented May 14, 2026

Uh oh!

lightseek-bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants