Skip to content

feat(spec-decode): add n-gram (prompt-lookup) speculative drafter#145

Closed
elwhyjay wants to merge 2 commits into
lightseekorg:mainfrom
elwhyjay:feat/spec-decode-ngram
Closed

feat(spec-decode): add n-gram (prompt-lookup) speculative drafter#145
elwhyjay wants to merge 2 commits into
lightseekorg:mainfrom
elwhyjay:feat/spec-decode-ngram

Conversation

@elwhyjay
Copy link
Copy Markdown
Contributor

Summary

Adds SpeculativeAlgorithm.NGRAM, a draft-model-free prompt-lookup drafter for chain speculative decoding.

The drafter keeps a CPU-side token history per request-pool slot, finds the longest suffix-matching n-gram in that history, and proposes the continuation tokens as [last_verified, d1, ..., dK]. This reuses the existing target-side chain verify path without adding a draft model, draft KV cache, drafter attention backend, or new verify kernel.

This first cut also adds startup validation for NGRAM-specific constraints:

  • no draft model path
  • topk == 1
  • num_draft_tokens == num_steps + 1
  • no PD disaggregation
  • prefix caching and chunked prefill disabled for now
  • eager mode forced for now

The lookup core is split into a pure-numpy module so the algorithm and batched proposer can be unit-tested without native kernel builds.

Test Plan

  • test/runtime/test_spec_decode_ngram.py
    • KMP suffix lookup cases
    • batched proposer row layout / padding / shape validation
  • test/runtime/test_cli_config_compat.py
    • NGRAM resolve path
    • invalid config rejection cases

@elwhyjay elwhyjay requested a review from a team as a code owner May 14, 2026 05:40
@lightseek-bot
Copy link
Copy Markdown
Contributor

@elwhyjay We should also think about whether it makes sense to add N-gram support. With TorchSpec already making training for algorithms like Eagle3 and DFlash simple and production-friendly, the key question is: in what scenarios can N-gram actually outperform algorithms like Eagle3 or DFlash?

Will keep you posted.

Adds `SpeculativeAlgorithm.NGRAM`, a draft-model-free prompt-lookup
drafter for chain speculative decoding. The drafter keeps a CPU-side
token history per request-pool slot, finds the longest suffix-matching
n-gram in that history, and proposes the continuation tokens as
`[last_verified, d1, ..., dK]`. Reuses the existing target-side chain
verify path without adding a draft model, draft KV cache, drafter
attention backend, or new verify kernel.

Adds startup validation for NGRAM-specific constraints: no draft
model path, `topk == 1`, `num_draft_tokens == num_steps + 1`, no PD
disaggregation, prefix caching and chunked prefill disabled, eager
mode forced.

The lookup core is split into a pure-numpy module so the algorithm
and batched proposer can be unit-tested without native kernel builds.

Tests: KMP suffix lookup, batched proposer row layout / padding /
shape validation, NGRAM resolve and reject paths in CLI compat.

Signed-off-by: yongjunlee <jqueen.astro@gmail.com>
@elwhyjay elwhyjay force-pushed the feat/spec-decode-ngram branch from 0bb40ed to 13f6ca2 Compare May 14, 2026 05:47
@elwhyjay
Copy link
Copy Markdown
Contributor Author

@lightseek-bot Thanks for quick raising this. I think NGRAM and trained drafters (EAGLE3, DFlash) sit in different categories rather than competing:

  • No training, no draft weights.NGRAM needs neither a trained draft head nor a draft KV pool, so it works on any model the moment it lands in tokenspeed (including new families that don't have an EAGLE3 / DFlash head trained yet). I see it as a bridge rather than a long-term replacement for trained drafters.
  • Different acceptance profile. Trained drafters generalize across prompts but rarely hit a 100% match on long verbatim spans. Prompt-lookup is the inverse: it does nothing on novel text but is very strong on patterns that repeat inside the same request — agentic tool-call replays, JSON / XML scaffolding, retrieval-augmented copy, repeated code chunks.

This PR is intentionally minimal, chain verify reused with no new kernel, single-rank only, and several features (prefix caching, chunked prefill, PD disaggregation, topk > 1) auto-disabled or rejected at startup. so it shouldn't get in the way of the trained-drafter track. Smaller follow-ups on top of the same lookup core are planned (CUDA-graph, prefix caching / KVStore, chunked prefill, etc.), and order is open to maintainer preference.

I might be missing context on TorchSpec's roadmap here, so let me know if there's an overlap I will be aware of to rescope or split things differently.

NGRAM previously forced `chunked_prefill_size = -1` to disable chunked
prefill. That value is also used as a scheduler-facing token capacity, so
propagating the negative sentinel can prevent SMG startup from completing.

Leave the resolved chunked-prefill budget untouched for NGRAM. The drafter's
per-pool token history is updated from the actual extend/verify stream, so it
does not require chunked prefill to be disabled.

Tests:
- `python3 -m pytest test/runtime/test_cli_config_compat.py -q`
- `python3 -m pytest test/runtime/test_spec_decode_ngram.py -q`

Signed-off-by: yongjunlee <jqueen.astro@gmail.com>
@elwhyjay elwhyjay force-pushed the feat/spec-decode-ngram branch from 40d9851 to fb611a5 Compare May 14, 2026 10:45
@elwhyjay
Copy link
Copy Markdown
Contributor Author

A quick e2e update on H100 (CUDA 13.0 / torch 2.11.0) with Qwen3-8B, eager mode, --speculative-num-steps 3 --speculative-ngram-min 2 --speculative-ngram-max 4, temperature=0:

Prompt (completion tokens) Baseline tok/s NGRAM tok/s delta
JSON-record repeat × 5 (95) 48.4 46.4 -4.1%
Code line repeat × 15 (109) 36.3 49.6 +36.6%
Word repeat × 50 (400) 50.4 50.3 -0.2%
Numbered items × 30 (205) 49.7 50.1 +0.8%
Total 47.5 49.6 +4.4%

So the results are close to baseline on the less repetitive prompts, and clearly faster on the code line repeat case. This is similar to the use cases I mentioned above (tool calls, structured output, code chunks). The small change on shorter or less repetitive prompts is also similar to what vLLM's #24986 reports on mt-bench for 8B (mean acceptance length around 2.03).

@lightseek-bot
Copy link
Copy Markdown
Contributor

Hi @elwhyjay
As discussed internally, ngram is neither included in nor planned for our roadmap. We have decided not to merge it and instead encourage a fork. Thank you for your understanding. Here are our thoughts on external contributions.
#120 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants