feat(spec-decode): add n-gram (prompt-lookup) speculative drafter#145
feat(spec-decode): add n-gram (prompt-lookup) speculative drafter#145elwhyjay wants to merge 2 commits into
Conversation
|
@elwhyjay We should also think about whether it makes sense to add N-gram support. With TorchSpec already making training for algorithms like Eagle3 and DFlash simple and production-friendly, the key question is: in what scenarios can N-gram actually outperform algorithms like Eagle3 or DFlash? Will keep you posted. |
Adds `SpeculativeAlgorithm.NGRAM`, a draft-model-free prompt-lookup drafter for chain speculative decoding. The drafter keeps a CPU-side token history per request-pool slot, finds the longest suffix-matching n-gram in that history, and proposes the continuation tokens as `[last_verified, d1, ..., dK]`. Reuses the existing target-side chain verify path without adding a draft model, draft KV cache, drafter attention backend, or new verify kernel. Adds startup validation for NGRAM-specific constraints: no draft model path, `topk == 1`, `num_draft_tokens == num_steps + 1`, no PD disaggregation, prefix caching and chunked prefill disabled, eager mode forced. The lookup core is split into a pure-numpy module so the algorithm and batched proposer can be unit-tested without native kernel builds. Tests: KMP suffix lookup, batched proposer row layout / padding / shape validation, NGRAM resolve and reject paths in CLI compat. Signed-off-by: yongjunlee <jqueen.astro@gmail.com>
0bb40ed to
13f6ca2
Compare
|
@lightseek-bot Thanks for quick raising this. I think NGRAM and trained drafters (EAGLE3, DFlash) sit in different categories rather than competing:
This PR is intentionally minimal, chain verify reused with no new kernel, single-rank only, and several features (prefix caching, chunked prefill, PD disaggregation, topk > 1) auto-disabled or rejected at startup. so it shouldn't get in the way of the trained-drafter track. Smaller follow-ups on top of the same lookup core are planned (CUDA-graph, prefix caching / KVStore, chunked prefill, etc.), and order is open to maintainer preference. I might be missing context on TorchSpec's roadmap here, so let me know if there's an overlap I will be aware of to rescope or split things differently. |
NGRAM previously forced `chunked_prefill_size = -1` to disable chunked prefill. That value is also used as a scheduler-facing token capacity, so propagating the negative sentinel can prevent SMG startup from completing. Leave the resolved chunked-prefill budget untouched for NGRAM. The drafter's per-pool token history is updated from the actual extend/verify stream, so it does not require chunked prefill to be disabled. Tests: - `python3 -m pytest test/runtime/test_cli_config_compat.py -q` - `python3 -m pytest test/runtime/test_spec_decode_ngram.py -q` Signed-off-by: yongjunlee <jqueen.astro@gmail.com>
40d9851 to
fb611a5
Compare
|
A quick e2e update on H100 (CUDA 13.0 / torch 2.11.0) with Qwen3-8B, eager mode,
So the results are close to baseline on the less repetitive prompts, and clearly faster on the code line repeat case. This is similar to the use cases I mentioned above (tool calls, structured output, code chunks). The small change on shorter or less repetitive prompts is also similar to what vLLM's #24986 reports on mt-bench for 8B (mean acceptance length around 2.03). |
|
Hi @elwhyjay |
Summary
Adds
SpeculativeAlgorithm.NGRAM, a draft-model-free prompt-lookup drafter for chain speculative decoding.The drafter keeps a CPU-side token history per request-pool slot, finds the longest suffix-matching n-gram in that history, and proposes the continuation tokens as
[last_verified, d1, ..., dK]. This reuses the existing target-side chain verify path without adding a draft model, draft KV cache, drafter attention backend, or new verify kernel.This first cut also adds startup validation for NGRAM-specific constraints:
topk == 1num_draft_tokens == num_steps + 1The lookup core is split into a pure-numpy module so the algorithm and batched proposer can be unit-tested without native kernel builds.
Test Plan
test/runtime/test_spec_decode_ngram.pytest/runtime/test_cli_config_compat.py