how does vllm handle wrong tokens in speculative decoding? #4284

Tomorrowdawn · 2024-04-23T03:28:22Z

Tomorrowdawn
Apr 23, 2024

The core problem is, accepted tokens in different sentences of same batch are different.
I tried to search a lot about batched speculative decoding and I found most inference frameworks ignore this problem by simply setting batch size == 1, except vllm. I read the code in vllm/spec_decode, and it seems vllm fill all wrong tokens with -1. I am still unclear about:

if there is other techniques to remove those -1s?
I know a trivial method is to mask these -1s off, if vllm does the same, how does it affects the performance?

thanks.

Answered by cadedaniel

Apr 23, 2024

Before the engine appends token ids to sequences, it removes -1 tokens. The logic is here:

vllm/vllm/engine/output_processor/multi_step.py

Line 68 in 34128a6

# -1 means the output token is not valid (eg. due to spec decode

View full answer

cadedaniel · 2024-04-23T03:39:37Z

cadedaniel
Apr 23, 2024
Collaborator

Before the engine appends token ids to sequences, it removes -1 tokens. The logic is here:

vllm/vllm/engine/output_processor/multi_step.py

Line 68 in 34128a6

# -1 means the output token is not valid (eg. due to spec decode

3 replies

cadedaniel Apr 23, 2024
Collaborator

This allows speculative decoding to fit within continuous batching scheduling.

Tomorrowdawn Apr 24, 2024
Author

Thanks! I found this PR #3103 which introduces the "batch expansion" used for verification and I also reviewed the relevant code in spec_decode/batch_expansion.py and worker/worker. From the image you provided, it appears that batch expansion splits the output sentence into a group of sentences, but tokens "1", "2", and "3" are not present in the kvcache because the target model hasn't executed the forward pass on them yet.

I was wondering if you could further explain the underlying mechanics of this process? I'm quite interested in gaining a deeper understanding of how it works. Any additional insights you could provide would be greatly appreciated.

cadedaniel Apr 24, 2024
Collaborator

Your description is correct! The KV for tokens 1,2,3 is populated in the fwd pass of the target model, which occurs after batch expansion and before batch contraction.

For more details, check out this doc.
It doesn't have any treatment of the draft model KV, but explains how the target model KV is populated using lookahead allocation. https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a

Tomorrowdawn · 2024-04-24T15:23:01Z

Tomorrowdawn
Apr 24, 2024
Author

Thanks! I found this PR #3103 which introduces the "batch expansion" used for verification and I also reviewed the relevant code in spec_decode/batch_expansion.py and worker/worker. From the image you provided, it appears that batch expansion splits the output sentence into a group of sentences, but tokens "1", "2", and "3" are not present in the kvcache because the target model hasn't executed the forward pass on them yet. I was wondering if you could further explain the underlying mechanics of this process? I'm quite interested in gaining a deeper understanding of how it works. Any additional insights you could provide would be greatly appreciated.

these lines suggests only the last token id would be appended into "input_ids". For example, in the last sentence, I don't understand how target model compute the key & value for 1, 2, 3.

1 reply

cadedaniel Apr 24, 2024
Collaborator

In batch expansion we create a new vLLM sequence for each token to be scored. So a single sequence with spec tokens 1,2,3 becomes three sequences. This means that there is still one input id per sequence, namely, 1, 2, and 3.

Because QKV projection happens before saving to KV cache in each layer, which happens before each attention computation, we can safely do this without impacting the computation result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how does vllm handle wrong tokens in speculative decoding? #4284

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

how does vllm handle wrong tokens in speculative decoding? #4284

Tomorrowdawn Apr 23, 2024

Replies: 2 comments · 4 replies

cadedaniel Apr 23, 2024 Collaborator

cadedaniel Apr 23, 2024 Collaborator

Tomorrowdawn Apr 24, 2024 Author

cadedaniel Apr 24, 2024 Collaborator

Tomorrowdawn Apr 24, 2024 Author

cadedaniel Apr 24, 2024 Collaborator

Tomorrowdawn
Apr 23, 2024

Replies: 2 comments 4 replies

cadedaniel
Apr 23, 2024
Collaborator

cadedaniel Apr 23, 2024
Collaborator

Tomorrowdawn Apr 24, 2024
Author

cadedaniel Apr 24, 2024
Collaborator

Tomorrowdawn
Apr 24, 2024
Author

cadedaniel Apr 24, 2024
Collaborator