feat: improve spyre logits processors for CB #527

wallashss · 2025-10-14T17:54:25Z

Description

This PR includes builtin logits processors for spyre. These logits processors are the same from vllm but more optimized to not be wrapped by the LogitsProcessorWrapper that will slice logits at each engine step. These logits processors work by calling the set_prefill_index and properly handling prefill in our spyre model runner. This PR also extends tests of sampling params to run with continuous batching for the parameters that use logits processors under the hood.

Signed-off-by: Wallas Santos <[email protected]>

wallashss · 2025-10-14T17:54:55Z

vllm_spyre/v1/worker/spyre_model_runner.py

                self.input_batch.refresh_metadata()
+        else:
+            # Due to logits processor we need to refresh metadata at each step
+            self.input_batch.refresh_metadata()


@tjohnson31415 please see this.

github-actions · 2025-10-14T17:55:30Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

Signed-off-by: Wallas Santos <[email protected]>

joerunde · 2025-10-14T21:05:06Z

tests/e2e/test_sampling_params.py

 def test_spyre_batch1_logit_bias(model: ModelInfo, backend, monkeypatch,
-                                 use_llm_cache, warmup_shapes):
+                                 use_llm_cache, warmup_shapes, max_model_len,
+                                 max_num_seqs, cb: int):


Thoughts on swapping these tests to continuous batching only and not testing at all for static batching?

Currently this file takes about 10 minutes to run for static batching, and I'm not sure that it makes sense to do given that we're only focusing on improvements to continuous batching

joerunde · 2025-10-14T21:42:51Z

tests/e2e/test_sampling_params.py

-        warmup_shapes=warmup_shapes,
-    )
+        warmup_shapes=warmup_shapes if cb == 0 else None,
+        use_cb=cb == 1)


while we're in here, I think the token_diversity check could be sped up by

using the n parameter instead of a for loop for batched decodes

setting the random seed to a fixed value and using n < 10

Running 20 separate batches for one test takes quite a long time on github actions 🐌🐌🐌

Signed-off-by: Wallas Santos <[email protected]>

wallashss · 2025-10-15T16:02:31Z

bot:test

maxdebayser · 2025-10-15T17:33:03Z

vllm_spyre/v1/sample/spyre_logits_processor.py

+        # Convert logits to probability distribution
+        probability_values = torch.nn.functional.softmax(logits, dim=-1)
+        # Calculate maximum probabilities per sequence
+        max_probabilities = torch.amax(probability_values,
+                                       dim=-1,
+                                       keepdim=True)
+        # Adjust min_p
+        adjusted_min_p = max_probabilities.mul_(
+            self.min_p[self._prefill_index].unsqueeze(0))
+        # Identify valid tokens using threshold comparison
+        invalid_token_mask = probability_values < adjusted_min_p
+        # Apply mask using boolean indexing
+        logits[invalid_token_mask] = -float('inf')
+        self._prefill_index = None
+
+        return logits


This code is identical to the super class apply(). Isn't it better to just call super().apply()?

I'd like to, but the self.min_p contains data of other requests that I filter them out with self._prefill_index to get only the request being prefilled.

Ok, I created a new class PrefillHelperLogitsProcessor: it will instantiate two logits processor, one for prefill and other for decoding, and our builtin logits processor just reuse the existing implementation on vllm. The class is more efficient than the logitswrapper, but it only works if the state between prefill and decode are independent. It won't work for golden token injection for example. So I think I solved the code deduplication in vllm spyre.

but it only works if the state between prefill and decode are independent

I think a well-behaved LogitsProcessor doesn't need persistent state between prefill and decode, i.e. it can be created from a request after output tokens have been generated. This would be needed to support resuming generation after preemption.

Could GTI be updated to remove the persistent state, i.e. to get its state from the content of the batch_update?
current_token_idx could be set based on the length of the current output_tokens and has_error if the expected and output tokens don't match or something like that.

Signed-off-by: Wallas Santos <[email protected]>

wallashss · 2025-10-16T19:32:14Z

bot:test

wallashss · 2025-10-17T12:52:41Z

bot:test

maxdebayser · 2025-10-20T13:32:55Z

vllm_spyre/v1/sample/spyre_logits_processor.py

+class SpyreLogitsProcessor:
+
+    def set_prefill(self, idx: int) -> None:
+        raise NotImplementedError


Can the GoldenTokenInjector implement SpyreLogitsProcessor and move the state from prefill to decode in the set_prefill method?

maxdebayser · 2025-10-20T13:33:55Z

vllm_spyre/v1/worker/spyre_input_batch.py

-        self.logitsprocs_wrappers = [lp for lp \
-            in self.logitsprocs.all if isinstance(lp, LogitProcessorWrapper)
+        self.spyre_logitsprocs = [lp for lp \
+            in self.logitsprocs.all if isinstance(lp, SpyreLogitsProcessor)


Do we need to require that all LogitsProcessors be SpyreLogitsProcessors?

tests/utils/test_spyre_logits_processor.py

tjohnson31415 · 2025-10-20T16:32:55Z

vllm_spyre/v1/sample/spyre_logits_processor.py

+        # Convert logits to probability distribution
+        probability_values = torch.nn.functional.softmax(logits, dim=-1)
+        # Calculate maximum probabilities per sequence
+        max_probabilities = torch.amax(probability_values,
+                                       dim=-1,
+                                       keepdim=True)
+        # Adjust min_p
+        adjusted_min_p = max_probabilities.mul_(
+            self.min_p[self._prefill_index].unsqueeze(0))
+        # Identify valid tokens using threshold comparison
+        invalid_token_mask = probability_values < adjusted_min_p
+        # Apply mask using boolean indexing
+        logits[invalid_token_mask] = -float('inf')
+        self._prefill_index = None
+
+        return logits


but it only works if the state between prefill and decode are independent

I think a well-behaved LogitsProcessor doesn't need persistent state between prefill and decode, i.e. it can be created from a request after output tokens have been generated. This would be needed to support resuming generation after preemption.

Could GTI be updated to remove the persistent state, i.e. to get its state from the content of the batch_update?
current_token_idx could be set based on the length of the current output_tokens and has_error if the expected and output tokens don't match or something like that.

Signed-off-by: Travis Johnson <[email protected]>

tjohnson31415

@maxdebayser found that this PR has a bug that can be reproduced using the approach in #508

wallashss added 5 commits October 13, 2025 10:07

feat: improve tests

80361bd

Signed-off-by: Wallas Santos <[email protected]>

test: test_mintokens_logits_processor

27bf0f5

Signed-off-by: Wallas Santos <[email protected]>

test: add docs and code cleanup

89a9fbb

Signed-off-by: Wallas Santos <[email protected]>

feat: added tests for logits bias

b74ba58

Signed-off-by: Wallas Santos <[email protected]>

feat: tests for min p

07e2f90

Signed-off-by: Wallas Santos <[email protected]>

wallashss requested review from nikolaospapandreou, prashantgupta24, rafvasq, sducouedic, tdoublep and yannicks1 as code owners October 14, 2025 17:54

wallashss commented Oct 14, 2025

View reviewed changes

wallashss added 2 commits October 14, 2025 15:04

style: fix linting

8f3c41b

Signed-off-by: Wallas Santos <[email protected]>

style: fix linting

db5cae2

Signed-off-by: Wallas Santos <[email protected]>

joerunde reviewed Oct 14, 2025

View reviewed changes

wallashss added 3 commits October 15, 2025 11:39

style: bypass yapf

29a91da

Signed-off-by: Wallas Santos <[email protected]>

feat: minor optimization on tests

b99d3a5

Signed-off-by: Wallas Santos <[email protected]>

Merge branch 'main' into wallas-improv-lps

81dcbd9

maxdebayser reviewed Oct 15, 2025

View reviewed changes

feat: new PrefillHelperLogitsProcessor to remove code deduplication

ac427c2

Signed-off-by: Wallas Santos <[email protected]>

wallashss requested review from joerunde and maxdebayser October 16, 2025 15:13

wallashss added 2 commits October 16, 2025 13:57

fix: batch management

83cb572

Signed-off-by: Wallas Santos <[email protected]>

fix: typing

f770f53

Signed-off-by: Wallas Santos <[email protected]>

maxdebayser reviewed Oct 20, 2025

View reviewed changes

tjohnson31415 reviewed Oct 20, 2025

View reviewed changes

address my own review nits

972de08

Signed-off-by: Travis Johnson <[email protected]>

tjohnson31415 requested changes Oct 20, 2025

View reviewed changes

feat: improve spyre logits processors for CB #527

Are you sure you want to change the base?

feat: improve spyre logits processors for CB #527

Conversation

wallashss commented Oct 14, 2025

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wallashss commented Oct 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjohnson31415 Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wallashss commented Oct 16, 2025

Uh oh!

wallashss commented Oct 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tjohnson31415 Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjohnson31415 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tjohnson31415 Oct 20, 2025 •

edited

Loading

tjohnson31415 Oct 20, 2025 •

edited

Loading