Skip to content

Conversation

sducouedic
Copy link
Collaborator

Interleave consecutive prefill operations are interleaved with a decode step to minimize interruptions of current running requests. This mitigates the peaks in inter-token latency (ITL). A prefill is skipped if the previous step was also a prefill.

@github-actions
Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

max_prompt_batch_size = 1
max_context_len = self.scheduler_config.max_model_len

# two consecutive prefill steps are now allowed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I guess you mean not allowed, not now allowed in the comment

@maxdebayser
Copy link
Collaborator

I think it makes sense to rate limit the prefills, but shouldn't this be regulated by how many requests are in the batch? Let's say that the current batch is completely empty, and I send N requests, then it will take 2*N to ramp up to full compute utilization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants