[DisaggEverything] Tokens in<>out `/generate` endpoint #24261

NickLucche · 2025-09-04T16:21:41Z

Overview

First step in implementing the "Disaggregated Everything" proposal #22817.
This PR focuses on the following component:

In particular, it introduces:

GenerateRequest/Response interface. NOTE: SamplingParams can now be validated and deserialized within a pydantic message (eg input-only). Check out PydanticMsgspecMixin.
/generate tokens-only endpoint
An initial set of endpoint args, mimicking /v1/chat/completions for the most part.
a --tokens-only "modality" for starting up the server, mostly intended to simplify ux.
/abort/request/ endpoint, see below.

Implementation Details

To get a "tokenizer-free" endpoint, one can already use --skip_tokenizer_init and/or detokenize: False sampling option, forcing the use of basic IncrementalDetokenizer.
In order to make ux easier for a Disaggregated Everything setup, a --tokens-only option is added, which enforces the two flags above.
This way the Detokenizer is optional, as intended in the initial design.
INFO 09-10 13:36:17 [arg_utils.py:1281] Skipping tokenizer initialization for tokens-only mode.

Furthermore, it enables the /abort_requests endpoint.

/abort_requests is a solution to the detection of stop strings, which is one of the main challenges to get a real "tokenizer-free" endpoint.
Currently this is done in AsyncLLM output_handler_loop, followed by an IPC abort request back to the EngineCore, like so:

	+-->AsyncLLM---+------------------->API
	|              |
	|ECOs          |stop_string abort
	|              |
EngineCore <-------+

With this Disaggregated Everything, we task the "Coordinator" (to be implemented in a follow-up PR) with detokenization. Hence, the "generate" instance needs to act more as a "remote EngineCore". The workflow is the following:

	+-->AsyncLLM---+------------------->API
	|              |
	|              |stop_string abort
	|              |
GenerateResponse   |
	|			   |				Coordinator Node
____|______________|_______________________________
	+-->AsyncLLM   |/abort_requests
	|              |
	|ECOs          |stop_string abort
	|              |
EngineCore <-------+
									Generate (tokens-only) Node

How to test

# vllm serve Qwen/Qwen3-0.6B
#python examples/online_serving/token_generation_client.py
{'request_id': 'a0e37922547c4d95885b9ce19588b9ef', 'choices': [{'index': 0, 'logprobs': None, 'finish_reason': 'stop', 'token_ids': [785, 7513, 9145, 320, 38807, 8, 17167, 315, 3070, 17, 22, 4462, 5302, 334, 13, 151645]}], 'prompt_logprobs': None, 'kv_transfer_params': None}
--------------------------------------------------
Token generation results:
The European Union (EU) consists of **27 member states**.<|im_end|>

or among other tests

# Ensures tokenizer+/generate+detokenizer == /v1/chat/completions
python -m pytest -v -s tests/entrypoints/openai/test_serving_tokens.py::test_same_response_as_chat_completions

Follow up PRs:

streaming mode
MultiModalFeatureSpec input, will add once Renderer effort progresses
more endpoint params

mergify · 2025-09-08T13:57:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NickLucche · 2025-09-10T15:04:34Z

cc @robertgshaw2-redhat @ywang96 @DarkLight1337

smarterclayton · 2025-09-11T14:41:05Z

EDIT: Moved to #22817 (comment)

russellb · 2025-09-11T15:17:35Z

I see that /v1/generate was added to the OpenAI API endpoint. This API is often exposed directly to end users. Is this API intended to be potentially used directly by end users, or is it more of an internal infrastructure API?

If it's a different audience, it may be better suited to a different HTTP service scoped to a different audience and purpose. I had similar feedback about an earlier version of http metadata exchange for the Nixl connector, but the latest version seems to have moved it to its own http service: #22274

If it is desired to keep this on the existing OpenAI API, I think it'd be nice if we used namespacing to make it clear which APIs are our own custom ones vs. our implementation of APIs defined by OpenAI. One option would be something like v1/vllm/generate where everything under v1/vllm/ is a vllm-custom API aligned with the V1 OpenAI API. Another option is vllm/generate or vllm/v1/generate. I feel less strongly about the specific choice than just doing something to separate our custom APIs.

NickLucche · 2025-09-12T13:12:50Z

Is this API intended to be potentially used directly by end users, or is it more of an internal infrastructure API

We're still discussing with @smarterclayton the full spectrum of intended use cases.
In my view, it's definitely going to be aimed to be used in a larger infrastructure, but there are also nicher cases where someone just wants vLLM for inference but doesn't care about the added overhead of OAI specs (eg RL).

I feel less strongly about the specific choice than just doing something to separate our custom APIs

I understand, would you be in favor of a separate entrypoint altogether? My motivation for keeping things inside the OAI one was to enable easy access to the other endpoints, which are not exclusive, at least in this early stage.

vllm/v1/generate works for me, although @smarterclayton was raising the issue of keeping the interface "open" as in not vllm-exclusive.

russellb · 2025-09-12T14:31:26Z

Is this API intended to be potentially used directly by end users, or is it more of an internal infrastructure API

We're still discussing with @smarterclayton the full spectrum of intended use cases. In my view, it's definitely going to be aimed to be used in a larger infrastructure, but there are also nicher cases where someone just wants vLLM for inference but doesn't care about the added overhead of OAI specs (eg RL).

I feel less strongly about the specific choice than just doing something to separate our custom APIs

I understand, would you be in favor of a separate entrypoint altogether? My motivation for keeping things inside the OAI one was to enable easy access to the other endpoints, which are not exclusive, at least in this early stage.

It's probably fine to keep within the same API. It doesn't seem harmful to expose (like maybe internal infrastructure metadata exchange would be).

vllm/v1/generate works for me, although @smarterclayton was raising the issue of keeping the interface "open" as in not vllm-exclusive.

Fair point. I just think it'd be nice to make it clear where we're copying OpenAI vs. defining our own completely independent APIs. It could be inference/v1/generate or something ...

NickLucche · 2025-09-18T16:38:33Z

@russellb Changed naming to the one you suggested. Let me know if there's something else I should change in this PR in your view, looking to move this forward

rizar · 2025-09-21T01:39:23Z

I'm looking forward to this feature!

Question: will this endpoint propagate data_parallel_rank selector to EngineCoreRequest? Like what is currently added in #24945

mergify · 2025-09-21T23:16:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-10-08T14:36:32Z

Documentation preview: https://vllm--24261.org.readthedocs.build/en/24261/

Signed-off-by: NickLucche <[email protected]>

mergify bot added frontend v1 documentation Improvements or additions to documentation labels Sep 4, 2025

mergify bot added the needs-rebase label Sep 8, 2025

NickLucche changed the title ~~[do not merge] Tokens in<>out /generate endpoint~~ [DisaggEverything] Tokens in<>out /generate endpoint Sep 10, 2025

NickLucche marked this pull request as ready for review September 10, 2025 15:04

NickLucche requested review from DarkLight1337, WoosukKwon, aarnphm, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, simon-mo and ywang96 as code owners September 10, 2025 15:04

mergify bot removed the needs-rebase label Sep 10, 2025

tlrmchlsmth mentioned this pull request Sep 11, 2025

[Feature]: Implement SRT generation for audio transcription in vLLM #24302

Open

1 task

NickLucche force-pushed the generate-api branch from 240f870 to 115b87f Compare September 18, 2025 16:38

NickLucche requested a review from chaunceyjiang as a code owner September 18, 2025 16:38

mergify bot added the needs-rebase label Sep 21, 2025

NickLucche mentioned this pull request Oct 3, 2025

[DisaggEverything] DisaggregatedRequestManager aka Coordinator [1/N] #26178

Draft

NickLucche added 16 commits October 24, 2025 10:34

init

6e77c0d

Signed-off-by: NickLucche <[email protected]>

msgspec+pydantic ser

9aa0e8d

Signed-off-by: NickLucche <[email protected]>

msgspec+pydantic ser mixin

49c846c

Signed-off-by: NickLucche <[email protected]>

example script

59ebbb9

Signed-off-by: NickLucche <[email protected]>

example script

4d5bc9a

Signed-off-by: NickLucche <[email protected]>

tests

62a5707

Signed-off-by: NickLucche <[email protected]>

detokenize False

2c6c4e4

Signed-off-by: NickLucche <[email protected]>

support lora

9efd3f7

Signed-off-by: NickLucche <[email protected]>

tokens-only cli arg

5deb313

Signed-off-by: NickLucche <[email protected]>

enforcing tokens-only+abort endpoint

8e37dc6

Signed-off-by: NickLucche <[email protected]>

stop string tests

dec0e97

Signed-off-by: NickLucche <[email protected]>

cruft

a7fcf28

Signed-off-by: NickLucche <[email protected]>

change endpoint name

edcd6f3

Signed-off-by: NickLucche <[email protected]>

precommit and update

0362f42

Signed-off-by: NickLucche <[email protected]>

update tests after v0 deprecation

06785d8

Signed-off-by: NickLucche <[email protected]>

remove openai prefix from oaiservingtoken

fdecd4f

Signed-off-by: NickLucche <[email protected]>

NickLucche force-pushed the generate-api branch from 115b87f to fdecd4f Compare October 24, 2025 14:44

mergify bot removed the needs-rebase label Oct 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[DisaggEverything] Tokens in<>out `/generate` endpoint #24261

[DisaggEverything] Tokens in<>out `/generate` endpoint #24261

NickLucche commented Sep 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Sep 8, 2025

Uh oh!

NickLucche commented Sep 10, 2025

Uh oh!

smarterclayton commented Sep 11, 2025 •

edited

Loading

Uh oh!

russellb commented Sep 11, 2025

Uh oh!

NickLucche commented Sep 12, 2025

Uh oh!

russellb commented Sep 12, 2025

Uh oh!

NickLucche commented Sep 18, 2025

Uh oh!

rizar commented Sep 21, 2025

Uh oh!

mergify bot commented Sep 21, 2025

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[DisaggEverything] Tokens in<>out /generate endpoint #24261

Are you sure you want to change the base?

[DisaggEverything] Tokens in<>out /generate endpoint #24261

Conversation

NickLucche commented Sep 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Implementation Details

How to test

Uh oh!

mergify bot commented Sep 8, 2025

Uh oh!

NickLucche commented Sep 10, 2025

Uh oh!

smarterclayton commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

russellb commented Sep 11, 2025

Uh oh!

NickLucche commented Sep 12, 2025

Uh oh!

russellb commented Sep 12, 2025

Uh oh!

NickLucche commented Sep 18, 2025

Uh oh!

rizar commented Sep 21, 2025

Uh oh!

mergify bot commented Sep 21, 2025

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[DisaggEverything] Tokens in<>out `/generate` endpoint #24261

[DisaggEverything] Tokens in<>out `/generate` endpoint #24261

NickLucche commented Sep 4, 2025 •

edited by github-actions bot

Loading

smarterclayton commented Sep 11, 2025 •

edited

Loading