Skip to content

Commit dacd7ed

Browse files
committed
Rename VLLM to VLLMOffline and create server-based VLLM/AsyncVLLM
1 parent 03b3e7a commit dacd7ed

22 files changed

+1390
-376
lines changed

docs/reference/models/vllm.md

+24-41
Original file line numberDiff line numberDiff line change
@@ -1,68 +1,51 @@
11
# vLLM
22

3+
## Prerequisites
34

4-
!!! Note "Installation"
5+
The Outlines `VLLM` model is inteded to be used along with a vllm instance running on a separate server (can be local or remote). Make sure you have a vllm server running before using the `VLLM` model. As the vllm client relies on the `openai` python sdk, you need to have an `openai` package installed. If you instead want to use the vllm offline inference mode, please refer to the [VLLMOffline model documentation](./vllm_offline.md).
56

6-
You need to install the `vllm` library to use the vLLM integration: `pip install vllm`. The default installation only works on machines with a GPU, follow the [installation section][vllm-install-cpu] for instructions to install vLLM for CPU or ROCm.
7+
## Initialize the model
78

8-
Consult the [vLLM documentation][vllm-docs] for detailed informations about how to initialize OpenAI clients and the available options.
9-
10-
## Load the model
11-
12-
Outlines supports models available via vLLM's offline batched inference interface. You can load a model using:
9+
To load the model, you can use the `from_vllm` function. The argument of the function is either an `OpenAI` or `AsyncOpenAI` instance from the `openai` library. Based on whether the `openai` instance is synchronous or asynchronous, you will receive a `VLLM` or `AsyncVLLM` model instance.
1310

1411
```python
12+
import openai
1513
import outlines
16-
from vllm import LLM
17-
18-
model = outlines.from_vllm(LLM("microsoft/Phi-3-mini-4k-instruct"))
19-
```
20-
21-
Models are loaded from the [HuggingFace hub](https://huggingface.co/).
2214

15+
sync_openai_client = openai.OpenAI(base_url="...")
16+
async_openai_client = openai.AsyncOpenAI(base_url="...")
2317

24-
!!! Warning "Device"
25-
26-
The default installation of vLLM only allows to load models on GPU. See the [installation instructions][vllm-install-cpu] to run models on CPU.
18+
sync_model = outlines.from_vllm(sync_openai_client)
19+
print(type(sync_model)) # <class 'outlines.models.vllm.VLLM'>
2720

21+
async_model = outlines.from_vllm(async_openai_client)
22+
print(type(async_model)) # <class 'outlines.models.vllm.AsyncVLLM'>
23+
```
2824

2925
## Generate text
3026

31-
To generate text, you can just call the model with a prompt as argument:
27+
To generate text, you can call the model with a prompt as argument and optionally an output type to rely on structured generation:
3228

3329
```python
34-
import outlines
35-
from vllm import LLM
36-
37-
model = outlines.from_vllm(LLM("microsoft/Phi-3-mini-4k-instruct"))
38-
answer = model("Write a short story about a cat.")
39-
```
40-
41-
You can also use structured generation with the `VLLM` model by providing an output type after the prompt:
42-
43-
```python
44-
import outlines
45-
from vllm import LLM
4630
from pydantic import BaseModel
4731

4832
class Character(BaseModel):
4933
name: str
5034

51-
model = outlines.from_vllm(LLM("microsoft/Phi-3-mini-4k-instruct"))
52-
answer = model("Create a character.", output_type=Character)
35+
answer = sync_model("Create a character.", output_type=Character)
36+
answer = await async_model("Create a character.", output_type=Character)
5337
```
5438

55-
The VLLM model supports batch generation. To use it, you can pass a list of strings as prompt instead of a single string.
39+
The `VLLM` model supports also supports streaming.
5640

57-
## Optional parameters
58-
59-
When calling the model, you can provide optional parameters on top of the prompt and the output type. Those will be passed on to the `LLM.generate` method of the `vllm` library. An optional parameter of particular interest is `sampling_params`, which is an instance of `SamplingParams`. You can find more information about it in the [vLLM documentation][https://docs.vllm.ai/en/latest/api/inference_params.html].
41+
```python
42+
for chunk in sync_model.stream("Write a short story about a cat.", max_tokens=100):
43+
print(chunk)
44+
```
6045

61-
!!! Warning
46+
## Optional parameters
6247

63-
Streaming is not available for the offline vLLM integration.
48+
When calling the model, you can provide optional parameters on top of the prompt and the output type. Those will be passed on to the `openai` client. An optional parameter of particular interest is `extra_body`, which is a dictionnary containing arguments that are specific to vLLM and are not part of the standard `openai` interface (see the [vLLM documentation][https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html] on the OpenAI-compatible server for more information).
6449

65-
[vllm-docs]:https://docs.vllm.ai/en/latest/
66-
[vllm-install-cpu]: https://docs.vllm.ai/en/latest/getting_started/cpu-installation.html
67-
[vllm-install-rocm]: https://docs.vllm.ai/en/latest/getting_started/amd-installation.html
68-
[rocm-flash-attention]: https://github.com/ROCm/flash-attention/tree/flash_attention_for_rocm#amd-gpurocm-support
50+
[vllm-docs]: https://docs.vllm.ai/en/latest/
51+
[vllm-online-quickstart]: https://docs.vllm.ai/en/latest/getting_started/quickstart.html#quickstart-online

docs/reference/models/vllm_offline.md

+68
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# vLLM Offline Inference mode
2+
3+
4+
!!! Note "Installation"
5+
6+
You need to install the `vllm` library to use the vLLM integration: `pip install vllm`. The default installation only works on machines with a GPU, follow the [installation section][vllm-install-cpu] for instructions to install vLLM for CPU or ROCm.
7+
8+
Consult the [vLLM documentation][vllm-docs] for detailed informations about how to initialize OpenAI clients and the available options.
9+
10+
## Load the model
11+
12+
Outlines supports models available via vLLM's offline batched inference interface. You can load a model using:
13+
14+
```python
15+
import outlines
16+
from vllm import LLM
17+
18+
model = outlines.from_vllm_offline(LLM("microsoft/Phi-3-mini-4k-instruct"))
19+
```
20+
21+
Models are loaded from the [HuggingFace hub](https://huggingface.co/).
22+
23+
24+
!!! Warning "Device"
25+
26+
The default installation of vLLM only allows to load models on GPU. See the [installation instructions][vllm-install-cpu] to run models on CPU.
27+
28+
29+
## Generate text
30+
31+
To generate text, you can just call the model with a prompt as argument:
32+
33+
```python
34+
import outlines
35+
from vllm import LLM
36+
37+
model = outlines.from_vllm_offline(LLM("microsoft/Phi-3-mini-4k-instruct"))
38+
answer = model("Write a short story about a cat.")
39+
```
40+
41+
You can also use structured generation with the `VLLM` model by providing an output type after the prompt:
42+
43+
```python
44+
import outlines
45+
from vllm import LLM
46+
from pydantic import BaseModel
47+
48+
class Character(BaseModel):
49+
name: str
50+
51+
model = outlines.from_vllm_offline(LLM("microsoft/Phi-3-mini-4k-instruct"))
52+
answer = model("Create a character.", output_type=Character)
53+
```
54+
55+
The VLLM model supports batch generation. To use it, you can pass a list of strings as prompt instead of a single string.
56+
57+
## Optional parameters
58+
59+
When calling the model, you can provide optional parameters on top of the prompt and the output type. Those will be passed on to the `LLM.generate` method of the `vllm` library. An optional parameter of particular interest is `sampling_params`, which is an instance of `SamplingParams`. You can find more information about it in the [vLLM documentation][https://docs.vllm.ai/en/latest/api/inference_params.html].
60+
61+
!!! Warning
62+
63+
Streaming is not available for the offline vLLM integration.
64+
65+
[vllm-docs]:https://docs.vllm.ai/en/latest/
66+
[vllm-install-cpu]: https://docs.vllm.ai/en/latest/getting_started/cpu-installation.html
67+
[vllm-install-rocm]: https://docs.vllm.ai/en/latest/getting_started/amd-installation.html
68+
[rocm-flash-attention]: https://github.com/ROCm/flash-attention/tree/flash_attention_for_rocm#amd-gpurocm-support

examples/vllm_integration.py renamed to examples/vllm_offline_integration.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
from pydantic import BaseModel
55
from transformers import AutoTokenizer
66

7-
from outlines.models.vllm import adapt_tokenizer
7+
from outlines.models.vllm_offline import adapt_tokenizer
88
from outlines.processors import JSONLogitsProcessor
99

1010

outlines/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
from_llamacpp,
2020
from_mlxlm,
2121
from_vllm,
22+
from_vllm_offline,
2223
)
2324
from outlines.templates import Template, prompt
2425
from outlines.types import regex, json_schema, cfg

outlines/generator.py

+30-12
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
from typing import Any, Optional, Union
22

33
from outlines.models import BlackBoxModel, SteerableModel
4+
from outlines.models.base import AsyncModel
45
from outlines.processors import (
56
CFGLogitsProcessor,
67
GuideLogitsProcessor,
@@ -13,23 +14,14 @@
1314

1415

1516
class BlackBoxGenerator:
16-
"""Represents a generator for which we don't control constrained generation.
17-
18-
This type of generator only accepts an output type as an argument defining
19-
constrained generation. This output type is not modified and thus only
20-
passed through to the model.
21-
"""
17+
"""Synchronous generator for which we don't control constrained generation."""
2218
output_type: Optional[Any]
2319

2420
def __init__(self, model, output_type: Optional[Any]):
2521
self.model = model
2622
self.output_type = output_type
2723

28-
if isinstance(self.output_type, CFG):
29-
raise NotImplementedError(
30-
"CFG generation is not supported for API-based models"
31-
)
32-
elif isinstance(self.output_type, FSM):
24+
if isinstance(self.output_type, FSM):
3325
raise NotImplementedError(
3426
"FSM generation is not supported for API-based models"
3527
)
@@ -41,6 +33,29 @@ def stream(self, prompt, **inference_kwargs):
4133
return self.model.generate_stream(prompt, self.output_type, **inference_kwargs)
4234

4335

36+
class AsyncBlackBoxGenerator:
37+
"""Asynchronous generator for which we don't control constrained generation."""
38+
output_type: Optional[Any]
39+
40+
def __init__(self, model, output_type: Optional[Any]):
41+
self.model = model
42+
self.output_type = output_type
43+
44+
if isinstance(self.output_type, FSM):
45+
raise NotImplementedError(
46+
"FSM generation is not supported for API-based models"
47+
)
48+
49+
async def __call__(self, prompt, **inference_kwargs):
50+
return await self.model.generate(prompt, self.output_type, **inference_kwargs)
51+
52+
async def stream(self, prompt, **inference_kwargs):
53+
async for chunk in self.model.generate_stream( # pragma: no cover
54+
prompt, self.output_type, **inference_kwargs
55+
):
56+
yield chunk
57+
58+
4459
class SteerableGenerator:
4560
"""Represents a generator for which we control constrained generation.
4661
@@ -134,7 +149,10 @@ def Generator(
134149
if processor is not None:
135150
raise NotImplementedError("This model does not support logits processors")
136151
else:
137-
return BlackBoxGenerator(model, output_type)
152+
if isinstance(model, AsyncModel):
153+
return AsyncBlackBoxGenerator(model, output_type)
154+
else:
155+
return BlackBoxGenerator(model, output_type)
138156
else:
139157
if processor is not None:
140158
return SteerableGenerator.from_processor(model, processor)

outlines/models/__init__.py

+5-4
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,12 @@
2222
TransformersMultiModal,
2323
from_transformers,
2424
)
25-
from .vllm import VLLM, from_vllm
25+
from .vllm_offline import VLLMOffline, from_vllm_offline
26+
from .vllm import AsyncVLLM, VLLM, from_vllm
2627

2728
LogitsGenerator = Union[
28-
Transformers, LlamaCpp, OpenAI, MLXLM, VLLM, Ollama
29+
Transformers, LlamaCpp, OpenAI, MLXLM, VLLMOffline, Ollama
2930
]
3031

31-
SteerableModel = Union[LlamaCpp, Transformers, MLXLM, VLLM]
32-
BlackBoxModel = Union[OpenAI, Anthropic, Gemini, Ollama, Dottxt]
32+
SteerableModel = Union[LlamaCpp, Transformers, MLXLM, VLLMOffline]
33+
BlackBoxModel = Union[OpenAI, Anthropic, Gemini, Ollama, Dottxt, AsyncVLLM, VLLM]

outlines/models/base.py

+86-1
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ def format_output_type(self, output_type):
3838

3939

4040
class Model(ABC):
41-
"""Base class for all models.
41+
"""Base class for all synchronous models.
4242
4343
This class defines a shared `__call__` method that can be used to call the
4444
model directly.
@@ -116,3 +116,88 @@ def generate_stream(self, model_input, output_type=None, **inference_kwargs):
116116
117117
"""
118118
...
119+
120+
121+
class AsyncModel(ABC):
122+
"""Base class for all asynchronous models.
123+
124+
This class defines a shared `__call__` method that can be used to call the
125+
model directly.
126+
All models inheriting from this class must define a `type_adapter`
127+
attribute of type `ModelTypeAdapter`. The methods of the `type_adapter`
128+
attribute are used in the `generate` method to format the input and output
129+
types received by the model.
130+
Additionally, local models must define a `tensor_library_name` attribute.
131+
132+
"""
133+
type_adapter: ModelTypeAdapter
134+
tensor_library_name: str
135+
136+
async def __call__(self, model_input, output_type=None, **inference_kwargs):
137+
"""Call the model.
138+
139+
Users can call the model directly, in which case we will create a
140+
generator instance with the output type provided and call it.
141+
Thus, those commands are equivalent:
142+
```python
143+
generator = Generator(model, Foo)
144+
await generator("prompt")
145+
```
146+
and
147+
```python
148+
await model("prompt", Foo)
149+
```
150+
151+
"""
152+
from outlines import Generator
153+
154+
generator = Generator(self, output_type)
155+
return await generator(model_input, **inference_kwargs)
156+
157+
async def stream(self, model_input, output_type=None, **inference_kwargs):
158+
"""Stream a response from the model.
159+
160+
Users can use the `stream` method from the model directly, in which
161+
case we will create a generator instance with the output type provided
162+
and then invoke its `stream` method.
163+
Thus, those commands are equivalent:
164+
```python
165+
generator = Generator(model, Foo)
166+
async for chunk in generator("prompt"):
167+
print(chunk)
168+
```
169+
and
170+
```python
171+
async for chunk in model.stream("prompt", Foo):
172+
print(chunk)
173+
```
174+
175+
"""
176+
from outlines import Generator
177+
178+
generator = Generator(self, output_type)
179+
180+
async for chunk in generator.stream(model_input, **inference_kwargs): # pragma: no cover
181+
yield chunk
182+
183+
@abstractmethod
184+
async def generate(self, model_input, output_type=None, **inference_kwargs):
185+
"""Generate a response from the model.
186+
187+
The output_type argument contains a logits processor for local models
188+
while it contains a type (Json, Enum...) for the API-based models.
189+
This method is not intended to be used directly by end users.
190+
191+
"""
192+
...
193+
194+
@abstractmethod
195+
async def generate_stream(self, model_input, output_type=None, **inference_kwargs):
196+
"""Generate a stream of responses from the model.
197+
198+
The output_type argument contains a logits processor for local models
199+
while it contains a type (Json, Enum...) for the API-based models.
200+
This method is not intended to be used directly by end users.
201+
202+
"""
203+
...

0 commit comments

Comments
 (0)