You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Outlines `VLLM` model is inteded to be used along with a vllm instance running on a separate server (can be local or remote). Make sure you have a vllm server running before using the `VLLM` model. As the vllm client relies on the `openai` python sdk, you need to have an `openai` package installed. If you instead want to use the vllm offline inference mode, please refer to the [VLLMOffline model documentation](./vllm_offline.md).
5
6
6
-
You need to install the `vllm` library to use the vLLM integration: `pip install vllm`. The default installation only works on machines with a GPU, follow the [installation section][vllm-install-cpu] for instructions to install vLLM for CPU or ROCm.
7
+
## Initialize the model
7
8
8
-
Consult the [vLLM documentation][vllm-docs] for detailed informations about how to initialize OpenAI clients and the available options.
9
-
10
-
## Load the model
11
-
12
-
Outlines supports models available via vLLM's offline batched inference interface. You can load a model using:
9
+
To load the model, you can use the `from_vllm` function. The argument of the function is either an `OpenAI` or `AsyncOpenAI` instance from the `openai` library. Based on whether the `openai` instance is synchronous or asynchronous, you will receive a `VLLM` or `AsyncVLLM` model instance.
13
10
14
11
```python
12
+
import openai
15
13
import outlines
16
-
from vllm importLLM
17
-
18
-
model = outlines.from_vllm(LLM("microsoft/Phi-3-mini-4k-instruct"))
19
-
```
20
-
21
-
Models are loaded from the [HuggingFace hub](https://huggingface.co/).
answer =model("Create a character.", output_type=Character)
35
+
answer=sync_model("Create a character.", output_type=Character)
36
+
answer =await async_model("Create a character.", output_type=Character)
53
37
```
54
38
55
-
The VLLM model supports batch generation. To use it, you can pass a list of strings as prompt instead of a single string.
39
+
The `VLLM` model supports also supports streaming.
56
40
57
-
## Optional parameters
58
-
59
-
When calling the model, you can provide optional parameters on top of the prompt and the output type. Those will be passed on to the `LLM.generate` method of the `vllm` library. An optional parameter of particular interest is `sampling_params`, which is an instance of `SamplingParams`. You can find more information about it in the [vLLM documentation][https://docs.vllm.ai/en/latest/api/inference_params.html].
41
+
```python
42
+
for chunk in sync_model.stream("Write a short story about a cat.", max_tokens=100):
43
+
print(chunk)
44
+
```
60
45
61
-
!!! Warning
46
+
## Optional parameters
62
47
63
-
Streaming is not available for the offline vLLM integration.
48
+
When calling the model, you can provide optional parameters on top of the prompt and the output type. Those will be passed on to the `openai` client. An optional parameter of particular interest is `extra_body`, which is a dictionnary containing arguments that are specific to vLLM and are not part of the standard `openai` interface (see the [vLLM documentation][https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html] on the OpenAI-compatible server for more information).
You need to install the `vllm` library to use the vLLM integration: `pip install vllm`. The default installation only works on machines with a GPU, follow the [installation section][vllm-install-cpu] for instructions to install vLLM for CPU or ROCm.
7
+
8
+
Consult the [vLLM documentation][vllm-docs] for detailed informations about how to initialize OpenAI clients and the available options.
9
+
10
+
## Load the model
11
+
12
+
Outlines supports models available via vLLM's offline batched inference interface. You can load a model using:
13
+
14
+
```python
15
+
import outlines
16
+
from vllm importLLM
17
+
18
+
model = outlines.from_vllm_offline(LLM("microsoft/Phi-3-mini-4k-instruct"))
19
+
```
20
+
21
+
Models are loaded from the [HuggingFace hub](https://huggingface.co/).
22
+
23
+
24
+
!!! Warning "Device"
25
+
26
+
The default installation of vLLM only allows to load models on GPU. See the [installation instructions][vllm-install-cpu] to run models on CPU.
27
+
28
+
29
+
## Generate text
30
+
31
+
To generate text, you can just call the model with a prompt as argument:
32
+
33
+
```python
34
+
import outlines
35
+
from vllm importLLM
36
+
37
+
model = outlines.from_vllm_offline(LLM("microsoft/Phi-3-mini-4k-instruct"))
38
+
answer = model("Write a short story about a cat.")
39
+
```
40
+
41
+
You can also use structured generation with the `VLLM` model by providing an output type after the prompt:
42
+
43
+
```python
44
+
import outlines
45
+
from vllm importLLM
46
+
from pydantic import BaseModel
47
+
48
+
classCharacter(BaseModel):
49
+
name: str
50
+
51
+
model = outlines.from_vllm_offline(LLM("microsoft/Phi-3-mini-4k-instruct"))
52
+
answer = model("Create a character.", output_type=Character)
53
+
```
54
+
55
+
The VLLM model supports batch generation. To use it, you can pass a list of strings as prompt instead of a single string.
56
+
57
+
## Optional parameters
58
+
59
+
When calling the model, you can provide optional parameters on top of the prompt and the output type. Those will be passed on to the `LLM.generate` method of the `vllm` library. An optional parameter of particular interest is `sampling_params`, which is an instance of `SamplingParams`. You can find more information about it in the [vLLM documentation][https://docs.vllm.ai/en/latest/api/inference_params.html].
60
+
61
+
!!! Warning
62
+
63
+
Streaming is not available for the offline vLLM integration.
0 commit comments