dottxt-ai
diff --git a/‎docs/reference/models/vllm.md
+24-41 b/‎docs/reference/models/vllm.md
+24-41
diff --git a/‎docs/reference/models/vllm_offline.md
+68 b/‎docs/reference/models/vllm_offline.md
+68
diff --git a/‎examples/vllm_integration.py renamed to ‎examples/vllm_offline_integration.py
+1-1 b/‎examples/vllm_integration.py renamed to ‎examples/vllm_offline_integration.py
+1-1
diff --git a/‎outlines/__init__.py
+1 b/‎outlines/__init__.py
+1
diff --git a/‎outlines/generator.py
+30-12 b/‎outlines/generator.py
+30-12
diff --git a/‎outlines/models/__init__.py
+5-4 b/‎outlines/models/__init__.py
+5-4
diff --git a/‎outlines/models/base.py
+86-1 b/‎outlines/models/base.py
+86-1
@@ -1,68 +1,51 @@
 # vLLM
 
+## Prerequisites
 
-!!! Note "Installation"
+The Outlines `VLLM` model is inteded to be used along with a vllm instance running on a separate server (can be local or remote). Make sure you have a vllm server running before using the `VLLM` model. As the vllm client relies on the `openai` python sdk, you need to have an `openai` package installed. If you instead want to use the vllm offline inference mode, please refer to the [VLLMOffline model documentation](./vllm_offline.md).
 
-    You need to install the `vllm` library to use the vLLM integration: `pip install vllm`. The default installation only works on machines with a GPU, follow the [installation section][vllm-install-cpu] for instructions to install vLLM for CPU or ROCm.
+## Initialize the model
 
-    Consult the [vLLM documentation][vllm-docs] for detailed informations about how to initialize OpenAI clients and the available options.
-
-## Load the model
-
-Outlines supports models available via vLLM's offline batched inference interface. You can load a model using:
+To load the model, you can use the `from_vllm` function. The argument of the function is either an `OpenAI` or `AsyncOpenAI` instance from the `openai` library. Based on whether the `openai` instance is synchronous or asynchronous, you will receive a `VLLM` or `AsyncVLLM` model instance.
 
 ```python
+import openai
 import outlines
-from vllm import LLM
-
-model = outlines.from_vllm(LLM("microsoft/Phi-3-mini-4k-instruct"))
-```
-
-Models are loaded from the [HuggingFace hub](https://huggingface.co/).
 
+sync_openai_client = openai.OpenAI(base_url="...")
+async_openai_client = openai.AsyncOpenAI(base_url="...")
 
-!!! Warning "Device"
-
-    The default installation of vLLM only allows to load models on GPU. See the [installation instructions][vllm-install-cpu] to run models on CPU.
+sync_model = outlines.from_vllm(sync_openai_client)
+print(type(sync_model)) # <class 'outlines.models.vllm.VLLM'>
 
+async_model = outlines.from_vllm(async_openai_client)
+print(type(async_model)) # <class 'outlines.models.vllm.AsyncVLLM'>
+```
 
 ## Generate text
 
-To generate text, you can just call the model with a prompt as argument:
+To generate text, you can call the model with a prompt as argument and optionally an output type to rely on structured generation:
 
 ```python
-import outlines
-from vllm import LLM
-
-model = outlines.from_vllm(LLM("microsoft/Phi-3-mini-4k-instruct"))
-answer = model("Write a short story about a cat.")
-```
-
-You can also use structured generation with the `VLLM` model by providing an output type after the prompt:
-
-```python
-import outlines
-from vllm import LLM
 from pydantic import BaseModel
 
 class Character(BaseModel):
     name: str
 
-model = outlines.from_vllm(LLM("microsoft/Phi-3-mini-4k-instruct"))
-answer = model("Create a character.", output_type=Character)
+answer = sync_model("Create a character.", output_type=Character)
+answer = await async_model("Create a character.", output_type=Character)
 ```
 
-The VLLM model supports batch generation. To use it, you can pass a list of strings as prompt instead of a single string.
+The `VLLM` model supports also supports streaming.
 
-## Optional parameters
-
-When calling the model, you can provide optional parameters on top of the prompt and the output type. Those will be passed on to the `LLM.generate` method of the `vllm` library. An optional parameter of particular interest is `sampling_params`, which is an instance of `SamplingParams`. You can find more information about it in the [vLLM documentation][https://docs.vllm.ai/en/latest/api/inference_params.html].
+```python
+for chunk in sync_model.stream("Write a short story about a cat.", max_tokens=100):
+    print(chunk)
+```
 
-!!! Warning
+## Optional parameters
 
-    Streaming is not available for the offline vLLM integration.
+When calling the model, you can provide optional parameters on top of the prompt and the output type. Those will be passed on to the `openai` client. An optional parameter of particular interest is `extra_body`, which is a dictionnary containing arguments that are specific to vLLM and are not part of the standard `openai` interface (see the [vLLM documentation][https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html] on the OpenAI-compatible server for more information).
 
-[vllm-docs]:https://docs.vllm.ai/en/latest/
-[vllm-install-cpu]: https://docs.vllm.ai/en/latest/getting_started/cpu-installation.html
-[vllm-install-rocm]: https://docs.vllm.ai/en/latest/getting_started/amd-installation.html
-[rocm-flash-attention]: https://github.com/ROCm/flash-attention/tree/flash_attention_for_rocm#amd-gpurocm-support
+[vllm-docs]: https://docs.vllm.ai/en/latest/
+[vllm-online-quickstart]: https://docs.vllm.ai/en/latest/getting_started/quickstart.html#quickstart-online
@@ -0,0 +1,68 @@
+# vLLM Offline Inference mode
+
+
+!!! Note "Installation"
+
+    You need to install the `vllm` library to use the vLLM integration: `pip install vllm`. The default installation only works on machines with a GPU, follow the [installation section][vllm-install-cpu] for instructions to install vLLM for CPU or ROCm.
+
+    Consult the [vLLM documentation][vllm-docs] for detailed informations about how to initialize OpenAI clients and the available options.
+
+## Load the model
+
+Outlines supports models available via vLLM's offline batched inference interface. You can load a model using:
+
+```python
+import outlines
+from vllm import LLM
+
+model = outlines.from_vllm_offline(LLM("microsoft/Phi-3-mini-4k-instruct"))
+```
+
+Models are loaded from the [HuggingFace hub](https://huggingface.co/).
+
+
+!!! Warning "Device"
+
+    The default installation of vLLM only allows to load models on GPU. See the [installation instructions][vllm-install-cpu] to run models on CPU.
+
+
+## Generate text
+
+To generate text, you can just call the model with a prompt as argument:
+
+```python
+import outlines
+from vllm import LLM
+
+model = outlines.from_vllm_offline(LLM("microsoft/Phi-3-mini-4k-instruct"))
+answer = model("Write a short story about a cat.")
+```
+
+You can also use structured generation with the `VLLM` model by providing an output type after the prompt:
+
+```python
+import outlines
+from vllm import LLM
+from pydantic import BaseModel
+
+class Character(BaseModel):
+    name: str
+
+model = outlines.from_vllm_offline(LLM("microsoft/Phi-3-mini-4k-instruct"))
+answer = model("Create a character.", output_type=Character)
+```
+
+The VLLM model supports batch generation. To use it, you can pass a list of strings as prompt instead of a single string.
+
+## Optional parameters
+
+When calling the model, you can provide optional parameters on top of the prompt and the output type. Those will be passed on to the `LLM.generate` method of the `vllm` library. An optional parameter of particular interest is `sampling_params`, which is an instance of `SamplingParams`. You can find more information about it in the [vLLM documentation][https://docs.vllm.ai/en/latest/api/inference_params.html].
+
+!!! Warning
+
+    Streaming is not available for the offline vLLM integration.
+
+[vllm-docs]:https://docs.vllm.ai/en/latest/
+[vllm-install-cpu]: https://docs.vllm.ai/en/latest/getting_started/cpu-installation.html
+[vllm-install-rocm]: https://docs.vllm.ai/en/latest/getting_started/amd-installation.html
+[rocm-flash-attention]: https://github.com/ROCm/flash-attention/tree/flash_attention_for_rocm#amd-gpurocm-support
@@ -4,7 +4,7 @@
 from pydantic import BaseModel
 from transformers import AutoTokenizer
 
-from outlines.models.vllm import adapt_tokenizer
+from outlines.models.vllm_offline import adapt_tokenizer
 from outlines.processors import JSONLogitsProcessor
 
 
 
@@ -19,6 +19,7 @@
     from_llamacpp,
     from_mlxlm,
     from_vllm,
+    from_vllm_offline,
 )
 from outlines.templates import Template, prompt
 from outlines.types import regex, json_schema, cfg
 
@@ -1,6 +1,7 @@
 from typing import Any, Optional, Union
 
 from outlines.models import BlackBoxModel, SteerableModel
+from outlines.models.base import AsyncModel
 from outlines.processors import (
     CFGLogitsProcessor,
     GuideLogitsProcessor,
@@ -13,23 +14,14 @@
 
 
 class BlackBoxGenerator:
-    """Represents a generator for which we don't control constrained generation.
-
-    This type of generator only accepts an output type as an argument defining
-    constrained generation. This output type is not modified and thus only
-    passed through to the model.
-    """
+    """Synchronous generator for which we don't control constrained generation."""
     output_type: Optional[Any]
 
     def __init__(self, model, output_type: Optional[Any]):
         self.model = model
         self.output_type = output_type
 
-        if isinstance(self.output_type, CFG):
-            raise NotImplementedError(
-                "CFG generation is not supported for API-based models"
-            )
-        elif isinstance(self.output_type, FSM):
+        if isinstance(self.output_type, FSM):
             raise NotImplementedError(
                 "FSM generation is not supported for API-based models"
             )
@@ -41,6 +33,29 @@ def stream(self, prompt, **inference_kwargs):
         return self.model.generate_stream(prompt, self.output_type, **inference_kwargs)
 
 
+class AsyncBlackBoxGenerator:
+    """Asynchronous generator for which we don't control constrained generation."""
+    output_type: Optional[Any]
+
+    def __init__(self, model, output_type: Optional[Any]):
+        self.model = model
+        self.output_type = output_type
+
+        if isinstance(self.output_type, FSM):
+            raise NotImplementedError(
+                "FSM generation is not supported for API-based models"
+            )
+
+    async def __call__(self, prompt, **inference_kwargs):
+        return await self.model.generate(prompt, self.output_type, **inference_kwargs)
+
+    async def stream(self, prompt, **inference_kwargs):
+        async for chunk in self.model.generate_stream(  # pragma: no cover
+            prompt, self.output_type, **inference_kwargs
+        ):
+            yield chunk
+
+
 class SteerableGenerator:
     """Represents a generator for which we control constrained generation.
 
@@ -134,7 +149,10 @@ def Generator(
         if processor is not None:
             raise NotImplementedError("This model does not support logits processors")
         else:
-            return BlackBoxGenerator(model, output_type)
+            if isinstance(model, AsyncModel):
+                return AsyncBlackBoxGenerator(model, output_type)
+            else:
+                return BlackBoxGenerator(model, output_type)
     else:
         if processor is not None:
             return SteerableGenerator.from_processor(model, processor)
 
@@ -22,11 +22,12 @@
     TransformersMultiModal,
     from_transformers,
 )
-from .vllm import VLLM, from_vllm
+from .vllm_offline import VLLMOffline, from_vllm_offline
+from .vllm import AsyncVLLM, VLLM, from_vllm
 
 LogitsGenerator = Union[
-    Transformers, LlamaCpp, OpenAI, MLXLM, VLLM, Ollama
+    Transformers, LlamaCpp, OpenAI, MLXLM, VLLMOffline, Ollama
 ]
 
-SteerableModel = Union[LlamaCpp, Transformers, MLXLM, VLLM]
-BlackBoxModel = Union[OpenAI, Anthropic, Gemini, Ollama, Dottxt]
+SteerableModel = Union[LlamaCpp, Transformers, MLXLM, VLLMOffline]
+BlackBoxModel = Union[OpenAI, Anthropic, Gemini, Ollama, Dottxt, AsyncVLLM, VLLM]
@@ -38,7 +38,7 @@ def format_output_type(self, output_type):
 
 
 class Model(ABC):
-    """Base class for all models.
+    """Base class for all synchronous models.
 
     This class defines a shared `__call__` method that can be used to call the
     model directly.
@@ -116,3 +116,88 @@ def generate_stream(self, model_input, output_type=None, **inference_kwargs):
 
         """
         ...
+
+
+class AsyncModel(ABC):
+    """Base class for all asynchronous models.
+
+    This class defines a shared `__call__` method that can be used to call the
+    model directly.
+    All models inheriting from this class must define a `type_adapter`
+    attribute of type `ModelTypeAdapter`. The methods of the `type_adapter`
+    attribute are used in the `generate` method to format the input and output
+    types received by the model.
+    Additionally, local models must define a `tensor_library_name` attribute.
+
+    """
+    type_adapter: ModelTypeAdapter
+    tensor_library_name: str
+
+    async def __call__(self, model_input, output_type=None, **inference_kwargs):
+        """Call the model.
+
+        Users can call the model directly, in which case we will create a
+        generator instance with the output type provided and call it.
+        Thus, those commands are equivalent:
+        ```python
+        generator = Generator(model, Foo)
+        await generator("prompt")
+        ```
+        and
+        ```python
+        await model("prompt", Foo)
+        ```
+
+        """
+        from outlines import Generator
+
+        generator = Generator(self, output_type)
+        return await generator(model_input, **inference_kwargs)
+
+    async def stream(self, model_input, output_type=None, **inference_kwargs):
+        """Stream a response from the model.
+
+        Users can use the `stream` method from the model directly, in which
+        case we will create a generator instance with the output type provided
+        and then invoke its `stream` method.
+        Thus, those commands are equivalent:
+        ```python
+        generator = Generator(model, Foo)
+        async for chunk in generator("prompt"):
+            print(chunk)
+        ```
+        and
+        ```python
+        async for chunk in model.stream("prompt", Foo):
+            print(chunk)
+        ```
+
+        """
+        from outlines import Generator
+
+        generator = Generator(self, output_type)
+
+        async for chunk in generator.stream(model_input, **inference_kwargs):  # pragma: no cover
+            yield chunk
+
+    @abstractmethod
+    async def generate(self, model_input, output_type=None, **inference_kwargs):
+        """Generate a response from the model.
+
+        The output_type argument contains a logits processor for local models
+        while it contains a type (Json, Enum...) for the API-based models.
+        This method is not intended to be used directly by end users.
+
+        """
+        ...
+
+    @abstractmethod
+    async def generate_stream(self, model_input, output_type=None, **inference_kwargs):
+        """Generate a stream of responses from the model.
+
+        The output_type argument contains a logits processor for local models
+        while it contains a type (Json, Enum...) for the API-based models.
+        This method is not intended to be used directly by end users.
+
+        """
+        ...
Original file line number	Diff line number	Diff line change
`@@ -19,6 +19,7 @@`
`19`	`19`	`from_llamacpp,`
`20`	`20`	`from_mlxlm,`
`21`	`21`	`from_vllm,`
	`22`	`+ from_vllm_offline,`
`22`	`23`	`)`
`23`	`24`	`from outlines.templates import Template, prompt`
`24`	`25`	`from outlines.types import regex, json_schema, cfg`