Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[example] Add vLLM OpenAI-server example #678

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions examples/vllm_openai_example/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# vllm_openai_example

This is an example of OpenAI compatible server based on vLLM and FastAPI showcasing integration with Langfuse for for event tracing and response generation.

1. **Shutdown Behavior**: The application defines shutdown logic using FastAPI's lifespan feature. On shutdown, it flushes all events to Langfuse, ensuring data integrity and completeness.

2. **Endpoints**:
- `/health`: Returns healthcheck status
- `/v1/models`: Shows available models
- `/version`: Shows current version of vLLM
- `/v1/chat/completions`: Generates chat completion and tracks sync and async responses to Langfuse
- `/v1/completions`: Generates completion and tracks sync and async responses to Langfuse
- More info can be found [here](https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/openai)

3. **Integration**:
- Langfuse: Utilized for event tracing with `@observe` decorator. It uses `transform_to_string` parameter to correctly process async results.

4. **Dependencies**:
- Langfuse: Library for event tracing and management.
- vLLM: vLLM is a fast and easy-to-use library for LLM inference and serving

5. **Usage**:
- Preparation: Ensure that you meet requirements specified in [vllm repo](https://github.com/vllm-project/vllm) to setup running environment
- Starting the Server: Navigate to the root directory of the project `langfuse-python/examples/vllm_openai_example`. Run the application using `python3 api_server.py --model <MODEL_PATH> --langfuse-public-key <PBK> --langfuse-secret-key <SCK> --langfuse-host <URL>`.
- Access endpoints at `http://localhost:8000`.

Consider checking `api_server.py`, `cli_args.py`, `serving_chat.py`, `serving_competion.py` as they are modified version of vLLM code.
Also check `langfuse_utils.py` as it contains main logic for correctly processing `stream=True` requests from API.
For more details on vLLM and Langfuse refer to their respective documentation.
3,179 changes: 3,179 additions & 0 deletions examples/vllm_openai_example/poetry.lock

Large diffs are not rendered by default.

16 changes: 16 additions & 0 deletions examples/vllm_openai_example/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[tool.poetry]
name = "vllm-openai-example"
version = "0.1.0"
description = "This example explains how to basically integrate LangFuse inside vllm openai-compatible server"
authors = ["dimaioksha <[email protected]>"]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.10"
vllm = "^0.4.0"
langfuse = "^2.13.3"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Empty file.
192 changes: 192 additions & 0 deletions examples/vllm_openai_example/vllm_openai_example/api_server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# Adapted from
# https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py
import asyncio
import importlib
import inspect
import os
from contextlib import asynccontextmanager
from http import HTTPStatus

import uvicorn
import vllm
from fastapi import Request, FastAPI
from fastapi.exceptions import RequestValidationError
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse, Response, StreamingResponse
from prometheus_client import make_asgi_app
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.logger import init_logger
from vllm.usage.usage_lib import UsageContext
from langfuse.utils.langfuse_singleton import LangfuseSingleton

from cli_args import make_arg_parser
from protocol import (
ChatCompletionRequest,
ChatCompletionResponse,
CompletionRequest,
ErrorResponse,
)
from serving_chat import OpenAIServingChat
from serving_completion import OpenAIServingCompletion

TIMEOUT_KEEP_ALIVE = 5 # seconds

openai_serving_chat: OpenAIServingChat
openai_serving_completion: OpenAIServingCompletion
logger = init_logger(__name__)


@asynccontextmanager
async def lifespan(app: FastAPI):
async def _force_log():
while True:
await asyncio.sleep(10)
await engine.do_log_stats()

if not engine_args.disable_log_stats:
asyncio.create_task(_force_log())

yield

# Flush all events to be sent to Langfuse on shutdown and terminate all Threads gracefully.
# This operation is blocking.
LangfuseSingleton().get().flush() # IMPORTANT if you do not want to lose logs


app = FastAPI(lifespan=lifespan)


def parse_args():
parser = make_arg_parser()
return parser.parse_args()


# Add prometheus asgi middleware to route /metrics requests
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)


@app.exception_handler(RequestValidationError)
async def validation_exception_handler(_, exc):
err = openai_serving_chat.create_error_response(message=str(exc))
return JSONResponse(err.model_dump(), status_code=HTTPStatus.BAD_REQUEST)


@app.get("/health")
async def health() -> Response:
"""Health check."""
await openai_serving_chat.engine.check_health()
return Response(status_code=200)


@app.get("/v1/models")
async def show_available_models():
models = await openai_serving_chat.show_available_models()
return JSONResponse(content=models.model_dump())


@app.get("/version")
async def show_version():
ver = {"version": vllm.__version__}
return JSONResponse(content=ver)


@app.post("/v1/chat/completions")
async def create_chat_completion(request: ChatCompletionRequest, raw_request: Request):
generator = await openai_serving_chat.create_chat_completion(request, raw_request)
if isinstance(generator, ErrorResponse):
return JSONResponse(content=generator.model_dump(), status_code=generator.code)
if request.stream:
return StreamingResponse(content=generator, media_type="text/event-stream")
else:
assert isinstance(generator, ChatCompletionResponse)
return JSONResponse(content=generator.model_dump())


@app.post("/v1/completions")
async def create_completion(request: CompletionRequest, raw_request: Request):
generator = await openai_serving_completion.create_completion(request, raw_request)
if isinstance(generator, ErrorResponse):
return JSONResponse(content=generator.model_dump(), status_code=generator.code)
if request.stream:
return StreamingResponse(content=generator, media_type="text/event-stream")
else:
return JSONResponse(content=generator.model_dump())


if __name__ == "__main__":
args = parse_args()

app.add_middleware(
CORSMiddleware,
allow_origins=args.allowed_origins,
allow_credentials=args.allow_credentials,
allow_methods=args.allowed_methods,
allow_headers=args.allowed_headers,
)

# We initialize langfuse singleton when starting OpenAI server
langfuse_client = LangfuseSingleton().get(
secret_key=args.langfuse_secret_key,
public_key=args.langfuse_public_key,
host=args.langfuse_host,
)

if token := os.environ.get("VLLM_API_KEY") or args.api_key:

@app.middleware("http")
async def authentication(request: Request, call_next):
root_path = "" if args.root_path is None else args.root_path
if not request.url.path.startswith(f"{root_path}/v1"):
return await call_next(request)
if request.headers.get("Authorization") != "Bearer " + token:
return JSONResponse(content={"error": "Unauthorized"}, status_code=401)
return await call_next(request)

for middleware in args.middleware:
module_path, object_name = middleware.rsplit(".", 1)
imported = getattr(importlib.import_module(module_path), object_name)
if inspect.isclass(imported):
app.add_middleware(imported)
elif inspect.iscoroutinefunction(imported):
app.middleware("http")(imported)
else:
raise ValueError(
f"Invalid middleware {middleware}. " f"Must be a function or a class."
)

logger.info(f"vLLM API server version {vllm.__version__}")
logger.info(f"args: {args}")

if args.served_model_name is not None:
served_model_names = args.served_model_name
else:
served_model_names = [args.model]
engine_args = AsyncEngineArgs.from_cli_args(args)
engine = AsyncLLMEngine.from_engine_args(
engine_args, usage_context=UsageContext.OPENAI_API_SERVER
)
openai_serving_chat = OpenAIServingChat(
engine,
served_model_names,
args.response_role,
args.lora_modules,
args.chat_template,
)
openai_serving_completion = OpenAIServingCompletion(
engine, served_model_names, args.lora_modules
)

app.root_path = args.root_path
uvicorn.run(
app,
host=args.host,
port=args.port,
log_level=args.uvicorn_log_level,
timeout_keep_alive=TIMEOUT_KEEP_ALIVE,
ssl_keyfile=args.ssl_keyfile,
ssl_certfile=args.ssl_certfile,
ssl_ca_certs=args.ssl_ca_certs,
ssl_cert_reqs=args.ssl_cert_reqs,
)
152 changes: 152 additions & 0 deletions examples/vllm_openai_example/vllm_openai_example/cli_args.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# Adapted from
# https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/cli_args.py
"""
This file contains the command line arguments for the vLLM's
OpenAI-compatible server. It is kept in a separate file for documentation
purposes.
"""

import argparse
import json
import ssl

from vllm.engine.arg_utils import AsyncEngineArgs
from serving_engine import LoRA


class LoRAParserAction(argparse.Action):
def __call__(self, parser, namespace, values, option_string=None):
lora_list = []
for item in values:
name, path = item.split("=")
lora_list.append(LoRA(name, path))
setattr(namespace, self.dest, lora_list)


def make_arg_parser():
parser = argparse.ArgumentParser(
description="vLLM OpenAI-Compatible RESTful API server."
)
parser.add_argument("--host", type=str, default=None, help="host name")
parser.add_argument("--port", type=int, default=8000, help="port number")
parser.add_argument(
"--uvicorn-log-level",
type=str,
default="info",
choices=["debug", "info", "warning", "error", "critical", "trace"],
help="log level for uvicorn",
)
parser.add_argument(
"--allow-credentials", action="store_true", help="allow credentials"
)
parser.add_argument(
"--allowed-origins", type=json.loads, default=["*"], help="allowed origins"
)
parser.add_argument(
"--allowed-methods", type=json.loads, default=["*"], help="allowed methods"
)
parser.add_argument(
"--allowed-headers", type=json.loads, default=["*"], help="allowed headers"
)
parser.add_argument(
"--api-key",
type=str,
default=None,
help="If provided, the server will require this key "
"to be presented in the header.",
)
parser.add_argument(
"--served-model-name",
nargs="+",
type=str,
default=None,
help="The model name(s) used in the API. If multiple "
"names are provided, the server will respond to any "
"of the provided names. The model name in the model "
"field of a response will be the first name in this "
"list. If not specified, the model name will be the "
"same as the `--model` argument.",
)
parser.add_argument(
"--lora-modules",
type=str,
default=None,
nargs="+",
action=LoRAParserAction,
help="LoRA module configurations in the format name=path. "
"Multiple modules can be specified.",
)
parser.add_argument(
"--chat-template",
type=str,
default=None,
help="The file path to the chat template, "
"or the template in single-line form "
"for the specified model",
)
parser.add_argument(
"--response-role",
type=str,
default="assistant",
help="The role name to return if " "`request.add_generation_prompt=true`.",
)
parser.add_argument(
"--ssl-keyfile",
type=str,
default=None,
help="The file path to the SSL key file",
)
parser.add_argument(
"--ssl-certfile",
type=str,
default=None,
help="The file path to the SSL cert file",
)
parser.add_argument(
"--ssl-ca-certs", type=str, default=None, help="The CA certificates file"
)
parser.add_argument(
"--ssl-cert-reqs",
type=int,
default=int(ssl.CERT_NONE),
help="Whether client certificate is required (see stdlib ssl module's)",
)
parser.add_argument(
"--root-path",
type=str,
default=None,
help="FastAPI root_path when app is behind a path based routing proxy",
)
parser.add_argument(
"--middleware",
type=str,
action="append",
default=[],
help="Additional ASGI middleware to apply to the app. "
"We accept multiple --middleware arguments. "
"The value should be an import path. "
"If a function is provided, vLLM will add it to the server "
"using @app.middleware('http'). "
"If a class is provided, vLLM will add it to the server "
"using app.add_middleware(). ",
)

# We add langufuse parameters to use them when initializing
parser.add_argument(
"--langfuse-secret-key",
type=str,
help="Langfuse private key for tracking experiments",
)
parser.add_argument(
"--langfuse-public-key",
type=str,
help="Langfuse public key for tracking experiments",
)
parser.add_argument(
"--langfuse-host",
type=str,
help="Langfuse host in format 'http://{SOME_DOMAIN}:{SOME_PORT}'",
)

parser = AsyncEngineArgs.add_cli_args(parser)
return parser
Loading