Lazarus ships a built-in OpenAI-compatible HTTP inference server. Any tool that speaks the OpenAI API — mcp-cli, LangChain, the openai Python SDK, curl — works against it without modification.
The server requires extra dependencies not included in the base install:
uv add "chuk-lazarus[server]"
# or
pip install "chuk-lazarus[server]"This adds fastapi, uvicorn, and httpx.
# Start the server with any supported model
lazarus serve --model google/gemma-3-4b-it
# Or use the dedicated standalone script
lazarus-serve --model google/gemma-3-4b-itThe server loads the model once, then serves all requests from it. On first run, the model is downloaded from HuggingFace Hub and cached locally.
Loading model: google/gemma-3-4b-it
============================================================
...
============================================================
Lazarus inference server ready
Model : google/gemma-3-4b-it
Protocols : openai
Base URL : http://0.0.0.0:8080
OpenAI URL: http://0.0.0.0:8080/v1
============================================================
lazarus serve [OPTIONS]
lazarus-serve [OPTIONS]| Option | Default | Description |
|---|---|---|
--model / -m |
required | HuggingFace model ID or local path |
--host |
0.0.0.0 |
Bind address |
--port / -p |
8080 |
Port |
--protocols |
openai |
Comma-separated: openai, ollama, anthropic |
--api-key |
None | Bearer token — if set all requests must include Authorization: Bearer <key> |
--max-tokens |
512 |
Default max_tokens when callers do not specify one |
# With authentication
lazarus-serve --model google/gemma-3-1b-it --api-key mysecret
# Different port, multiple protocols (once implemented)
lazarus-serve --model google/gemma-3-4b-it --port 9000 --protocols openai,ollama
# Smaller model, higher token limit
lazarus-serve --model google/gemma-3-1b-it --max-tokens 2048GET /health
{
"status": "ok",
"model": "google/gemma-3-4b-it",
"protocols": ["openai"]
}GET /v1/models
{
"object": "list",
"data": [
{ "id": "google/gemma-3-4b-it", "object": "model", "owned_by": "lazarus" }
]
}POST /v1/chat/completions
Full OpenAI ChatCompletion schema, including:
messages— system, user, assistant, and tool rolesstream—truefor Server-Sent Events streamingtools— function definitions for tool callingmax_tokens,temperature,top_p,stop
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-3-4b-it",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}'{
"id": "chatcmpl-a1b2c3d4",
"object": "chat.completion",
"created": 1750000000,
"model": "google/gemma-3-4b-it",
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "The capital of France is Paris."},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 14, "completion_tokens": 9, "total_tokens": 23}
}curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "google/gemma-3-4b-it", "messages": [...], "stream": true}'data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"role":"assistant","content":""},...}]}
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":"The capital"},...}]}
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":" of France"},...}]}
...
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]
The server supports OpenAI-style function calling. Tool definitions are injected into the model's chat template via tokenizer.apply_chat_template(..., tools=[...]). The model's <tool_call> output blocks are parsed and returned in OpenAI format.
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-3-4b-it",
"messages": [{"role": "user", "content": "What time is it?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_current_time",
"description": "Get the current time",
"parameters": {
"type": "object",
"properties": {},
"required": []
}
}
}]
}'When the model calls a tool the response has finish_reason: "tool_calls":
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_a1b2c3",
"type": "function",
"function": {"name": "get_current_time", "arguments": "{}"}
}]
},
"finish_reason": "tool_calls"
}]
}Send the tool result back as a tool role message:
{
"messages": [
{"role": "user", "content": "What time is it?"},
{"role": "assistant", "content": null, "tool_calls": [{"id": "call_a1b2c3", ...}]},
{"role": "tool", "tool_call_id": "call_a1b2c3", "content": "14:32 UTC"}
]
}mcp-cli connects to the Lazarus server as an OpenAI-compatible provider:
# Start the server
lazarus-serve --model google/gemma-3-4b-it --api-key lazarus
# In another terminal, start mcp-cli
mcp-cli chat \
--provider lazarus \
--server time \
--model google/gemma-3-4b-it
# With dashboard
mcp-cli chat \
--provider lazarus \
--server time \
--model google/gemma-3-4b-it \
--dashboardmcp-cli discovers the lazarus provider from its config. The provider entry points to http://localhost:8080/v1 with the model name matching what the server is serving.
Because the server is fully OpenAI-compatible, the openai package works directly:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="lazarus", # any non-empty string if auth is disabled
)
# Non-streaming
response = client.chat.completions.create(
model="google/gemma-3-4b-it",
messages=[{"role": "user", "content": "Write a haiku about Python."}],
)
print(response.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model="google/gemma-3-4b-it",
messages=[{"role": "user", "content": "Count to ten."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)The server uses a layered protocol router design:
HTTP request
│
▼
FastAPI app (app.py)
│
├── /v1/* → OpenAI router (routers/openai.py)
├── /api/* → Ollama router (routers/ollama.py) ← TODO
└── /v1/* → Anthropic router (routers/anthropic.py) ← TODO
│
▼
ModelEngine (engine.py) ← format-agnostic
│
├── agenerate() → InternalResponse
└── astream() → AsyncIterator[InternalChunk]
│
▼
UnifiedPipeline (inference/)
Each router translates its wire format to InternalRequest, calls the engine, and translates InternalResponse back. The engine knows nothing about protocols.
| Protocol | Status | Endpoints |
|---|---|---|
| OpenAI | Implemented | GET /v1/models, POST /v1/chat/completions |
| Ollama | Planned | GET /api/tags, POST /api/chat, POST /api/generate |
| Anthropic | Planned | POST /v1/messages |
Enable multiple protocols when they are implemented:
lazarus-serve --model gemma-3-1b-it --protocols openai,ollama,anthropicYou can embed the server in your own application:
import asyncio
from chuk_lazarus.server import ModelEngine, Protocol, create_app
import uvicorn
async def main():
engine = await ModelEngine.load("google/gemma-3-1b-it")
app = create_app(
engine,
protocols=[Protocol.OPENAI],
api_key="secret",
default_max_tokens=1024,
)
config = uvicorn.Config(app, host="0.0.0.0", port=8080)
server = uvicorn.Server(config)
await server.serve()
asyncio.run(main())- Client Library — Python client for the server
- Inference Guide — Direct pipeline usage without HTTP
- CLI Reference —
lazarus servecommand details