Inference Server

Lazarus ships a built-in OpenAI-compatible HTTP inference server. Any tool that speaks the OpenAI API — mcp-cli, LangChain, the openai Python SDK, curl — works against it without modification.

Installation

The server requires extra dependencies not included in the base install:

uv add "chuk-lazarus[server]"
# or
pip install "chuk-lazarus[server]"

This adds fastapi, uvicorn, and httpx.

Quick Start

# Start the server with any supported model
lazarus serve --model google/gemma-3-4b-it

# Or use the dedicated standalone script
lazarus-serve --model google/gemma-3-4b-it

The server loads the model once, then serves all requests from it. On first run, the model is downloaded from HuggingFace Hub and cached locally.

Loading model: google/gemma-3-4b-it
============================================================
...
============================================================
Lazarus inference server ready
  Model     : google/gemma-3-4b-it
  Protocols : openai
  Base URL  : http://0.0.0.0:8080
  OpenAI URL: http://0.0.0.0:8080/v1
============================================================

CLI Options

lazarus serve [OPTIONS]
lazarus-serve [OPTIONS]

Option	Default	Description
`--model` / `-m`	required	HuggingFace model ID or local path
`--host`	`0.0.0.0`	Bind address
`--port` / `-p`	`8080`	Port
`--protocols`	`openai`	Comma-separated: `openai`, `ollama`, `anthropic`
`--api-key`	None	Bearer token — if set all requests must include `Authorization: Bearer <key>`
`--max-tokens`	`512`	Default `max_tokens` when callers do not specify one

# With authentication
lazarus-serve --model google/gemma-3-1b-it --api-key mysecret

# Different port, multiple protocols (once implemented)
lazarus-serve --model google/gemma-3-4b-it --port 9000 --protocols openai,ollama

# Smaller model, higher token limit
lazarus-serve --model google/gemma-3-1b-it --max-tokens 2048

Endpoints

Health

GET /health

{
  "status": "ok",
  "model": "google/gemma-3-4b-it",
  "protocols": ["openai"]
}

OpenAI — List Models

GET /v1/models

{
  "object": "list",
  "data": [
    { "id": "google/gemma-3-4b-it", "object": "model", "owned_by": "lazarus" }
  ]
}

OpenAI — Chat Completions

POST /v1/chat/completions

Full OpenAI ChatCompletion schema, including:

messages — system, user, assistant, and tool roles
stream — true for Server-Sent Events streaming
tools — function definitions for tool calling
max_tokens, temperature, top_p, stop

Non-streaming

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3-4b-it",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

{
  "id": "chatcmpl-a1b2c3d4",
  "object": "chat.completion",
  "created": 1750000000,
  "model": "google/gemma-3-4b-it",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "The capital of France is Paris."},
    "finish_reason": "stop"
  }],
  "usage": {"prompt_tokens": 14, "completion_tokens": 9, "total_tokens": 23}
}

Streaming

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "google/gemma-3-4b-it", "messages": [...], "stream": true}'

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"role":"assistant","content":""},...}]}
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":"The capital"},...}]}
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":" of France"},...}]}
...
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]

Tool Calling

The server supports OpenAI-style function calling. Tool definitions are injected into the model's chat template via tokenizer.apply_chat_template(..., tools=[...]). The model's <tool_call> output blocks are parsed and returned in OpenAI format.

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3-4b-it",
    "messages": [{"role": "user", "content": "What time is it?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_current_time",
        "description": "Get the current time",
        "parameters": {
          "type": "object",
          "properties": {},
          "required": []
        }
      }
    }]
  }'

When the model calls a tool the response has finish_reason: "tool_calls":

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_a1b2c3",
        "type": "function",
        "function": {"name": "get_current_time", "arguments": "{}"}
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

Send the tool result back as a tool role message:

{
  "messages": [
    {"role": "user", "content": "What time is it?"},
    {"role": "assistant", "content": null, "tool_calls": [{"id": "call_a1b2c3", ...}]},
    {"role": "tool", "tool_call_id": "call_a1b2c3", "content": "14:32 UTC"}
  ]
}

mcp-cli Integration

mcp-cli connects to the Lazarus server as an OpenAI-compatible provider:

# Start the server
lazarus-serve --model google/gemma-3-4b-it --api-key lazarus

# In another terminal, start mcp-cli
mcp-cli chat \
  --provider lazarus \
  --server time \
  --model google/gemma-3-4b-it

# With dashboard
mcp-cli chat \
  --provider lazarus \
  --server time \
  --model google/gemma-3-4b-it \
  --dashboard

mcp-cli discovers the lazarus provider from its config. The provider entry points to http://localhost:8080/v1 with the model name matching what the server is serving.

Using the openai Python SDK

Because the server is fully OpenAI-compatible, the openai package works directly:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="lazarus",  # any non-empty string if auth is disabled
)

# Non-streaming
response = client.chat.completions.create(
    model="google/gemma-3-4b-it",
    messages=[{"role": "user", "content": "Write a haiku about Python."}],
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="google/gemma-3-4b-it",
    messages=[{"role": "user", "content": "Count to ten."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Protocol Architecture

The server uses a layered protocol router design:

HTTP request
    │
    ▼
FastAPI app (app.py)
    │
    ├── /v1/*  →  OpenAI router   (routers/openai.py)
    ├── /api/* →  Ollama router   (routers/ollama.py)   ← TODO
    └── /v1/*  →  Anthropic router (routers/anthropic.py) ← TODO
         │
         ▼
    ModelEngine (engine.py)          ← format-agnostic
         │
         ├── agenerate()   → InternalResponse
         └── astream()     → AsyncIterator[InternalChunk]
              │
              ▼
    UnifiedPipeline (inference/)

Each router translates its wire format to InternalRequest, calls the engine, and translates InternalResponse back. The engine knows nothing about protocols.

Planned Protocols

Protocol	Status	Endpoints
OpenAI	Implemented	`GET /v1/models`, `POST /v1/chat/completions`
Ollama	Planned	`GET /api/tags`, `POST /api/chat`, `POST /api/generate`
Anthropic	Planned	`POST /v1/messages`

Enable multiple protocols when they are implemented:

lazarus-serve --model gemma-3-1b-it --protocols openai,ollama,anthropic

Programmatic Usage

You can embed the server in your own application:

import asyncio
from chuk_lazarus.server import ModelEngine, Protocol, create_app
import uvicorn

async def main():
    engine = await ModelEngine.load("google/gemma-3-1b-it")
    app = create_app(
        engine,
        protocols=[Protocol.OPENAI],
        api_key="secret",
        default_max_tokens=1024,
    )
    config = uvicorn.Config(app, host="0.0.0.0", port=8080)
    server = uvicorn.Server(config)
    await server.serve()

asyncio.run(main())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference Server

Installation

Quick Start

CLI Options

Endpoints

Health

OpenAI — List Models

OpenAI — Chat Completions

Non-streaming

Streaming

Tool Calling

mcp-cli Integration

Using the openai Python SDK

Protocol Architecture

Planned Protocols

Programmatic Usage

See Also

FilesExpand file tree

server.md

Latest commit

History

server.md

File metadata and controls

Inference Server

Installation

Quick Start

CLI Options

Endpoints

Health

OpenAI — List Models

OpenAI — Chat Completions

Non-streaming

Streaming

Tool Calling

mcp-cli Integration

Using the openai Python SDK

Protocol Architecture

Planned Protocols

Programmatic Usage

See Also