Inference Gateway

An OpenAI-compatible API gateway written in Go that sits between clients and vLLM inference engine instances. It provides a model registry, request routing, weighted round-robin load balancing with health checks, API key authentication, rate limiting (RPM + TPM), SSE streaming support, and an admin API for runtime management.

Clients ──> [ Auth + Rate Limit ] ──> [ Route + LB ] ──> [ Proxy + Transform ] ──> vLLM backends
                                            |
                                   [ PostgreSQL + Redis ]

Features

OpenAI-compatible API -- drop-in replacement for /v1/chat/completions and /v1/models
Model registry -- register models with aliases, map to actual vLLM model IDs
Weighted round-robin load balancing -- distribute traffic across multiple vLLM backends per model
Active + passive health checks -- automatic backend failover
API key authentication -- per-key model access control, RPM/TPM limits, expiration
Rate limiting -- sliding window RPM + TPM via Redis sorted sets, fail-open on Redis errors
SSE streaming -- zero-buffering proxy with live TPM tracking via continuous_usage_stats
Request transforms -- default params injection, reasoning config override, system prompt prefix
Admin API -- full CRUD for models, backends, and API keys at runtime
Prometheus metrics -- TTFT, ITL, tokens/s, backend health, active streams, rate limit hits
Structured logging -- JSON request logs with request ID, model, key prefix, backend URL, duration
Redis-backed cache -- shared across gateway pods, periodic refresh + immediate invalidation on admin mutations

Quick Start

Prerequisites

Go 1.25+
Docker and Docker Compose
Make (optional)

1. Start infrastructure

docker compose up -d postgres redis

2. Run database migrations

export DATABASE_URL="postgres://gateway:gateway@localhost:5432/gateway?sslmode=disable"
go run ./cmd/gateway migrate up

3. Start the gateway

export DATABASE_URL="postgres://gateway:gateway@localhost:5432/gateway?sslmode=disable"
export REDIS_URL="redis://localhost:6379/0"
export ADMIN_API_KEY="your-admin-secret"
go run ./cmd/gateway serve

The gateway is now listening on :8080.

4. Register a model via the admin API

# Create a model
curl -s -X POST http://localhost:8080/admin/v1/models \
  -H "X-Admin-Key: your-admin-secret" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama-3",
    "model_id": "meta-llama/Llama-3.1-70B-Instruct"
  }' | jq .

# Add a backend (pointing to your vLLM instance)
MODEL_ID=$(curl -s http://localhost:8080/admin/v1/models \
  -H "X-Admin-Key: your-admin-secret" | jq -r '.[0].id')

curl -s -X POST "http://localhost:8080/admin/v1/models/${MODEL_ID}/backends" \
  -H "X-Admin-Key: your-admin-secret" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "http://localhost:8000",
    "weight": 1
  }' | jq .

5. Create an API key

curl -s -X POST http://localhost:8080/admin/v1/keys \
  -H "X-Admin-Key: your-admin-secret" \
  -H "Content-Type: application/json" \
  -d '{"name": "dev-key"}' | jq .

Save the key field from the response -- it is shown only once.

6. Make a request

# Non-streaming
curl -s http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer ml-YOUR_KEY_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3",
    "messages": [{"role": "user", "content": "Hello!"}]
  }' | jq .

# Streaming
curl -N http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer ml-YOUR_KEY_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3",
    "stream": true,
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Docker Compose (full stack)

docker compose up --build

This starts the gateway, PostgreSQL, and Redis. Run migrations first in a separate step or use an init container.

Configuration

The gateway is configured via a YAML file and/or environment variables. Environment variables take precedence.

# Using a config file
gateway serve gateway.yaml

# Using environment variables only (config file optional)
DATABASE_URL=... REDIS_URL=... ADMIN_API_KEY=... gateway serve

Full Configuration Reference

server:
  port: 8080                # Server listen port (env: PORT)
  read_timeout: 30s
  write_timeout: 300s       # Long timeout for streaming responses
  idle_timeout: 120s

database:
  url: "${DATABASE_URL}"    # PostgreSQL connection string (required)
  max_open_conns: 25
  max_idle_conns: 10

redis:
  url: "${REDIS_URL}"       # Redis connection string (required)

auth:
  admin_key: "${ADMIN_API_KEY}"  # Admin API key for /admin/v1 endpoints

rate_limits:
  default_rpm: 60           # Default requests per minute per key
  default_tpm: 100000       # Default tokens per minute per key

health_check:
  interval: 10s             # Active health check interval
  timeout: 5s               # Health check request timeout
  healthy_threshold: 3      # Consecutive successes to mark healthy
  unhealthy_threshold: 1    # Consecutive failures to mark unhealthy

registry:
  cache_refresh_interval: 30s  # How often to refresh the model cache from PG

logging:
  level: info               # debug, info, warn, error (env: LOG_LEVEL)
  format: json              # json or text (env: LOG_FORMAT)

Environment Variable Overrides

Variable	Config Path	Description
`DATABASE_URL`	`database.url`	PostgreSQL connection string
`REDIS_URL`	`redis.url`	Redis connection string
`ADMIN_API_KEY`	`auth.admin_key`	Admin API authentication key
`PORT`	`server.port`	Server listen port
`LOG_LEVEL`	`logging.level`	Log level
`LOG_FORMAT`	`logging.format`	Log format (json/text)

API Reference

Client API (OpenAI-compatible)

All /v1 endpoints require a valid API key via Authorization: Bearer <key>.

`GET /v1/models`

List available models (filtered by the key's allowed_models).

{
  "object": "list",
  "data": [
    {"id": "llama-3", "object": "model", "created": 1710000000, "owned_by": "inference-gateway"}
  ]
}

`POST /v1/chat/completions`

Proxy chat completions to the appropriate vLLM backend. Supports both streaming ("stream": true) and non-streaming requests. The model field uses the registered alias name (not the underlying vLLM model ID).

Request:

{
  "model": "llama-3",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}

Response (non-streaming):

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "model": "llama-3",
  "choices": [{"index": 0, "message": {"role": "assistant", "content": "..."}, "finish_reason": "stop"}],
  "usage": {"prompt_tokens": 10, "completion_tokens": 20, "total_tokens": 30}
}

Response (streaming): Server-Sent Events, each line prefixed with data: . The model name is rewritten to the alias in every chunk. The stream ends with data: [DONE].

`GET /health`

Returns {"status": "ok"} -- no authentication required.

`GET /metrics`

Prometheus metrics endpoint -- no authentication required.

Admin API

All /admin/v1 endpoints require X-Admin-Key: <admin_key> header.

Models

Method	Endpoint	Description
`GET`	`/admin/v1/models`	List all models
`POST`	`/admin/v1/models`	Create a model
`GET`	`/admin/v1/models/{id}`	Get a model by ID
`PUT`	`/admin/v1/models/{id}`	Update a model (partial)
`DELETE`	`/admin/v1/models/{id}`	Delete a model

Create model request:

{
  "name": "llama-3",
  "model_id": "meta-llama/Llama-3.1-70B-Instruct",
  "active": true,
  "default_params": {"temperature": 0.7, "max_tokens": 4096},
  "reasoning_config": {"enabled": true},
  "transforms": {"system_prompt_prefix": "You are a helpful assistant."}
}

Backends

Method	Endpoint	Description
`POST`	`/admin/v1/models/{id}/backends`	Add a backend
`PUT`	`/admin/v1/models/{id}/backends/{bid}`	Update a backend
`DELETE`	`/admin/v1/models/{id}/backends/{bid}`	Remove a backend

Create backend request:

{
  "url": "http://vllm-host:8000",
  "weight": 1,
  "active": true
}

API Keys

Method	Endpoint	Description
`GET`	`/admin/v1/keys`	List all keys
`POST`	`/admin/v1/keys`	Create a key (returns plaintext once)
`GET`	`/admin/v1/keys/{id}`	Get key metadata
`PUT`	`/admin/v1/keys/{id}`	Update a key
`DELETE`	`/admin/v1/keys/{id}`	Revoke a key

Create key request:

{
  "name": "production-service",
  "rpm_limit": 120,
  "tpm_limit": 500000,
  "allowed_models": ["llama-3"],
  "expires_at": "2025-12-31T23:59:59Z"
}

Create key response:

{
  "key": "ml-abc123...",
  "api_key": {
    "id": "uuid",
    "name": "production-service",
    "key_prefix": "ml-abc123",
    "active": true,
    "rpm_limit": 120,
    "tpm_limit": 500000,
    "allowed_models": ["llama-3"]
  }
}

Rate Limit Headers

All /v1 responses include rate limit headers:

X-RateLimit-Limit-Requests: 60
X-RateLimit-Remaining-Requests: 58
X-RateLimit-Reset-Requests: 45s
X-RateLimit-Limit-Tokens: 100000
X-RateLimit-Remaining-Tokens: 99000
X-RateLimit-Reset-Tokens: 45s

When rate limited, the gateway returns 429 Too Many Requests with a Retry-After header.

Architecture

┌──────────────────────────────────────────────────────────┐
│                      HTTP Server                          │
│                                                           │
│  Middleware Chain:                                         │
│  Recoverer -> RealIP -> RequestID -> Logging              │
│                                                           │
│  /v1 routes:  Auth -> RateLimit -> Metrics -> Handler     │
│  /admin routes:  AdminAuth -> Handler                     │
│  /health, /metrics:  (no auth)                            │
│                                                           │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────────┐ │
│  │ Proxy       │  │ Admin        │  │ Load Balancer    │ │
│  │ Handler     │  │ Handler      │  │ (WRR + Health)   │ │
│  └──────┬──────┘  └──────┬───────┘  └────────┬─────────┘ │
│         │                │                    │           │
│  ┌──────▼──────┐  ┌──────▼───────┐  ┌────────▼─────────┐ │
│  │ Registry    │  │ Auth Store   │  │ Health Checker   │ │
│  │ Cache       │  │ (PG)         │  │ (Active+Passive) │ │
│  │ (Redis)     │  │              │  │                  │ │
│  └──────┬──────┘  └──────────────┘  └──────────────────┘ │
│         │                                                 │
│  ┌──────▼──────┐  ┌──────────────┐                        │
│  │ Registry    │  │ Rate Limiter │                        │
│  │ Store (PG)  │  │ (Redis)      │                        │
│  └─────────────┘  └──────────────┘                        │
└──────────────────────────────────────────────────────────┘

Key Components

Component	Package	Description
Config	`internal/config`	YAML + env var configuration loading
Database	`internal/database`	PostgreSQL connection pool, embedded SQL migrations
Registry	`internal/registry`	Model/backend store (PG) + Redis-backed cache
Auth	`internal/auth`	API key generation (`ml-` prefix), SHA-256 hashing, PG store
Rate Limiter	`internal/ratelimit`	Redis sliding window (Lua script), TPM batcher
Load Balancer	`internal/loadbalancer`	Smooth WRR (Nginx algorithm), active + passive health checks
Proxy	`internal/proxy`	Non-streaming + SSE streaming reverse proxy, request transforms
Admin	`internal/admin`	REST API for model, backend, and key management
Middleware	`internal/middleware`	Auth, rate limit, metrics, request ID, logging
Metrics	`internal/metrics`	Prometheus metric definitions
Server	`internal/server`	HTTP server, router wiring, dependency injection

Development

Build

make build          # Build binary to bin/gateway

Test

make test           # Run all tests (requires Docker for testcontainers)
go test -p 1 ./...  # Run sequentially if Docker resource pressure causes flakes

Lint

make lint           # Requires golangci-lint

Migrations

make migrate-up     # Run pending migrations
make migrate-down   # Roll back last migration

Project Structure

cmd/gateway/main.go                 Entry point (serve + migrate subcommands)
internal/
  config/                           Configuration loading
  database/                         PG connection + migrations (embedded SQL)
  registry/                         Model/backend store + Redis cache
  auth/                             API key management
  ratelimit/                        Redis sliding window rate limiter
  loadbalancer/                     Weighted round-robin + health checks
  proxy/                            Reverse proxy (streaming + non-streaming)
  admin/                            Admin REST API handlers
  middleware/                       HTTP middleware chain
  metrics/                          Prometheus metric definitions
  server/                           HTTP server + route wiring

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
cmd/gateway		cmd/gateway
grafana		grafana
internal		internal
.gitignore		.gitignore
AGENTS.md		AGENTS.md
DESIGN.md		DESIGN.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
gateway.example.yaml		gateway.example.yaml
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

Inference Gateway

Features

Quick Start

Prerequisites

1. Start infrastructure

2. Run database migrations

3. Start the gateway

4. Register a model via the admin API

5. Create an API key

6. Make a request

Docker Compose (full stack)

Configuration

Full Configuration Reference

Environment Variable Overrides

API Reference

Client API (OpenAI-compatible)

GET /v1/models

POST /v1/chat/completions

GET /health

GET /metrics

Admin API

Models

Backends

API Keys

Rate Limit Headers

Architecture

Key Components

Development

Build

Test

Lint

Migrations

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`GET /v1/models`

`POST /v1/chat/completions`

`GET /health`

`GET /metrics`

Packages