Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions concepts/openai-proxy.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
---
title: OpenAI-Compatible Proxy API
description: Use Magemaker endpoints through an OpenAI-style REST interface
---

## Overview

Magemaker ships with a lightweight FastAPI server (`server.py`) that lets you interact with any SageMaker endpoint you deployed **as if it were an OpenAI model**.
This is useful when you want to:

- Drop‐in replace `openai` SDK calls with your own hosted models
- Integrate Magemaker deployments with existing tooling that expects an OpenAI endpoint (e.g. LangChain, Llama-Index, Chat-GPT UIs)

The server exposes two kinds of routes:

1. **Utility routes** to inspect / query individual endpoints
2. An **OpenAI-compatible `/chat/completions` route**

> NOTE
> The proxy currently supports **chat-completion style models** only. Text-completion and other modalities will be added in future releases.

---

## Quick Start

1. Ensure you have already **deployed at least one SageMaker endpoint** with Magemaker.
2. Set the required AWS region variable (Magemaker does this automatically when you call the CLI):

```bash
export AWS_REGION_NAME=<your-aws-region> # e.g. us-east-1
```

3. Run the server:

```bash
python -m magemaker.server # or `python server.py` from repo root
```

The server will start on `http://localhost:8000` by default.

---

## REST Endpoints

### 1. `GET /endpoint/{endpoint_name}`
Returns metadata for a specific SageMaker endpoint.

```bash
curl http://localhost:8000/endpoint/my-bert-endpoint
```

Response example:
```json
{
"EndpointName": "my-bert-endpoint",
"EndpointStatus": "InService",
"CreationTime": "2024-05-28T12:34:56Z",
...
}
```

### 2. `POST /endpoint/{endpoint_name}/query`
Runs an **arbitrary payload** against the specified endpoint.

Request body (JSON):
```json
{
"inputs": "Hello, how are you?",
"context": "" // optional, defaults to empty string
}
```

Example cURL:
```bash
curl -X POST \
-H "Content-Type: application/json" \
-d '{"inputs":"Hello there!"}' \
http://localhost:8000/endpoint/my-bert-endpoint/query
```

### 3. `POST /chat/completions`
OpenAI-compatible route that dispatches the request to the **first endpoint that was deployed with the requested model id**.

Minimal request example:
```json
{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [{"role": "user", "content": "Tell me a joke."}]
}
```

Using `openai` SDK (just change the base URL):
```python
import openai

openai.api_key = "sk-ignored" # not used but required by the client
openai.base_url = "http://localhost:8000"

resp = openai.ChatCompletion.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}]
)
print(resp.choices[0].message.content)
```

---

## Environment Variables

The server relies on the same `.env` file generated by Magemaker plus one extra variable:

| Variable | Description | Required | Default |
| -------- | ----------- | -------- | ------- |
| `AWS_REGION_NAME` | AWS region where your SageMaker endpoints live | ✅ | – |

All other variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, etc.) are read automatically through `dotenv` just like the CLI.

---

## Error Handling

- The server raises `404 Not Deployed` if the requested model has **no active endpoints**.
- Querying endpoints returns the raw error from SageMaker if the invocation fails.

---

## Production Deployment Tips

1. **Behind a Reverse Proxy** – Run FastAPI with Uvicorn or Gunicorn behind Nginx for TLS termination.
2. **Authentication** – Add an API key or OAuth middleware before exposing the service publicly.
3. **Autoscaling** – Combine with Magemaker’s upcoming autoscaling feature to scale SageMaker instances based on traffic.

---

## Roadmap

- Support for non-chat completion routes (`/completions`, `/embeddings`)
- Multi-model routing strategies (e.g. round-robin, latency-based)
- Streaming responses
33 changes: 25 additions & 8 deletions mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,13 @@
"mode": "auto"
},
"navigation": [
{
{
"group": "Getting Started",
"pages": ["about", "installation", "quick-start"]
"pages": [
"about",
"installation",
"quick-start"
]
},
{
"group": "Tutorials",
Expand All @@ -64,7 +68,8 @@
"pages": [
"concepts/deployment",
"concepts/models",
"concepts/contributing"
"concepts/contributing",
"concepts/openai-proxy"
]
}
],
Expand All @@ -77,17 +82,29 @@
{
"title": "Documentation",
"links": [
{ "label": "Getting Started", "url": "/" },
{ "label": "Contributing", "url": "/contributing" }
{
"label": "Getting Started",
"url": "/"
},
{
"label": "Contributing",
"url": "/contributing"
}
]
},
{
"title": "Resources",
"links": [
{ "label": "GitHub", "url": "https://github.com/slashml/magemaker" },
{ "label": "Support", "url": "mailto:[email protected]" }
{
"label": "GitHub",
"url": "https://github.com/slashml/magemaker"
},
{
"label": "Support",
"url": "mailto:[email protected]"
}
]
}
]
}
}
}