feat(api_server): Add OpenAI-compatible API server for MaxText models #2313

babyplutokurt · 2025-09-08T22:34:54Z

This commit introduces a fully-featured, OpenAI-compatible RESTful API server for serving MaxText models. The server is built with FastAPI, supports multi-host inference on TPUs, and is designed for both interactive use and large-scale benchmarking.

Key features and additions:

Core Server Implementation:
- Adds maxtext_server.py, a FastAPI application that serves /v1/completions and /v1/chat/completions endpoints.
- Implements dynamic request batching to efficiently utilize underlying hardware.
- Uses maxtext_generator.py to encapsulate the MaxText inference engine, handling model loading, tokenization, and the generation loop.
- Includes Pydantic models in server_models.py for robust, OpenAI-compliant request and response validation.
Deployment and Utilities:
- Provides start_server.sh to simplify launching the server from the project root.
- Adds port_forward_xpk.sh, a utility script to automatically find and connect to a server running on a GKE cluster via xpk, supporting custom namespaces.
- Isolates server-specific dependencies in benchmarks/api_server/requirements.txt (uvicorn, fastapi, openai-harmony).
Comprehensive Documentation:
- A new README.md in the api_server directory offers a complete guide covering: - Installation and environment setup. - Launching the server in both single-pod and multi-pod GKE environments. - Detailed examples for interacting with the API using curl and the openai Python client. - Step-by-step instructions for running benchmarks with lm-evaluation-harness and evalchemy for both log-likelihood and generative tasks.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

RissyRan

Great work! I will take 2nd round of review for files maxtet_generator, maxtext_server, and server_models.

src/MaxText/maxengine.py

benchmarks/api_server/README.md

requirements.txt

benchmarks/api_server/README.md

benchmarks/api_server/port_forward_xpk.sh

benchmarks/api_server/server_utils.py

RissyRan

Thanks! LGTM in general. Could you leverage Gemini to build some unit tests, especailly for maxtext_generator? More unit tests are very welcome!

benchmarks/api_server/README.md

benchmarks/api_server/maxtext_generator.py

benchmarks/api_server/server_models.py

benchmarks/api_server/maxtext_server.py

hengtaoguo

Great job! LGTM to unblock.

+1 to Ran's comment, it would be great to have some unit tests guarding your functionality.

benchmarks/api_server/README.md

benchmarks/api_server/maxtext_generator.py

RissyRan

I am fine to merge those at this moment (in a separate file, and no breakage for existing codebase), but could @hengtaoguo or @bvandermoon help test and verify end-to-end? Currently no tests for those scripts and functionality yet.

RissyRan · 2025-09-16T05:16:16Z

benchmarks/api_server/maxtext_generator.py

+        # engine use its default configured `decode_sampling_strategy`.
+        return None
+
+    def _run_prefill_step(self, streams, decode_state, rng, logprobs, echo, temperature, top_k, top_p):


I agree that when involving with server, it may be difficult to mock. I guess my point is for those helper functions, we should test at least.

Since MaxEnginer has been in the codebase and verified many times

I think you make assumption that you verified many times, but not us? How do you ensure it's working end-to-end. I think currently only you tested those server and script. We have to "trust" you if no tests.

benchmarks/api_server/port_forward_xpk.sh

This commit introduces a fully-featured, OpenAI-compatible RESTful API server for serving MaxText models. The server is built with FastAPI, supports multi-host inference on TPUs, and is designed for both interactive use and large-scale benchmarking. Key features and additions: 1. **Core Server Implementation:** - Adds `maxtext_server.py`, a FastAPI application that serves `/v1/completions` and `/v1/chat/completions` endpoints. - Implements dynamic request batching to efficiently utilize underlying hardware. - Uses `maxtext_generator.py` to encapsulate the MaxText inference engine, handling model loading, tokenization, and the generation loop. - Includes Pydantic models in `server_models.py` for robust, OpenAI-compliant request and response validation. 2. **Deployment and Utilities:** - Provides `start_server.sh` to simplify launching the server from the project root. - Adds `port_forward_xpk.sh`, a utility script to automatically find and connect to a server running on a GKE cluster via `xpk`, supporting custom namespaces. - Isolates server-specific dependencies in `benchmarks/api_server/requirements.txt` (`uvicorn`, `fastapi`, `openai-harmony`). 3. **Comprehensive Documentation:** - A new `README.md` in the `api_server` directory offers a complete guide covering: - Installation and environment setup. - Launching the server in both single-pod and multi-pod GKE environments. - Detailed examples for interacting with the API using `curl` and the `openai` Python client. - Step-by-step instructions for running benchmarks with `lm-evaluation-harness` and `evalchemy` for both log-likelihood and generative tasks.

babyplutokurt requested review from SujeethJinesh, bvandermoon, richjames0, shralex, vipannalla, mitalisi, RissyRan, shauryagup, yangyuwei, parambole, gobbleturk, khatwanimohit, gagika, SurbhiJainUSC, hengtaoguo, A9isha, aireenmei and NuojCheng as code owners September 8, 2025 22:34

babyplutokurt force-pushed the api_sever_v1 branch 5 times, most recently from 400ee8c to ce5fab1 Compare September 9, 2025 18:56

RissyRan reviewed Sep 11, 2025

View reviewed changes

babyplutokurt force-pushed the api_sever_v1 branch 5 times, most recently from b70e083 to 18d3055 Compare September 11, 2025 20:31

RissyRan reviewed Sep 12, 2025

View reviewed changes

hengtaoguo approved these changes Sep 12, 2025

View reviewed changes

benchmarks/api_server/README.md Show resolved Hide resolved

benchmarks/api_server/README.md Outdated Show resolved Hide resolved

benchmarks/api_server/maxtext_generator.py Show resolved Hide resolved

benchmarks/api_server/maxtext_generator.py Outdated Show resolved Hide resolved

babyplutokurt force-pushed the api_sever_v1 branch 5 times, most recently from d1b5ad2 to a29735a Compare September 15, 2025 17:42

babyplutokurt requested review from gpolovets1, mailvijayasingh, jrplatin, patemotter, Lumosis and michelle-yooh as code owners September 15, 2025 17:42

babyplutokurt force-pushed the api_sever_v1 branch 3 times, most recently from 000872a to 750ffc2 Compare September 15, 2025 21:42

RissyRan reviewed Sep 16, 2025

View reviewed changes

babyplutokurt force-pushed the api_sever_v1 branch 3 times, most recently from 92956a1 to c703914 Compare September 19, 2025 22:53

babyplutokurt force-pushed the api_sever_v1 branch from c703914 to 07a0e66 Compare September 23, 2025 17:22

feat(api_server): Add OpenAI-compatible API server for MaxText models #2313

Are you sure you want to change the base?

feat(api_server): Add OpenAI-compatible API server for MaxText models #2313

Conversation

babyplutokurt commented Sep 8, 2025

Checklist

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hengtaoguo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

RissyRan Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hengtaoguo left a comment •

edited

Loading