-
Notifications
You must be signed in to change notification settings - Fork 417
feat(api_server): Add OpenAI-compatible API server for MaxText models #2313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
400ee8c
to
ce5fab1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I will take 2nd round of review for files maxtet_generator
, maxtext_server
, and server_models
.
b70e083
to
18d3055
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM in general. Could you leverage Gemini to build some unit tests, especailly for maxtext_generator? More unit tests are very welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! LGTM to unblock.
+1 to Ran's comment, it would be great to have some unit tests guarding your functionality.
d1b5ad2
to
a29735a
Compare
000872a
to
750ffc2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine to merge those at this moment (in a separate file, and no breakage for existing codebase), but could @hengtaoguo or @bvandermoon help test and verify end-to-end? Currently no tests for those scripts and functionality yet.
# engine use its default configured `decode_sampling_strategy`. | ||
return None | ||
|
||
def _run_prefill_step(self, streams, decode_state, rng, logprobs, echo, temperature, top_k, top_p): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that when involving with server, it may be difficult to mock. I guess my point is for those helper functions, we should test at least.
Since MaxEnginer has been in the codebase and verified many times
I think you make assumption that you verified many times, but not us? How do you ensure it's working end-to-end. I think currently only you tested those server and script. We have to "trust" you if no tests.
92956a1
to
c703914
Compare
This commit introduces a fully-featured, OpenAI-compatible RESTful API server for serving MaxText models. The server is built with FastAPI, supports multi-host inference on TPUs, and is designed for both interactive use and large-scale benchmarking. Key features and additions: 1. **Core Server Implementation:** - Adds `maxtext_server.py`, a FastAPI application that serves `/v1/completions` and `/v1/chat/completions` endpoints. - Implements dynamic request batching to efficiently utilize underlying hardware. - Uses `maxtext_generator.py` to encapsulate the MaxText inference engine, handling model loading, tokenization, and the generation loop. - Includes Pydantic models in `server_models.py` for robust, OpenAI-compliant request and response validation. 2. **Deployment and Utilities:** - Provides `start_server.sh` to simplify launching the server from the project root. - Adds `port_forward_xpk.sh`, a utility script to automatically find and connect to a server running on a GKE cluster via `xpk`, supporting custom namespaces. - Isolates server-specific dependencies in `benchmarks/api_server/requirements.txt` (`uvicorn`, `fastapi`, `openai-harmony`). 3. **Comprehensive Documentation:** - A new `README.md` in the `api_server` directory offers a complete guide covering: - Installation and environment setup. - Launching the server in both single-pod and multi-pod GKE environments. - Detailed examples for interacting with the API using `curl` and the `openai` Python client. - Step-by-step instructions for running benchmarks with `lm-evaluation-harness` and `evalchemy` for both log-likelihood and generative tasks.
c703914
to
07a0e66
Compare
This commit introduces a fully-featured, OpenAI-compatible RESTful API server for serving MaxText models. The server is built with FastAPI, supports multi-host inference on TPUs, and is designed for both interactive use and large-scale benchmarking.
Key features and additions:
Core Server Implementation:
maxtext_server.py
, a FastAPI application that serves/v1/completions
and/v1/chat/completions
endpoints.maxtext_generator.py
to encapsulate the MaxText inference engine, handling model loading, tokenization, and the generation loop.server_models.py
for robust, OpenAI-compliant request and response validation.Deployment and Utilities:
start_server.sh
to simplify launching the server from the project root.port_forward_xpk.sh
, a utility script to automatically find and connect to a server running on a GKE cluster viaxpk
, supporting custom namespaces.benchmarks/api_server/requirements.txt
(uvicorn
,fastapi
,openai-harmony
).Comprehensive Documentation:
README.md
in theapi_server
directory offers a complete guide covering: - Installation and environment setup. - Launching the server in both single-pod and multi-pod GKE environments. - Detailed examples for interacting with the API usingcurl
and theopenai
Python client. - Step-by-step instructions for running benchmarks withlm-evaluation-harness
andevalchemy
for both log-likelihood and generative tasks.Checklist
Before submitting this PR, please make sure (put X in square brackets):