Skip to content

Conversation

babyplutokurt
Copy link
Collaborator

This commit introduces a fully-featured, OpenAI-compatible RESTful API server for serving MaxText models. The server is built with FastAPI, supports multi-host inference on TPUs, and is designed for both interactive use and large-scale benchmarking.

Key features and additions:

  1. Core Server Implementation:

    • Adds maxtext_server.py, a FastAPI application that serves /v1/completions and /v1/chat/completions endpoints.
    • Implements dynamic request batching to efficiently utilize underlying hardware.
    • Uses maxtext_generator.py to encapsulate the MaxText inference engine, handling model loading, tokenization, and the generation loop.
    • Includes Pydantic models in server_models.py for robust, OpenAI-compliant request and response validation.
  2. Deployment and Utilities:

    • Provides start_server.sh to simplify launching the server from the project root.
    • Adds port_forward_xpk.sh, a utility script to automatically find and connect to a server running on a GKE cluster via xpk, supporting custom namespaces.
    • Isolates server-specific dependencies in benchmarks/api_server/requirements.txt (uvicorn, fastapi, openai-harmony).
  3. Comprehensive Documentation:

    • A new README.md in the api_server directory offers a complete guide covering: - Installation and environment setup. - Launching the server in both single-pod and multi-pod GKE environments. - Detailed examples for interacting with the API using curl and the openai Python client. - Step-by-step instructions for running benchmarks with lm-evaluation-harness and evalchemy for both log-likelihood and generative tasks.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

Copy link
Collaborator

@RissyRan RissyRan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I will take 2nd round of review for files maxtet_generator, maxtext_server, and server_models.

@babyplutokurt babyplutokurt force-pushed the api_sever_v1 branch 5 times, most recently from b70e083 to 18d3055 Compare September 11, 2025 20:31
Copy link
Collaborator

@RissyRan RissyRan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM in general. Could you leverage Gemini to build some unit tests, especailly for maxtext_generator? More unit tests are very welcome!

Copy link
Collaborator

@hengtaoguo hengtaoguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! LGTM to unblock.

+1 to Ran's comment, it would be great to have some unit tests guarding your functionality.

Copy link
Collaborator

@RissyRan RissyRan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine to merge those at this moment (in a separate file, and no breakage for existing codebase), but could @hengtaoguo or @bvandermoon help test and verify end-to-end? Currently no tests for those scripts and functionality yet.

# engine use its default configured `decode_sampling_strategy`.
return None

def _run_prefill_step(self, streams, decode_state, rng, logprobs, echo, temperature, top_k, top_p):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that when involving with server, it may be difficult to mock. I guess my point is for those helper functions, we should test at least.

Since MaxEnginer has been in the codebase and verified many times

I think you make assumption that you verified many times, but not us? How do you ensure it's working end-to-end. I think currently only you tested those server and script. We have to "trust" you if no tests.

@babyplutokurt babyplutokurt force-pushed the api_sever_v1 branch 3 times, most recently from 92956a1 to c703914 Compare September 19, 2025 22:53
This commit introduces a fully-featured, OpenAI-compatible RESTful API server for serving MaxText models. The server is built with FastAPI, supports multi-host inference on TPUs, and is designed for both interactive use and large-scale benchmarking.

Key features and additions:

1.  **Core Server Implementation:**
    - Adds `maxtext_server.py`, a FastAPI application that serves `/v1/completions` and `/v1/chat/completions` endpoints.
    - Implements dynamic request batching to efficiently utilize underlying hardware.
    - Uses `maxtext_generator.py` to encapsulate the MaxText inference engine, handling model loading, tokenization, and the generation loop.
    - Includes Pydantic models in `server_models.py` for robust, OpenAI-compliant request and response validation.

2.  **Deployment and Utilities:**
    - Provides `start_server.sh` to simplify launching the server from the project root.
    - Adds `port_forward_xpk.sh`, a utility script to automatically find and connect to a server running on a GKE cluster via `xpk`, supporting custom namespaces.
    - Isolates server-specific dependencies in `benchmarks/api_server/requirements.txt` (`uvicorn`, `fastapi`, `openai-harmony`).

3.  **Comprehensive Documentation:**
    - A new `README.md` in the `api_server` directory offers a complete guide covering:
      - Installation and environment setup.
      - Launching the server in both single-pod and multi-pod GKE environments.
      - Detailed examples for interacting with the API using `curl` and the `openai` Python client.
      - Step-by-step instructions for running benchmarks with `lm-evaluation-harness` and `evalchemy` for both log-likelihood and generative tasks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants