gemma4_31b: add OpenAI serving entrypoint#20473
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20473
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Adds an OpenAI-compatible serving entrypoint for the Gemma4 31B example, reusing the shared examples/llm_server harness while keeping Gemma-specific logic under examples/models/gemma4_31b. The model runner is refactored to use a dedicated Gemma4_31BEngine / LLMSession adapter, and a new C++ JSONL worker process is introduced for model execution.
Changes:
- Introduce
Gemma4_31BEngine+ session adapter and refactor the existing runner to use it. - Add a C++ JSONL worker (
gemma4_31b_worker) and a Python OpenAI-compatible launcher (serve.py) with Gemma-specific templating/tool parsing. - Wire new targets into CMake presets/Makefile/docs and add hermetic + opt-in on-device + CUDA no-bleed validation tests.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| Makefile | Updates Gemma4 31B build targets to include worker (+ CUDA no-bleed test). |
| examples/models/gemma4_31b/README.md | Documents the serving harness, worker binary, and no-bleed test usage. |
| examples/models/gemma4_31b/CMakePresets.json | Ensures presets build runner + worker (+ CUDA no-bleed test). |
| examples/models/gemma4_31b/CMakeLists.txt | Adds worker + no-bleed test executables and links engine implementation into targets. |
| examples/models/gemma4_31b/export.py | Exports mutable-buffer metadata for multi-session isolation support. |
| examples/models/gemma4_31b/gemma4_31b_engine.h | Declares the Gemma4 engine/session adapter API. |
| examples/models/gemma4_31b/gemma4_31b_engine.cpp | Implements engine/session logic, including mutable-state rebinding (CUDA/MLX). |
| examples/models/gemma4_31b/main.cpp | Refactors runner to use the new engine/session adapter. |
| examples/models/gemma4_31b/gemma4_31b_worker.cpp | Adds JSONL worker entrypoint for OpenAI-serving execution. |
| examples/models/gemma4_31b/serve.py | Adds OpenAI-compatible Python control-plane launcher for Gemma4 31B. |
| examples/models/gemma4_31b/test_serve.py | Adds hermetic launcher/control-plane tests. |
| examples/models/gemma4_31b/test_ondevice_serving.py | Adds opt-in on-device BOS/template regression test. |
| examples/models/gemma4_31b/test_gemma4_31b_nobleed.cpp | Adds CUDA integration proof for isolated interleaved multi-session execution. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::string prompt_text = read_prompt(); | ||
| if (!FLAGS_prompt_file.empty() && prompt_text.empty()) { | ||
| return 1; | ||
| } |
| std::vector<std::unique_ptr<llm::LLMSession>> held; | ||
| while (true) { | ||
| auto session = engine->create_session(); | ||
| if (session.error() != Error::Ok) { | ||
| break; | ||
| } | ||
| held.push_back(std::move(session.get())); | ||
| if (held.size() > static_cast<size_t>(config.max_sessions) + 2) { | ||
| break; | ||
| } | ||
| } | ||
| check( | ||
| "capacity enforced: create_session fails past max_sessions", | ||
| held.size() <= static_cast<size_t>(config.max_sessions)); |
|
@claude review the code plz |
|
Claude finished @Gasoonjia's task in 2m 36s —— View job Code Review — Gemma4 31B serving entrypointNice refactor. The Functional concerns1. static bool valid_temperature(float t) {
return t == -1.0f || (t >= 0.0f && t <= 1.0f);
}The OpenAI chat API accepts 2. Empty prompt file is treated as fatal — Tests3. No-bleed capacity assertion keys off the wrong value — Minor / nits
CIThe 3 reported "new failures" are Docker None of the above is blocking; (1) and (3) are the two I'd address before merge. The engine/session contract adherence (poisoning semantics, terminal-step accounting, position bookkeeping) all checks out against |
| #endif | ||
|
|
||
| #if defined(EXECUTORCH_BUILD_CUDA) || defined(EXECUTORCH_BUILD_MLX) | ||
| #define GEMMA_HAS_MUTABLE_STATE 1 |
There was a problem hiding this comment.
i think we can remove this; gemma431b can only run on cuda or mlx -> EXECUTORCH_BUILD_CUDA or EXECUTORCH_BUILD_MLX will always be defined -> GEMMA_HAS_MUTABLE_STATE should always be 1.
| } | ||
|
|
||
| Error set_temperature(float temperature) { | ||
| temp_val_ = (temperature <= 0.0f) ? 1e-6f : temperature; |
There was a problem hiding this comment.
should we check the temperature here then set it?
| decode_pos_data_[0] = pos_; | ||
| std::vector<EValue> inputs; | ||
| #ifdef EXECUTORCH_BUILD_CUDA | ||
| ET_CHECK_OK_OR_RETURN_ERROR(copy_decode_inputs_to_cuda()); |
There was a problem hiding this comment.
why do we need to copy input token from host to gpu? Now input token of decode method should be always on GPU.
| endif() | ||
|
|
||
| if(TARGET mlxdelegate) | ||
| executorch_target_copy_mlx_metallib(gemma4_31b_runner) |
There was a problem hiding this comment.
runner and worker both need the MLX metallib copy
| "get_vocab_size": config.vocab_size, | ||
| "get_n_layers": config.num_hidden_layers, | ||
| "get_max_prefill_chunk": max_prefill, | ||
| "get_min_prefill_chunk": 5, |
There was a problem hiding this comment.
Do you pad up to this chunk size? Should this be backend specific? MLX can handle seqlen=1.
There was a problem hiding this comment.
get_min_prefill_chunk=5 is CUDA-only; MLX uses the default path
| constexpr const char* kMaxPrefillChunk = "get_max_prefill_chunk"; | ||
| constexpr const char* kMinPrefillChunk = "get_min_prefill_chunk"; | ||
|
|
||
| Result<uint64_t> read_sampled_token( |
Add Gemma4 31B model-specific serving support on top of the shared examples/llm_server harness. This extracts the existing runner flow into a small Gemma4_31BEngine/LLMSession adapter, keeps main.cpp as a thin runner wrapper, and adds a C++ JSONL worker plus Python OpenAI-compatible launcher. The generic server remains model-agnostic; Gemma-specific behavior stays in examples/models/gemma4_31b, including chat-template options, BOS handling, channel cleanup, and Gemma tool-call parsing. Also wire the worker into the existing Gemma CUDA/MLX CMake presets and Makefile targets, document the serving harness usage, and add validation coverage: hermetic launcher tests, an opt-in on-device BOS/template regression test, and a CUDA no-bleed integration proof for interleaved multi-session execution.
2e0ac26 to
8cab3d0
Compare
|
@claude Review this PR |
|
Claude finished @mergennachin's task in 11m 11s —— View job Code Review — Gemma4 31B serving entrypointI re-read the current state of the branch against the earlier review round. Most of the previously-flagged items are already fixed in the HEAD commit, so this pass focuses on what remains. Nothing below is blocking. Previously flagged → resolved on this branch
Remaining minor / nits1. Three literal copies of the Gemma chat template. The prompt scaffold
The earlier review noted the main.cpp ↔ serve.py drift; the no-bleed test adds a third. These can silently diverge from the real HF template. A one-line comment in each pointing at the others (and noting the Python path is authoritative for serving) would help; better still, share the C++ scaffold between 2. Export-time mutable-buffer filter is implicitly coupled to the runtime. 3. 4. Correctness spot-checks (all pass)
Overall this is a clean refactor and the engine/session contract adherence checks out. The four items above are all comment-level polish — none should block merge. |
|
@claude How's the optimization for device-host copy? |
|
I'll analyze this and get back to you. |

Add Gemma4 31B model-specific serving support on top of the shared examples/llm_server harness.
This extracts the existing runner flow into a small Gemma4_31BEngine/LLMSession adapter, keeps main.cpp as a thin runner wrapper, and adds a C++ JSONL worker plus Python OpenAI-compatible launcher. The generic server remains model-agnostic; Gemma-specific behavior stays in examples/models/gemma4_31b, including chat-template options, BOS handling, channel cleanup, and Gemma tool-call parsing.
Also wire the worker into the existing Gemma CUDA/MLX CMake presets and Makefile targets, document the serving harness usage, and add validation coverage: hermetic launcher tests, an opt-in on-device BOS/template regression test, and a CUDA no-bleed integration proof for interleaved multi-session execution.
#20001