Skip to content

gemma4_31b: add OpenAI serving entrypoint#20473

Merged
mergennachin merged 1 commit into
mainfrom
gemma4-31b-serving-pr
Jun 25, 2026
Merged

gemma4_31b: add OpenAI serving entrypoint#20473
mergennachin merged 1 commit into
mainfrom
gemma4-31b-serving-pr

Conversation

@mergennachin

@mergennachin mergennachin commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Add Gemma4 31B model-specific serving support on top of the shared examples/llm_server harness.

This extracts the existing runner flow into a small Gemma4_31BEngine/LLMSession adapter, keeps main.cpp as a thin runner wrapper, and adds a C++ JSONL worker plus Python OpenAI-compatible launcher. The generic server remains model-agnostic; Gemma-specific behavior stays in examples/models/gemma4_31b, including chat-template options, BOS handling, channel cleanup, and Gemma tool-call parsing.

Also wire the worker into the existing Gemma CUDA/MLX CMake presets and Makefile targets, document the serving harness usage, and add validation coverage: hermetic launcher tests, an opt-in on-device BOS/template regression test, and a CUDA no-bleed integration proof for interleaved multi-session execution.

#20001

Copilot AI review requested due to automatic review settings June 24, 2026 02:57
@pytorch-bot

pytorch-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20473

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 23 Pending, 3 Unrelated Failures, 1 Unclassified Failure

As of commit 8cab3d0 with merge base 3169302 (image):

NEW FAILURES - The following jobs have failed:

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026
@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an OpenAI-compatible serving entrypoint for the Gemma4 31B example, reusing the shared examples/llm_server harness while keeping Gemma-specific logic under examples/models/gemma4_31b. The model runner is refactored to use a dedicated Gemma4_31BEngine / LLMSession adapter, and a new C++ JSONL worker process is introduced for model execution.

Changes:

  • Introduce Gemma4_31BEngine + session adapter and refactor the existing runner to use it.
  • Add a C++ JSONL worker (gemma4_31b_worker) and a Python OpenAI-compatible launcher (serve.py) with Gemma-specific templating/tool parsing.
  • Wire new targets into CMake presets/Makefile/docs and add hermetic + opt-in on-device + CUDA no-bleed validation tests.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
Makefile Updates Gemma4 31B build targets to include worker (+ CUDA no-bleed test).
examples/models/gemma4_31b/README.md Documents the serving harness, worker binary, and no-bleed test usage.
examples/models/gemma4_31b/CMakePresets.json Ensures presets build runner + worker (+ CUDA no-bleed test).
examples/models/gemma4_31b/CMakeLists.txt Adds worker + no-bleed test executables and links engine implementation into targets.
examples/models/gemma4_31b/export.py Exports mutable-buffer metadata for multi-session isolation support.
examples/models/gemma4_31b/gemma4_31b_engine.h Declares the Gemma4 engine/session adapter API.
examples/models/gemma4_31b/gemma4_31b_engine.cpp Implements engine/session logic, including mutable-state rebinding (CUDA/MLX).
examples/models/gemma4_31b/main.cpp Refactors runner to use the new engine/session adapter.
examples/models/gemma4_31b/gemma4_31b_worker.cpp Adds JSONL worker entrypoint for OpenAI-serving execution.
examples/models/gemma4_31b/serve.py Adds OpenAI-compatible Python control-plane launcher for Gemma4 31B.
examples/models/gemma4_31b/test_serve.py Adds hermetic launcher/control-plane tests.
examples/models/gemma4_31b/test_ondevice_serving.py Adds opt-in on-device BOS/template regression test.
examples/models/gemma4_31b/test_gemma4_31b_nobleed.cpp Adds CUDA integration proof for isolated interleaved multi-session execution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/models/gemma4_31b/main.cpp Outdated
Comment on lines 143 to 146
std::string prompt_text = read_prompt();
if (!FLAGS_prompt_file.empty() && prompt_text.empty()) {
return 1;
}
Comment on lines +241 to +254
std::vector<std::unique_ptr<llm::LLMSession>> held;
while (true) {
auto session = engine->create_session();
if (session.error() != Error::Ok) {
break;
}
held.push_back(std::move(session.get()));
if (held.size() > static_cast<size_t>(config.max_sessions) + 2) {
break;
}
}
check(
"capacity enforced: create_session fails past max_sessions",
held.size() <= static_cast<size_t>(config.max_sessions));
@Gasoonjia

Copy link
Copy Markdown
Contributor

@claude review the code plz

@claude

claude Bot commented Jun 24, 2026

Copy link
Copy Markdown

Claude finished @Gasoonjia's task in 2m 36s —— View job


Code Review — Gemma4 31B serving entrypoint

Nice refactor. The LLMEngine/LLMSession split is clean, the generic worker_loop stays model-agnostic, and the warm-resume invariant (resident_token_ids.size() == position()) is well-documented and looks correct end-to-end. Comments below, ordered by severity.

Functional concerns

1. temperature validation rejects part of the OpenAI rangegemma4_31b_engine.cpp:412

static bool valid_temperature(float t) {
  return t == -1.0f || (t >= 0.0f && t <= 1.0f);
}

The OpenAI chat API accepts temperature ∈ [0, 2]. The worker passes the request value straight through (worker_loop.h:113), so a standards-compliant client sending temperature: 1.5 gets Error::InvalidArgument, which surfaces as a hard "prefill failed" for the whole request rather than a clamp. For something billed as an OpenAI-compatible entrypoint this is a surprising rejection. Consider accepting [0, 2] (or clamping in the control plane). Fix this →

2. Empty prompt file is treated as fatalmain.cpp:68-79, main.cpp:144 (also flagged by Copilot)
read_prompt() returns "" for both an unreadable file and a valid-but-empty file, and main() exits on any empty result. Only an actual open failure should be fatal. Either signal the open error distinctly (e.g. return std::optional) or only bail when !f.is_open().

Tests

3. No-bleed capacity assertion keys off the wrong valuetest_gemma4_31b_nobleed.cpp:248-254 (also flagged by Copilot)
The loop bound and final check both use config.max_sessions, but the engine clamps real capacity to 1 when rebinding is unavailable (and serving_capacity() is the source of truth). As written, the test passes even if create_session() silently clamps to 1 — i.e. it doesn't actually prove the multi-session capacity it's named for. Assert against engine->serving_capacity().max_physical_sessions_without_weight_duplication instead. Fix this →

Minor / nits

  • gemma4_31b_engine.cpp:584add_token_piece(..., "<turn|>") and "<end_of_turn>" are added to eos_ids, but main.cpp's template emits <turn|> as the user-turn terminator inside the prompt. Since these only affect sampled-token EOS detection (not prompt tokens), this is fine — worth a one-line comment noting why both spellings are covered, as the asymmetry with format_prompt is easy to misread.
  • main.cpp vs gemma4_31b_worker.cpp — the chat-template string lives literally in main.cpp:85 (raw runner) while the server path renders via Python ChatTemplate. Two sources of truth for "what a Gemma prompt looks like." Acceptable for an example, but a drift risk worth a comment pointing each at the other.
  • gemma4_31b_engine.cpp:251 — session destructor calls mutable_state_->destroy_session() outside exec_mutex_. The worker loop is single-threaded so this is safe today, but every other mutable-state touch goes through run_locked/with_active_session; a brief note that session lifecycle is single-threaded would prevent a future concurrent caller from tripping here.

CI

The 3 reported "new failures" are Docker exec infra errors (docker exec ... failed with exit code 1) and the aarch64 wheel failure is a pre-existing CoreML InMemoryNode::Kind build break unrelated to this PR — none appear caused by these changes. A rebase onto viable/strict should clear the broken-trunk entries.


None of the above is blocking; (1) and (3) are the two I'd address before merge. The engine/session contract adherence (poisoning semantics, terminal-step accounting, position bookkeeping) all checks out against llm_session.h.
· branch gemma4-31b-serving-pr

Comment thread examples/models/gemma4_31b/gemma4_31b_engine.h Outdated
#endif

#if defined(EXECUTORCH_BUILD_CUDA) || defined(EXECUTORCH_BUILD_MLX)
#define GEMMA_HAS_MUTABLE_STATE 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can remove this; gemma431b can only run on cuda or mlx -> EXECUTORCH_BUILD_CUDA or EXECUTORCH_BUILD_MLX will always be defined -> GEMMA_HAS_MUTABLE_STATE should always be 1.

}

Error set_temperature(float temperature) {
temp_val_ = (temperature <= 0.0f) ? 1e-6f : temperature;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check the temperature here then set it?

decode_pos_data_[0] = pos_;
std::vector<EValue> inputs;
#ifdef EXECUTORCH_BUILD_CUDA
ET_CHECK_OK_OR_RETURN_ERROR(copy_decode_inputs_to_cuda());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to copy input token from host to gpu? Now input token of decode method should be always on GPU.

endif()

if(TARGET mlxdelegate)
executorch_target_copy_mlx_metallib(gemma4_31b_runner)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this repeated twice?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runner and worker both need the MLX metallib copy

"get_vocab_size": config.vocab_size,
"get_n_layers": config.num_hidden_layers,
"get_max_prefill_chunk": max_prefill,
"get_min_prefill_chunk": 5,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you pad up to this chunk size? Should this be backend specific? MLX can handle seqlen=1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_min_prefill_chunk=5 is CUDA-only; MLX uses the default path

constexpr const char* kMaxPrefillChunk = "get_max_prefill_chunk";
constexpr const char* kMinPrefillChunk = "get_min_prefill_chunk";

Result<uint64_t> read_sampled_token(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we do topk as well?

Add Gemma4 31B model-specific serving support on top of the shared examples/llm_server harness.

This extracts the existing runner flow into a small Gemma4_31BEngine/LLMSession adapter, keeps main.cpp as a thin runner wrapper, and adds a C++ JSONL worker plus Python OpenAI-compatible launcher. The generic server remains model-agnostic; Gemma-specific behavior stays in examples/models/gemma4_31b, including chat-template options, BOS handling, channel cleanup, and Gemma tool-call parsing.

Also wire the worker into the existing Gemma CUDA/MLX CMake presets and Makefile targets, document the serving harness usage, and add validation coverage: hermetic launcher tests, an opt-in on-device BOS/template regression test, and a CUDA no-bleed integration proof for interleaved multi-session execution.
@mergennachin

Copy link
Copy Markdown
Contributor Author

@claude Review this PR

@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Claude finished @mergennachin's task in 11m 11s —— View job


Code Review — Gemma4 31B serving entrypoint

I re-read the current state of the branch against the earlier review round. Most of the previously-flagged items are already fixed in the HEAD commit, so this pass focuses on what remains. Nothing below is blocking.

Previously flagged → resolved on this branch

  • temperature range now accepts [0, 2] (gemma4_31b_engine.cpp:425), matching the OpenAI range. ✅
  • Empty prompt file is no longer fatal — read_prompt() returns std::optional and only bails on an actual open failure (main.cpp:69-80,144). ✅
  • No-bleed capacity assertion now keys off engine->serving_capacity().max_physical_sessions_without_weight_duplication (test_gemma4_31b_nobleed.cpp:242-256) instead of config.max_sessions. ✅
  • GEMMA_HAS_MUTABLE_STATE macro is gone — the header now hard-#errors when neither CUDA nor MLX is defined (gemma4_31b_engine.h:28-34), per Gasoonjia's note that one is always set. ✅
  • set_temperature validation — now checks valid_temperature() before writing (gemma4_31b_engine.cpp:470-473). ✅
  • CMake metallib copy "repeated twice" — confirmed intentional: runner and worker are distinct targets that each need the copy (CMakeLists.txt:89-90). ✅

Remaining minor / nits

1. Three literal copies of the Gemma chat template. The prompt scaffold <|turn>user\n…<turn|>\n<|turn>model\n<|channel>thought\n<channel|> is now hand-written in three places:

  • main.cpp:86-87 (format_prompt)
  • test_gemma4_31b_nobleed.cpp:46-48 (encode_prompt)
  • and rendered separately via Python ChatTemplate on the serving path (serve.py:118-123).

The earlier review noted the main.cpp ↔ serve.py drift; the no-bleed test adds a third. These can silently diverge from the real HF template. A one-line comment in each pointing at the others (and noting the Python path is authoritative for serving) would help; better still, share the C++ scaffold between main.cpp and the test.

2. Export-time mutable-buffer filter is implicitly coupled to the runtime. _mutable_buffer_metadata() (export.py) selects buffers by the substring ".kv_cache.". If a future model renames those buffers, the list comes back empty, register_mutable_fqns() rejects the empty array (gemma4_31b_engine.cpp:182-188), and the engine silently clamps to a single session. It is logged at Info, so it won't be obvious in a serving deployment that --max-sessions 4 collapsed to 1. Worth a comment on the export side that the substring is a load-bearing contract with the runtime.

3. get_min_prefill_chunk = 5 has two sources of truth. It is baked into the exported metadata (export.py:265) and hardcoded as the CUDA default in the engine (gemma4_31b_engine.cpp:660). The engine prefers the metadata value when present, so they agree today, but the literal 5 living in both spots is a drift risk — a brief comment noting the engine default mirrors the export value would prevent silent skew.

4. reset() does not clear temperature_. After reset() (gemma4_31b_engine.cpp:411-417), a subsequent prefill_tokens() with initial_sampling == nullptr would reuse the stale temperature_ from the prior conversation. Harmless on the serving path (the worker always passes a SamplingConfig), but since reset() is meant to restore a clean state, resetting temperature_ to -1.0f there would make the contract self-consistent.

Correctness spot-checks (all pass)

  • The resident_token_ids.size() == position() invariant holds across the EOS/terminal-stop paths: EOS and cooperative-stop steps don't advance pos_ and aren't forwarded (gemma4_31b_engine.cpp:350-352,367-371), matching the worker's "terminal token not counted/emitted" handling (worker_loop.h:228-241).
  • The decode_one stop contract is honored — synthetic terminal step returns {0, "", is_eos=false, is_terminal=true} (gemma4_31b_engine.cpp:351).
  • CUDA decode position lookup is bounded: pos_ < max_context is enforced before pos_table + pos_ indexing (gemma4_31b_engine.cpp:373-382update_decode_pos_on_cuda:522-547).
  • Session capacity: when rebinding is unavailable, serving_capacity() returns the default 1 (gemma4_31b_engine.cpp:774-782), so create_session() and the worker's max_named = cap-1 reservation degrade safely.

Overall this is a clean refactor and the engine/session contract adherence checks out. The four items above are all comment-level polish — none should block merge.
· branch gemma4-31b-serving-pr

@mergennachin

Copy link
Copy Markdown
Contributor Author

@claude How's the optimization for device-host copy?

@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Claude finished @mergennachin's task in 1m 38s —— View job


Device⇄host copy — current state

I traced the CUDA decode/prefill path end-to-end. The optimization that Gasoonjia flagged earlier ("input token of decode should already be on GPU") is now in place, and the steady-state decode loop is essentially free of host↔device round-trips for the token and position. Summary of where each transfer stands:

Token input — no longer copied H2D per step ✅
The sampled token from step N is staged directly into the device-resident decode_tokens_dev_ at the tail of run_locked() via stage_next_decode_token_on_cuda() (gemma4_31b_engine.cpp:566-567). When the model output is a Long device tensor it uses cudaMemcpyDeviceToDevice (:508-512), so step N+1's decode_one() consumes the token already on-GPU and pushes decode_tokens_dev_ without any H2D (:387-392). The prefill→decode handoff is covered the same way, since the last prefill chunk also runs through run_locked and stages its sampled token. copy_decode_token_to_cuda() (:486) survives only as the fallback for a non-Long output.

Decode position — D2D, no host transfer ✅
When decode_pos_table_dev_ is present (the common path; it's a precomputed [0..max_context) table built once on-device, :204-220), update_decode_pos_on_cuda() copies pos_table + pos_ into the position slot with cudaMemcpyDeviceToDevice (:528-532). The H2D branch (:538-542) only fires when the table couldn't be built.

Sampled-token readback — 1 scalar D2H per step, unavoidable
read_sampled_token() (:68-93) still does a cudaMemcpyDeviceToHost of the single scalar. That's required: the host needs the value for EOS detection (:355) and tokenizer_->decode() (:357) to stream text. A 4–8 byte D2H per token is negligible; nothing to do here.

Temperature — 1 scalar H2D per step, cheap but cache-able
set_temperature() (:470-484) issues a cudaMemcpyHostToDevice of one float on every decode_one()/prefill chunk, even when the temperature hasn't changed (the worker passes the same SamplingConfig for the life of a request). It's a 4-byte copy so the win is small, but if you want to shave it, guard the copy on temp_val_ actually changing:

Error set_temperature(float temperature) {
  if (!valid_temperature(temperature)) return Error::InvalidArgument;
  float v = (temperature <= 0.0f) ? 1e-6f : temperature;
  if (v == temp_val_ && temp_initialized_) return Error::Ok;  // skip redundant H2D
  temp_val_ = v;
  ...
}

Apply this →

Verdict: the meaningful per-step copies (token + position) are now D2D, so the decode loop no longer bounces data across PCIe each step — that's the bulk of the win and it's done correctly. The only residual host transfers are the necessary scalar token readback (keep) and a redundant 4-byte temperature H2D (optional micro-opt above). One tiny dead store worth noting: decode_token_data_[0] is still written in decode_one() (:384) but is unused on the CUDA path since decode_tokens_dev_ is what's pushed — harmless, only the non-CUDA path reads it.
· branch gemma4-31b-serving-pr

@mergennachin mergennachin merged commit fa5d85a into main Jun 25, 2026
272 of 278 checks passed
@mergennachin mergennachin deleted the gemma4-31b-serving-pr branch June 25, 2026 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants