gemma4_31b: add OpenAI serving entrypoint by mergennachin · Pull Request #20473 · pytorch/executorch

mergennachin · 2026-06-24T02:57:07Z

Add Gemma4 31B model-specific serving support on top of the shared examples/llm_server harness.

This extracts the existing runner flow into a small Gemma4_31BEngine/LLMSession adapter, keeps main.cpp as a thin runner wrapper, and adds a C++ JSONL worker plus Python OpenAI-compatible launcher. The generic server remains model-agnostic; Gemma-specific behavior stays in examples/models/gemma4_31b, including chat-template options, BOS handling, channel cleanup, and Gemma tool-call parsing.

Also wire the worker into the existing Gemma CUDA/MLX CMake presets and Makefile targets, document the serving harness usage, and add validation coverage: hermetic launcher tests, an opt-in on-device BOS/template regression test, and a CUDA no-bleed integration proof for interleaved multi-session execution.

#20001

pytorch-bot · 2026-06-24T02:57:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20473

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 23 Pending, 3 Unrelated Failures, 1 Unclassified Failure

As of commit 8cab3d0 with merge base 3169302 ():

NEW FAILURES - The following jobs have failed:

pull / test-arm-backend-no-driver (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 5908275b24f75736af5bf61aca65eebf3db6d691de21891c139e3e49a3d46c94 /exec failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 919434cfa5a7d71b53b3c29fd1bb41d7037dd88d9bacd968903f464807736791 /exec failed with exit code 1

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

pull / test-arm-backend-public-api-backward-compatibility / linux-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command docker exec -t 25ca8260dd967459a75a9fed0bb3d3b023338828abd4ecd69f72897646a06fd5 /exec failed with exit code 127

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / unittest / linux / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / unittest-editable / macos / macos-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-24T02:58:25Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Adds an OpenAI-compatible serving entrypoint for the Gemma4 31B example, reusing the shared examples/llm_server harness while keeping Gemma-specific logic under examples/models/gemma4_31b. The model runner is refactored to use a dedicated Gemma4_31BEngine / LLMSession adapter, and a new C++ JSONL worker process is introduced for model execution.

Changes:

Introduce Gemma4_31BEngine + session adapter and refactor the existing runner to use it.
Add a C++ JSONL worker (gemma4_31b_worker) and a Python OpenAI-compatible launcher (serve.py) with Gemma-specific templating/tool parsing.
Wire new targets into CMake presets/Makefile/docs and add hermetic + opt-in on-device + CUDA no-bleed validation tests.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
Makefile	Updates Gemma4 31B build targets to include worker (+ CUDA no-bleed test).
examples/models/gemma4_31b/README.md	Documents the serving harness, worker binary, and no-bleed test usage.
examples/models/gemma4_31b/CMakePresets.json	Ensures presets build runner + worker (+ CUDA no-bleed test).
examples/models/gemma4_31b/CMakeLists.txt	Adds worker + no-bleed test executables and links engine implementation into targets.
examples/models/gemma4_31b/export.py	Exports mutable-buffer metadata for multi-session isolation support.
examples/models/gemma4_31b/gemma4_31b_engine.h	Declares the Gemma4 engine/session adapter API.
examples/models/gemma4_31b/gemma4_31b_engine.cpp	Implements engine/session logic, including mutable-state rebinding (CUDA/MLX).
examples/models/gemma4_31b/main.cpp	Refactors runner to use the new engine/session adapter.
examples/models/gemma4_31b/gemma4_31b_worker.cpp	Adds JSONL worker entrypoint for OpenAI-serving execution.
examples/models/gemma4_31b/serve.py	Adds OpenAI-compatible Python control-plane launcher for Gemma4 31B.
examples/models/gemma4_31b/test_serve.py	Adds hermetic launcher/control-plane tests.
examples/models/gemma4_31b/test_ondevice_serving.py	Adds opt-in on-device BOS/template regression test.
examples/models/gemma4_31b/test_gemma4_31b_nobleed.cpp	Adds CUDA integration proof for isolated interleaved multi-session execution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  std::string prompt_text = read_prompt();
+  if (!FLAGS_prompt_file.empty() && prompt_text.empty()) {
+    return 1;
  }


+  std::vector<std::unique_ptr<llm::LLMSession>> held;
+  while (true) {
+    auto session = engine->create_session();
+    if (session.error() != Error::Ok) {
+      break;
+    }
+    held.push_back(std::move(session.get()));
+    if (held.size() > static_cast<size_t>(config.max_sessions) + 2) {
+      break;
+    }
+  }
+  check(
+      "capacity enforced: create_session fails past max_sessions",
+      held.size() <= static_cast<size_t>(config.max_sessions));


Gasoonjia · 2026-06-24T20:20:10Z

@claude review the code plz

claude · 2026-06-24T20:20:55Z

Claude finished @Gasoonjia's task in 2m 36s —— View job

Code Review — Gemma4 31B serving entrypoint

Nice refactor. The LLMEngine/LLMSession split is clean, the generic worker_loop stays model-agnostic, and the warm-resume invariant (resident_token_ids.size() == position()) is well-documented and looks correct end-to-end. Comments below, ordered by severity.

Functional concerns

1. temperature validation rejects part of the OpenAI range — gemma4_31b_engine.cpp:412

static bool valid_temperature(float t) {
  return t == -1.0f || (t >= 0.0f && t <= 1.0f);
}

The OpenAI chat API accepts temperature ∈ [0, 2]. The worker passes the request value straight through (worker_loop.h:113), so a standards-compliant client sending temperature: 1.5 gets Error::InvalidArgument, which surfaces as a hard "prefill failed" for the whole request rather than a clamp. For something billed as an OpenAI-compatible entrypoint this is a surprising rejection. Consider accepting [0, 2] (or clamping in the control plane). Fix this →

2. Empty prompt file is treated as fatal — main.cpp:68-79, main.cpp:144 (also flagged by Copilot)
read_prompt() returns "" for both an unreadable file and a valid-but-empty file, and main() exits on any empty result. Only an actual open failure should be fatal. Either signal the open error distinctly (e.g. return std::optional) or only bail when !f.is_open().

Tests

3. No-bleed capacity assertion keys off the wrong value — test_gemma4_31b_nobleed.cpp:248-254 (also flagged by Copilot)
The loop bound and final check both use config.max_sessions, but the engine clamps real capacity to 1 when rebinding is unavailable (and serving_capacity() is the source of truth). As written, the test passes even if create_session() silently clamps to 1 — i.e. it doesn't actually prove the multi-session capacity it's named for. Assert against engine->serving_capacity().max_physical_sessions_without_weight_duplication instead. Fix this →

Minor / nits

gemma4_31b_engine.cpp:584 — add_token_piece(..., "<turn|>") and "<end_of_turn>" are added to eos_ids, but main.cpp's template emits <turn|> as the user-turn terminator inside the prompt. Since these only affect sampled-token EOS detection (not prompt tokens), this is fine — worth a one-line comment noting why both spellings are covered, as the asymmetry with format_prompt is easy to misread.
main.cpp vs gemma4_31b_worker.cpp — the chat-template string lives literally in main.cpp:85 (raw runner) while the server path renders via Python ChatTemplate. Two sources of truth for "what a Gemma prompt looks like." Acceptable for an example, but a drift risk worth a comment pointing each at the other.
gemma4_31b_engine.cpp:251 — session destructor calls mutable_state_->destroy_session() outside exec_mutex_. The worker loop is single-threaded so this is safe today, but every other mutable-state touch goes through run_locked/with_active_session; a brief note that session lifecycle is single-threaded would prevent a future concurrent caller from tripping here.

CI

The 3 reported "new failures" are Docker exec infra errors (docker exec ... failed with exit code 1) and the aarch64 wheel failure is a pre-existing CoreML InMemoryNode::Kind build break unrelated to this PR — none appear caused by these changes. A rebase onto viable/strict should clear the broken-trunk entries.

None of the above is blocking; (1) and (3) are the two I'd address before merge. The engine/session contract adherence (poisoning semantics, terminal-step accounting, position bookkeeping) all checks out against llm_session.h.
· branch gemma4-31b-serving-pr

Gasoonjia · 2026-06-24T20:23:57Z

+#endif
+
+#if defined(EXECUTORCH_BUILD_CUDA) || defined(EXECUTORCH_BUILD_MLX)
+#define GEMMA_HAS_MUTABLE_STATE 1


i think we can remove this; gemma431b can only run on cuda or mlx -> EXECUTORCH_BUILD_CUDA or EXECUTORCH_BUILD_MLX will always be defined -> GEMMA_HAS_MUTABLE_STATE should always be 1.

Gasoonjia · 2026-06-24T20:27:35Z

+  }
+
+  Error set_temperature(float temperature) {
+    temp_val_ = (temperature <= 0.0f) ? 1e-6f : temperature;


should we check the temperature here then set it?

Gasoonjia · 2026-06-24T20:29:23Z

+    decode_pos_data_[0] = pos_;
+    std::vector<EValue> inputs;
+#ifdef EXECUTORCH_BUILD_CUDA
+    ET_CHECK_OK_OR_RETURN_ERROR(copy_decode_inputs_to_cuda());


why do we need to copy input token from host to gpu? Now input token of decode method should be always on GPU.

metascroy · 2026-06-24T22:25:59Z

 endif()

 if(TARGET mlxdelegate)
  executorch_target_copy_mlx_metallib(gemma4_31b_runner)


Is this repeated twice?

runner and worker both need the MLX metallib copy

metascroy · 2026-06-24T22:28:00Z

            "get_vocab_size": config.vocab_size,
            "get_n_layers": config.num_hidden_layers,
            "get_max_prefill_chunk": max_prefill,
+            "get_min_prefill_chunk": 5,


Do you pad up to this chunk size? Should this be backend specific? MLX can handle seqlen=1.

get_min_prefill_chunk=5 is CUDA-only; MLX uses the default path

metascroy · 2026-06-24T23:05:35Z

+constexpr const char* kMaxPrefillChunk = "get_max_prefill_chunk";
+constexpr const char* kMinPrefillChunk = "get_min_prefill_chunk";
+
+Result<uint64_t> read_sampled_token(


Do we do topk as well?

Add Gemma4 31B model-specific serving support on top of the shared examples/llm_server harness. This extracts the existing runner flow into a small Gemma4_31BEngine/LLMSession adapter, keeps main.cpp as a thin runner wrapper, and adds a C++ JSONL worker plus Python OpenAI-compatible launcher. The generic server remains model-agnostic; Gemma-specific behavior stays in examples/models/gemma4_31b, including chat-template options, BOS handling, channel cleanup, and Gemma tool-call parsing. Also wire the worker into the existing Gemma CUDA/MLX CMake presets and Makefile targets, document the serving harness usage, and add validation coverage: hermetic launcher tests, an opt-in on-device BOS/template regression test, and a CUDA no-bleed integration proof for interleaved multi-session execution.

mergennachin · 2026-06-25T17:33:56Z

@claude Review this PR

claude · 2026-06-25T17:35:23Z

Claude finished @mergennachin's task in 11m 11s —— View job

Code Review — Gemma4 31B serving entrypoint

I re-read the current state of the branch against the earlier review round. Most of the previously-flagged items are already fixed in the HEAD commit, so this pass focuses on what remains. Nothing below is blocking.

Previously flagged → resolved on this branch

temperature range now accepts [0, 2] (gemma4_31b_engine.cpp:425), matching the OpenAI range. ✅
Empty prompt file is no longer fatal — read_prompt() returns std::optional and only bails on an actual open failure (main.cpp:69-80,144). ✅
No-bleed capacity assertion now keys off engine->serving_capacity().max_physical_sessions_without_weight_duplication (test_gemma4_31b_nobleed.cpp:242-256) instead of config.max_sessions. ✅
GEMMA_HAS_MUTABLE_STATE macro is gone — the header now hard-#errors when neither CUDA nor MLX is defined (gemma4_31b_engine.h:28-34), per Gasoonjia's note that one is always set. ✅
set_temperature validation — now checks valid_temperature() before writing (gemma4_31b_engine.cpp:470-473). ✅
CMake metallib copy "repeated twice" — confirmed intentional: runner and worker are distinct targets that each need the copy (CMakeLists.txt:89-90). ✅

Remaining minor / nits

main.cpp:86-87 (format_prompt)
test_gemma4_31b_nobleed.cpp:46-48 (encode_prompt)
and rendered separately via Python ChatTemplate on the serving path (serve.py:118-123).

The earlier review noted the main.cpp ↔ serve.py drift; the no-bleed test adds a third. These can silently diverge from the real HF template. A one-line comment in each pointing at the others (and noting the Python path is authoritative for serving) would help; better still, share the C++ scaffold between main.cpp and the test.

2. Export-time mutable-buffer filter is implicitly coupled to the runtime. _mutable_buffer_metadata() (export.py) selects buffers by the substring ".kv_cache.". If a future model renames those buffers, the list comes back empty, register_mutable_fqns() rejects the empty array (gemma4_31b_engine.cpp:182-188), and the engine silently clamps to a single session. It is logged at Info, so it won't be obvious in a serving deployment that --max-sessions 4 collapsed to 1. Worth a comment on the export side that the substring is a load-bearing contract with the runtime.

3. get_min_prefill_chunk = 5 has two sources of truth. It is baked into the exported metadata (export.py:265) and hardcoded as the CUDA default in the engine (gemma4_31b_engine.cpp:660). The engine prefers the metadata value when present, so they agree today, but the literal 5 living in both spots is a drift risk — a brief comment noting the engine default mirrors the export value would prevent silent skew.

4. reset() does not clear temperature_. After reset() (gemma4_31b_engine.cpp:411-417), a subsequent prefill_tokens() with initial_sampling == nullptr would reuse the stale temperature_ from the prior conversation. Harmless on the serving path (the worker always passes a SamplingConfig), but since reset() is meant to restore a clean state, resetting temperature_ to -1.0f there would make the contract self-consistent.

Correctness spot-checks (all pass)

The resident_token_ids.size() == position() invariant holds across the EOS/terminal-stop paths: EOS and cooperative-stop steps don't advance pos_ and aren't forwarded (gemma4_31b_engine.cpp:350-352,367-371), matching the worker's "terminal token not counted/emitted" handling (worker_loop.h:228-241).
The decode_one stop contract is honored — synthetic terminal step returns {0, "", is_eos=false, is_terminal=true} (gemma4_31b_engine.cpp:351).
CUDA decode position lookup is bounded: pos_ < max_context is enforced before pos_table + pos_ indexing (gemma4_31b_engine.cpp:373-382 → update_decode_pos_on_cuda:522-547).
Session capacity: when rebinding is unavailable, serving_capacity() returns the default 1 (gemma4_31b_engine.cpp:774-782), so create_session() and the worker's max_named = cap-1 reservation degrade safely.

Overall this is a clean refactor and the engine/session contract adherence checks out. The four items above are all comment-level polish — none should block merge.
· branch gemma4-31b-serving-pr

mergennachin · 2026-06-25T17:50:19Z

@claude How's the optimization for device-host copy?

claude · 2026-06-25T17:51:35Z

Claude finished @mergennachin's task in 1m 38s —— View job

Device⇄host copy — current state

I traced the CUDA decode/prefill path end-to-end. The optimization that Gasoonjia flagged earlier ("input token of decode should already be on GPU") is now in place, and the steady-state decode loop is essentially free of host↔device round-trips for the token and position. Summary of where each transfer stands:

Token input — no longer copied H2D per step ✅
The sampled token from step N is staged directly into the device-resident decode_tokens_dev_ at the tail of run_locked() via stage_next_decode_token_on_cuda() (gemma4_31b_engine.cpp:566-567). When the model output is a Long device tensor it uses cudaMemcpyDeviceToDevice (:508-512), so step N+1's decode_one() consumes the token already on-GPU and pushes decode_tokens_dev_ without any H2D (:387-392). The prefill→decode handoff is covered the same way, since the last prefill chunk also runs through run_locked and stages its sampled token. copy_decode_token_to_cuda() (:486) survives only as the fallback for a non-Long output.

Decode position — D2D, no host transfer ✅
When decode_pos_table_dev_ is present (the common path; it's a precomputed [0..max_context) table built once on-device, :204-220), update_decode_pos_on_cuda() copies pos_table + pos_ into the position slot with cudaMemcpyDeviceToDevice (:528-532). The H2D branch (:538-542) only fires when the table couldn't be built.

Sampled-token readback — 1 scalar D2H per step, unavoidable
read_sampled_token() (:68-93) still does a cudaMemcpyDeviceToHost of the single scalar. That's required: the host needs the value for EOS detection (:355) and tokenizer_->decode() (:357) to stream text. A 4–8 byte D2H per token is negligible; nothing to do here.

Temperature — 1 scalar H2D per step, cheap but cache-able
set_temperature() (:470-484) issues a cudaMemcpyHostToDevice of one float on every decode_one()/prefill chunk, even when the temperature hasn't changed (the worker passes the same SamplingConfig for the life of a request). It's a 4-byte copy so the win is small, but if you want to shave it, guard the copy on temp_val_ actually changing:

Error set_temperature(float temperature) {
  if (!valid_temperature(temperature)) return Error::InvalidArgument;
  float v = (temperature <= 0.0f) ? 1e-6f : temperature;
  if (v == temp_val_ && temp_initialized_) return Error::Ok;  // skip redundant H2D
  temp_val_ = v;
  ...
}

Apply this →

Verdict: the meaningful per-step copies (token + position) are now D2D, so the decode loop no longer bounces data across PCIe each step — that's the bulk of the win and it's done correctly. The only residual host transfers are the necessary scalar token readback (keep) and a redundant 4-byte temperature H2D (optional micro-opt above). One tiny dead store worth noting: decode_token_data_[0] is still written in decode_one() (:384) but is unused on the CUDA path since decode_tokens_dev_ is what's pushed — harmless, only the non-CUDA path reads it.
· branch gemma4-31b-serving-pr

Copilot AI review requested due to automatic review settings June 24, 2026 02:57

mergennachin requested review from kirklandsign and larryliu0820 as code owners June 24, 2026 02:57

mergennachin temporarily deployed to cadence June 24, 2026 02:57 — with GitHub Actions Inactive

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026

mergennachin requested review from Gasoonjia, digantdesai and metascroy June 24, 2026 02:57

Copilot started reviewing on behalf of mergennachin June 24, 2026 02:57 View session

mergennachin added the ciflow/cuda label Jun 24, 2026

Copilot AI reviewed Jun 24, 2026

View reviewed changes

mergennachin mentioned this pull request Jun 24, 2026

examples/models/gemma4_31b: CUDA Engine/Session adapter + OpenAI serving #20207

Closed

Gasoonjia reviewed Jun 24, 2026

View reviewed changes

metascroy reviewed Jun 24, 2026

View reviewed changes

metascroy approved these changes Jun 24, 2026

View reviewed changes

mergennachin force-pushed the gemma4-31b-serving-pr branch from 2e0ac26 to 8cab3d0 Compare June 25, 2026 17:23

mergennachin temporarily deployed to cadence June 25, 2026 17:23 — with GitHub Actions Inactive

mergennachin merged commit fa5d85a into main Jun 25, 2026
272 of 278 checks passed

mergennachin deleted the gemma4-31b-serving-pr branch June 25, 2026 18:43

Uh oh!

Conversation

mergennachin commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20473

❌ 2 New Failures, 23 Pending, 3 Unrelated Failures, 1 Unclassified Failure

Uh oh!

github-actions Bot commented Jun 24, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Gasoonjia commented Jun 24, 2026

Uh oh!

claude Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — Gemma4 31B serving entrypoint

Functional concerns

Tests

Minor / nits

CI

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergennachin commented Jun 25, 2026

Uh oh!

claude Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — Gemma4 31B serving entrypoint

Previously flagged → resolved on this branch

Remaining minor / nits

Correctness spot-checks (all pass)

Uh oh!

mergennachin commented Jun 25, 2026

Uh oh!

claude Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Device⇄host copy — current state

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mergennachin commented Jun 24, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 24, 2026 •

edited

Loading

claude Bot commented Jun 25, 2026 •

edited

Loading

claude Bot commented Jun 25, 2026 •

edited

Loading