You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add Gemma4 31B model-specific serving support on top of the shared
examples/llm_server harness.
This extracts the existing runner flow into a small
Gemma4_31BEngine/LLMSession adapter, keeps main.cpp as a thin runner
wrapper, and adds a C++ JSONL worker plus Python OpenAI-compatible
launcher. The generic server remains model-agnostic; Gemma-specific
behavior stays in examples/models/gemma4_31b, including chat-template
options, BOS handling, channel cleanup, and Gemma tool-call parsing.
Also wire the worker into the existing Gemma CUDA/MLX CMake presets and
Makefile targets, document the serving harness usage, and add validation
coverage: hermetic launcher tests, an opt-in on-device BOS/template
regression test, and a CUDA no-bleed integration proof for interleaved
multi-session execution.
#20001
0 commit comments