fix: Fix Kimi K2.5 function calling and structured outputs by gongwei-130 · Pull Request #1430 · lightseekorg/smg

gongwei-130 · 2026-05-02T01:04:47Z

Summary

Consolidate Kimi K2.5 function-calling, reasoning, logging, health probe, and multimodal handling fixes
Add support for response_format.type = regex
Ensure regex structured-output responses skip reasoning parsing so content is emitted normally in streaming and non-streaming paths
Make /health_generate inference probe timeout configurable via --health-generate-timeout-secs, defaulting to 3 seconds
Clean up lint/formatting issues

Verification

cargo test -p smg routers::grpc::utils::parsers::tests::response_format_regex_is_constrained
cargo test -p smg config::types::tests::test_router_config_default
cargo test -p smg config::types::tests::test_router_config_deserializes_skills_enabled_flag
cargo fmt --check
cargo clippy -p smg --all-targets -- -D warnings

Summary by CodeRabbit

Release Notes

New Features
- Added optional message hashing for session reconstruction logging via --enable-message-hash flag
- Added support for user-supplied request ID pass-through in chat, completion, and messages APIs
- Added ResponseFormat::Regex constraint support across inference backends
- Added thinking/extended thinking configuration support for compatible models
- Improved health monitoring with internal probes and recent-traffic detection
Improvements
- Enhanced tool call parsing with argument validation and schema-based coercion
- Optimized streaming responses with token timing tracking
- Better handling of special tokens in tokenization
- Improved reasoning parser detection for tool-call sections
- Added TypeScript-style tool description formatting for chat templates

… argument types Cherry-pick from mourya/engine_metrics_grpc_fixes (65f6c1c). 1. Register `raise_exception` as a Jinja global in chat template env so HuggingFace templates that use `{{ raise_exception("msg") }}` don't crash with "undefined function". 2. Add `coerce_tool_args_to_schema()` post-processor in processor.rs: after tool calls are parsed, walk the function schema's `properties` and coerce values to match declared types: - Number/Bool/Null → String when schema says "string" - String → Number/Integer when schema says "number"/"integer" - String → Bool when schema says "boolean" - Scalar → Array when schema says "array" Fixes fc-dash regressions where the model emits correct tokens but the parsed JSON has wrong types: - bfcl/simple_javascript_43: `error: null` → `error: "null"` - bfcl/simple_python_218: `patient_id: 546382` → `"546382"` - bfcl/parallel_117: `year: 2000` → `"2000"` Signed-off-by: Connor Li <[email protected]> Made-with: Cursor Signed-off-by: Connor Li <[email protected]>

…ol-call tokens - Prioritize explicitly configured tool parser over JSON schema parsing - Support alternative delimiters (<|func_start|>/<|func_end|>) in KimiK2 parser - Prevent reasoning parser from consuming tool call markers - Strip leaked chatml tokens only when a parser is explicitly configured - Truncate trailing content after first valid JSON object for JSON response formats Signed-off-by: ConnorLi96 <[email protected]> Made-with: Cursor

- Fix tiktoken encoder to use encode_with_special_tokens for chat template output - Prioritize configured tool parser over JSON schema with fallthrough - Skip reasoning parsing when output is constrained (specific tool_choice / JSON response_format) - Support alternative delimiters and hyphens in KimiK2 tool call parser - Prevent reasoning parser from consuming tool call markers - Keep tool call arguments as strings to preserve formatting - Generate TypeScript-style tool declarations for Kimi K2.5 - Strip leaked chatml tokens conditionally and clean up JSON response format - Add name field to Tool ChatMessage variant - Graceful metrics server bind failure Signed-off-by: ConnorLi96 <[email protected]> Made-with: Cursor

…on_config.json Models like Kimi-K2.5 define different EOS tokens in config.json ([EOS]=163585) and generation_config.json (<|im_end|>=163586). The engine must stop at both. SGLang handles this internally by merging both sources, but TRT-LLM's gRPC path does not. This change: - Adds load_eos_token_ids() to read and merge EOS IDs from both config files - Exposes eos_token_ids() on the Tokenizer trait - Passes merged IDs as stop_token_ids in TRT-LLM gRPC requests - Removes the hardcoded <|im_end|> stop string workaround (bfab5ad) - Strips [EOS]/[BOS] text from constrained decoding output Result: SMG+TRT-LLM fc-dash 55/60 (was 52/60 with hack, 18/60 without) Signed-off-by: ConnorLi96 <[email protected]> Made-with: Cursor

… debug During streaming detokenization, multi-byte characters (CJK, emoji) are split across token boundaries. Decoding a partial token sequence produces incomplete UTF-8, which is expected and handled correctly by the lossy fallback + streaming buffer. This was logging at WARN level, flooding production logs on every request containing non-ASCII output. Made-with: Cursor

…g image_url 1. Add Anthropic-style `thinking` field to ChatCompletionRequest: `{"thinking": {"type": "enabled", "budget_tokens": N}}` or `{"thinking": {"type": "disabled"}}`. Normalized to chat_template_kwargs (both `thinking` and `enable_thinking` keys) so it works with Kimi-K2.5 and Qwen3 templates. 2. Support bare string image_url format: accept both `"image_url": "https://..."` and `"image_url": {"url": "https://..."}` via custom serde deserializer on ImageUrl. Made-with: Cursor

… when thinking is active When separate_reasoning is true the chat template injects <think> in the prompt, so the model emits thinking content regardless of backend constraints. TRT-LLM does not have TGL's ReasonerGrammarObject that defers JSON constraint enforcement until after </think>, so the raw output contains <think>...</think> followed by the actual JSON or tool-call tokens. The existing output_is_constrained gate unconditionally skipped the reasoning parser for response_format json_object/json_schema and tool_choice Function, assuming constrained decoding produces pure JSON. This left thinking content in the response, breaking structured output parsing (0/3 on all 4 structured-output tests) and tool_choice streaming (0/3). Changes: - Gate output_is_constrained on !separate_reasoning so the reasoning parser runs when thinking is expected (processor.rs + streaming.rs) - Use native tool parser instead of JSON parser for streaming tool_choice Function when reasoning is active, since the backend may emit native tool-call tags rather than constrained JSON - Strip markdown code fences (```json / ```) from streaming JSON responses, including residual language tags split across chunk boundaries Tested: all 5 previously-failing tests now pass 3/3 (47/58 → 52/58). Made-with: Cursor

…ry-pick Signed-off-by: Connor Li <[email protected]>

The streaming path incorrectly gated reasoning_parser_available on !has_constrained_output(), preventing the reasoning parser from running when response_format (json_object/json_schema) was set — even when separate_reasoning was true. This caused thinking content to leak into the content field instead of reasoning_content during streaming. The non-streaming path correctly handled this case by not including the constrained output check in reasoning_parser_available. The downstream output_is_constrained variable already correctly gates on !separate_reasoning, so the upfront check was both redundant and wrong.

Two fixes for Kimi K2.5 fc-dash regressions: 1. Rewrite user-supplied tool_call IDs (e.g. "call_1") to the "functions.NAME:INDEX" format that the Kimi K2 chat template and parser expect. Without this, the model generates unparseable IDs and the tool parser returns raw special tokens in content. 2. Handle markdown code fences (```json) that split across streaming chunk boundaries. Add a cross-chunk buffer that holds partial fence suffixes and flushes/drops fence residue at end-of-stream. Fixes: multi-turn/native-tag-leak, native-tag-leak-streaming, structured-outputs/json-mode-streaming, call-feedback-classification-streaming Signed-off-by: Connor Li <[email protected]>

…ogging Replace the cached-flag-only health_generate with a real inference probe: sends POST /v1/chat/completions (max_tokens=1, temperature=0, stream=false, 3s timeout) through the router's own API port (self-call via 127.0.0.1). Adds structured INFO/WARN logs at smg::health target: - "sending real inference probe" — before the request, with model + URL - "probe succeeded" — on 200, with duration_ms + worker count - "probe failed (backend error)" — on non-2xx, with status + duration - "probe failed (transport error)" — on timeout/network error Widens scope from External-runtime workers only to ALL workers, so gRPC-connected backends (sglang, TRT-LLM, vLLM) are visible to K8s readiness probes. Both changes are intentional K8s probe semantic changes. Signed-off-by: Connor Li <[email protected]>

Add request_id field to ChatCompletionRequest, CompletionRequest, and CreateMessageRequest protocol structs. When a user supplies request_id in the request payload, SMG now passes it through to the engine as the gRPC request ID instead of generating chatcmpl-{Uuid}. This enables correlating engine logs with upstream request tracking. All 4 API paths updated: Chat, Completions, Messages, and Responses (Harmony). Each path logs an INFO message when a user-supplied request_id is used. Also fixes clippy uninlined_format_args and unnecessary qualification warnings in server.rs (from cargo fmt reformatting). Signed-off-by: Connor Li <[email protected]> Signed-off-by: Connor Li <[email protected]>

…ruction Add --enable-message-hash CLI flag (default: false) that computes per-message SHA-256 hashes matching TRT-LLM's openai_server.py format: sha256(role + "\x00" + content).hexdigest()[:12] When enabled, hashes are emitted as structured INFO logs on the smg::request target with the request_id, enabling session reconstruction for traffic simulation across all engine runtimes (TRT-LLM, sglang, vLLM). Hash computation wired through RouterConfig -> pipeline constructors -> stage constructors for all API paths. Python bindings updated with enable_message_hash field and --enable-message-hash CLI argument. Signed-off-by: Connor Li <[email protected]> Signed-off-by: Connor Li <[email protected]>

…ripping The fence_buffer approach from 06574da failed to strip the "json\n" prefix when markdown code fences split across streaming chunks (e.g. "```" in chunk N, "json\n{..." in chunk N+1). Replace with a simpler fence_backticks_stripped state flag that tracks when backticks were stripped, then removes leading language tags from the next chunk. Signed-off-by: Connor Li <[email protected]>

Extract inline fence stripping logic into a standalone strip_json_fence() function for testability. Add 7 unit tests covering: - full fence in single chunk - cross-chunk split (backticks then language tag) - CRLF line endings - standalone language tag drop - no-fence passthrough - uppercase JSON/json variants No behavior change — pure refactor of commit eae26bc. Signed-off-by: Connor Li <[email protected]> Made-with: Cursor

Add MessageHash message type and message_hashes field to GenerateRequest proto. Wire computed hashes from request building stages through the gRPC client to TRT-LLM backend, gated behind --enable-message-hash. Signed-off-by: Connor Li <[email protected]> Signed-off-by: Connor Li <[email protected]>

Prevents false-positive health check failures under load and during graceful shutdown, which were causing unnecessary k8s pod restarts.

When response_format is JsonObject or JsonSchema, the model produces JSON content, not reasoning tokens. The kimi_k25 reasoning parser was consuming this JSON as reasoning_content, causing structured-outputs fc-dash tests to fail (53/58 -> 58/58 with this fix). Check response_format directly instead of gating on !separate_reasoning (which was always false for Kimi, making the constrained-output check a no-op). Applied to all three code paths: non-streaming single choice, non-streaming batch, and streaming. Also fix pre-existing clippy warnings: unnecessary qualifications for ResponseFormat, unneeded binding pattern in monitor.rs, and missing too_many_arguments expects in client.rs and trtllm_service.rs. Format request_building.rs files. Signed-off-by: Connor Li <[email protected]>

Mirrors TRT-LLM / TGL approach: if any token chunk was forwarded to a client within the last 10 seconds, /health_generate returns 200 immediately without sending a synthetic inference request. Under high load the probe adds backpressure and can itself time out under k8s's 10s kubelet deadline, causing spurious restarts. The timestamp fast-path eliminates that failure mode. Implementation: - AppContext gains last_token_time: Arc<AtomicU64> (unix seconds, 0 = none) - StreamingProcessor (gRPC path) wraps the SSE channel with a map() that stores the current timestamp on every successful chunk via build_tracked_sse_response() - OpenAI HTTP relay (chat.rs) stores the timestamp after each successful tx.send() in the bytes relay task - health_generate checks last_token_time first; falls back to the real inference probe only when the server has been idle for >=10s Signed-off-by: Connor Li <[email protected]>

Both TRT-LLM (KimiK25InputProcessor) and TGL/SGLang (KimiGPUProcessorWrapper) expand a single <|media_pad|> placeholder to N vision tokens server-side based on grid_thws. The previous code pre-expanded to N tokens in SMG, causing double expansion and a "More media placeholder tokens than media items" 400 error from the engine. Switch from PromptReplacement::repeated(num_tokens) to PromptReplacement::sequence(vec![1]) so SMG sends exactly 1 placeholder per image, matching the engine contract. Signed-off-by: Connor Li <[email protected]> Made-with: Cursor

SGLang uses N <|media_pad|> tokens per image directly for embedding lookup, while TRT-LLM's KimiK25InputProcessor re-expands each single placeholder to N tokens based on grid_thws. The previous commit used 1 placeholder per image unconditionally, which fixed TRT-LLM (no double-expansion) but broke SGLang (embedding split_with_sizes mismatch: tensor=3588 vs split_sizes=[1]). Now: prompt_replacements emits N tokens (SGLang-compatible), and collapse_media_placeholders() in request_building collapses consecutive runs to 1 for TRT-LLM only, gated on builder_client.is_trtllm(). Applied to both Chat and Messages API request building paths. Signed-off-by: Connor Li <[email protected]> Made-with: Cursor

coderabbitai · 2026-05-02T01:05:01Z

📝 Walkthrough

Walkthrough

This PR introduces per-message SHA-256 hashing for session reconstruction logging, request ID passthrough across chat/completion/messages APIs, last-token-time tracking for optimized health probing, improved tokenizer EOS token handling, constrained output detection, tool-call argument coercion, Regex response format support across backends, KimiK2 tool-call delimiter variants, chat-template enhancements (TypeScript tool declarations, raise_exception), custom ImageUrl serde, thinking config support, and refactored health probing with intelligent fast-paths.

Changes

Unified Request Tracking, Constraint Handling, and Token Management

Layer / File(s)	Summary
Data Shape & Protocol Updates `crates/protocols/src/`, `crates/grpc_client/proto/`	`ChatCompletionRequest`/`CompletionRequest`/`CreateMessageRequest` gain optional `request_id` fields; `ChatCompletionRequest` adds optional `thinking: Option<ThinkingConfig>`; `ResponseFormat` adds `Regex { pattern }` variant; `ImageUrl` custom serde (conditional `detail`, accept string or object); `GenerateRequest` proto gains `repeated MessageHash` field; new `MessageHash` proto message added.
Core Tokenizer & Trait Updates `crates/tokenizer/src/traits.rs`, `crates/tokenizer/src/chat_template.rs`, `crates/tokenizer/src/{huggingface,mock,tiktoken}.rs`	`SpecialTokens` gains `eos_token_ids: Vec<TokenIdType>` field; `Tokenizer` trait default `eos_token_ids()` method added; chat template `build_environment` registers `raise_exception` Jinja2 callable; `SpecialTokens` initialization refactored to use `..Default::default()` for consistency.
App Context & Router Config `model_gateway/src/app_context.rs`, `model_gateway/src/config/{types,builder}.rs`	`AppContext` gains `last_token_time: Arc<AtomicU64>`; `RouterConfig` gains `enable_message_hash: bool` field; `RouterConfigBuilder` adds `enable_message_hash()` fluent setter.
Request Building & Constraint Detection `model_gateway/src/routers/grpc/utils/parsers.rs`, `model_gateway/src/routers/grpc/common/stages/helpers.rs`	New `has_constrained_output(tool_choice, response_format)` helper (detects specific-function, JSON/Schema, or Regex constraints); new message-hashing helpers `compute_and_log_message_hashes` and `compute_and_log_input_message_hashes` (SHA-256 truncated to 12 hex chars, logged for audit).
gRPC Client & Backend Builders `crates/grpc_client/src/{trtllm_service,sglang_scheduler,vllm_engine}.rs`	All three backends extend `build_*_request` methods to accept and populate `message_hashes: Option<Vec<(String, String)>>`; TRT-LLM builder gains `eos_token_ids` parameter and derives `stop_token_ids` from it; Regex response format support added to `build_constraint_for_chat` and `build_guided_decoding`; proto `MessageHash` entries constructed when hashes provided.
Pipeline & Stage Construction `model_gateway/src/routers/grpc/{pipeline,pd_router,router}.rs`	All `RequestPipeline::new_*` methods gain `enable_message_hash: bool` and `last_token_time: Arc<AtomicU64>` parameters; constructors thread these into request-building and streaming stages; PD router extracts `enable_message_hash` from `ctx.router_config`.
Request Building Stages `model_gateway/src/routers/grpc/regular/stages/{chat,messages,completion}/request_building.rs`, `model_gateway/src/routers/grpc/harmony/stages/request_building.rs`	Chat/Messages stages gain `enable_message_hash` field and conditionally compute `message_hashes`; request IDs now prefer user-supplied `*_request.request_id` with info logging, else generate UUID-based defaults; Harmony stage similarly updated; multimodal handling now calls `collapse_media_placeholders` for TRT-LLM; `eos_token_ids` sourced from tokenizer.
Streaming & SSE Tracking `model_gateway/src/routers/grpc/common/responses/streaming.rs`, `model_gateway/src/routers/grpc/regular/streaming.rs`	New `build_tracked_sse_response` records Unix epoch seconds into shared `Arc<AtomicU64>` on successful stream chunks; `StreamingProcessor` gains `last_token_time` field; chat/messages/completion streaming now use tracked SSE responses.
Streaming Logic Refinements `model_gateway/src/routers/grpc/regular/streaming.rs`, `model_gateway/src/routers/grpc/regular/processor.rs`	Reasoning parsing gated on `!has_constrained_output`; tool-call parsing prefers configured parser when present and not constrained; ChatML token stripping gated by parser configuration; new `strip_json_fence` helper for JSON response sanitization; tool-argument parsing now coerces values to schema via `coerce_tool_args_to_schema` helper; Messages path updated similarly.
Tool Parsing & Response Handling `model_gateway/src/routers/grpc/utils/chat_utils.rs`	`process_tool_call_arguments` refactored: validates JSON, canonicalizes tool-call IDs to `functions.{name}:{index}`, updates `tool`-role message `tool_call_id` via mapping; new `tools_to_typescript` generator (plus helpers) converts OpenAI `Tool` schemas to TypeScript function declarations; `parse_json_schema_response` sanitizes output by stripping chat/special tokens.
Special Token Tokenization `model_gateway/src/routers/grpc/regular/stages/{chat,messages}/preparation.rs`	Tokenization now uses `encode(..., true)` to recognize special tokens (chat template embeddings, images, control tokens) as single token IDs instead of prior `encode(..., false)`.
Multimodal & Tool Utilities `model_gateway/src/routers/grpc/multimodal.rs`	New `collapse_media_placeholders` helper de-duplicates consecutive media placeholder token IDs for TRT-LLM (which re-expands them server-side) vs. SGLang (which uses repeated pad tokens).
Protocol Regex Support `crates/grpc_client/src/{sglang_scheduler,vllm_engine}.rs`, `model_gateway/src/routers/grpc/harmony/stages/preparation.rs`	Both SGLang and vLLM engines add `ResponseFormat::Regex { pattern }` case in constraint building; Harmony preparation treats Regex as structured output (returns `Ok(None)` without Harmony tag generation).
Tool Parser Enhancements `crates/tool_parser/src/parsers/kimik2.rs`	KimiK2 regex patterns extended to recognize both `<
Reasoning Parser Improvements `crates/reasoning_parser/src/parsers/base.rs`	When closing `</think>` is missing, detect `<
OpenAI Router Tracking `model_gateway/src/routers/openai/{context,router,chat}.rs`	`SharedComponents` gains `last_token_time: Arc<AtomicU64>`; `route_chat` streaming now updates `last_token_time` with Unix epoch seconds on each successful chunk; retries also capture the timestamp handle.
Health Probing Refactor `model_gateway/src/server.rs`, `model_gateway/src/app_context.rs`	`AppState` gains `api_port: u16`; `/health_generate` now inlined: fails `503` if no healthy workers, skips probe (returns `200`) if `last_token_time` recent (<10s), else POSTs real `/v1/chat/completions` request to localhost with 60s timeout, returns response status or `503` with error.
CLI & Python Bindings `model_gateway/src/main.rs`, `bindings/python/src/lib.rs`, `bindings/python/src/smg/router_args.py`	CLI flag `--enable-message-hash` added (default `false`, "Logging" group); Python `Router.__new__` gains `enable_message_hash = false` parameter; `RouterArgs` dataclass gains `enable_message_hash: bool` field with CLI mapping via `--{prefix}enable-message-hash` flag.
Documentation & Comments `crates/multimodal/src/registry/kimi_k25.rs`, `crates/tokenizer/src/tiktoken.rs`, various location-specific inline comments	Inline comments clarified for pad-token collapse behavior, test generalization for "N pad tokens", tiktoken decode fallback level changed from `warn!` to `debug!`.
Tests & Fixtures `model_gateway/tests/*`, `crates/` test helpers	Test fixtures updated to include `request_id: None`/`api_port: 0` fields; health-probe tests added for recent-token fast-path, stale-token fallthrough, and no-workers cases; SSE tracking timestamp tests verify update-on-`Ok` and no-update-on-`Err`; `has_constrained_output` unit tests cover tool-choice/response-format combinations; striped JSON fence unit tests cover LF/CRLF, single/multi-chunk, uppercase variants.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant GrpcRouter
    participant RequestPipeline
    participant ChatRequestBuildingStage
    participant TrtllmServiceClient
    participant TensorRtLlm as TensorRT-LLM Backend

    Client->>GrpcRouter: POST /v1/chat/completions + optional request_id
    GrpcRouter->>RequestPipeline: Build pipeline with enable_message_hash=true, last_token_time
    RequestPipeline->>ChatRequestBuildingStage: Create with enable_message_hash=true
    
    Client->>GrpcRouter: Chat request
    GrpcRouter->>RequestPipeline: execute()
    RequestPipeline->>ChatRequestBuildingStage: execute(chat_request)
    
    alt User-supplied request_id present
        ChatRequestBuildingStage->>ChatRequestBuildingStage: Use request_id from chat_request (log info)
    else No user request_id
        ChatRequestBuildingStage->>ChatRequestBuildingStage: Generate chatcmpl-<uuid>
    end
    
    alt enable_message_hash=true
        ChatRequestBuildingStage->>ChatRequestBuildingStage: compute_and_log_message_hashes()
        ChatRequestBuildingStage->>ChatRequestBuildingStage: SHA-256(role + "\x00" + content), truncate to 12 hex chars
    else enable_message_hash=false
        ChatRequestBuildingStage->>ChatRequestBuildingStage: message_hashes = None
    end
    
    ChatRequestBuildingStage->>TrtllmServiceClient: build_chat_request(..., eos_token_ids, message_hashes)
    TrtllmServiceClient->>TrtllmServiceClient: Convert message_hashes to proto::MessageHash entries
    TrtllmServiceClient->>TensorRtLlm: GenerateRequest with message_hashes field
    TensorRtLlm->>TensorRtLlm: Process request with per-message hashes for audit logging
    TensorRtLlm->>TrtllmServiceClient: Stream response chunks
    
    TrtllmServiceClient->>RequestPipeline: Response stream
    RequestPipeline->>StreamingProcessor: Stream with last_token_time handle
    StreamingProcessor->>StreamingProcessor: build_tracked_sse_response()
    StreamingProcessor->>StreamingProcessor: For each chunk: store Unix epoch seconds in Arc<AtomicU64>
    StreamingProcessor->>Client: SSE stream (tracked timestamps enable health-probe fast-path)

sequenceDiagram
    participant HealthClient
    participant HealthEndpoint as /health_generate Endpoint
    participant WorkerSet
    participant TokenTracker as last_token_time Arc<AtomicU64>
    participant LocalBackend as http://127.0.0.1:api_port

    HealthClient->>HealthEndpoint: GET /health_generate
    HealthEndpoint->>WorkerSet: Get healthy workers
    
    alt No healthy workers
        HealthEndpoint->>HealthClient: 503 Service Unavailable
    else Healthy workers exist
        HealthEndpoint->>TokenTracker: Load last stored epoch seconds
        HealthEndpoint->>TokenTracker: Compare with now()
        
        alt Recent token (< 10 seconds ago)
            HealthEndpoint->>HealthEndpoint: Skip probe (active traffic detected)
            HealthEndpoint->>HealthClient: 200 OK + "probe skipped"
        else Stale or no recent token
            HealthEndpoint->>LocalBackend: POST /v1/chat/completions (ping request, 60s timeout)
            alt Backend responds with 2xx
                HealthEndpoint->>HealthClient: 200 OK + "workers healthy"
            else Backend error or timeout
                HealthEndpoint->>HealthClient: 503 Service Unavailable + error detail
            end
        end
    end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Rationale: This PR spans 50+ files with heterogeneous logic changes touching foundational systems (tokenizer traits, request building, streaming, constraint detection, tool parsing, health probing). Changes include dense logic (message hashing, tool-argument coercion, JSON fence stripping, timestamp tracking), multiple new public functions and struct fields, modified control flow in streaming and request building (reasoning/tool-call gating on constrained output), refactored tool-argument processing with ID canonicalization, custom serde implementations, and multiple concurrent features (request ID passthrough, message hashing, EOS token exposure, Regex format support, KimiK2 parser variants, thinking config, health-probe refactor). Each feature domain demands separate reasoning, and interactions between constrained output detection, tool parsing, and streaming logic are complex.

Possibly related PRs

The PR modifies model_gateway/src/routers/openai/chat.rs route_chat to track last_token_time for health-probe optimization, directly related to changes in OpenAI streaming context and routing.
The PR extends TrtllmServiceClient/GrpcClient build_generate_request_from_* methods to accept and populate message_hashes, directly related to previous message-building infrastructure PRs.
The PR exposes eos_token_ids via SpecialTokens and Tokenizer trait and uses them in request/stop-token derivation, directly related to tokenizer EOS token handling PRs.

Suggested labels

model-gateway, grpc, tokenizer, reasoning-parser, protocols, python-bindings, tests

Suggested reviewers

CatherineSue
key4ng
slin1237

Poem

🐰 A rabbit's jubilation:

Hash tracking for messages in flight,
Request IDs glowing bright—
Tokens tracked, health probes sprint fast,
Constraints detected, schemas amassed,
The codebase hops to a brighter light! 🌟

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch wei/kimi-k25-consolidate

mergify · 2026-05-02T01:05:25Z

Hi @gongwei-130, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

Use git commit -s every time, or
VSCode: enable Git: Always Sign Off in Settings
PyCharm: enable Sign-off commit in the Commit tool window

gemini-code-assist

Code Review

This pull request introduces several enhancements and fixes to the model gateway, including message hash logging for session reconstruction, support for user-supplied request IDs, and a fast path for health checks based on recent traffic. It also adds support for regex response formats, improves tool call parsing and argument coercion, and implements special token stripping and JSON fence removal in streaming responses. Feedback focuses on potential data corruption when stripping backticks from JSON responses, the need for global indexing of tool call IDs to ensure uniqueness across turns, and inconsistencies in reasoning parser logic between streaming and non-streaming paths. Additionally, improvements were suggested for JSON extraction to support arrays and for more robust token stripping in streaming contexts to handle cross-chunk splits.

gemini-code-assist · 2026-05-02T01:07:18Z

+    delta = delta
+        .replace("```json", "")
+        .replace("```JSON", "")
+        .replace("```", "");


Using a global replace to strip triple backticks ( ``` ) from chunks when is_json_response is true is dangerous and can lead to data corruption. If the model-generated JSON content contains strings that include backticks (e.g., a description of code or a markdown snippet), those backticks will be incorrectly removed from the data. This logic should be more surgical, ideally only targeting fences at the very beginning or end of the stream.

References

When stripping specific marker strings from data that might contain structured formats like JSON, avoid methods that can remove characters that are part of valid data, leading to data corruption.

gemini-code-assist · 2026-05-02T01:07:18Z

-            let Some(args_str) = args.as_str() else {
-                continue;
-            };
+        for (index, call) in tool_calls.iter_mut().enumerate() {


The index used for canonical tool call IDs is local to each message. In a multi-turn conversation with multiple assistant messages containing tool calls, this will result in duplicate IDs (e.g., functions.search:0 appearing in multiple turns). This violates the requirement for unique tool call IDs within a session and will likely cause parsing failures or model confusion when the history is re-processed. You should use a global counter that increments across all tool calls in the entire messages array.

let mut global_index = 0; for msg in messages.iter_mut() { if let Some(tool_calls) = msg.get_mut("tool_calls").and_then(|v| v.as_array_mut()) { for call in tool_calls.iter_mut() { if let (Some(name), Some(old)) = (func_name, old_id) { let canonical = format!("functions.{name}:{global_index}"); global_index += 1; } } } }

References

For protocol data structures that mirror an external API (e.g., OpenAI), prioritize alignment with the external specification over internal consistency.

gemini-code-assist · 2026-05-02T01:07:18Z

+
+        if original_request.separate_reasoning
+            && reasoning_parser_available
+            && !has_structured_output


The logic to skip reasoning parsing when has_structured_output is true is inconsistent with the streaming path (see streaming.rs line 209). When separate_reasoning is enabled, the chat template typically injects a marker, causing the model to emit reasoning content even if a structured output format is requested. Skipping the reasoning parser here will result in the reasoning content being treated as normal text or potentially interfering with JSON parsing.

Suggested change

&& !has_structured_output

&& reasoning_parser_available

References

To maintain consistency, changes to a feature on one execution path (e.g., streaming) should align with its behavior on related paths (e.g., non-streaming).

gemini-code-assist · 2026-05-02T01:07:18Z

+                }
+            }
+
+            if processed_text.starts_with('{') {


The JSON extraction logic is too restrictive. It only handles JSON objects starting with { and ignores top-level JSON arrays starting with [. Additionally, it should probably trim() the processed_text before checking the start character to handle cases where the model outputs leading whitespace without markdown fences.

Suggested change

if processed_text.starts_with('{') {

let trimmed = processed_text.trim();

if trimmed.starts_with('{') || trimmed.starts_with('[') {

gemini-code-assist · 2026-05-02T01:07:19Z

+                    if self.configured_tool_parser.is_some()
+                        || self.configured_reasoning_parser.is_some()
+                    {
+                        for token in [
+                            "<|im_end|>",
+                            "<|im_start|>",
+                            "<|im_user|>",
+                            "<|im_assistant|>",
+                            "<|im_system|>",
+                            "<|im_middle|>",
+                            "</think>",
+                            "[EOS]",
+                            "[BOS]",
+                        ] {
+                            delta = delta.replace(token, "");
+                        }
+                    }


Stripping leaked tokens using replace on individual chunks is fragile in a streaming context. If a token like <|im_end|> is split across two chunks (e.g., <|im_ and end|>), it will not be detected or removed. This can lead to special tokens leaking into the client-facing stream. Consider using a small buffer to handle potential cross-chunk token matches, similar to how you handle markdown fences.

claude · 2026-05-02T01:07:41Z

+    let trimmed = delta.trim();
+    if trimmed == "json" || trimmed == "JSON" {
+        delta = String::new();
+    }


🔴 Important: This unconditional check drops any chunk whose trimmed content is exactly "json" or "JSON", even when it's not preceded by fence backticks. In constrained JSON streaming, the tokenizer can emit a chunk that is just json as part of a key or string value (e.g. {"json_key": 1} chunked as {", json, _key": 1}), which would corrupt the output.

This guard should only fire when fence_backticks_stripped is true — i.e., when a previous chunk already stripped the ``` and this json is the language tag following it. The cross-chunk handler below already covers "json\n", but this check catches the case where the newline hasn't arrived yet.

Suggested change

let trimmed = delta.trim();

if trimmed == "json" || trimmed == "JSON" {

delta = String::new();

}

if *fence_backticks_stripped && (trimmed == "json" || trimmed == "JSON") {

delta = String::new();

}

claude · 2026-05-02T01:07:46Z

+            for token in [
+                "<|im_end|>",
+                "<|im_start|>",
+                "<|im_user|>",
+                "<|im_assistant|>",
+                "<|im_system|>",
+                "<|im_middle|>",
+            ] {


🟡 Nit: The streaming path (streaming.rs:483-493) and parse_json_schema_response (chat_utils.rs) strip additional tokens "</think>", "[EOS]", "[BOS]" that are missing here. The non-streaming path could leak these tokens to the client when a model emits them in a non-streaming response. Consider adding the same set of tokens for consistency, or extracting the list to a shared constant to keep all paths in sync.

claude · 2026-05-02T01:07:49Z

+        "stream": false,
+        "temperature": 0
+    });
+


🟡 Nit: The self-referential probe (http://127.0.0.1:{port}/v1/chat/completions) will itself be routed through the same server's middleware, concurrency limiter, etc. Under heavy load, the probe competes with real traffic for the concurrency queue slot. Consider adding a header (e.g. X-Health-Probe: true) so the concurrency limiter can exempt it, or using the gRPC client directly to bypass the HTTP layer.

claude · 2026-05-02T01:08:06Z

+            if let (Some(name), Some(old)) = (func_name, old_id) {
+                let canonical = format!("functions.{name}:{index}");
+                if old != canonical {
+                    id_rewrites.insert(old.to_string(), canonical.clone());
+                    if let Some(obj) = call.as_object_mut() {
+                        obj.insert("id".to_string(), Value::String(canonical));
+                    }
                }


🟡 Nit: When a conversation has multiple assistant messages each containing tool calls, index resets to 0 for each assistant message's tool_calls array. If two different assistant messages both call the same function at index 0, they'll generate the same canonical ID (functions.search:0), and the id_rewrites map will overwrite the first mapping. The second-pass rewrite of tool messages could then map an old ID to the wrong canonical ID.

For example:

Assistant msg 1: call_abc → functions.search:0

Assistant msg 2: call_xyz → functions.search:0 (overwrites call_abc → functions.search:0)

Both tool response messages will get rewritten to functions.search:0, which is technically correct for Kimi K2's format (it's per-turn), but the id_rewrites map collision means call_abc's mapping is lost. If a tool response references call_abc, it won't be found in the map and won't be rewritten, creating a mismatch.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2d59ff4cf9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-02T01:09:30Z

+        .post(&probe_url)
+        .json(&probe_body)
+        .timeout(Duration::from_secs(60))
+        .send()


Include auth when issuing health probe request

The new health_generate probe sends a POST to /v1/chat/completions without any Authorization header, even though that route is behind auth_middleware in this server. In deployments with api_key enabled, the self-probe will get 401/403 and health_generate will return SERVICE_UNAVAILABLE despite healthy workers, creating persistent false-negative health checks.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-02T01:09:30Z

+    }
+
+    let model_id = healthy[0].model_id().to_string();
+    let probe_url = format!("http://127.0.0.1:{}/v1/chat/completions", state.api_port);


Use correct scheme for loopback health probe URL

The probe URL is hardcoded to http://127.0.0.1:{port}, but this server can run with TLS (bind_rustls). In TLS-enabled deployments, the local port expects HTTPS, so this plain-HTTP probe fails at the transport layer and marks health_generate unhealthy regardless of backend health.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

model_gateway/tests/common/test_app.rs (1)

103-110: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't hard-code the shared test app's probe port to 0.

AppState.api_port is now a live input to the inference-probe URL built in model_gateway/src/server.rs, so Line 109 and Line 152 make any test that exercises that path probe port 0 instead of the bound test server. Please thread the real bound port into these helpers or explicitly disable that probe in test wiring.

Also applies to: 146-153
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/tests/common/test_app.rs` around lines 103 - 110, The tests
currently construct AppState with api_port: 0 which causes the inference-probe
URL in model_gateway::server.rs to point at port 0; change the test wiring to
pass the actual bound test-server port into AppState.api_port (e.g., obtain the
listener's local_addr().port() or test_server.uri().port()) when constructing
AppState (and any helper that builds it such as the shared test app builder used
at the other affected locations), or alternatively provide a flag to disable the
probe in the test-only helpers and set that flag in the test; update all places
that call AppState creation in these tests (references: AppState, api_port, the
shared test app helper used at those lines) so the probe uses the real bound
port or is explicitly disabled.

model_gateway/src/routers/grpc/utils/chat_utils.rs (1)

98-157: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep canonical Kimi tool-call IDs monotonic across the full history.

enumerate() resets to 0 for every assistant message, so a multi-turn transcript rewrites both turns to IDs like functions.foo:0. That diverges from generate_tool_call_id() below, which uses a history-wide index, and can make prior tool_call_id references ambiguous for Kimi prompts.

Suggested fix

-    for msg in messages.iter_mut() {
+    let mut next_tool_index = 0usize;
+    for msg in messages.iter_mut() {
         let role = msg.get("role").and_then(|v| v.as_str());
         if role != Some("assistant") {
             continue;
         }

         let Some(tool_calls) = msg.get_mut("tool_calls").and_then(|tc| tc.as_array_mut()) else {
             continue;
         };

-        for (index, call) in tool_calls.iter_mut().enumerate() {
+        for call in tool_calls.iter_mut() {
             // Validate arguments JSON
             if let Some(args_str) = call
                 .get("function")
                 .and_then(|f| f.get("arguments"))
                 .and_then(|a| a.as_str())
@@
             let old_id = call.get("id").and_then(|v| v.as_str());

             if let (Some(name), Some(old)) = (func_name, old_id) {
-                let canonical = format!("functions.{name}:{index}");
+                let canonical = format!("functions.{name}:{next_tool_index}");
                 if old != canonical {
                     id_rewrites.insert(old.to_string(), canonical.clone());
                     if let Some(obj) = call.as_object_mut() {
                         obj.insert("id".to_string(), Value::String(canonical));
                     }
                 }
             }
+
+            next_tool_index += 1;
         }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs` around lines 98 - 157,
The bug is that enumerate() restarts per assistant message so IDs like
"functions.name:0" repeat; replace the per-message enumerate with a single
history-wide counter to keep canonical Kimi tool-call IDs monotonic and match
generate_tool_call_id(). Concretely, add a mutable counter (e.g.,
tool_call_counter: usize = 0) outside the first for loop over messages,
increment it for each tool call instead of using enumerate(), use that counter
when building canonical = format!("functions.{name}:{tool_call_counter}"), and
increment tool_call_counter after assigning; ensure id_rewrites, the ID
replacement in call ("id") and the second-pass replacement of "tool_call_id"
remain unchanged.

model_gateway/src/routers/grpc/regular/streaming.rs (1)

427-451: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Configured tool parsers are still unreachable for constrained chat streaming.

On Line 428, is_specific_function already implies output_is_constrained, so this branch always routes specific-function streaming through process_specific_function_stream(). In the fallback branch, Line 448 derives effective_json_parser from separate_reasoning, but Line 209 already forced separate_reasoning = false whenever has_constrained_output() is true. So constrained chat tool calls still always pick the JSON/specific-function path and never honor an explicitly configured parser here, which regresses the parser-precedence goal. The Messages path on Lines 1879-1881 already uses the expected precedence.

Suggested direction

+ let prefer_configured_tool_parser =
+     self.configured_tool_parser.is_some() && tool_parser_available;
+
- let tool_chunks = if is_specific_function
-     && (output_is_constrained
-         || !(self.configured_tool_parser.is_some() && tool_parser_available))
+ let tool_chunks = if is_specific_function && !prefer_configured_tool_parser
 {
     Self::process_specific_function_stream(...)
 } else {
-    let effective_json_parser = used_json_schema
-        && !(separate_reasoning
-            && self.configured_tool_parser.is_some()
-            && tool_parser_available);
+    let effective_json_parser = used_json_schema && !prefer_configured_tool_parser;
     self.process_tool_calls_stream(..., effective_json_parser).await
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 427 - 451,
The current branch always sends specific-function/constrained output to
process_specific_function_stream because is_specific_function implies
output_is_constrained, which prevents honoring an explicitly configured tool
parser; change the branching so that process_specific_function_stream is only
used when no configured parser is present or available (i.e.,
is_specific_function && !(self.configured_tool_parser.is_some() &&
tool_parser_available)); likewise adjust the effective_json_parser logic to
consider self.configured_tool_parser and tool_parser_available even when
separate_reasoning is false so configured parsers take precedence; update
references in this block: process_specific_function_stream,
is_specific_function, output_is_constrained (or has_constrained_output()),
self.configured_tool_parser, tool_parser_available, effective_json_parser,
separate_reasoning, and used_json_schema.

model_gateway/src/routers/grpc/regular/responses/conversions.rs (1)

109-116: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

name is in scope but set to None — consider populating it.

In the FunctionToolCall branch, name (the function name) is destructured but then discarded. Setting name: Some(name.clone()) would make the tool message self-describing, which can help backends or middleware that inspect this field directly rather than correlating via tool_call_id.
♻️ Optional: populate name when available
 messages.push(ChatMessage::Tool {
     content: MessageContent::Text(output_text.clone()),
     tool_call_id: id.clone(),
-    name: None,
+    name: Some(name.clone()),
 });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/responses/conversions.rs` around lines
109 - 116, In the FunctionToolCall branch, the tool message created as
ChatMessage::Tool currently sets name: None even though the function name was
destructured into name; change this to populate the field (e.g., name:
Some(name.clone())) so the ChatMessage::Tool includes the function name
alongside tool_call_id and output_text; update the code that builds the tool
message in the FunctionToolCall handling to use name.clone() for the name field.

model_gateway/src/routers/grpc/harmony/builder.rs (1)

1073-1099: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

LGTM — .. correctly accommodates the new name field.

Minor note: when tool_call_map lookup misses (line 1082), the fallback produces functions.<tool_call_id> instead of the actual function name. Now that name: Option<String> is available, it could serve as an intermediate fallback before tool_call_id:
♻️ Optional: use name as intermediate fallback
-    let function_name = tool_call_map
-        .get(tool_call_id)
-        .cloned()
-        .unwrap_or_else(|| tool_call_id.clone());
+    let function_name = tool_call_map
+        .get(tool_call_id)
+        .cloned()
+        .or_else(|| name.clone())
+        .unwrap_or_else(|| tool_call_id.clone());
All current callers in this PR set name: None, so behaviour is identical today — purely a defensive improvement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/harmony/builder.rs` around lines 1073 - 1099,
When handling ChatMessage::Tool in the builder (the match arm that constructs
harmony_msg), change the logic that computes function_name so it first tries
tool_call_map.get(tool_call_id).cloned(), then falls back to the ChatMessage's
name field if present, and only finally falls back to tool_call_id; reference
the ChatMessage::Tool pattern variables tool_call_id and name and the existing
tool_call_map lookup to implement this additional intermediate fallback so the
Author.name still becomes format!("functions.{function_name}") using the
resolved function_name.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/protocols/src/common.rs`:
- Around line 198-258: The schema is wrong because ImageUrl derives
schemars::JsonSchema but the Deserialize impl accepts either a bare string or an
object; fix by aligning schema with deserialization—either convert ImageUrl into
an #[serde(untagged)] enum (like PromptVariable) with variants e.g.
Simple(String) and Object { url: String, detail: Option<String> } and update
usages to map to the original struct shape, or implement a custom impl
schemars::JsonSchema for ImageUrl that returns a schema using anyOf: [string
schema, object schema] so the generated OpenAPI correctly documents both
accepted wire formats (update/remove the derive and add the custom impl).

In `@crates/reasoning_parser/src/parsers/base.rs`:
- Around line 141-149: The parser only checks for the tool-call marker when
self.in_reasoning && self.config.stream_reasoning, which lets a model jump
directly from <think> to "<|tool_calls_section_begin|>" leave the parser stuck
in reasoning mode; change the logic so the
current_text.find("<|tool_calls_section_begin|>") branch runs whenever
self.in_reasoning (regardless of self.config.stream_reasoning), then extract
reasoning_text and normal_text, call self.buffer.clear(), set self.in_reasoning
= false, and return Ok(ParserResult::new(normal_text, reasoning_text))—this
ensures the parser exits reasoning mode and emits the tool-call section even
when stream_reasoning is false.

In `@crates/tokenizer/src/chat_template.rs`:
- Around line 600-607: Add a unit test that verifies the registered
raise_exception template function actually aborts rendering and propagates its
message: create a ChatTemplateState using a template string that calls "{{
raise_exception('unsupported model') }}", unwrap the parsed state, call
state.apply(...) with an empty messages slice and ChatTemplateParams::default(),
assert that apply() returns Err and that the error's string contains
"unsupported model"; this will exercise the env.add_function("raise_exception",
...) path and guard the behavior of raise_exception during rendering.

In `@crates/tool_parser/src/parsers/kimik2.rs`:
- Around line 117-118: The end_pattern used to build tool_call_end_pattern
("end_pattern" / "tool_call_end_pattern") lacks the DOTALL flag so "." won't
match newlines and multi-line JSON tool calls break find() and trigger the
fallback that clears the buffer; update end_pattern to include the (?s) flag
(e.g. prepend "(?s)" to the regex) so the Regex::new(...) for
tool_call_end_pattern can match multi-line content and avoid the erroneous
fallback.

In `@model_gateway/src/routers/grpc/common/responses/streaming.rs`:
- Around line 1061-1068: The current closure in
UnboundedReceiverStream::new(rx).map updates last_token_time for any successful
SSE chunk, which is too broad; remove the last_token_time.store(...) call from
the generic SSE wrapper (the map closure) and instead update last_token_time
only where actual model token deltas are produced (e.g., the code paths that
emit token-delta events or where token pieces are wrapped into responses),
leaving other events like response.created, in_progress, wrapper events, and
tool events unchanged; locate symbols such as last_token_time, the
UnboundedReceiverStream::new(rx).map closure, and the code that emits
token-delta events and move the timestamp update into those token-emission sites
so health fast-path reflects real token activity.

In `@model_gateway/src/routers/grpc/harmony/stages/request_building.rs`:
- Around line 114-125: The current logic computes message_hashes from the
original RequestType::Chat(req).messages, but when modified_request is present
the downstream TRT-LLM builder uses body; update the branches that set
message_hashes (both the block around RequestType::Chat at the shown location
and the similar block around lines 228-239) to resolve the effective chat body
first (use modified_request.body if Some, otherwise req.body/body equivalent)
and then call helpers::compute_and_log_message_hashes(&request_id,
&body.messages) so the hashes reflect the actual messages sent downstream; keep
the existing enable_message_hash guard and RequestType::Chat match but replace
req.messages with the resolved body.messages and preserve request_id and the
helper call.

In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 693-715: The function collapse_media_placeholders currently
collapses raw runs of im_token_id and can merge two adjacent image expansions
into one placeholder; fix it by changing collapse_media_placeholders to accept
image-boundary metadata (e.g., an additional parameter like image_starts:
&[bool] or a slice of ranges) and use that to reset prev_was_placeholder at each
image boundary so each original image block collapses to its own single
placeholder even when expansions are adjacent, and add a unit/regression test
that constructs two back-to-back image blocks (adjacent im_token_id runs) and
asserts they collapse to two placeholders rather than one; update any call sites
that invoke collapse_media_placeholders to pass the boundary info or a default
when not available.

In `@model_gateway/src/routers/grpc/regular/processor.rs`:
- Around line 746-755: The code is calling parse_tool_calls with None for the
new tools parameter so coerce_tool_args_to_schema never sees the Messages tool
schemas; update the two call sites that pass None (the call in the block gated
by self.configured_tool_parser.is_some() && tool_parser_available and the other
at 771-779) to pass the Messages tool definitions from messages_request (or the
same place you source message schemas) instead of None, so
parse_tool_calls/coerce_tool_args_to_schema can coerce emitted tool args to the
Messages schemas; alternatively, if changing parse_tool_calls signature here is
undesirable, add a Messages-specific coercion step (invoking
coerce_tool_args_to_schema or equivalent) immediately after parse_tool_calls and
before building ContentBlock::ToolUse so tool_use.input has correct types.

In `@model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs`:
- Around line 133-137: The tokenizer is re-applying special-token
post-processing when tokenizing the already-rendered chat template string;
change the call to tokenizer.encode(&processed_messages.text, true) in the chat
preparation step to use add_special_tokens=false (i.e., pass false as the second
argument) so the post-processor is not injected again—update the call site where
encoding is assigned (the variable encoding and the tokenizer.encode invocation)
to stop adding BOS/SEP/etc. on top of the rendered template.

In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 2788-2809: The strip_json_fence function currently uses
replace(...) which mutates every occurrence of fence markers inside the chunk;
instead only strip fences at chunk boundaries: check and remove a leading
language fence (e.g., starts_with "```json" / "```JSON" or "```json\n" variants)
and separately check and remove a trailing fence ("```" possibly
preceded/followed by newline) — do not call replace on the whole string. Update
strip_json_fence to use starts_with/ends_with and slice only the prefix/suffix
when present, preserve any internal occurrences, and keep/adjust the
fence_backticks_stripped boolean to reflect whether the chunk ended with
backticks removed so the next chunk should strip a leading language tag.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs`:
- Around line 482-497: The current cleanup in chat_utils.rs mutates any
occurrence of framing tokens inside the JSON by calling clean.replace(...) on
processed_text; instead only remove framing/wrapping tokens that appear as
leading or trailing wrappers around the JSON payload. Change the logic around
processed_text / clean to iterate over the token list and repeatedly apply
strip_prefix and strip_suffix (or anchored regex matches) for tokens like
"<|im_end|>", "<|im_start|>", "<|im_user|>", "<|im_assistant|>",
"<|im_system|>", "<|im_middle|>", "</think>", "[EOS]", "[BOS]" so you only trim
those tokens at the ends (and then trim whitespace), rather than doing a global
replace which corrupts valid JSON fields; keep variable names processed_text and
clean and update the loop that currently calls clean.replace to use
prefix/suffix stripping instead.
- Around line 696-718: The generated TypeScript uses unquoted identifiers so
function names and object property keys like func.name and schema property names
produce invalid TS when they contain characters like hyphens; update
function_to_typescript to emit a quoted/backticked type name when func.name is
not a valid JS identifier (e.g., wrap in backticks) and update the schema-to-TS
generator (the code that builds property strings—look for the part variable and
the function schema_to_typescript / property rendering logic) to quote property
names that are not valid identifiers (wrap in double quotes) and preserve
optional markers and spacing; implement a small helper is_valid_identifier(name:
&str) and use it to decide between raw identifier or quoted/backticked form in
both places.

In `@model_gateway/src/routers/grpc/utils/parsers.rs`:
- Around line 183-192: Update the doc comment for has_constrained_output to
mention regex-based constraints as a form of "constrained decoding": state that
constrained output includes JSON formats (json_object/json_schema) and regex
patterns (ResponseFormat::Regex), and explain that when constrained decoding is
active (e.g., tool_choice required/allowed_tools with mode == "required",
response_format json_object/json_schema/regex, or a specific function), the
reasoning parser must be skipped so constrained content isn't captured as
reasoning_content; reference has_constrained_output and ResponseFormat::Regex in
the comment for clarity.

In `@model_gateway/src/server.rs`:
- Around line 177-186: The health handler is leaking internal worker details by
returning worker.url(), model_id(), and raw backend error bodies to public
callers; instead, change the branches that build and return info.join(...) and
any branch that returns upstream error bodies (also around the 268-295 region)
to log the full details locally (use your existing logging/tracing) and return a
sanitized message like "0/{workers.len()} workers healthy" or a generic
"upstream error" string with the appropriate StatusCode::SERVICE_UNAVAILABLE;
keep uses of workers.iter(), is_healthy(), model_id(), url(), and the StatusCode
branches intact but remove any inclusion of url(), model_id(), or raw body in
the HTTP response.

---

Outside diff comments:
In `@model_gateway/src/routers/grpc/harmony/builder.rs`:
- Around line 1073-1099: When handling ChatMessage::Tool in the builder (the
match arm that constructs harmony_msg), change the logic that computes
function_name so it first tries tool_call_map.get(tool_call_id).cloned(), then
falls back to the ChatMessage's name field if present, and only finally falls
back to tool_call_id; reference the ChatMessage::Tool pattern variables
tool_call_id and name and the existing tool_call_map lookup to implement this
additional intermediate fallback so the Author.name still becomes
format!("functions.{function_name}") using the resolved function_name.

In `@model_gateway/src/routers/grpc/regular/responses/conversions.rs`:
- Around line 109-116: In the FunctionToolCall branch, the tool message created
as ChatMessage::Tool currently sets name: None even though the function name was
destructured into name; change this to populate the field (e.g., name:
Some(name.clone())) so the ChatMessage::Tool includes the function name
alongside tool_call_id and output_text; update the code that builds the tool
message in the FunctionToolCall handling to use name.clone() for the name field.

In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 427-451: The current branch always sends
specific-function/constrained output to process_specific_function_stream because
is_specific_function implies output_is_constrained, which prevents honoring an
explicitly configured tool parser; change the branching so that
process_specific_function_stream is only used when no configured parser is
present or available (i.e., is_specific_function &&
!(self.configured_tool_parser.is_some() && tool_parser_available)); likewise
adjust the effective_json_parser logic to consider self.configured_tool_parser
and tool_parser_available even when separate_reasoning is false so configured
parsers take precedence; update references in this block:
process_specific_function_stream, is_specific_function, output_is_constrained
(or has_constrained_output()), self.configured_tool_parser,
tool_parser_available, effective_json_parser, separate_reasoning, and
used_json_schema.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs`:
- Around line 98-157: The bug is that enumerate() restarts per assistant message
so IDs like "functions.name:0" repeat; replace the per-message enumerate with a
single history-wide counter to keep canonical Kimi tool-call IDs monotonic and
match generate_tool_call_id(). Concretely, add a mutable counter (e.g.,
tool_call_counter: usize = 0) outside the first for loop over messages,
increment it for each tool call instead of using enumerate(), use that counter
when building canonical = format!("functions.{name}:{tool_call_counter}"), and
increment tool_call_counter after assigning; ensure id_rewrites, the ID
replacement in call ("id") and the second-pass replacement of "tool_call_id"
remain unchanged.

In `@model_gateway/tests/common/test_app.rs`:
- Around line 103-110: The tests currently construct AppState with api_port: 0
which causes the inference-probe URL in model_gateway::server.rs to point at
port 0; change the test wiring to pass the actual bound test-server port into
AppState.api_port (e.g., obtain the listener's local_addr().port() or
test_server.uri().port()) when constructing AppState (and any helper that builds
it such as the shared test app builder used at the other affected locations), or
alternatively provide a flag to disable the probe in the test-only helpers and
set that flag in the test; update all places that call AppState creation in
these tests (references: AppState, api_port, the shared test app helper used at
those lines) so the probe uses the real bound port or is explicitly disabled.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ec3bc316-8dc4-4a73-9a22-cb585f9aca18

📥 Commits

Reviewing files that changed from the base of the PR and between a41aa6c and 2d59ff4.

📒 Files selected for processing (61)

bindings/python/src/lib.rs
bindings/python/src/smg/router_args.py
crates/grpc_client/proto/trtllm_service.proto
crates/grpc_client/src/sglang_scheduler.rs
crates/grpc_client/src/trtllm_service.rs
crates/grpc_client/src/vllm_engine.rs
crates/multimodal/src/registry/kimi_k25.rs
crates/multimodal/src/registry/mod.rs
crates/protocols/src/chat.rs
crates/protocols/src/common.rs
crates/protocols/src/completion.rs
crates/protocols/src/messages.rs
crates/reasoning_parser/src/parsers/base.rs
crates/tokenizer/src/chat_template.rs
crates/tokenizer/src/huggingface.rs
crates/tokenizer/src/mock.rs
crates/tokenizer/src/tiktoken.rs
crates/tokenizer/src/traits.rs
crates/tool_parser/src/parsers/kimik2.rs
model_gateway/benches/request_processing.rs
model_gateway/benches/wasm_middleware_latency.rs
model_gateway/src/app_context.rs
model_gateway/src/config/builder.rs
model_gateway/src/config/types.rs
model_gateway/src/main.rs
model_gateway/src/routers/grpc/client.rs
model_gateway/src/routers/grpc/common/responses/mod.rs
model_gateway/src/routers/grpc/common/responses/streaming.rs
model_gateway/src/routers/grpc/common/stages/helpers.rs
model_gateway/src/routers/grpc/harmony/builder.rs
model_gateway/src/routers/grpc/harmony/stages/preparation.rs
model_gateway/src/routers/grpc/harmony/stages/request_building.rs
model_gateway/src/routers/grpc/multimodal.rs
model_gateway/src/routers/grpc/pd_router.rs
model_gateway/src/routers/grpc/pipeline.rs
model_gateway/src/routers/grpc/regular/processor.rs
model_gateway/src/routers/grpc/regular/responses/conversions.rs
model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs
model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
model_gateway/src/routers/grpc/regular/stages/completion/request_building.rs
model_gateway/src/routers/grpc/regular/stages/generate/request_building.rs
model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs
model_gateway/src/routers/grpc/regular/stages/messages/request_building.rs
model_gateway/src/routers/grpc/regular/stages/request_building.rs
model_gateway/src/routers/grpc/regular/streaming.rs
model_gateway/src/routers/grpc/router.rs
model_gateway/src/routers/grpc/utils/chat_utils.rs
model_gateway/src/routers/grpc/utils/message_utils.rs
model_gateway/src/routers/grpc/utils/mod.rs
model_gateway/src/routers/grpc/utils/parsers.rs
model_gateway/src/routers/openai/chat.rs
model_gateway/src/routers/openai/context.rs
model_gateway/src/routers/openai/router.rs
model_gateway/src/server.rs
model_gateway/src/service_discovery.rs
model_gateway/src/worker/monitor.rs
model_gateway/tests/api/api_endpoints_test.rs
model_gateway/tests/common/test_app.rs
model_gateway/tests/routing/test_openai_routing.rs
model_gateway/tests/spec/chat_message.rs
model_gateway/tests/wasm_test.rs

coderabbitai · 2026-05-02T01:16:37Z

+#[derive(Debug, Clone, PartialEq, schemars::JsonSchema)]
 pub struct ImageUrl {
    pub url: String,
-    #[serde(skip_serializing_if = "Option::is_none")]
    pub detail: Option<String>, // "auto", "low", or "high"
 }

+impl Serialize for ImageUrl {
+    fn serialize<S: serde::Serializer>(&self, serializer: S) -> Result<S::Ok, S::Error> {
+        use serde::ser::SerializeStruct;
+        let field_count = 1 + self.detail.is_some() as usize;
+        let mut state = serializer.serialize_struct("ImageUrl", field_count)?;
+        state.serialize_field("url", &self.url)?;
+        if let Some(ref detail) = self.detail {
+            state.serialize_field("detail", detail)?;
+        }
+        state.end()
+    }
+}
+
+impl<'de> Deserialize<'de> for ImageUrl {
+    fn deserialize<D: serde::Deserializer<'de>>(deserializer: D) -> Result<Self, D::Error> {
+        use serde::de;
+
+        struct ImageUrlVisitor;
+
+        impl<'de> de::Visitor<'de> for ImageUrlVisitor {
+            type Value = ImageUrl;
+
+            fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
+                formatter.write_str("a string URL or an object with url field")
+            }
+
+            fn visit_str<E: de::Error>(self, v: &str) -> Result<ImageUrl, E> {
+                Ok(ImageUrl {
+                    url: v.to_string(),
+                    detail: None,
+                })
+            }
+
+            fn visit_map<M: de::MapAccess<'de>>(self, mut map: M) -> Result<ImageUrl, M::Error> {
+                let mut url = None;
+                let mut detail = None;
+                while let Some(key) = map.next_key::<String>()? {
+                    match key.as_str() {
+                        "url" => url = Some(map.next_value()?),
+                        "detail" => detail = map.next_value()?,
+                        _ => {
+                            let _ = map.next_value::<de::IgnoredAny>()?;
+                        }
+                    }
+                }
+                Ok(ImageUrl {
+                    url: url.ok_or_else(|| de::Error::missing_field("url"))?,
+                    detail,
+                })
+            }
+        }
+
+        deserializer.deserialize_any(ImageUrlVisitor)
+    }
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's verify the file exists and check the actual content at lines 198-258 head -n 258 crates/protocols/src/common.rs | tail -n 61

Repository: lightseekorg/smg

Length of output: 2224

🏁 Script executed:

# Check if there's a custom JsonSchema implementation for ImageUrl rg "ImageUrl" crates/protocols/src/common.rs -A 5 -B 5

Repository: lightseekorg/smg

Length of output: 2519

🏁 Script executed:

# Look for any schema-related code or schemars usage with ImageUrl rg "json_schema|JsonSchema" crates/protocols/src/common.rs -B 2 -A 2

Repository: lightseekorg/smg

Length of output: 7624

ImageUrl's generated schema does not match its accepted wire format.

The custom Deserialize impl accepts either a bare string or an object, but ImageUrl derives schemars::JsonSchema without a custom implementation. The derived schema documents only the object shape, causing OpenAPI/SDK validators to reject the bare string format even though deserialization accepts it.

Add a custom JsonSchema impl for ImageUrl that documents both accepted formats (e.g., anyOf: [string, object]), or switch to an #[serde(untagged)] enum pattern like PromptVariable to ensure the schema stays aligned with the actual API.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/protocols/src/common.rs` around lines 198 - 258, The schema is wrong because ImageUrl derives schemars::JsonSchema but the Deserialize impl accepts either a bare string or an object; fix by aligning schema with deserialization—either convert ImageUrl into an #[serde(untagged)] enum (like PromptVariable) with variants e.g. Simple(String) and Object { url: String, detail: Option<String> } and update usages to map to the original struct shape, or implement a custom impl schemars::JsonSchema for ImageUrl that returns a schema using anyOf: [string schema, object schema] so the generated OpenAPI correctly documents both accepted wire formats (update/remove the derive and add the custom impl).

coderabbitai · 2026-05-02T01:16:37Z

        if self.in_reasoning && self.config.stream_reasoning {
+            // Some models skip </think> and go straight to tool calls
+            if let Some(tool_pos) = current_text.find("<|tool_calls_section_begin|>") {
+                let reasoning_text = current_text[..tool_pos].trim().to_string();
+                let normal_text = current_text[tool_pos..].to_string();
+                self.buffer.clear();
+                self.in_reasoning = false;
+                return Ok(ParserResult::new(normal_text, reasoning_text));
+            }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle the tool-call marker even when stream_reasoning is off.

Right now this exit path only runs when self.config.stream_reasoning is true. If a model jumps from <think> straight to <|tool_calls_section_begin|> while reasoning streaming is disabled, the parser never leaves reasoning mode here, so the tool-call section is buffered instead of emitted.

Suggested fix

- // Continue with reasoning content - if self.in_reasoning && self.config.stream_reasoning { - // Some models skip </think> and go straight to tool calls - if let Some(tool_pos) = current_text.find("<|tool_calls_section_begin|>") { - let reasoning_text = current_text[..tool_pos].trim().to_string(); - let normal_text = current_text[tool_pos..].to_string(); - self.buffer.clear(); - self.in_reasoning = false; - return Ok(ParserResult::new(normal_text, reasoning_text)); - } + // Some models skip </think> and go straight to tool calls. + // This transition must work regardless of whether reasoning itself is streamed. + if self.in_reasoning { + if let Some(tool_pos) = current_text.find("<|tool_calls_section_begin|>") { + let reasoning_text = current_text[..tool_pos].trim().to_string(); + let normal_text = current_text[tool_pos..].to_string(); + self.buffer.clear(); + self.in_reasoning = false; + return Ok(ParserResult::new(normal_text, reasoning_text)); + } + } + + // Continue with reasoning content + if self.in_reasoning && self.config.stream_reasoning { // Stream the content immediately let reasoning_text = current_text; self.buffer.clear(); Ok(ParserResult::reasoning(reasoning_text))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/reasoning_parser/src/parsers/base.rs` around lines 141 - 149, The parser only checks for the tool-call marker when self.in_reasoning && self.config.stream_reasoning, which lets a model jump directly from <think> to "<|tool_calls_section_begin|>" leave the parser stuck in reasoning mode; change the logic so the current_text.find("<|tool_calls_section_begin|>") branch runs whenever self.in_reasoning (regardless of self.config.stream_reasoning), then extract reasoning_text and normal_text, call self.buffer.clear(), set self.in_reasoning = false, and return Ok(ParserResult::new(normal_text, reasoning_text))—this ensures the parser exits reasoning mode and emits the tool-call section even when stream_reasoning is false.

coderabbitai · 2026-05-02T01:16:37Z

+    // HuggingFace's Jinja2 environment registers `raise_exception` as a global
+    // callable so templates can do `{{ raise_exception("msg") }}`.
+    env.add_function(
+        "raise_exception",
+        |msg: String| -> Result<String, minijinja::Error> {
+            Err(MinijinjaError::new(ErrorKind::InvalidOperation, msg))
+        },
+    );


🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Consider adding a unit test for raise_exception template invocation.

The implementation correctly mirrors HuggingFace's pattern — fn name(arg: String) -> Result<String, Error> registered via env.add_function is the canonical minijinja 2.x approach, and the error kind choice is appropriate. However, no test exercises this new path; a simple case would lock in the behavior and guard against regressions:

✅ Suggested test

#[test] fn test_raise_exception_halts_rendering() { // Template that unconditionally calls raise_exception let result = ChatTemplateState::new(Some( "{{ raise_exception('unsupported model') }}".to_string(), )); // Template itself parses fine; the error surfaces at render time let state = result.unwrap(); let err = state.apply(&[], ChatTemplateParams::default()).unwrap_err(); assert!( err.to_string().contains("unsupported model"), "Error should propagate the raise_exception message, got: {err}" ); }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/tokenizer/src/chat_template.rs` around lines 600 - 607, Add a unit test that verifies the registered raise_exception template function actually aborts rendering and propagates its message: create a ChatTemplateState using a template string that calls "{{ raise_exception('unsupported model') }}", unwrap the parsed state, call state.apply(...) with an empty messages slice and ChatTemplateParams::default(), assert that apply() returns Err and that the error's string contains "unsupported model"; this will exercise the env.add_function("raise_exception", ...) path and guard the behavior of raise_exception during rendering.

coderabbitai · 2026-05-02T01:16:37Z

+        let end_pattern = r"<\|tool_call_begin\|>.*?(?:<\|tool_call_end\|>|<\|func_end\|>)";
        let tool_call_end_pattern = Regex::new(end_pattern).expect("Valid regex pattern");


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file at the specified lines and surrounding context cat -n crates/tool_parser/src/parsers/kimik2.rs | sed -n '100,130p'

Repository: lightseekorg/smg

Length of output: 1980

🏁 Script executed:

# Now let's look at the full context where extractors are defined cat -n crates/tool_parser/src/parsers/kimik2.rs | sed -n '100,150p'

Repository: lightseekorg/smg

Length of output: 2764

🏁 Script executed:

# Check if there are extractor patterns with (?s) mentioned in the review rg '(?s)' crates/tool_parser/src/parsers/kimik2.rs -n -A 2 -B 2

Repository: lightseekorg/smg

Length of output: 18750

🏁 Script executed:

# Look for the buffer clearing logic mentioned cat -n crates/tool_parser/src/parsers/kimik2.rs | sed -n '290,360p'

Repository: lightseekorg/smg

Length of output: 4299

Add (?s) to the completed-tool regex to support multi-line JSON.

The extractors at lines 110 and 113 now accept multi-line JSON via (?s), but end_pattern at line 117 omits this flag. When a tool call contains pretty-printed arguments (newlines), the stream extractor matches it as complete, but tool_call_end_pattern.find() at line 347 fails because . doesn't span newlines. The fallback at line 351 then clears the entire buffer, dropping any trailing content from the chunk.

Suggested fix

- let end_pattern = r"<\|tool_call_begin\|>.*?(?:<\|tool_call_end\|>|<\|func_end\|>)"; + let end_pattern = + r"(?s)<\|tool_call_begin\|>.*?(?:<\|tool_call_end\|>|<\|func_end\|>)";

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let end_pattern = r"<\|tool_call_begin\|>.*?(?:<\|tool_call_end\|>|<\|func_end\|>)";

let tool_call_end_pattern = Regex::new(end_pattern).expect("Valid regex pattern");

let end_pattern =

r"(?s)<\|tool_call_begin\|>.*?(?:<\|tool_call_end\|>|<\|func_end\|>)";

let tool_call_end_pattern = Regex::new(end_pattern).expect("Valid regex pattern");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/tool_parser/src/parsers/kimik2.rs` around lines 117 - 118, The end_pattern used to build tool_call_end_pattern ("end_pattern" / "tool_call_end_pattern") lacks the DOTALL flag so "." won't match newlines and multi-line JSON tool calls break find() and trigger the fallback that clears the buffer; update end_pattern to include the (?s) flag (e.g. prepend "(?s)" to the regex) so the Regex::new(...) for tool_call_end_pattern can match multi-line content and avoid the erroneous fallback.

coderabbitai · 2026-05-02T01:16:38Z

+    // Strip chatml / special tokens that may trail the constrained JSON output
+    let mut clean = processed_text.to_string();
+    for token in [
+        "<|im_end|>",
+        "<|im_start|>",
+        "<|im_user|>",
+        "<|im_assistant|>",
+        "<|im_system|>",
+        "<|im_middle|>",
+        "</think>",
+        "[EOS]",
+        "[BOS]",
+    ] {
+        clean = clean.replace(token, "");
+    }
+    let clean = clean.trim();


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only trim framing tokens around the JSON payload.

This replaces markers everywhere in the response, so valid constrained output like {"text":"[EOS]"} or {"marker":"<|im_end|>"} gets silently mutated before parsing. The cleanup needs to strip wrappers/trailing tokens, not rewrite string content inside the JSON itself.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/utils/chat_utils.rs` around lines 482 - 497, The current cleanup in chat_utils.rs mutates any occurrence of framing tokens inside the JSON by calling clean.replace(...) on processed_text; instead only remove framing/wrapping tokens that appear as leading or trailing wrappers around the JSON payload. Change the logic around processed_text / clean to iterate over the token list and repeatedly apply strip_prefix and strip_suffix (or anchored regex matches) for tokens like "<|im_end|>", "<|im_start|>", "<|im_user|>", "<|im_assistant|>", "<|im_system|>", "<|im_middle|>", "</think>", "[EOS]", "[BOS]" so you only trim those tokens at the ends (and then trim whitespace), rather than doing a global replace which corrupts valid JSON fields; keep variable names processed_text and clean and update the loop that currently calls clean.replace to use prefix/suffix stripping instead.

coderabbitai · 2026-05-02T01:16:38Z

+fn function_to_typescript(func: &openai_protocol::common::Function) -> String {
+    let mut out = String::new();
+
+    // Description comment
+    if let Some(ref desc) = func.description {
+        for line in desc.lines() {
+            if line.is_empty() {
+                out.push('\n');
+            } else {
+                out.push_str(&format!("// {line}\n"));
+            }
+        }
+    }
+
+    // Parameters
+    let params_str =
+        if func.parameters.is_null() || func.parameters == Value::Object(Default::default()) {
+            "{}".to_string()
+        } else {
+            schema_to_typescript(&func.parameters, "", &[])
+        };
+
+    out.push_str(&format!("type {} = (_: {params_str}) => any;", func.name));


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n model_gateway/src/routers/grpc/utils/chat_utils.rs | sed -n '690,725p'

Repository: lightseekorg/smg

Length of output: 1491

🏁 Script executed:

cat -n model_gateway/src/routers/grpc/utils/chat_utils.rs | sed -n '745,820p'

Repository: lightseekorg/smg

Length of output: 2978

🏁 Script executed:

cd model_gateway && grep -n "schema_to_typescript" src/routers/grpc/utils/chat_utils.rs | head -20

Repository: lightseekorg/smg

Length of output: 446

🏁 Script executed:

grep -n "function_to_typescript\|tools_ts_str" model_gateway/src/routers/grpc/utils/chat_utils.rs | head -20

Repository: lightseekorg/smg

Length of output: 444

🏁 Script executed:

grep -rn "execute-tool\|user-id" model_gateway/src/routers/grpc/utils/ --include="*.rs" | head -10

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

grep -n "quote\|escape\|identifier" model_gateway/src/routers/grpc/utils/chat_utils.rs

Repository: lightseekorg/smg

Length of output: 185

🏁 Script executed:

cat -n model_gateway/src/routers/grpc/utils/chat_utils.rs | sed -n '330,355p'

Repository: lightseekorg/smg

Length of output: 1379

🏁 Script executed:

cat -n model_gateway/src/routers/grpc/utils/chat_utils.rs | sed -n '675,695p'

Repository: lightseekorg/smg

Length of output: 720

🏁 Script executed:

# Check if there's any TypeScript quoting/escaping logic grep -A 5 -B 5 "type.*=" model_gateway/src/routers/grpc/utils/chat_utils.rs | head -40

Repository: lightseekorg/smg

Length of output: 1563

🏁 Script executed:

grep -rn "execute-tool\|user-id\|-" model_gateway/src/routers/grpc/utils/chat_utils.rs | grep -E "(name|property|Function)" | head -20

Repository: lightseekorg/smg

Length of output: 252

🏁 Script executed:

# Look for any tests that might show what happens with invalid identifiers find model_gateway -name "*.rs" -type f -exec grep -l "tools_to_typescript\|function_to_typescript" {} \;

Repository: lightseekorg/smg

Length of output: 111

🏁 Script executed:

# Check if TypeScript quoting is done anywhere in the codebase rg "\".*\".*:" model_gateway/src/routers/grpc/utils/chat_utils.rs | head -10

Repository: lightseekorg/smg

Length of output: 729

🏁 Script executed:

# Search for actual tool definitions with hyphens to confirm this is a real issue find model_gateway -name "*.rs" -o -name "*.json" | xargs grep -l "execute-tool\|user-id" 2>/dev/null | head -10

Repository: lightseekorg/smg

Length of output: 108

🏁 Script executed:

# Check OpenAI protocol to see if hyphens are allowed in function/property names find model_gateway -path "*/openai_protocol*" -name "*.rs" | head -5

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

# Let's verify what valid TypeScript looks like with property names cat > /tmp/test_ts.ts << 'EOF' // Invalid - hyphens in type name are not valid identifiers // type execute-tool = (_: {}) => any; // Valid - quoted property names with hyphens type executeToolParams = { "execute-tool": string; "user-id": string; }; // This is what the current code would generate for a function named "execute-tool": // type execute-tool = (_: {}) => any; // INVALID! // This is what it SHOULD generate: // type ExecuteTool = (_: {}) => any; // or use backticks // type `execute-tool` = (_: {}) => any; // Valid in TypeScript! EOF cat /tmp/test_ts.ts

Repository: lightseekorg/smg

Length of output: 583

🏁 Script executed:

# Check if the code handles any special characters or quoting at all grep -n "push_str\|format!" model_gateway/src/routers/grpc/utils/chat_utils.rs | grep -E "(name|property)" | head -20

Repository: lightseekorg/smg

Length of output: 378

Wrap identifier names in backticks or quotes to generate valid TypeScript for non-identifier tool and field names.

func.name and object property names are inserted directly without quoting. Tool names like execute-tool or schema properties like user-id generate invalid TypeScript (e.g., type execute-tool = ...; and unquoted user-id: string). These are valid in OpenAI schemas but produce malformed declarations that break Kimi-style templates.

Example fixes

Line 718: Use backticks for type names containing non-identifier characters:

out.push_str(&format!("type `{}` = (_: {params_str}) => any;", func.name));

Line 811: Quote property names containing non-identifier characters:

part.push_str(&format!("{child_indent}\"{name}\"{opt_marker}: {type_str}"));

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/utils/chat_utils.rs` around lines 696 - 718, The generated TypeScript uses unquoted identifiers so function names and object property keys like func.name and schema property names produce invalid TS when they contain characters like hyphens; update function_to_typescript to emit a quoted/backticked type name when func.name is not a valid JS identifier (e.g., wrap in backticks) and update the schema-to-TS generator (the code that builds property strings—look for the part variable and the function schema_to_typescript / property rendering logic) to quote property names that are not valid identifiers (wrap in double quotes) and preserve optional markers and spacing; implement a small helper is_valid_identifier(name: &str) and use it to decide between raw identifier or quoted/backticked form in both places.

coderabbitai · 2026-05-02T01:16:38Z

+/// Returns `true` when constrained decoding is active, meaning the model output
+/// is structured JSON rather than free-form text. In that case the reasoning
+/// parser must be skipped — otherwise it captures the constrained JSON as
+/// `reasoning_content` and leaves `content` empty.
+///
+/// Constrained decoding is triggered by:
+/// - `tool_choice` = a specific function, `required`, or `allowed_tools` with
+///   `mode == "required"`
+/// - `response_format` = `json_object` or `json_schema`
+pub(crate) fn has_constrained_output(


🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Update doc comments to reflect regex-constrained output.

The helper now treats ResponseFormat::Regex as constrained (Line 208), but the docs still describe only JSON constraints. Keeping this comment accurate avoids future regressions.

Proposed doc-only diff

-/// Returns `true` when constrained decoding is active, meaning the model output -/// is structured JSON rather than free-form text. In that case the reasoning +/// Returns `true` when constrained decoding is active, meaning output is +/// constrained by tool/format rules rather than free-form text. In that case the reasoning /// parser must be skipped — otherwise it captures the constrained JSON as /// `reasoning_content` and leaves `content` empty. /// /// Constrained decoding is triggered by: /// - `tool_choice` = a specific function, `required`, or `allowed_tools` with /// `mode == "required"` -/// - `response_format` = `json_object` or `json_schema` +/// - `response_format` = `json_object`, `json_schema`, or `regex`

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

/// Returns `true` when constrained decoding is active, meaning the model output

/// is structured JSON rather than free-form text. In that case the reasoning

/// parser must be skipped — otherwise it captures the constrained JSON as

/// `reasoning_content` and leaves `content` empty.

///

/// Constrained decoding is triggered by:

/// - `tool_choice` = a specific function, `required`, or `allowed_tools` with

/// `mode == "required"`

/// - `response_format` = `json_object` or `json_schema`

pub(crate) fn has_constrained_output(

/// Returns `true` when constrained decoding is active, meaning output is

/// constrained by tool/format rules rather than free-form text. In that case the reasoning

/// parser must be skipped — otherwise it captures the constrained JSON as

/// `reasoning_content` and leaves `content` empty.

///

/// Constrained decoding is triggered by:

/// - `tool_choice` = a specific function, `required`, or `allowed_tools` with

/// `mode == "required"`

/// - `response_format` = `json_object`, `json_schema`, or `regex`

pub(crate) fn has_constrained_output(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/utils/parsers.rs` around lines 183 - 192, Update the doc comment for has_constrained_output to mention regex-based constraints as a form of "constrained decoding": state that constrained output includes JSON formats (json_object/json_schema) and regex patterns (ResponseFormat::Regex), and explain that when constrained decoding is active (e.g., tool_choice required/allowed_tools with mode == "required", response_format json_object/json_schema/regex, or a specific function), the reasoning parser must be skipped so constrained content isn't captured as reasoning_content; reference has_constrained_output and ResponseFormat::Regex in the comment for clarity.

coderabbitai · 2026-05-02T01:16:38Z

+    let healthy: Vec<_> = workers.iter().filter(|w| w.is_healthy()).collect();
+    if healthy.is_empty() {
+        let info: Vec<_> = workers
+            .iter()
+            .map(|w| format!("{} ({})", w.model_id(), w.url()))
+            .collect();
+        return (
+            StatusCode::SERVICE_UNAVAILABLE,
+            format!("0/{} workers healthy: {}", workers.len(), info.join(", ")),
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't leak internal worker details from a public health endpoint.

/health_generate is public, but these branches return worker URLs and raw backend error bodies to any caller. That exposes internal topology and upstream diagnostics. Keep the detailed data in logs and return a sanitized status message here.

Also applies to: 268-295

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/server.rs` around lines 177 - 186, The health handler is leaking internal worker details by returning worker.url(), model_id(), and raw backend error bodies to public callers; instead, change the branches that build and return info.join(...) and any branch that returns upstream error bodies (also around the 268-295 region) to log the full details locally (use your existing logging/tracing) and return a sanitized message like "0/{workers.len()} workers healthy" or a generic "upstream error" string with the appropriate StatusCode::SERVICE_UNAVAILABLE; keep uses of workers.iter(), is_healthy(), model_id(), url(), and the StatusCode branches intact but remove any inclusion of url(), model_id(), or raw body in the HTTP response.

coderabbitai · 2026-05-02T01:16:38Z

+    let model_id = healthy[0].model_id().to_string();
+    let probe_url = format!("http://127.0.0.1:{}/v1/chat/completions", state.api_port);
+    let probe_body = serde_json::json!({
+        "model": model_id,
+        "messages": [{"role": "user", "content": "ping"}],
+        "max_tokens": 1,
+        "stream": false,
+        "temperature": 0
+    });
+
+    info!(
+        target: "smg::health",
+        model = %model_id,
+        probe_url = %probe_url,
+        max_tokens = 1,
+        "health_generate: sending real inference probe"
+    );
+
+    let start = Instant::now();
+    let probe_result = state
+        .context
+        .client
+        .post(&probe_url)
+        .json(&probe_body)
+        .timeout(Duration::from_secs(60))
+        .send()
+        .await;


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

The self-probe bypasses real server configuration.

This loopback request hits /v1/chat/completions without auth headers, hard-codes http://127.0.0.1 even when startup later binds TLS, and assumes the first healthy worker can answer chat. Healthy auth-enabled, HTTPS, or non-chat deployments will start failing /health_generate once the fast-path expires.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c42c86c2e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-02T01:33:19Z

+            if self.configured_tool_parser.is_some() && tool_parser_available {
+                (tool_calls, processed_text) = self
+                    .parse_tool_calls(
+                        &processed_text,
+                        &messages_request.model,


Fall back to JSON-schema parsing for Messages tool choices

When configured_tool_parser is set, this branch always runs before the used_json_schema path for Messages API requests, so ToolChoice::Tool/ToolChoice::Any outputs that are constrained JSON are first fed to the structural-tag parser and can come back as tool_calls = None. Because this is an if/else if chain, the JSON-schema parser is then skipped, and the response is emitted as plain text instead of a tool call. This regresses constrained tool calling for Messages on models with configured parsers.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-02T01:33:19Z

+        if tok == im_token_id {
+            if !prev_was_placeholder {
+                result.push(tok);
+            }


Preserve separate image placeholders when collapsing TRT tokens

This collapse logic removes every consecutive im_token_id after the first, which also merges placeholders from distinct adjacent images (for example, two image parts with no intervening text). In that case TRT receives a single placeholder for multiple images, so placeholder count no longer matches multimodal inputs and can misalign or fail image binding.

Useful? React with 👍 / 👎.

claude · 2026-05-02T01:36:48Z

@@ -655,12 +775,28 @@ impl ResponseProcessor {
                        utils::message_utils::get_history_tool_calls_count_messages(
                            &messages_request,
                        ),
+                        None,
                    )
                    .await;
            }


🔴 Important: The Messages API path uses an if / else if / else if chain, but the Chat API path (lines 171–204 above) was refactored to use independent if blocks with tool_calls.is_none() guards. This means the Messages path is missing the fallback behavior.

When all three conditions are true (configured_tool_parser.is_some(), tool_parser_available, and used_json_schema), the first branch runs the configured parser. If it returns tool_calls = None (e.g., model emitted constrained JSON instead of native tool-call tokens), the else if used_json_schema branch is skipped because the first if already matched. The json_schema fallback never runs, and the tool call is silently lost.

The Chat API path handles this correctly with independent if blocks:

if self.configured_tool_parser.is_some() && tool_parser_available && !output_is_constrained { (tool_calls, processed_text) = self.parse_tool_calls(...).await; } if tool_calls.is_none() && used_json_schema { (tool_calls, processed_text) = utils::parse_json_schema_response(...); } if tool_calls.is_none() && tool_parser_available { (tool_calls, processed_text) = self.parse_tool_calls(...).await; }

This Messages path should use the same pattern.

claude · 2026-05-02T01:36:57Z

+                ThinkingConfig::Enabled { .. } => {
+                    kwargs
+                        .entry("enable_thinking".to_string())
+                        .or_insert(Value::Bool(true));
+                    kwargs
+                        .entry("thinking".to_string())
+                        .or_insert(Value::Bool(true));


🟡 Nit: ThinkingConfig::Enabled { .. } silently drops the budget_tokens field. Users who send {"type": "enabled", "budget_tokens": 4096} will get thinking enabled but their budget constraint ignored. Consider passing budget_tokens through as a kwarg (e.g., "budget_tokens") so chat templates or downstream engines that support it can respect the value.

Suggested change

ThinkingConfig::Enabled { .. } => {

kwargs

.entry("enable_thinking".to_string())

.or_insert(Value::Bool(true));

kwargs

.entry("thinking".to_string())

.or_insert(Value::Bool(true));

ThinkingConfig::Enabled { budget_tokens } => {

kwargs

.entry("enable_thinking".to_string())

.or_insert(Value::Bool(true));

kwargs

.entry("thinking".to_string())

.or_insert(Value::Bool(true));

kwargs

.entry("budget_tokens".to_string())

.or_insert(Value::Number((*budget_tokens).into()));

claude · 2026-05-02T01:37:06Z

+fn chat_message_text_content(msg: &ChatMessage) -> String {
+    match msg {
+        ChatMessage::System { content, .. }
+        | ChatMessage::User { content, .. }
+        | ChatMessage::Tool { content, .. }
+        | ChatMessage::Developer { content, .. } => content.to_simple_string(),
+        ChatMessage::Assistant { content, .. } => content
+            .as_ref()
+            .map_or_else(String::new, |c| c.to_simple_string()),
+        ChatMessage::Function { content, .. } => content.clone(),
+    }
+}


🟡 Nit: Assistant messages with tool_calls but no text content all hash to the same value (sha256("assistant\x00")), because tool_calls and reasoning_content are ignored by this extractor. In multi-turn conversations with several tool-calling turns, every assistant tool-call message produces an identical hash, making session reconstruction ambiguous for exactly the messages that matter most in agentic workflows.

If this needs to match TRT-LLM's openai_server.py format exactly, this is correct by spec — but worth a comment noting the limitation.

vschandramourya and others added 26 commits May 1, 2026 15:17

fix(grpc): skip reasoning parser when constrained decoding is active

4da2617

style: fix formatting, clippy warnings, and merge artifacts from cher…

849382a

…ry-pick Signed-off-by: Connor Li <[email protected]>

fix(health): increase health_generate probe timeout from 3s to 60s

f979166

Prevents false-positive health check failures under load and during graceful shutdown, which were causing unnecessary k8s pod restarts.

fix(logging): downgrade message_hash log from info to debug

4398a4f

feat(protocols): add response_format.type=regex support

6a2278f

Fix regex structured output reasoning parsing

ad34ef9

Fix lint formatting

883c8e7

gongwei-130 requested review from CatherineSue, key4ng and slin1237 as code owners May 2, 2026 01:04

gemini-code-assist Bot reviewed May 2, 2026

View reviewed changes

claude Bot reviewed May 2, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 2, 2026

View reviewed changes

lightseek-bot changed the title ~~Fix Kimi K2.5 function calling and structured outputs~~ fix: Fix Kimi K2.5 function calling and structured outputs May 2, 2026

coderabbitai Bot requested changes May 2, 2026

View reviewed changes

gongwei-130 force-pushed the wei/kimi-k25-consolidate branch from fe53444 to 3c42c86 Compare May 2, 2026 01:28

Make health_generate probe timeout configurable

3c42c86

chatgpt-codex-connector Bot reviewed May 2, 2026

View reviewed changes

claude Bot reviewed May 2, 2026

View reviewed changes

	if processed_text.starts_with('{') {
	let trimmed = processed_text.trim();
	if trimmed.starts_with('{') \|\| trimmed.starts_with('[') {

		let end_pattern = r"<\\|tool_call_begin\\|>.*?(?:<\\|tool_call_end\\|>\|<\\|func_end\\|>)";
		let tool_call_end_pattern = Regex::new(end_pattern).expect("Valid regex pattern");

Conversation

gongwei-130 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

mergify Bot commented May 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 2, 2026

Choose a reason for hiding this comment

gongwei-130 commented May 2, 2026 •

edited

Loading

coderabbitai Bot commented May 2, 2026 •

edited

Loading