Skip to content

fix: Fix Kimi K2.5 function calling and structured outputs#1430

Open
gongwei-130 wants to merge 27 commits into
mainfrom
wei/kimi-k25-consolidate
Open

fix: Fix Kimi K2.5 function calling and structured outputs#1430
gongwei-130 wants to merge 27 commits into
mainfrom
wei/kimi-k25-consolidate

Conversation

@gongwei-130
Copy link
Copy Markdown
Collaborator

@gongwei-130 gongwei-130 commented May 2, 2026

Summary

  • Consolidate Kimi K2.5 function-calling, reasoning, logging, health probe, and multimodal handling fixes
  • Add support for response_format.type = regex
  • Ensure regex structured-output responses skip reasoning parsing so content is emitted normally in streaming and non-streaming paths
  • Make /health_generate inference probe timeout configurable via --health-generate-timeout-secs, defaulting to 3 seconds
  • Clean up lint/formatting issues

Verification

  • cargo test -p smg routers::grpc::utils::parsers::tests::response_format_regex_is_constrained
  • cargo test -p smg config::types::tests::test_router_config_default
  • cargo test -p smg config::types::tests::test_router_config_deserializes_skills_enabled_flag
  • cargo fmt --check
  • cargo clippy -p smg --all-targets -- -D warnings

Summary by CodeRabbit

Release Notes

  • New Features

    • Added optional message hashing for session reconstruction logging via --enable-message-hash flag
    • Added support for user-supplied request ID pass-through in chat, completion, and messages APIs
    • Added ResponseFormat::Regex constraint support across inference backends
    • Added thinking/extended thinking configuration support for compatible models
    • Improved health monitoring with internal probes and recent-traffic detection
  • Improvements

    • Enhanced tool call parsing with argument validation and schema-based coercion
    • Optimized streaming responses with token timing tracking
    • Better handling of special tokens in tokenization
    • Improved reasoning parser detection for tool-call sections
    • Added TypeScript-style tool description formatting for chat templates

vschandramourya and others added 26 commits May 1, 2026 15:17
… argument types

Cherry-pick from mourya/engine_metrics_grpc_fixes (65f6c1c).

1. Register `raise_exception` as a Jinja global in chat template env
   so HuggingFace templates that use `{{ raise_exception("msg") }}`
   don't crash with "undefined function".

2. Add `coerce_tool_args_to_schema()` post-processor in processor.rs:
   after tool calls are parsed, walk the function schema's `properties`
   and coerce values to match declared types:
   - Number/Bool/Null → String when schema says "string"
   - String → Number/Integer when schema says "number"/"integer"
   - String → Bool when schema says "boolean"
   - Scalar → Array when schema says "array"

   Fixes fc-dash regressions where the model emits correct tokens but
   the parsed JSON has wrong types:
   - bfcl/simple_javascript_43: `error: null` → `error: "null"`
   - bfcl/simple_python_218: `patient_id: 546382` → `"546382"`
   - bfcl/parallel_117: `year: 2000` → `"2000"`

Signed-off-by: Connor Li <[email protected]>
Made-with: Cursor
Signed-off-by: Connor Li <[email protected]>
…ol-call tokens

- Prioritize explicitly configured tool parser over JSON schema parsing
- Support alternative delimiters (<|func_start|>/<|func_end|>) in KimiK2 parser
- Prevent reasoning parser from consuming tool call markers
- Strip leaked chatml tokens only when a parser is explicitly configured
- Truncate trailing content after first valid JSON object for JSON response formats

Signed-off-by: ConnorLi96 <[email protected]>
Made-with: Cursor
- Fix tiktoken encoder to use encode_with_special_tokens for chat template output
- Prioritize configured tool parser over JSON schema with fallthrough
- Skip reasoning parsing when output is constrained (specific tool_choice / JSON response_format)
- Support alternative delimiters and hyphens in KimiK2 tool call parser
- Prevent reasoning parser from consuming tool call markers
- Keep tool call arguments as strings to preserve formatting
- Generate TypeScript-style tool declarations for Kimi K2.5
- Strip leaked chatml tokens conditionally and clean up JSON response format
- Add name field to Tool ChatMessage variant
- Graceful metrics server bind failure

Signed-off-by: ConnorLi96 <[email protected]>
Made-with: Cursor
…on_config.json

Models like Kimi-K2.5 define different EOS tokens in config.json ([EOS]=163585)
and generation_config.json (<|im_end|>=163586). The engine must stop at both.

SGLang handles this internally by merging both sources, but TRT-LLM's gRPC path
does not. This change:

- Adds load_eos_token_ids() to read and merge EOS IDs from both config files
- Exposes eos_token_ids() on the Tokenizer trait
- Passes merged IDs as stop_token_ids in TRT-LLM gRPC requests
- Removes the hardcoded <|im_end|> stop string workaround (bfab5ad)
- Strips [EOS]/[BOS] text from constrained decoding output

Result: SMG+TRT-LLM fc-dash 55/60 (was 52/60 with hack, 18/60 without)

Signed-off-by: ConnorLi96 <[email protected]>
Made-with: Cursor
… debug

During streaming detokenization, multi-byte characters (CJK, emoji) are
split across token boundaries. Decoding a partial token sequence produces
incomplete UTF-8, which is expected and handled correctly by the lossy
fallback + streaming buffer. This was logging at WARN level, flooding
production logs on every request containing non-ASCII output.

Made-with: Cursor
…g image_url

1. Add Anthropic-style `thinking` field to ChatCompletionRequest:
   `{"thinking": {"type": "enabled", "budget_tokens": N}}` or
   `{"thinking": {"type": "disabled"}}`. Normalized to
   chat_template_kwargs (both `thinking` and `enable_thinking` keys)
   so it works with Kimi-K2.5 and Qwen3 templates.

2. Support bare string image_url format: accept both
   `"image_url": "https://..."` and `"image_url": {"url": "https://..."}`
   via custom serde deserializer on ImageUrl.

Made-with: Cursor
… when thinking is active

When separate_reasoning is true the chat template injects <think> in the
prompt, so the model emits thinking content regardless of backend
constraints.  TRT-LLM does not have TGL's ReasonerGrammarObject that
defers JSON constraint enforcement until after </think>, so the raw
output contains <think>...</think> followed by the actual JSON or
tool-call tokens.

The existing output_is_constrained gate unconditionally skipped the
reasoning parser for response_format json_object/json_schema and
tool_choice Function, assuming constrained decoding produces pure JSON.
This left thinking content in the response, breaking structured output
parsing (0/3 on all 4 structured-output tests) and tool_choice streaming
(0/3).

Changes:
- Gate output_is_constrained on !separate_reasoning so the reasoning
  parser runs when thinking is expected (processor.rs + streaming.rs)
- Use native tool parser instead of JSON parser for streaming
  tool_choice Function when reasoning is active, since the backend may
  emit native tool-call tags rather than constrained JSON
- Strip markdown code fences (```json / ```) from streaming JSON
  responses, including residual language tags split across chunk
  boundaries

Tested: all 5 previously-failing tests now pass 3/3 (47/58 → 52/58).
Made-with: Cursor
The streaming path incorrectly gated reasoning_parser_available on
!has_constrained_output(), preventing the reasoning parser from running
when response_format (json_object/json_schema) was set — even when
separate_reasoning was true. This caused thinking content to leak into
the content field instead of reasoning_content during streaming.

The non-streaming path correctly handled this case by not including the
constrained output check in reasoning_parser_available. The downstream
output_is_constrained variable already correctly gates on
!separate_reasoning, so the upfront check was both redundant and wrong.
Two fixes for Kimi K2.5 fc-dash regressions:

1. Rewrite user-supplied tool_call IDs (e.g. "call_1") to the
   "functions.NAME:INDEX" format that the Kimi K2 chat template and
   parser expect. Without this, the model generates unparseable IDs
   and the tool parser returns raw special tokens in content.

2. Handle markdown code fences (```json) that split across streaming
   chunk boundaries. Add a cross-chunk buffer that holds partial fence
   suffixes and flushes/drops fence residue at end-of-stream.

Fixes: multi-turn/native-tag-leak, native-tag-leak-streaming,
structured-outputs/json-mode-streaming, call-feedback-classification-streaming

Signed-off-by: Connor Li <[email protected]>
…ogging

Replace the cached-flag-only health_generate with a real inference probe:
sends POST /v1/chat/completions (max_tokens=1, temperature=0, stream=false,
3s timeout) through the router's own API port (self-call via 127.0.0.1).

Adds structured INFO/WARN logs at smg::health target:
- "sending real inference probe" — before the request, with model + URL
- "probe succeeded" — on 200, with duration_ms + worker count
- "probe failed (backend error)" — on non-2xx, with status + duration
- "probe failed (transport error)" — on timeout/network error

Widens scope from External-runtime workers only to ALL workers, so
gRPC-connected backends (sglang, TRT-LLM, vLLM) are visible to K8s
readiness probes. Both changes are intentional K8s probe semantic changes.

Signed-off-by: Connor Li <[email protected]>
Add request_id field to ChatCompletionRequest, CompletionRequest, and
CreateMessageRequest protocol structs. When a user supplies request_id
in the request payload, SMG now passes it through to the engine as the
gRPC request ID instead of generating chatcmpl-{Uuid}. This enables
correlating engine logs with upstream request tracking.

All 4 API paths updated: Chat, Completions, Messages, and Responses
(Harmony). Each path logs an INFO message when a user-supplied
request_id is used.

Also fixes clippy uninlined_format_args and unnecessary qualification
warnings in server.rs (from cargo fmt reformatting).

Signed-off-by: Connor Li <[email protected]>
Signed-off-by: Connor Li <[email protected]>
…ruction

Add --enable-message-hash CLI flag (default: false) that computes
per-message SHA-256 hashes matching TRT-LLM's openai_server.py format:
sha256(role + "\x00" + content).hexdigest()[:12]

When enabled, hashes are emitted as structured INFO logs on the
smg::request target with the request_id, enabling session
reconstruction for traffic simulation across all engine runtimes
(TRT-LLM, sglang, vLLM).

Hash computation wired through RouterConfig -> pipeline constructors ->
stage constructors for all API paths. Python bindings updated with
enable_message_hash field and --enable-message-hash CLI argument.

Signed-off-by: Connor Li <[email protected]>
Signed-off-by: Connor Li <[email protected]>
…ripping

The fence_buffer approach from 06574da failed to strip the "json\n"
prefix when markdown code fences split across streaming chunks (e.g.
"```" in chunk N, "json\n{..." in chunk N+1). Replace with a simpler
fence_backticks_stripped state flag that tracks when backticks were
stripped, then removes leading language tags from the next chunk.

Signed-off-by: Connor Li <[email protected]>
Extract inline fence stripping logic into a standalone
strip_json_fence() function for testability. Add 7 unit tests covering:
- full fence in single chunk
- cross-chunk split (backticks then language tag)
- CRLF line endings
- standalone language tag drop
- no-fence passthrough
- uppercase JSON/json variants

No behavior change — pure refactor of commit eae26bc.

Signed-off-by: Connor Li <[email protected]>
Made-with: Cursor
Add MessageHash message type and message_hashes field to GenerateRequest
proto. Wire computed hashes from request building stages through the gRPC
client to TRT-LLM backend, gated behind --enable-message-hash.

Signed-off-by: Connor Li <[email protected]>
Signed-off-by: Connor Li <[email protected]>
Prevents false-positive health check failures under load and during
graceful shutdown, which were causing unnecessary k8s pod restarts.
When response_format is JsonObject or JsonSchema, the model produces
JSON content, not reasoning tokens. The kimi_k25 reasoning parser was
consuming this JSON as reasoning_content, causing structured-outputs
fc-dash tests to fail (53/58 -> 58/58 with this fix).

Check response_format directly instead of gating on !separate_reasoning
(which was always false for Kimi, making the constrained-output check
a no-op). Applied to all three code paths: non-streaming single choice,
non-streaming batch, and streaming.

Also fix pre-existing clippy warnings: unnecessary qualifications for
ResponseFormat, unneeded binding pattern in monitor.rs, and missing
too_many_arguments expects in client.rs and trtllm_service.rs. Format
request_building.rs files.

Signed-off-by: Connor Li <[email protected]>
Mirrors TRT-LLM / TGL approach: if any token chunk was forwarded to a
client within the last 10 seconds, /health_generate returns 200
immediately without sending a synthetic inference request.

Under high load the probe adds backpressure and can itself time out
under k8s's 10s kubelet deadline, causing spurious restarts. The
timestamp fast-path eliminates that failure mode.

Implementation:
- AppContext gains last_token_time: Arc<AtomicU64> (unix seconds, 0 = none)
- StreamingProcessor (gRPC path) wraps the SSE channel with a map() that
  stores the current timestamp on every successful chunk via
  build_tracked_sse_response()
- OpenAI HTTP relay (chat.rs) stores the timestamp after each successful
  tx.send() in the bytes relay task
- health_generate checks last_token_time first; falls back to the real
  inference probe only when the server has been idle for >=10s

Signed-off-by: Connor Li <[email protected]>
Both TRT-LLM (KimiK25InputProcessor) and TGL/SGLang
(KimiGPUProcessorWrapper) expand a single <|media_pad|> placeholder
to N vision tokens server-side based on grid_thws.

The previous code pre-expanded to N tokens in SMG, causing double
expansion and a "More media placeholder tokens than media items"
400 error from the engine.

Switch from PromptReplacement::repeated(num_tokens) to
PromptReplacement::sequence(vec![1]) so SMG sends exactly 1
placeholder per image, matching the engine contract.

Signed-off-by: Connor Li <[email protected]>
Made-with: Cursor
SGLang uses N <|media_pad|> tokens per image directly for embedding
lookup, while TRT-LLM's KimiK25InputProcessor re-expands each single
placeholder to N tokens based on grid_thws.

The previous commit used 1 placeholder per image unconditionally,
which fixed TRT-LLM (no double-expansion) but broke SGLang (embedding
split_with_sizes mismatch: tensor=3588 vs split_sizes=[1]).

Now: prompt_replacements emits N tokens (SGLang-compatible), and
collapse_media_placeholders() in request_building collapses consecutive
runs to 1 for TRT-LLM only, gated on builder_client.is_trtllm().

Applied to both Chat and Messages API request building paths.

Signed-off-by: Connor Li <[email protected]>
Made-with: Cursor
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 2, 2026

📝 Walkthrough

Walkthrough

This PR introduces per-message SHA-256 hashing for session reconstruction logging, request ID passthrough across chat/completion/messages APIs, last-token-time tracking for optimized health probing, improved tokenizer EOS token handling, constrained output detection, tool-call argument coercion, Regex response format support across backends, KimiK2 tool-call delimiter variants, chat-template enhancements (TypeScript tool declarations, raise_exception), custom ImageUrl serde, thinking config support, and refactored health probing with intelligent fast-paths.

Changes

Unified Request Tracking, Constraint Handling, and Token Management

Layer / File(s) Summary
Data Shape & Protocol Updates
crates/protocols/src/*, crates/grpc_client/proto/*
ChatCompletionRequest/CompletionRequest/CreateMessageRequest gain optional request_id fields; ChatCompletionRequest adds optional thinking: Option<ThinkingConfig>; ResponseFormat adds Regex { pattern } variant; ImageUrl custom serde (conditional detail, accept string or object); GenerateRequest proto gains repeated MessageHash field; new MessageHash proto message added.
Core Tokenizer & Trait Updates
crates/tokenizer/src/traits.rs, crates/tokenizer/src/chat_template.rs, crates/tokenizer/src/{huggingface,mock,tiktoken}.rs
SpecialTokens gains eos_token_ids: Vec<TokenIdType> field; Tokenizer trait default eos_token_ids() method added; chat template build_environment registers raise_exception Jinja2 callable; SpecialTokens initialization refactored to use ..Default::default() for consistency.
App Context & Router Config
model_gateway/src/app_context.rs, model_gateway/src/config/{types,builder}.rs
AppContext gains last_token_time: Arc<AtomicU64>; RouterConfig gains enable_message_hash: bool field; RouterConfigBuilder adds enable_message_hash() fluent setter.
Request Building & Constraint Detection
model_gateway/src/routers/grpc/utils/parsers.rs, model_gateway/src/routers/grpc/common/stages/helpers.rs
New has_constrained_output(tool_choice, response_format) helper (detects specific-function, JSON/Schema, or Regex constraints); new message-hashing helpers compute_and_log_message_hashes and compute_and_log_input_message_hashes (SHA-256 truncated to 12 hex chars, logged for audit).
gRPC Client & Backend Builders
crates/grpc_client/src/{trtllm_service,sglang_scheduler,vllm_engine}.rs
All three backends extend build_*_request methods to accept and populate message_hashes: Option<Vec<(String, String)>>; TRT-LLM builder gains eos_token_ids parameter and derives stop_token_ids from it; Regex response format support added to build_constraint_for_chat and build_guided_decoding; proto MessageHash entries constructed when hashes provided.
Pipeline & Stage Construction
model_gateway/src/routers/grpc/{pipeline,pd_router,router}.rs
All RequestPipeline::new_* methods gain enable_message_hash: bool and last_token_time: Arc<AtomicU64> parameters; constructors thread these into request-building and streaming stages; PD router extracts enable_message_hash from ctx.router_config.
Request Building Stages
model_gateway/src/routers/grpc/regular/stages/{chat,messages,completion}/request_building.rs, model_gateway/src/routers/grpc/harmony/stages/request_building.rs
Chat/Messages stages gain enable_message_hash field and conditionally compute message_hashes; request IDs now prefer user-supplied *_request.request_id with info logging, else generate UUID-based defaults; Harmony stage similarly updated; multimodal handling now calls collapse_media_placeholders for TRT-LLM; eos_token_ids sourced from tokenizer.
Streaming & SSE Tracking
model_gateway/src/routers/grpc/common/responses/streaming.rs, model_gateway/src/routers/grpc/regular/streaming.rs
New build_tracked_sse_response records Unix epoch seconds into shared Arc<AtomicU64> on successful stream chunks; StreamingProcessor gains last_token_time field; chat/messages/completion streaming now use tracked SSE responses.
Streaming Logic Refinements
model_gateway/src/routers/grpc/regular/streaming.rs, model_gateway/src/routers/grpc/regular/processor.rs
Reasoning parsing gated on !has_constrained_output; tool-call parsing prefers configured parser when present and not constrained; ChatML token stripping gated by parser configuration; new strip_json_fence helper for JSON response sanitization; tool-argument parsing now coerces values to schema via coerce_tool_args_to_schema helper; Messages path updated similarly.
Tool Parsing & Response Handling
model_gateway/src/routers/grpc/utils/chat_utils.rs
process_tool_call_arguments refactored: validates JSON, canonicalizes tool-call IDs to functions.{name}:{index}, updates tool-role message tool_call_id via mapping; new tools_to_typescript generator (plus helpers) converts OpenAI Tool schemas to TypeScript function declarations; parse_json_schema_response sanitizes output by stripping chat/special tokens.
Special Token Tokenization
model_gateway/src/routers/grpc/regular/stages/{chat,messages}/preparation.rs
Tokenization now uses encode(..., true) to recognize special tokens (chat template embeddings, images, control tokens) as single token IDs instead of prior encode(..., false).
Multimodal & Tool Utilities
model_gateway/src/routers/grpc/multimodal.rs
New collapse_media_placeholders helper de-duplicates consecutive media placeholder token IDs for TRT-LLM (which re-expands them server-side) vs. SGLang (which uses repeated pad tokens).
Protocol Regex Support
crates/grpc_client/src/{sglang_scheduler,vllm_engine}.rs, model_gateway/src/routers/grpc/harmony/stages/preparation.rs
Both SGLang and vLLM engines add ResponseFormat::Regex { pattern } case in constraint building; Harmony preparation treats Regex as structured output (returns Ok(None) without Harmony tag generation).
Tool Parser Enhancements
crates/tool_parser/src/parsers/kimik2.rs
KimiK2 regex patterns extended to recognize both `<
Reasoning Parser Improvements
crates/reasoning_parser/src/parsers/base.rs
When closing </think> is missing, detect `<
OpenAI Router Tracking
model_gateway/src/routers/openai/{context,router,chat}.rs
SharedComponents gains last_token_time: Arc<AtomicU64>; route_chat streaming now updates last_token_time with Unix epoch seconds on each successful chunk; retries also capture the timestamp handle.
Health Probing Refactor
model_gateway/src/server.rs, model_gateway/src/app_context.rs
AppState gains api_port: u16; /health_generate now inlined: fails 503 if no healthy workers, skips probe (returns 200) if last_token_time recent (<10s), else POSTs real /v1/chat/completions request to localhost with 60s timeout, returns response status or 503 with error.
CLI & Python Bindings
model_gateway/src/main.rs, bindings/python/src/lib.rs, bindings/python/src/smg/router_args.py
CLI flag --enable-message-hash added (default false, "Logging" group); Python Router.__new__ gains enable_message_hash = false parameter; RouterArgs dataclass gains enable_message_hash: bool field with CLI mapping via --{prefix}enable-message-hash flag.
Documentation & Comments
crates/multimodal/src/registry/kimi_k25.rs, crates/tokenizer/src/tiktoken.rs, various location-specific inline comments
Inline comments clarified for pad-token collapse behavior, test generalization for "N pad tokens", tiktoken decode fallback level changed from warn! to debug!.
Tests & Fixtures
model_gateway/tests/*, crates/ test helpers
Test fixtures updated to include request_id: None/api_port: 0 fields; health-probe tests added for recent-token fast-path, stale-token fallthrough, and no-workers cases; SSE tracking timestamp tests verify update-on-Ok and no-update-on-Err; has_constrained_output unit tests cover tool-choice/response-format combinations; striped JSON fence unit tests cover LF/CRLF, single/multi-chunk, uppercase variants.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant GrpcRouter
    participant RequestPipeline
    participant ChatRequestBuildingStage
    participant TrtllmServiceClient
    participant TensorRtLlm as TensorRT-LLM Backend

    Client->>GrpcRouter: POST /v1/chat/completions + optional request_id
    GrpcRouter->>RequestPipeline: Build pipeline with enable_message_hash=true, last_token_time
    RequestPipeline->>ChatRequestBuildingStage: Create with enable_message_hash=true
    
    Client->>GrpcRouter: Chat request
    GrpcRouter->>RequestPipeline: execute()
    RequestPipeline->>ChatRequestBuildingStage: execute(chat_request)
    
    alt User-supplied request_id present
        ChatRequestBuildingStage->>ChatRequestBuildingStage: Use request_id from chat_request (log info)
    else No user request_id
        ChatRequestBuildingStage->>ChatRequestBuildingStage: Generate chatcmpl-<uuid>
    end
    
    alt enable_message_hash=true
        ChatRequestBuildingStage->>ChatRequestBuildingStage: compute_and_log_message_hashes()
        ChatRequestBuildingStage->>ChatRequestBuildingStage: SHA-256(role + "\x00" + content), truncate to 12 hex chars
    else enable_message_hash=false
        ChatRequestBuildingStage->>ChatRequestBuildingStage: message_hashes = None
    end
    
    ChatRequestBuildingStage->>TrtllmServiceClient: build_chat_request(..., eos_token_ids, message_hashes)
    TrtllmServiceClient->>TrtllmServiceClient: Convert message_hashes to proto::MessageHash entries
    TrtllmServiceClient->>TensorRtLlm: GenerateRequest with message_hashes field
    TensorRtLlm->>TensorRtLlm: Process request with per-message hashes for audit logging
    TensorRtLlm->>TrtllmServiceClient: Stream response chunks
    
    TrtllmServiceClient->>RequestPipeline: Response stream
    RequestPipeline->>StreamingProcessor: Stream with last_token_time handle
    StreamingProcessor->>StreamingProcessor: build_tracked_sse_response()
    StreamingProcessor->>StreamingProcessor: For each chunk: store Unix epoch seconds in Arc<AtomicU64>
    StreamingProcessor->>Client: SSE stream (tracked timestamps enable health-probe fast-path)
Loading
sequenceDiagram
    participant HealthClient
    participant HealthEndpoint as /health_generate Endpoint
    participant WorkerSet
    participant TokenTracker as last_token_time Arc<AtomicU64>
    participant LocalBackend as http://127.0.0.1:api_port

    HealthClient->>HealthEndpoint: GET /health_generate
    HealthEndpoint->>WorkerSet: Get healthy workers
    
    alt No healthy workers
        HealthEndpoint->>HealthClient: 503 Service Unavailable
    else Healthy workers exist
        HealthEndpoint->>TokenTracker: Load last stored epoch seconds
        HealthEndpoint->>TokenTracker: Compare with now()
        
        alt Recent token (< 10 seconds ago)
            HealthEndpoint->>HealthEndpoint: Skip probe (active traffic detected)
            HealthEndpoint->>HealthClient: 200 OK + "probe skipped"
        else Stale or no recent token
            HealthEndpoint->>LocalBackend: POST /v1/chat/completions (ping request, 60s timeout)
            alt Backend responds with 2xx
                HealthEndpoint->>HealthClient: 200 OK + "workers healthy"
            else Backend error or timeout
                HealthEndpoint->>HealthClient: 503 Service Unavailable + error detail
            end
        end
    end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Rationale: This PR spans 50+ files with heterogeneous logic changes touching foundational systems (tokenizer traits, request building, streaming, constraint detection, tool parsing, health probing). Changes include dense logic (message hashing, tool-argument coercion, JSON fence stripping, timestamp tracking), multiple new public functions and struct fields, modified control flow in streaming and request building (reasoning/tool-call gating on constrained output), refactored tool-argument processing with ID canonicalization, custom serde implementations, and multiple concurrent features (request ID passthrough, message hashing, EOS token exposure, Regex format support, KimiK2 parser variants, thinking config, health-probe refactor). Each feature domain demands separate reasoning, and interactions between constrained output detection, tool parsing, and streaming logic are complex.

Possibly related PRs

  • The PR modifies model_gateway/src/routers/openai/chat.rs route_chat to track last_token_time for health-probe optimization, directly related to changes in OpenAI streaming context and routing.
  • The PR extends TrtllmServiceClient/GrpcClient build_generate_request_from_* methods to accept and populate message_hashes, directly related to previous message-building infrastructure PRs.
  • The PR exposes eos_token_ids via SpecialTokens and Tokenizer trait and uses them in request/stop-token derivation, directly related to tokenizer EOS token handling PRs.

Suggested labels

model-gateway, grpc, tokenizer, reasoning-parser, protocols, python-bindings, tests

Suggested reviewers

  • CatherineSue
  • key4ng
  • slin1237

Poem

🐰 A rabbit's jubilation:

Hash tracking for messages in flight,
Request IDs glowing bright—
Tokens tracked, health probes sprint fast,
Constraints detected, schemas amassed,
The codebase hops to a brighter light! 🌟

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch wei/kimi-k25-consolidate

@github-actions github-actions Bot added tokenizer Tokenizer related changes python-bindings Python bindings changes grpc gRPC client and router changes benchmarks Benchmark changes tests Test changes tool-parser Tool/function call parser changes reasoning-parser Reasoning parser changes multimodal Multimodal crate changes protocols Protocols crate changes model-gateway Model gateway crate changes openai OpenAI router changes labels May 2, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 2, 2026

Hi @gongwei-130, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

  • Use git commit -s every time, or
  • VSCode: enable Git: Always Sign Off in Settings
  • PyCharm: enable Sign-off commit in the Commit tool window

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several enhancements and fixes to the model gateway, including message hash logging for session reconstruction, support for user-supplied request IDs, and a fast path for health checks based on recent traffic. It also adds support for regex response formats, improves tool call parsing and argument coercion, and implements special token stripping and JSON fence removal in streaming responses. Feedback focuses on potential data corruption when stripping backticks from JSON responses, the need for global indexing of tool call IDs to ensure uniqueness across turns, and inconsistencies in reasoning parser logic between streaming and non-streaming paths. Additionally, improvements were suggested for JSON extraction to support arrays and for more robust token stripping in streaming contexts to handle cross-chunk splits.

Comment on lines +2789 to +2792
delta = delta
.replace("```json", "")
.replace("```JSON", "")
.replace("```", "");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a global replace to strip triple backticks ( ``` ) from chunks when is_json_response is true is dangerous and can lead to data corruption. If the model-generated JSON content contains strings that include backticks (e.g., a description of code or a markdown snippet), those backticks will be incorrectly removed from the data. This logic should be more surgical, ideally only targeting fences at the very beginning or end of the stream.

References
  1. When stripping specific marker strings from data that might contain structured formats like JSON, avoid methods that can remove characters that are part of valid data, leading to data corruption.

let Some(args_str) = args.as_str() else {
continue;
};
for (index, call) in tool_calls.iter_mut().enumerate() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The index used for canonical tool call IDs is local to each message. In a multi-turn conversation with multiple assistant messages containing tool calls, this will result in duplicate IDs (e.g., functions.search:0 appearing in multiple turns). This violates the requirement for unique tool call IDs within a session and will likely cause parsing failures or model confusion when the history is re-processed. You should use a global counter that increments across all tool calls in the entire messages array.

        let mut global_index = 0;
        for msg in messages.iter_mut() {
            if let Some(tool_calls) = msg.get_mut("tool_calls").and_then(|v| v.as_array_mut()) {
                for call in tool_calls.iter_mut() {
                    if let (Some(name), Some(old)) = (func_name, old_id) {
                        let canonical = format!("functions.{name}:{global_index}");
                        global_index += 1;
                    }
                }
            }
        }
References
  1. For protocol data structures that mirror an external API (e.g., OpenAI), prioritize alignment with the external specification over internal consistency.


if original_request.separate_reasoning
&& reasoning_parser_available
&& !has_structured_output
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to skip reasoning parsing when has_structured_output is true is inconsistent with the streaming path (see streaming.rs line 209). When separate_reasoning is enabled, the chat template typically injects a marker, causing the model to emit reasoning content even if a structured output format is requested. Skipping the reasoning parser here will result in the reasoning content being treated as normal text or potentially interfering with JSON parsing.

Suggested change
&& !has_structured_output
&& reasoning_parser_available
References
  1. To maintain consistency, changes to a feature on one execution path (e.g., streaming) should align with its behavior on related paths (e.g., non-streaming).

}
}

if processed_text.starts_with('{') {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The JSON extraction logic is too restrictive. It only handles JSON objects starting with { and ignores top-level JSON arrays starting with [. Additionally, it should probably trim() the processed_text before checking the start character to handle cases where the model outputs leading whitespace without markdown fences.

Suggested change
if processed_text.starts_with('{') {
let trimmed = processed_text.trim();
if trimmed.starts_with('{') || trimmed.starts_with('[') {

Comment on lines +480 to +496
if self.configured_tool_parser.is_some()
|| self.configured_reasoning_parser.is_some()
{
for token in [
"<|im_end|>",
"<|im_start|>",
"<|im_user|>",
"<|im_assistant|>",
"<|im_system|>",
"<|im_middle|>",
"</think>",
"[EOS]",
"[BOS]",
] {
delta = delta.replace(token, "");
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Stripping leaked tokens using replace on individual chunks is fragile in a streaming context. If a token like <|im_end|> is split across two chunks (e.g., <|im_ and end|>), it will not be detected or removed. This can lead to special tokens leaking into the client-facing stream. Consider using a small buffer to handle potential cross-chunk token matches, similar to how you handle markdown fences.

Comment on lines +2793 to +2796
let trimmed = delta.trim();
if trimmed == "json" || trimmed == "JSON" {
delta = String::new();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Important: This unconditional check drops any chunk whose trimmed content is exactly "json" or "JSON", even when it's not preceded by fence backticks. In constrained JSON streaming, the tokenizer can emit a chunk that is just json as part of a key or string value (e.g. {"json_key": 1} chunked as {", json, _key": 1}), which would corrupt the output.

This guard should only fire when fence_backticks_stripped is true — i.e., when a previous chunk already stripped the ``` and this json is the language tag following it. The cross-chunk handler below already covers "json\n", but this check catches the case where the newline hasn't arrived yet.

Suggested change
let trimmed = delta.trim();
if trimmed == "json" || trimmed == "JSON" {
delta = String::new();
}
if *fence_backticks_stripped && (trimmed == "json" || trimmed == "JSON") {
delta = String::new();
}

Comment on lines +228 to +235
for token in [
"<|im_end|>",
"<|im_start|>",
"<|im_user|>",
"<|im_assistant|>",
"<|im_system|>",
"<|im_middle|>",
] {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: The streaming path (streaming.rs:483-493) and parse_json_schema_response (chat_utils.rs) strip additional tokens "</think>", "[EOS]", "[BOS]" that are missing here. The non-streaming path could leak these tokens to the client when a model emits them in a non-streaming response. Consider adding the same set of tokens for consistency, or extracting the list to a shared constant to keep all paths in sync.

"stream": false,
"temperature": 0
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: The self-referential probe (http://127.0.0.1:{port}/v1/chat/completions) will itself be routed through the same server's middleware, concurrency limiter, etc. Under heavy load, the probe competes with real traffic for the concurrency queue slot. Consider adding a header (e.g. X-Health-Probe: true) so the concurrency limiter can exempt it, or using the gRPC client directly to bypass the HTTP layer.

Comment on lines +130 to 137
if let (Some(name), Some(old)) = (func_name, old_id) {
let canonical = format!("functions.{name}:{index}");
if old != canonical {
id_rewrites.insert(old.to_string(), canonical.clone());
if let Some(obj) = call.as_object_mut() {
obj.insert("id".to_string(), Value::String(canonical));
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: When a conversation has multiple assistant messages each containing tool calls, index resets to 0 for each assistant message's tool_calls array. If two different assistant messages both call the same function at index 0, they'll generate the same canonical ID (functions.search:0), and the id_rewrites map will overwrite the first mapping. The second-pass rewrite of tool messages could then map an old ID to the wrong canonical ID.

For example:

  • Assistant msg 1: call_abcfunctions.search:0
  • Assistant msg 2: call_xyzfunctions.search:0 (overwrites call_abcfunctions.search:0)

Both tool response messages will get rewritten to functions.search:0, which is technically correct for Kimi K2's format (it's per-turn), but the id_rewrites map collision means call_abc's mapping is lost. If a tool response references call_abc, it won't be found in the map and won't be rewritten, creating a mismatch.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2d59ff4cf9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +241 to +244
.post(&probe_url)
.json(&probe_body)
.timeout(Duration::from_secs(60))
.send()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include auth when issuing health probe request

The new health_generate probe sends a POST to /v1/chat/completions without any Authorization header, even though that route is behind auth_middleware in this server. In deployments with api_key enabled, the self-probe will get 401/403 and health_generate will return SERVICE_UNAVAILABLE despite healthy workers, creating persistent false-negative health checks.

Useful? React with 👍 / 👎.

}

let model_id = healthy[0].model_id().to_string();
let probe_url = format!("http://127.0.0.1:{}/v1/chat/completions", state.api_port);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use correct scheme for loopback health probe URL

The probe URL is hardcoded to http://127.0.0.1:{port}, but this server can run with TLS (bind_rustls). In TLS-enabled deployments, the local port expects HTTPS, so this plain-HTTP probe fails at the transport layer and marks health_generate unhealthy regardless of backend health.

Useful? React with 👍 / 👎.

@lightseek-bot lightseek-bot changed the title Fix Kimi K2.5 function calling and structured outputs fix: Fix Kimi K2.5 function calling and structured outputs May 2, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
model_gateway/tests/common/test_app.rs (1)

103-110: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't hard-code the shared test app's probe port to 0.

AppState.api_port is now a live input to the inference-probe URL built in model_gateway/src/server.rs, so Line 109 and Line 152 make any test that exercises that path probe port 0 instead of the bound test server. Please thread the real bound port into these helpers or explicitly disable that probe in test wiring.

Also applies to: 146-153

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/tests/common/test_app.rs` around lines 103 - 110, The tests
currently construct AppState with api_port: 0 which causes the inference-probe
URL in model_gateway::server.rs to point at port 0; change the test wiring to
pass the actual bound test-server port into AppState.api_port (e.g., obtain the
listener's local_addr().port() or test_server.uri().port()) when constructing
AppState (and any helper that builds it such as the shared test app builder used
at the other affected locations), or alternatively provide a flag to disable the
probe in the test-only helpers and set that flag in the test; update all places
that call AppState creation in these tests (references: AppState, api_port, the
shared test app helper used at those lines) so the probe uses the real bound
port or is explicitly disabled.
model_gateway/src/routers/grpc/utils/chat_utils.rs (1)

98-157: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep canonical Kimi tool-call IDs monotonic across the full history.

enumerate() resets to 0 for every assistant message, so a multi-turn transcript rewrites both turns to IDs like functions.foo:0. That diverges from generate_tool_call_id() below, which uses a history-wide index, and can make prior tool_call_id references ambiguous for Kimi prompts.

Suggested fix
-    for msg in messages.iter_mut() {
+    let mut next_tool_index = 0usize;
+    for msg in messages.iter_mut() {
         let role = msg.get("role").and_then(|v| v.as_str());
         if role != Some("assistant") {
             continue;
         }

         let Some(tool_calls) = msg.get_mut("tool_calls").and_then(|tc| tc.as_array_mut()) else {
             continue;
         };

-        for (index, call) in tool_calls.iter_mut().enumerate() {
+        for call in tool_calls.iter_mut() {
             // Validate arguments JSON
             if let Some(args_str) = call
                 .get("function")
                 .and_then(|f| f.get("arguments"))
                 .and_then(|a| a.as_str())
@@
             let old_id = call.get("id").and_then(|v| v.as_str());

             if let (Some(name), Some(old)) = (func_name, old_id) {
-                let canonical = format!("functions.{name}:{index}");
+                let canonical = format!("functions.{name}:{next_tool_index}");
                 if old != canonical {
                     id_rewrites.insert(old.to_string(), canonical.clone());
                     if let Some(obj) = call.as_object_mut() {
                         obj.insert("id".to_string(), Value::String(canonical));
                     }
                 }
             }
+
+            next_tool_index += 1;
         }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs` around lines 98 - 157,
The bug is that enumerate() restarts per assistant message so IDs like
"functions.name:0" repeat; replace the per-message enumerate with a single
history-wide counter to keep canonical Kimi tool-call IDs monotonic and match
generate_tool_call_id(). Concretely, add a mutable counter (e.g.,
tool_call_counter: usize = 0) outside the first for loop over messages,
increment it for each tool call instead of using enumerate(), use that counter
when building canonical = format!("functions.{name}:{tool_call_counter}"), and
increment tool_call_counter after assigning; ensure id_rewrites, the ID
replacement in call ("id") and the second-pass replacement of "tool_call_id"
remain unchanged.
model_gateway/src/routers/grpc/regular/streaming.rs (1)

427-451: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Configured tool parsers are still unreachable for constrained chat streaming.

On Line 428, is_specific_function already implies output_is_constrained, so this branch always routes specific-function streaming through process_specific_function_stream(). In the fallback branch, Line 448 derives effective_json_parser from separate_reasoning, but Line 209 already forced separate_reasoning = false whenever has_constrained_output() is true. So constrained chat tool calls still always pick the JSON/specific-function path and never honor an explicitly configured parser here, which regresses the parser-precedence goal. The Messages path on Lines 1879-1881 already uses the expected precedence.

Suggested direction
+ let prefer_configured_tool_parser =
+     self.configured_tool_parser.is_some() && tool_parser_available;
+
- let tool_chunks = if is_specific_function
-     && (output_is_constrained
-         || !(self.configured_tool_parser.is_some() && tool_parser_available))
+ let tool_chunks = if is_specific_function && !prefer_configured_tool_parser
 {
     Self::process_specific_function_stream(...)
 } else {
-    let effective_json_parser = used_json_schema
-        && !(separate_reasoning
-            && self.configured_tool_parser.is_some()
-            && tool_parser_available);
+    let effective_json_parser = used_json_schema && !prefer_configured_tool_parser;
     self.process_tool_calls_stream(..., effective_json_parser).await
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 427 - 451,
The current branch always sends specific-function/constrained output to
process_specific_function_stream because is_specific_function implies
output_is_constrained, which prevents honoring an explicitly configured tool
parser; change the branching so that process_specific_function_stream is only
used when no configured parser is present or available (i.e.,
is_specific_function && !(self.configured_tool_parser.is_some() &&
tool_parser_available)); likewise adjust the effective_json_parser logic to
consider self.configured_tool_parser and tool_parser_available even when
separate_reasoning is false so configured parsers take precedence; update
references in this block: process_specific_function_stream,
is_specific_function, output_is_constrained (or has_constrained_output()),
self.configured_tool_parser, tool_parser_available, effective_json_parser,
separate_reasoning, and used_json_schema.
model_gateway/src/routers/grpc/regular/responses/conversions.rs (1)

109-116: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

name is in scope but set to None — consider populating it.

In the FunctionToolCall branch, name (the function name) is destructured but then discarded. Setting name: Some(name.clone()) would make the tool message self-describing, which can help backends or middleware that inspect this field directly rather than correlating via tool_call_id.

♻️ Optional: populate name when available
 messages.push(ChatMessage::Tool {
     content: MessageContent::Text(output_text.clone()),
     tool_call_id: id.clone(),
-    name: None,
+    name: Some(name.clone()),
 });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/responses/conversions.rs` around lines
109 - 116, In the FunctionToolCall branch, the tool message created as
ChatMessage::Tool currently sets name: None even though the function name was
destructured into name; change this to populate the field (e.g., name:
Some(name.clone())) so the ChatMessage::Tool includes the function name
alongside tool_call_id and output_text; update the code that builds the tool
message in the FunctionToolCall handling to use name.clone() for the name field.
model_gateway/src/routers/grpc/harmony/builder.rs (1)

1073-1099: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

LGTM — .. correctly accommodates the new name field.

Minor note: when tool_call_map lookup misses (line 1082), the fallback produces functions.<tool_call_id> instead of the actual function name. Now that name: Option<String> is available, it could serve as an intermediate fallback before tool_call_id:

♻️ Optional: use name as intermediate fallback
-    let function_name = tool_call_map
-        .get(tool_call_id)
-        .cloned()
-        .unwrap_or_else(|| tool_call_id.clone());
+    let function_name = tool_call_map
+        .get(tool_call_id)
+        .cloned()
+        .or_else(|| name.clone())
+        .unwrap_or_else(|| tool_call_id.clone());

All current callers in this PR set name: None, so behaviour is identical today — purely a defensive improvement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/harmony/builder.rs` around lines 1073 - 1099,
When handling ChatMessage::Tool in the builder (the match arm that constructs
harmony_msg), change the logic that computes function_name so it first tries
tool_call_map.get(tool_call_id).cloned(), then falls back to the ChatMessage's
name field if present, and only finally falls back to tool_call_id; reference
the ChatMessage::Tool pattern variables tool_call_id and name and the existing
tool_call_map lookup to implement this additional intermediate fallback so the
Author.name still becomes format!("functions.{function_name}") using the
resolved function_name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/protocols/src/common.rs`:
- Around line 198-258: The schema is wrong because ImageUrl derives
schemars::JsonSchema but the Deserialize impl accepts either a bare string or an
object; fix by aligning schema with deserialization—either convert ImageUrl into
an #[serde(untagged)] enum (like PromptVariable) with variants e.g.
Simple(String) and Object { url: String, detail: Option<String> } and update
usages to map to the original struct shape, or implement a custom impl
schemars::JsonSchema for ImageUrl that returns a schema using anyOf: [string
schema, object schema] so the generated OpenAPI correctly documents both
accepted wire formats (update/remove the derive and add the custom impl).

In `@crates/reasoning_parser/src/parsers/base.rs`:
- Around line 141-149: The parser only checks for the tool-call marker when
self.in_reasoning && self.config.stream_reasoning, which lets a model jump
directly from <think> to "<|tool_calls_section_begin|>" leave the parser stuck
in reasoning mode; change the logic so the
current_text.find("<|tool_calls_section_begin|>") branch runs whenever
self.in_reasoning (regardless of self.config.stream_reasoning), then extract
reasoning_text and normal_text, call self.buffer.clear(), set self.in_reasoning
= false, and return Ok(ParserResult::new(normal_text, reasoning_text))—this
ensures the parser exits reasoning mode and emits the tool-call section even
when stream_reasoning is false.

In `@crates/tokenizer/src/chat_template.rs`:
- Around line 600-607: Add a unit test that verifies the registered
raise_exception template function actually aborts rendering and propagates its
message: create a ChatTemplateState using a template string that calls "{{
raise_exception('unsupported model') }}", unwrap the parsed state, call
state.apply(...) with an empty messages slice and ChatTemplateParams::default(),
assert that apply() returns Err and that the error's string contains
"unsupported model"; this will exercise the env.add_function("raise_exception",
...) path and guard the behavior of raise_exception during rendering.

In `@crates/tool_parser/src/parsers/kimik2.rs`:
- Around line 117-118: The end_pattern used to build tool_call_end_pattern
("end_pattern" / "tool_call_end_pattern") lacks the DOTALL flag so "." won't
match newlines and multi-line JSON tool calls break find() and trigger the
fallback that clears the buffer; update end_pattern to include the (?s) flag
(e.g. prepend "(?s)" to the regex) so the Regex::new(...) for
tool_call_end_pattern can match multi-line content and avoid the erroneous
fallback.

In `@model_gateway/src/routers/grpc/common/responses/streaming.rs`:
- Around line 1061-1068: The current closure in
UnboundedReceiverStream::new(rx).map updates last_token_time for any successful
SSE chunk, which is too broad; remove the last_token_time.store(...) call from
the generic SSE wrapper (the map closure) and instead update last_token_time
only where actual model token deltas are produced (e.g., the code paths that
emit token-delta events or where token pieces are wrapped into responses),
leaving other events like response.created, in_progress, wrapper events, and
tool events unchanged; locate symbols such as last_token_time, the
UnboundedReceiverStream::new(rx).map closure, and the code that emits
token-delta events and move the timestamp update into those token-emission sites
so health fast-path reflects real token activity.

In `@model_gateway/src/routers/grpc/harmony/stages/request_building.rs`:
- Around line 114-125: The current logic computes message_hashes from the
original RequestType::Chat(req).messages, but when modified_request is present
the downstream TRT-LLM builder uses body; update the branches that set
message_hashes (both the block around RequestType::Chat at the shown location
and the similar block around lines 228-239) to resolve the effective chat body
first (use modified_request.body if Some, otherwise req.body/body equivalent)
and then call helpers::compute_and_log_message_hashes(&request_id,
&body.messages) so the hashes reflect the actual messages sent downstream; keep
the existing enable_message_hash guard and RequestType::Chat match but replace
req.messages with the resolved body.messages and preserve request_id and the
helper call.

In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 693-715: The function collapse_media_placeholders currently
collapses raw runs of im_token_id and can merge two adjacent image expansions
into one placeholder; fix it by changing collapse_media_placeholders to accept
image-boundary metadata (e.g., an additional parameter like image_starts:
&[bool] or a slice of ranges) and use that to reset prev_was_placeholder at each
image boundary so each original image block collapses to its own single
placeholder even when expansions are adjacent, and add a unit/regression test
that constructs two back-to-back image blocks (adjacent im_token_id runs) and
asserts they collapse to two placeholders rather than one; update any call sites
that invoke collapse_media_placeholders to pass the boundary info or a default
when not available.

In `@model_gateway/src/routers/grpc/regular/processor.rs`:
- Around line 746-755: The code is calling parse_tool_calls with None for the
new tools parameter so coerce_tool_args_to_schema never sees the Messages tool
schemas; update the two call sites that pass None (the call in the block gated
by self.configured_tool_parser.is_some() && tool_parser_available and the other
at 771-779) to pass the Messages tool definitions from messages_request (or the
same place you source message schemas) instead of None, so
parse_tool_calls/coerce_tool_args_to_schema can coerce emitted tool args to the
Messages schemas; alternatively, if changing parse_tool_calls signature here is
undesirable, add a Messages-specific coercion step (invoking
coerce_tool_args_to_schema or equivalent) immediately after parse_tool_calls and
before building ContentBlock::ToolUse so tool_use.input has correct types.

In `@model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs`:
- Around line 133-137: The tokenizer is re-applying special-token
post-processing when tokenizing the already-rendered chat template string;
change the call to tokenizer.encode(&processed_messages.text, true) in the chat
preparation step to use add_special_tokens=false (i.e., pass false as the second
argument) so the post-processor is not injected again—update the call site where
encoding is assigned (the variable encoding and the tokenizer.encode invocation)
to stop adding BOS/SEP/etc. on top of the rendered template.

In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 2788-2809: The strip_json_fence function currently uses
replace(...) which mutates every occurrence of fence markers inside the chunk;
instead only strip fences at chunk boundaries: check and remove a leading
language fence (e.g., starts_with "```json" / "```JSON" or "```json\n" variants)
and separately check and remove a trailing fence ("```" possibly
preceded/followed by newline) — do not call replace on the whole string. Update
strip_json_fence to use starts_with/ends_with and slice only the prefix/suffix
when present, preserve any internal occurrences, and keep/adjust the
fence_backticks_stripped boolean to reflect whether the chunk ended with
backticks removed so the next chunk should strip a leading language tag.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs`:
- Around line 482-497: The current cleanup in chat_utils.rs mutates any
occurrence of framing tokens inside the JSON by calling clean.replace(...) on
processed_text; instead only remove framing/wrapping tokens that appear as
leading or trailing wrappers around the JSON payload. Change the logic around
processed_text / clean to iterate over the token list and repeatedly apply
strip_prefix and strip_suffix (or anchored regex matches) for tokens like
"<|im_end|>", "<|im_start|>", "<|im_user|>", "<|im_assistant|>",
"<|im_system|>", "<|im_middle|>", "</think>", "[EOS]", "[BOS]" so you only trim
those tokens at the ends (and then trim whitespace), rather than doing a global
replace which corrupts valid JSON fields; keep variable names processed_text and
clean and update the loop that currently calls clean.replace to use
prefix/suffix stripping instead.
- Around line 696-718: The generated TypeScript uses unquoted identifiers so
function names and object property keys like func.name and schema property names
produce invalid TS when they contain characters like hyphens; update
function_to_typescript to emit a quoted/backticked type name when func.name is
not a valid JS identifier (e.g., wrap in backticks) and update the schema-to-TS
generator (the code that builds property strings—look for the part variable and
the function schema_to_typescript / property rendering logic) to quote property
names that are not valid identifiers (wrap in double quotes) and preserve
optional markers and spacing; implement a small helper is_valid_identifier(name:
&str) and use it to decide between raw identifier or quoted/backticked form in
both places.

In `@model_gateway/src/routers/grpc/utils/parsers.rs`:
- Around line 183-192: Update the doc comment for has_constrained_output to
mention regex-based constraints as a form of "constrained decoding": state that
constrained output includes JSON formats (json_object/json_schema) and regex
patterns (ResponseFormat::Regex), and explain that when constrained decoding is
active (e.g., tool_choice required/allowed_tools with mode == "required",
response_format json_object/json_schema/regex, or a specific function), the
reasoning parser must be skipped so constrained content isn't captured as
reasoning_content; reference has_constrained_output and ResponseFormat::Regex in
the comment for clarity.

In `@model_gateway/src/server.rs`:
- Around line 177-186: The health handler is leaking internal worker details by
returning worker.url(), model_id(), and raw backend error bodies to public
callers; instead, change the branches that build and return info.join(...) and
any branch that returns upstream error bodies (also around the 268-295 region)
to log the full details locally (use your existing logging/tracing) and return a
sanitized message like "0/{workers.len()} workers healthy" or a generic
"upstream error" string with the appropriate StatusCode::SERVICE_UNAVAILABLE;
keep uses of workers.iter(), is_healthy(), model_id(), url(), and the StatusCode
branches intact but remove any inclusion of url(), model_id(), or raw body in
the HTTP response.

---

Outside diff comments:
In `@model_gateway/src/routers/grpc/harmony/builder.rs`:
- Around line 1073-1099: When handling ChatMessage::Tool in the builder (the
match arm that constructs harmony_msg), change the logic that computes
function_name so it first tries tool_call_map.get(tool_call_id).cloned(), then
falls back to the ChatMessage's name field if present, and only finally falls
back to tool_call_id; reference the ChatMessage::Tool pattern variables
tool_call_id and name and the existing tool_call_map lookup to implement this
additional intermediate fallback so the Author.name still becomes
format!("functions.{function_name}") using the resolved function_name.

In `@model_gateway/src/routers/grpc/regular/responses/conversions.rs`:
- Around line 109-116: In the FunctionToolCall branch, the tool message created
as ChatMessage::Tool currently sets name: None even though the function name was
destructured into name; change this to populate the field (e.g., name:
Some(name.clone())) so the ChatMessage::Tool includes the function name
alongside tool_call_id and output_text; update the code that builds the tool
message in the FunctionToolCall handling to use name.clone() for the name field.

In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 427-451: The current branch always sends
specific-function/constrained output to process_specific_function_stream because
is_specific_function implies output_is_constrained, which prevents honoring an
explicitly configured tool parser; change the branching so that
process_specific_function_stream is only used when no configured parser is
present or available (i.e., is_specific_function &&
!(self.configured_tool_parser.is_some() && tool_parser_available)); likewise
adjust the effective_json_parser logic to consider self.configured_tool_parser
and tool_parser_available even when separate_reasoning is false so configured
parsers take precedence; update references in this block:
process_specific_function_stream, is_specific_function, output_is_constrained
(or has_constrained_output()), self.configured_tool_parser,
tool_parser_available, effective_json_parser, separate_reasoning, and
used_json_schema.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs`:
- Around line 98-157: The bug is that enumerate() restarts per assistant message
so IDs like "functions.name:0" repeat; replace the per-message enumerate with a
single history-wide counter to keep canonical Kimi tool-call IDs monotonic and
match generate_tool_call_id(). Concretely, add a mutable counter (e.g.,
tool_call_counter: usize = 0) outside the first for loop over messages,
increment it for each tool call instead of using enumerate(), use that counter
when building canonical = format!("functions.{name}:{tool_call_counter}"), and
increment tool_call_counter after assigning; ensure id_rewrites, the ID
replacement in call ("id") and the second-pass replacement of "tool_call_id"
remain unchanged.

In `@model_gateway/tests/common/test_app.rs`:
- Around line 103-110: The tests currently construct AppState with api_port: 0
which causes the inference-probe URL in model_gateway::server.rs to point at
port 0; change the test wiring to pass the actual bound test-server port into
AppState.api_port (e.g., obtain the listener's local_addr().port() or
test_server.uri().port()) when constructing AppState (and any helper that builds
it such as the shared test app builder used at the other affected locations), or
alternatively provide a flag to disable the probe in the test-only helpers and
set that flag in the test; update all places that call AppState creation in
these tests (references: AppState, api_port, the shared test app helper used at
those lines) so the probe uses the real bound port or is explicitly disabled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ec3bc316-8dc4-4a73-9a22-cb585f9aca18

📥 Commits

Reviewing files that changed from the base of the PR and between a41aa6c and 2d59ff4.

📒 Files selected for processing (61)
  • bindings/python/src/lib.rs
  • bindings/python/src/smg/router_args.py
  • crates/grpc_client/proto/trtllm_service.proto
  • crates/grpc_client/src/sglang_scheduler.rs
  • crates/grpc_client/src/trtllm_service.rs
  • crates/grpc_client/src/vllm_engine.rs
  • crates/multimodal/src/registry/kimi_k25.rs
  • crates/multimodal/src/registry/mod.rs
  • crates/protocols/src/chat.rs
  • crates/protocols/src/common.rs
  • crates/protocols/src/completion.rs
  • crates/protocols/src/messages.rs
  • crates/reasoning_parser/src/parsers/base.rs
  • crates/tokenizer/src/chat_template.rs
  • crates/tokenizer/src/huggingface.rs
  • crates/tokenizer/src/mock.rs
  • crates/tokenizer/src/tiktoken.rs
  • crates/tokenizer/src/traits.rs
  • crates/tool_parser/src/parsers/kimik2.rs
  • model_gateway/benches/request_processing.rs
  • model_gateway/benches/wasm_middleware_latency.rs
  • model_gateway/src/app_context.rs
  • model_gateway/src/config/builder.rs
  • model_gateway/src/config/types.rs
  • model_gateway/src/main.rs
  • model_gateway/src/routers/grpc/client.rs
  • model_gateway/src/routers/grpc/common/responses/mod.rs
  • model_gateway/src/routers/grpc/common/responses/streaming.rs
  • model_gateway/src/routers/grpc/common/stages/helpers.rs
  • model_gateway/src/routers/grpc/harmony/builder.rs
  • model_gateway/src/routers/grpc/harmony/stages/preparation.rs
  • model_gateway/src/routers/grpc/harmony/stages/request_building.rs
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/pd_router.rs
  • model_gateway/src/routers/grpc/pipeline.rs
  • model_gateway/src/routers/grpc/regular/processor.rs
  • model_gateway/src/routers/grpc/regular/responses/conversions.rs
  • model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs
  • model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
  • model_gateway/src/routers/grpc/regular/stages/completion/request_building.rs
  • model_gateway/src/routers/grpc/regular/stages/generate/request_building.rs
  • model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs
  • model_gateway/src/routers/grpc/regular/stages/messages/request_building.rs
  • model_gateway/src/routers/grpc/regular/stages/request_building.rs
  • model_gateway/src/routers/grpc/regular/streaming.rs
  • model_gateway/src/routers/grpc/router.rs
  • model_gateway/src/routers/grpc/utils/chat_utils.rs
  • model_gateway/src/routers/grpc/utils/message_utils.rs
  • model_gateway/src/routers/grpc/utils/mod.rs
  • model_gateway/src/routers/grpc/utils/parsers.rs
  • model_gateway/src/routers/openai/chat.rs
  • model_gateway/src/routers/openai/context.rs
  • model_gateway/src/routers/openai/router.rs
  • model_gateway/src/server.rs
  • model_gateway/src/service_discovery.rs
  • model_gateway/src/worker/monitor.rs
  • model_gateway/tests/api/api_endpoints_test.rs
  • model_gateway/tests/common/test_app.rs
  • model_gateway/tests/routing/test_openai_routing.rs
  • model_gateway/tests/spec/chat_message.rs
  • model_gateway/tests/wasm_test.rs

Comment on lines +198 to +258
#[derive(Debug, Clone, PartialEq, schemars::JsonSchema)]
pub struct ImageUrl {
pub url: String,
#[serde(skip_serializing_if = "Option::is_none")]
pub detail: Option<String>, // "auto", "low", or "high"
}

impl Serialize for ImageUrl {
fn serialize<S: serde::Serializer>(&self, serializer: S) -> Result<S::Ok, S::Error> {
use serde::ser::SerializeStruct;
let field_count = 1 + self.detail.is_some() as usize;
let mut state = serializer.serialize_struct("ImageUrl", field_count)?;
state.serialize_field("url", &self.url)?;
if let Some(ref detail) = self.detail {
state.serialize_field("detail", detail)?;
}
state.end()
}
}

impl<'de> Deserialize<'de> for ImageUrl {
fn deserialize<D: serde::Deserializer<'de>>(deserializer: D) -> Result<Self, D::Error> {
use serde::de;

struct ImageUrlVisitor;

impl<'de> de::Visitor<'de> for ImageUrlVisitor {
type Value = ImageUrl;

fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
formatter.write_str("a string URL or an object with url field")
}

fn visit_str<E: de::Error>(self, v: &str) -> Result<ImageUrl, E> {
Ok(ImageUrl {
url: v.to_string(),
detail: None,
})
}

fn visit_map<M: de::MapAccess<'de>>(self, mut map: M) -> Result<ImageUrl, M::Error> {
let mut url = None;
let mut detail = None;
while let Some(key) = map.next_key::<String>()? {
match key.as_str() {
"url" => url = Some(map.next_value()?),
"detail" => detail = map.next_value()?,
_ => {
let _ = map.next_value::<de::IgnoredAny>()?;
}
}
}
Ok(ImageUrl {
url: url.ok_or_else(|| de::Error::missing_field("url"))?,
detail,
})
}
}

deserializer.deserialize_any(ImageUrlVisitor)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's verify the file exists and check the actual content at lines 198-258
head -n 258 crates/protocols/src/common.rs | tail -n 61

Repository: lightseekorg/smg

Length of output: 2224


🏁 Script executed:

# Check if there's a custom JsonSchema implementation for ImageUrl
rg "ImageUrl" crates/protocols/src/common.rs -A 5 -B 5

Repository: lightseekorg/smg

Length of output: 2519


🏁 Script executed:

# Look for any schema-related code or schemars usage with ImageUrl
rg "json_schema|JsonSchema" crates/protocols/src/common.rs -B 2 -A 2

Repository: lightseekorg/smg

Length of output: 7624


ImageUrl's generated schema does not match its accepted wire format.

The custom Deserialize impl accepts either a bare string or an object, but ImageUrl derives schemars::JsonSchema without a custom implementation. The derived schema documents only the object shape, causing OpenAPI/SDK validators to reject the bare string format even though deserialization accepts it.

Add a custom JsonSchema impl for ImageUrl that documents both accepted formats (e.g., anyOf: [string, object]), or switch to an #[serde(untagged)] enum pattern like PromptVariable to ensure the schema stays aligned with the actual API.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/protocols/src/common.rs` around lines 198 - 258, The schema is wrong
because ImageUrl derives schemars::JsonSchema but the Deserialize impl accepts
either a bare string or an object; fix by aligning schema with
deserialization—either convert ImageUrl into an #[serde(untagged)] enum (like
PromptVariable) with variants e.g. Simple(String) and Object { url: String,
detail: Option<String> } and update usages to map to the original struct shape,
or implement a custom impl schemars::JsonSchema for ImageUrl that returns a
schema using anyOf: [string schema, object schema] so the generated OpenAPI
correctly documents both accepted wire formats (update/remove the derive and add
the custom impl).

Comment on lines 141 to +149
if self.in_reasoning && self.config.stream_reasoning {
// Some models skip </think> and go straight to tool calls
if let Some(tool_pos) = current_text.find("<|tool_calls_section_begin|>") {
let reasoning_text = current_text[..tool_pos].trim().to_string();
let normal_text = current_text[tool_pos..].to_string();
self.buffer.clear();
self.in_reasoning = false;
return Ok(ParserResult::new(normal_text, reasoning_text));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle the tool-call marker even when stream_reasoning is off.

Right now this exit path only runs when self.config.stream_reasoning is true. If a model jumps from <think> straight to <|tool_calls_section_begin|> while reasoning streaming is disabled, the parser never leaves reasoning mode here, so the tool-call section is buffered instead of emitted.

Suggested fix
-        // Continue with reasoning content
-        if self.in_reasoning && self.config.stream_reasoning {
-            // Some models skip </think> and go straight to tool calls
-            if let Some(tool_pos) = current_text.find("<|tool_calls_section_begin|>") {
-                let reasoning_text = current_text[..tool_pos].trim().to_string();
-                let normal_text = current_text[tool_pos..].to_string();
-                self.buffer.clear();
-                self.in_reasoning = false;
-                return Ok(ParserResult::new(normal_text, reasoning_text));
-            }
+        // Some models skip </think> and go straight to tool calls.
+        // This transition must work regardless of whether reasoning itself is streamed.
+        if self.in_reasoning {
+            if let Some(tool_pos) = current_text.find("<|tool_calls_section_begin|>") {
+                let reasoning_text = current_text[..tool_pos].trim().to_string();
+                let normal_text = current_text[tool_pos..].to_string();
+                self.buffer.clear();
+                self.in_reasoning = false;
+                return Ok(ParserResult::new(normal_text, reasoning_text));
+            }
+        }
+
+        // Continue with reasoning content
+        if self.in_reasoning && self.config.stream_reasoning {
             // Stream the content immediately
             let reasoning_text = current_text;
             self.buffer.clear();
             Ok(ParserResult::reasoning(reasoning_text))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/reasoning_parser/src/parsers/base.rs` around lines 141 - 149, The
parser only checks for the tool-call marker when self.in_reasoning &&
self.config.stream_reasoning, which lets a model jump directly from <think> to
"<|tool_calls_section_begin|>" leave the parser stuck in reasoning mode; change
the logic so the current_text.find("<|tool_calls_section_begin|>") branch runs
whenever self.in_reasoning (regardless of self.config.stream_reasoning), then
extract reasoning_text and normal_text, call self.buffer.clear(), set
self.in_reasoning = false, and return Ok(ParserResult::new(normal_text,
reasoning_text))—this ensures the parser exits reasoning mode and emits the
tool-call section even when stream_reasoning is false.

Comment on lines +600 to +607
// HuggingFace's Jinja2 environment registers `raise_exception` as a global
// callable so templates can do `{{ raise_exception("msg") }}`.
env.add_function(
"raise_exception",
|msg: String| -> Result<String, minijinja::Error> {
Err(MinijinjaError::new(ErrorKind::InvalidOperation, msg))
},
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Consider adding a unit test for raise_exception template invocation.

The implementation correctly mirrors HuggingFace's pattern — fn name(arg: String) -> Result<String, Error> registered via env.add_function is the canonical minijinja 2.x approach, and the error kind choice is appropriate. However, no test exercises this new path; a simple case would lock in the behavior and guard against regressions:

✅ Suggested test
#[test]
fn test_raise_exception_halts_rendering() {
    // Template that unconditionally calls raise_exception
    let result = ChatTemplateState::new(Some(
        "{{ raise_exception('unsupported model') }}".to_string(),
    ));
    // Template itself parses fine; the error surfaces at render time
    let state = result.unwrap();
    let err = state.apply(&[], ChatTemplateParams::default()).unwrap_err();
    assert!(
        err.to_string().contains("unsupported model"),
        "Error should propagate the raise_exception message, got: {err}"
    );
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokenizer/src/chat_template.rs` around lines 600 - 607, Add a unit
test that verifies the registered raise_exception template function actually
aborts rendering and propagates its message: create a ChatTemplateState using a
template string that calls "{{ raise_exception('unsupported model') }}", unwrap
the parsed state, call state.apply(...) with an empty messages slice and
ChatTemplateParams::default(), assert that apply() returns Err and that the
error's string contains "unsupported model"; this will exercise the
env.add_function("raise_exception", ...) path and guard the behavior of
raise_exception during rendering.

Comment on lines +117 to 118
let end_pattern = r"<\|tool_call_begin\|>.*?(?:<\|tool_call_end\|>|<\|func_end\|>)";
let tool_call_end_pattern = Regex::new(end_pattern).expect("Valid regex pattern");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file at the specified lines and surrounding context
cat -n crates/tool_parser/src/parsers/kimik2.rs | sed -n '100,130p'

Repository: lightseekorg/smg

Length of output: 1980


🏁 Script executed:

# Now let's look at the full context where extractors are defined
cat -n crates/tool_parser/src/parsers/kimik2.rs | sed -n '100,150p'

Repository: lightseekorg/smg

Length of output: 2764


🏁 Script executed:

# Check if there are extractor patterns with (?s) mentioned in the review
rg '(?s)' crates/tool_parser/src/parsers/kimik2.rs -n -A 2 -B 2

Repository: lightseekorg/smg

Length of output: 18750


🏁 Script executed:

# Look for the buffer clearing logic mentioned
cat -n crates/tool_parser/src/parsers/kimik2.rs | sed -n '290,360p'

Repository: lightseekorg/smg

Length of output: 4299


Add (?s) to the completed-tool regex to support multi-line JSON.

The extractors at lines 110 and 113 now accept multi-line JSON via (?s), but end_pattern at line 117 omits this flag. When a tool call contains pretty-printed arguments (newlines), the stream extractor matches it as complete, but tool_call_end_pattern.find() at line 347 fails because . doesn't span newlines. The fallback at line 351 then clears the entire buffer, dropping any trailing content from the chunk.

Suggested fix
-        let end_pattern = r"<\|tool_call_begin\|>.*?(?:<\|tool_call_end\|>|<\|func_end\|>)";
+        let end_pattern =
+            r"(?s)<\|tool_call_begin\|>.*?(?:<\|tool_call_end\|>|<\|func_end\|>)";
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let end_pattern = r"<\|tool_call_begin\|>.*?(?:<\|tool_call_end\|>|<\|func_end\|>)";
let tool_call_end_pattern = Regex::new(end_pattern).expect("Valid regex pattern");
let end_pattern =
r"(?s)<\|tool_call_begin\|>.*?(?:<\|tool_call_end\|>|<\|func_end\|>)";
let tool_call_end_pattern = Regex::new(end_pattern).expect("Valid regex pattern");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tool_parser/src/parsers/kimik2.rs` around lines 117 - 118, The
end_pattern used to build tool_call_end_pattern ("end_pattern" /
"tool_call_end_pattern") lacks the DOTALL flag so "." won't match newlines and
multi-line JSON tool calls break find() and trigger the fallback that clears the
buffer; update end_pattern to include the (?s) flag (e.g. prepend "(?s)" to the
regex) so the Regex::new(...) for tool_call_end_pattern can match multi-line
content and avoid the erroneous fallback.

Comment thread model_gateway/src/routers/grpc/common/responses/streaming.rs
Comment on lines +482 to +497
// Strip chatml / special tokens that may trail the constrained JSON output
let mut clean = processed_text.to_string();
for token in [
"<|im_end|>",
"<|im_start|>",
"<|im_user|>",
"<|im_assistant|>",
"<|im_system|>",
"<|im_middle|>",
"</think>",
"[EOS]",
"[BOS]",
] {
clean = clean.replace(token, "");
}
let clean = clean.trim();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only trim framing tokens around the JSON payload.

This replaces markers everywhere in the response, so valid constrained output like {"text":"[EOS]"} or {"marker":"<|im_end|>"} gets silently mutated before parsing. The cleanup needs to strip wrappers/trailing tokens, not rewrite string content inside the JSON itself.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs` around lines 482 - 497,
The current cleanup in chat_utils.rs mutates any occurrence of framing tokens
inside the JSON by calling clean.replace(...) on processed_text; instead only
remove framing/wrapping tokens that appear as leading or trailing wrappers
around the JSON payload. Change the logic around processed_text / clean to
iterate over the token list and repeatedly apply strip_prefix and strip_suffix
(or anchored regex matches) for tokens like "<|im_end|>", "<|im_start|>",
"<|im_user|>", "<|im_assistant|>", "<|im_system|>", "<|im_middle|>", "</think>",
"[EOS]", "[BOS]" so you only trim those tokens at the ends (and then trim
whitespace), rather than doing a global replace which corrupts valid JSON
fields; keep variable names processed_text and clean and update the loop that
currently calls clean.replace to use prefix/suffix stripping instead.

Comment on lines +696 to +718
fn function_to_typescript(func: &openai_protocol::common::Function) -> String {
let mut out = String::new();

// Description comment
if let Some(ref desc) = func.description {
for line in desc.lines() {
if line.is_empty() {
out.push('\n');
} else {
out.push_str(&format!("// {line}\n"));
}
}
}

// Parameters
let params_str =
if func.parameters.is_null() || func.parameters == Value::Object(Default::default()) {
"{}".to_string()
} else {
schema_to_typescript(&func.parameters, "", &[])
};

out.push_str(&format!("type {} = (_: {params_str}) => any;", func.name));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n model_gateway/src/routers/grpc/utils/chat_utils.rs | sed -n '690,725p'

Repository: lightseekorg/smg

Length of output: 1491


🏁 Script executed:

cat -n model_gateway/src/routers/grpc/utils/chat_utils.rs | sed -n '745,820p'

Repository: lightseekorg/smg

Length of output: 2978


🏁 Script executed:

cd model_gateway && grep -n "schema_to_typescript" src/routers/grpc/utils/chat_utils.rs | head -20

Repository: lightseekorg/smg

Length of output: 446


🏁 Script executed:

grep -n "function_to_typescript\|tools_ts_str" model_gateway/src/routers/grpc/utils/chat_utils.rs | head -20

Repository: lightseekorg/smg

Length of output: 444


🏁 Script executed:

grep -rn "execute-tool\|user-id" model_gateway/src/routers/grpc/utils/ --include="*.rs" | head -10

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

grep -n "quote\|escape\|identifier" model_gateway/src/routers/grpc/utils/chat_utils.rs

Repository: lightseekorg/smg

Length of output: 185


🏁 Script executed:

cat -n model_gateway/src/routers/grpc/utils/chat_utils.rs | sed -n '330,355p'

Repository: lightseekorg/smg

Length of output: 1379


🏁 Script executed:

cat -n model_gateway/src/routers/grpc/utils/chat_utils.rs | sed -n '675,695p'

Repository: lightseekorg/smg

Length of output: 720


🏁 Script executed:

# Check if there's any TypeScript quoting/escaping logic
grep -A 5 -B 5 "type.*=" model_gateway/src/routers/grpc/utils/chat_utils.rs | head -40

Repository: lightseekorg/smg

Length of output: 1563


🏁 Script executed:

grep -rn "execute-tool\|user-id\|-" model_gateway/src/routers/grpc/utils/chat_utils.rs | grep -E "(name|property|Function)" | head -20

Repository: lightseekorg/smg

Length of output: 252


🏁 Script executed:

# Look for any tests that might show what happens with invalid identifiers
find model_gateway -name "*.rs" -type f -exec grep -l "tools_to_typescript\|function_to_typescript" {} \;

Repository: lightseekorg/smg

Length of output: 111


🏁 Script executed:

# Check if TypeScript quoting is done anywhere in the codebase
rg "\".*\".*:" model_gateway/src/routers/grpc/utils/chat_utils.rs | head -10

Repository: lightseekorg/smg

Length of output: 729


🏁 Script executed:

# Search for actual tool definitions with hyphens to confirm this is a real issue
find model_gateway -name "*.rs" -o -name "*.json" | xargs grep -l "execute-tool\|user-id" 2>/dev/null | head -10

Repository: lightseekorg/smg

Length of output: 108


🏁 Script executed:

# Check OpenAI protocol to see if hyphens are allowed in function/property names
find model_gateway -path "*/openai_protocol*" -name "*.rs" | head -5

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

# Let's verify what valid TypeScript looks like with property names
cat > /tmp/test_ts.ts << 'EOF'
// Invalid - hyphens in type name are not valid identifiers
// type execute-tool = (_: {}) => any;

// Valid - quoted property names with hyphens
type executeToolParams = {
  "execute-tool": string;
  "user-id": string;
};

// This is what the current code would generate for a function named "execute-tool":
// type execute-tool = (_: {}) => any;  // INVALID!

// This is what it SHOULD generate:
// type ExecuteTool = (_: {}) => any;  // or use backticks
// type `execute-tool` = (_: {}) => any;  // Valid in TypeScript!
EOF
cat /tmp/test_ts.ts

Repository: lightseekorg/smg

Length of output: 583


🏁 Script executed:

# Check if the code handles any special characters or quoting at all
grep -n "push_str\|format!" model_gateway/src/routers/grpc/utils/chat_utils.rs | grep -E "(name|property)" | head -20

Repository: lightseekorg/smg

Length of output: 378


Wrap identifier names in backticks or quotes to generate valid TypeScript for non-identifier tool and field names.

func.name and object property names are inserted directly without quoting. Tool names like execute-tool or schema properties like user-id generate invalid TypeScript (e.g., type execute-tool = ...; and unquoted user-id: string). These are valid in OpenAI schemas but produce malformed declarations that break Kimi-style templates.

Example fixes

Line 718: Use backticks for type names containing non-identifier characters:

out.push_str(&format!("type `{}` = (_: {params_str}) => any;", func.name));

Line 811: Quote property names containing non-identifier characters:

part.push_str(&format!("{child_indent}\"{name}\"{opt_marker}: {type_str}"));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs` around lines 696 - 718,
The generated TypeScript uses unquoted identifiers so function names and object
property keys like func.name and schema property names produce invalid TS when
they contain characters like hyphens; update function_to_typescript to emit a
quoted/backticked type name when func.name is not a valid JS identifier (e.g.,
wrap in backticks) and update the schema-to-TS generator (the code that builds
property strings—look for the part variable and the function
schema_to_typescript / property rendering logic) to quote property names that
are not valid identifiers (wrap in double quotes) and preserve optional markers
and spacing; implement a small helper is_valid_identifier(name: &str) and use it
to decide between raw identifier or quoted/backticked form in both places.

Comment on lines +183 to +192
/// Returns `true` when constrained decoding is active, meaning the model output
/// is structured JSON rather than free-form text. In that case the reasoning
/// parser must be skipped — otherwise it captures the constrained JSON as
/// `reasoning_content` and leaves `content` empty.
///
/// Constrained decoding is triggered by:
/// - `tool_choice` = a specific function, `required`, or `allowed_tools` with
/// `mode == "required"`
/// - `response_format` = `json_object` or `json_schema`
pub(crate) fn has_constrained_output(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Update doc comments to reflect regex-constrained output.

The helper now treats ResponseFormat::Regex as constrained (Line 208), but the docs still describe only JSON constraints. Keeping this comment accurate avoids future regressions.

Proposed doc-only diff
-/// Returns `true` when constrained decoding is active, meaning the model output
-/// is structured JSON rather than free-form text. In that case the reasoning
+/// Returns `true` when constrained decoding is active, meaning output is
+/// constrained by tool/format rules rather than free-form text. In that case the reasoning
 /// parser must be skipped — otherwise it captures the constrained JSON as
 /// `reasoning_content` and leaves `content` empty.
 ///
 /// Constrained decoding is triggered by:
 /// - `tool_choice` = a specific function, `required`, or `allowed_tools` with
 ///   `mode == "required"`
-/// - `response_format` = `json_object` or `json_schema`
+/// - `response_format` = `json_object`, `json_schema`, or `regex`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Returns `true` when constrained decoding is active, meaning the model output
/// is structured JSON rather than free-form text. In that case the reasoning
/// parser must be skipped — otherwise it captures the constrained JSON as
/// `reasoning_content` and leaves `content` empty.
///
/// Constrained decoding is triggered by:
/// - `tool_choice` = a specific function, `required`, or `allowed_tools` with
/// `mode == "required"`
/// - `response_format` = `json_object` or `json_schema`
pub(crate) fn has_constrained_output(
/// Returns `true` when constrained decoding is active, meaning output is
/// constrained by tool/format rules rather than free-form text. In that case the reasoning
/// parser must be skipped — otherwise it captures the constrained JSON as
/// `reasoning_content` and leaves `content` empty.
///
/// Constrained decoding is triggered by:
/// - `tool_choice` = a specific function, `required`, or `allowed_tools` with
/// `mode == "required"`
/// - `response_format` = `json_object`, `json_schema`, or `regex`
pub(crate) fn has_constrained_output(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/utils/parsers.rs` around lines 183 - 192,
Update the doc comment for has_constrained_output to mention regex-based
constraints as a form of "constrained decoding": state that constrained output
includes JSON formats (json_object/json_schema) and regex patterns
(ResponseFormat::Regex), and explain that when constrained decoding is active
(e.g., tool_choice required/allowed_tools with mode == "required",
response_format json_object/json_schema/regex, or a specific function), the
reasoning parser must be skipped so constrained content isn't captured as
reasoning_content; reference has_constrained_output and ResponseFormat::Regex in
the comment for clarity.

Comment on lines +177 to +186
let healthy: Vec<_> = workers.iter().filter(|w| w.is_healthy()).collect();
if healthy.is_empty() {
let info: Vec<_> = workers
.iter()
.map(|w| format!("{} ({})", w.model_id(), w.url()))
.collect();
return (
StatusCode::SERVICE_UNAVAILABLE,
format!("0/{} workers healthy: {}", workers.len(), info.join(", ")),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't leak internal worker details from a public health endpoint.

/health_generate is public, but these branches return worker URLs and raw backend error bodies to any caller. That exposes internal topology and upstream diagnostics. Keep the detailed data in logs and return a sanitized status message here.

Also applies to: 268-295

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/server.rs` around lines 177 - 186, The health handler is
leaking internal worker details by returning worker.url(), model_id(), and raw
backend error bodies to public callers; instead, change the branches that build
and return info.join(...) and any branch that returns upstream error bodies
(also around the 268-295 region) to log the full details locally (use your
existing logging/tracing) and return a sanitized message like "0/{workers.len()}
workers healthy" or a generic "upstream error" string with the appropriate
StatusCode::SERVICE_UNAVAILABLE; keep uses of workers.iter(), is_healthy(),
model_id(), url(), and the StatusCode branches intact but remove any inclusion
of url(), model_id(), or raw body in the HTTP response.

Comment on lines +219 to +245
let model_id = healthy[0].model_id().to_string();
let probe_url = format!("http://127.0.0.1:{}/v1/chat/completions", state.api_port);
let probe_body = serde_json::json!({
"model": model_id,
"messages": [{"role": "user", "content": "ping"}],
"max_tokens": 1,
"stream": false,
"temperature": 0
});

info!(
target: "smg::health",
model = %model_id,
probe_url = %probe_url,
max_tokens = 1,
"health_generate: sending real inference probe"
);

let start = Instant::now();
let probe_result = state
.context
.client
.post(&probe_url)
.json(&probe_body)
.timeout(Duration::from_secs(60))
.send()
.await;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

The self-probe bypasses real server configuration.

This loopback request hits /v1/chat/completions without auth headers, hard-codes http://127.0.0.1 even when startup later binds TLS, and assumes the first healthy worker can answer chat. Healthy auth-enabled, HTTPS, or non-chat deployments will start failing /health_generate once the fast-path expires.

@gongwei-130 gongwei-130 force-pushed the wei/kimi-k25-consolidate branch from fe53444 to 3c42c86 Compare May 2, 2026 01:28
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c42c86c2e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +746 to +750
if self.configured_tool_parser.is_some() && tool_parser_available {
(tool_calls, processed_text) = self
.parse_tool_calls(
&processed_text,
&messages_request.model,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fall back to JSON-schema parsing for Messages tool choices

When configured_tool_parser is set, this branch always runs before the used_json_schema path for Messages API requests, so ToolChoice::Tool/ToolChoice::Any outputs that are constrained JSON are first fed to the structural-tag parser and can come back as tool_calls = None. Because this is an if/else if chain, the JSON-schema parser is then skipped, and the response is emitted as plain text instead of a tool call. This regresses constrained tool calling for Messages on models with configured parsers.

Useful? React with 👍 / 👎.

Comment on lines +704 to +707
if tok == im_token_id {
if !prev_was_placeholder {
result.push(tok);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve separate image placeholders when collapsing TRT tokens

This collapse logic removes every consecutive im_token_id after the first, which also merges placeholders from distinct adjacent images (for example, two image parts with no intervening text). In that case TRT receives a single placeholder for multiple images, so placeholder count no longer matches multimodal inputs and can misalign or fail image binding.

Useful? React with 👍 / 👎.

Comment on lines 746 to 781
@@ -655,12 +775,28 @@ impl ResponseProcessor {
utils::message_utils::get_history_tool_calls_count_messages(
&messages_request,
),
None,
)
.await;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Important: The Messages API path uses an if / else if / else if chain, but the Chat API path (lines 171–204 above) was refactored to use independent if blocks with tool_calls.is_none() guards. This means the Messages path is missing the fallback behavior.

When all three conditions are true (configured_tool_parser.is_some(), tool_parser_available, and used_json_schema), the first branch runs the configured parser. If it returns tool_calls = None (e.g., model emitted constrained JSON instead of native tool-call tokens), the else if used_json_schema branch is skipped because the first if already matched. The json_schema fallback never runs, and the tool call is silently lost.

The Chat API path handles this correctly with independent if blocks:

if self.configured_tool_parser.is_some() && tool_parser_available && !output_is_constrained {
    (tool_calls, processed_text) = self.parse_tool_calls(...).await;
}
if tool_calls.is_none() && used_json_schema {
    (tool_calls, processed_text) = utils::parse_json_schema_response(...);
}
if tool_calls.is_none() && tool_parser_available {
    (tool_calls, processed_text) = self.parse_tool_calls(...).await;
}

This Messages path should use the same pattern.

Comment on lines +586 to +592
ThinkingConfig::Enabled { .. } => {
kwargs
.entry("enable_thinking".to_string())
.or_insert(Value::Bool(true));
kwargs
.entry("thinking".to_string())
.or_insert(Value::Bool(true));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: ThinkingConfig::Enabled { .. } silently drops the budget_tokens field. Users who send {"type": "enabled", "budget_tokens": 4096} will get thinking enabled but their budget constraint ignored. Consider passing budget_tokens through as a kwarg (e.g., "budget_tokens") so chat templates or downstream engines that support it can respect the value.

Suggested change
ThinkingConfig::Enabled { .. } => {
kwargs
.entry("enable_thinking".to_string())
.or_insert(Value::Bool(true));
kwargs
.entry("thinking".to_string())
.or_insert(Value::Bool(true));
ThinkingConfig::Enabled { budget_tokens } => {
kwargs
.entry("enable_thinking".to_string())
.or_insert(Value::Bool(true));
kwargs
.entry("thinking".to_string())
.or_insert(Value::Bool(true));
kwargs
.entry("budget_tokens".to_string())
.or_insert(Value::Number((*budget_tokens).into()));

Comment on lines +73 to +84
fn chat_message_text_content(msg: &ChatMessage) -> String {
match msg {
ChatMessage::System { content, .. }
| ChatMessage::User { content, .. }
| ChatMessage::Tool { content, .. }
| ChatMessage::Developer { content, .. } => content.to_simple_string(),
ChatMessage::Assistant { content, .. } => content
.as_ref()
.map_or_else(String::new, |c| c.to_simple_string()),
ChatMessage::Function { content, .. } => content.clone(),
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: Assistant messages with tool_calls but no text content all hash to the same value (sha256("assistant\x00")), because tool_calls and reasoning_content are ignored by this extractor. In multi-turn conversations with several tool-calling turns, every assistant tool-call message produces an identical hash, making session reconstruction ambiguous for exactly the messages that matter most in agentic workflows.

If this needs to match TRT-LLM's openai_server.py format exactly, this is correct by spec — but worth a comment noting the limitation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmarks Benchmark changes grpc gRPC client and router changes model-gateway Model gateway crate changes multimodal Multimodal crate changes openai OpenAI router changes protocols Protocols crate changes python-bindings Python bindings changes reasoning-parser Reasoning parser changes tests Test changes tokenizer Tokenizer related changes tool-parser Tool/function call parser changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants