feat(mlx-grpc): support string stop sequences for chat and completion by zach-li-sudo · Pull Request #1447 · lightseekorg/smg

zach-li-sudo · 2026-05-05T03:06:16Z

Description

Follow-up on (#1099) to support chat/completion with stop field.

curl http://localhost:3000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
     "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
     "stop": ["6"],
     "stream": false,
     "max_tokens": 100
   }'

Problem

support string stop sequences for chat and completion

Solution

convert stop strings to stop token ids before passing to mlx backend

Changes

Test Plan

Unit tests for newly added helper functions
Deploy personal dev stack (SMG + mlx inference backend) with manual tests:
See detailed curl command and real responses here

Checklist

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

New Features
- MLX backend now supports string stop sequences for chat and completion API requests
- Completion streaming responses now include matched stop sequence information
Improvements
- Enhanced stop sequence resolution for MLX backend with full request context awareness, ensuring accurate token identification and proper reporting of matched stops

gemini-code-assist · 2026-05-05T03:06:19Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-05-05T03:06:28Z

📝 Walkthrough

Walkthrough

This PR enables MLX stop-sequence support by tokenizing user-provided stop strings into token IDs during request building and resolving matched stop IDs back to user-facing values during response processing using request context.

Changes

MLX Stop-Sequence Support

Layer / File(s)	Summary
Response Data Model `crates/protocols/src/completion.rs`	`CompletionStreamChoice` adds optional `matched_stop` field with conditional serialization and derives `Default` to support populated matched-stop values in streaming responses.
MLX Request Validation `crates/grpc_client/src/mlx_engine.rs`	Removes `reject_stop_strings` validation from chat and completion request builders, allowing string stops; messages and generate pipelines continue to reject them. Documentation updated.
Test Tokenizer Infrastructure `crates/tokenizer/src/mock.rs`	`MockTokenizer` extends with `fail_encode: bool` field and `failing()` constructor to simulate tokenizer encode failures in tests.
Stop-Sequence Resolution Helpers `model_gateway/src/routers/grpc/utils/chat_utils.rs`	Adds `stop_strings_to_token_ids` (encode stops to single-token IDs), `resolve_mlx_matched_stop_json` (map matched token IDs to user values), `resolve_mlx_stop_ids` (validate and orchestrate with error handling).
Stop-Sequence Test Coverage `model_gateway/src/routers/grpc/utils/chat_utils.rs`	Comprehensive test suite covering tokenization success/error cases, HTTP error responses, and matched-stop mapping precedence.
MLX Stop-Sequence Injection Helper `model_gateway/src/routers/grpc/common/stages/helpers.rs`	Introduces `apply_mlx_stop_sequences` function to validate tokenizer, tokenize optional stop sequences, and extend MLX sampling parameters with resolved token IDs.
Chat and Completion Request Building `model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs`, `model_gateway/src/routers/grpc/regular/stages/completion/request_building.rs`	Request-building stages invoke `apply_mlx_stop_sequences` after metadata injection to populate MLX sampling parameters before finalizing protobuf requests.
Proto Wrapper Context-Aware Resolution `model_gateway/src/routers/grpc/proto_wrapper.rs`	Adds internal `mlx_matched_stop_token_id()` helper. Treats MLX backend as unreachable in `matched_stop_json()`. Updates `matched_stop_json_with_context()` to use `resolve_mlx_matched_stop_json()` with request context.
Non-Streaming Response Processing `model_gateway/src/routers/grpc/regular/processor.rs`	`process_single_choice` and `process_non_streaming_completion_response` use `matched_stop_json_with_context()`, passing request stop, stop token IDs, and tokenizer for proper matched-stop resolution.
Streaming Response Processing `model_gateway/src/routers/grpc/regular/streaming.rs`	Chat streaming uses context-aware `matched_stop_json_with_context()` for Complete events. Completion streaming refactors choice construction to use `Default::default()` and sets `matched_stop` on final chunks using context-aware resolution.
Utils Re-exports `model_gateway/src/routers/grpc/utils/mod.rs`	Extends gRPC utils re-export block to include `resolve_mlx_matched_stop_json` and `resolve_mlx_stop_ids` helpers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

lightseekorg/smg#592: Modifies matched_stop handling in model_gateway gRPC pipeline across parser/processor/streaming paths.
lightseekorg/smg#606: Makes related changes to stop-sequence handling and router-side tokenization of string vs pre-tokenized stops.
lightseekorg/smg#1034: Introduces MLX client/proto and gateway plumbing that this PR directly extends for stop-sequence/token-id handling.

Suggested labels

tests

Suggested reviewers

key4ng
slin1237
CatherineSue

🐰 Stop sequences now dance with MLX tokens,
One string maps to one ID spoken,
Request builds, response resolves with care,
Context flows through the pipeline's air! 🎯

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(mlx-grpc): support string stop sequences for chat and completion' directly and clearly describes the main feature addition—enabling string stop sequences for MLX gRPC chat and completion requests, which is the core objective of the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b4ce20db09

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@model_gateway/src/routers/grpc/common/stages/helpers.rs`:
- Around line 83-99: The code calls resolve_mlx_stop_ids(stop, tokenizer) before
verifying the request is the MLX variant, which can tokenize unnecessarily and
produce spurious errors; change the order so you first match on
ProtoGenerateRequest::Mlx (e.g., if let ProtoGenerateRequest::Mlx(req) =
proto_request { ... }), bail early if not MLX, then check for Some(stop) and
only then call resolve_mlx_stop_ids(stop, tokenizer) and extend
sampling.stop_token_ids; ensure you still return Ok(()) when stop is None or
when sampling_params is missing.

In `@model_gateway/src/routers/grpc/proto_wrapper.rs`:
- Around line 741-747: The MLX variant's matched_stop_json() can return raw
integer token IDs; update the five unguarded call sites
(process_non_streaming_generate_response,
process_non_streaming_messages_response, process_non_streaming_chat_response
(harmony), the Harmony streaming response processing, and the Harmony streaming
variant) so they do not consume raw matched_stop_json() directly: either guard
with is_mlx() and call resolve_mlx_matched_stop_json() for Mlx, or always call
resolve_mlx_matched_stop_json() (which uses mlx_matched_stop_token_id()) before
using the value; ensure any code paths that previously read matched_stop_json()
now receive the resolved string form.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs`:
- Around line 422-453: Update the docstring for stop_strings_to_token_ids to
document that tokenizer.encode(...) errors are not propagated but are logged and
the corresponding stop string is skipped (i.e., both zero-token encodings and
encoder errors are warn-and-skipped), and clarify that only multi-token
encodings produce an Err result; reference the function name
stop_strings_to_token_ids and the call tokenizer.encode to make clear where this
behavior occurs.
- Around line 496-508: apply_mlx_stop_sequences currently extends
sampling.stop_token_ids with values converted by resolve_mlx_stop_ids without
checking if the request already set explicit stop_token_ids; add a validation
that rejects the request when both stop strings and explicit stop_token_ids are
provided. To fix, change apply_mlx_stop_sequences (or its caller) to receive the
original request's stop_token_ids (or a boolean flag) and if that vector is
non-empty and stop_strings is present, return a bad_request error; alternatively
perform this check in the caller before invoking apply_mlx_stop_sequences.
Ensure the error originates from the same error::bad_request pattern and
reference sampling.stop_token_ids, apply_mlx_stop_sequences, and
resolve_mlx_stop_ids in your change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8297696e-30a4-4e03-9896-ce3f2cb893d9

📥 Commits

Reviewing files that changed from the base of the PR and between 829604f and b4ce20d.

📒 Files selected for processing (13)

Cargo.toml
crates/grpc_client/src/mlx_engine.rs
crates/protocols/src/completion.rs
crates/tokenizer/src/mock.rs
model_gateway/Cargo.toml
model_gateway/src/routers/grpc/common/stages/helpers.rs
model_gateway/src/routers/grpc/proto_wrapper.rs
model_gateway/src/routers/grpc/regular/processor.rs
model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
model_gateway/src/routers/grpc/regular/stages/completion/request_building.rs
model_gateway/src/routers/grpc/regular/streaming.rs
model_gateway/src/routers/grpc/utils/chat_utils.rs
model_gateway/src/routers/grpc/utils/mod.rs

coderabbitai · 2026-05-05T03:15:48Z

+pub(crate) fn resolve_mlx_stop_ids(
+    stop_strings: &StringOrArray,
+    tokenizer: Option<&dyn Tokenizer>,
+) -> Result<Vec<u32>, Response> {
+    let tok = tokenizer.ok_or_else(|| {
+        error::bad_request(
+            "tokenizer_unavailable",
+            "MLX backend requires a tokenizer to convert string stop sequences",
+        )
+    })?;
+    stop_strings_to_token_ids(stop_strings.iter(), tok)
+        .map_err(|e| error::bad_request("unsupported_stop_string", e))
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Confirm that apply_mlx_stop_sequences (or its caller) validates the # "stop strings AND explicit stop_token_ids present simultaneously" case. rg -n -B5 -A30 "apply_mlx_stop_sequences" --type rust

Repository: lightseekorg/smg

Length of output: 6747

🏁 Script executed:

#!/bin/bash # Get the full implementation of apply_mlx_stop_sequences cat -n model_gateway/src/routers/grpc/common/stages/helpers.rs | sed -n '73,110p'

Repository: lightseekorg/smg

Length of output: 1141

🏁 Script executed:

#!/bin/bash # Search for any validation that checks for both stop strings and stop_token_ids simultaneously rg -n "stop_token_ids|stop_strings" model_gateway/src/routers/grpc/common/stages/helpers.rs -A3 -B3

Repository: lightseekorg/smg

Length of output: 604

🏁 Script executed:

#!/bin/bash # Check the proto_request structure and any validation in the pipeline before apply_mlx_stop_sequences rg -n "stop_token_ids" model_gateway/src/routers/grpc --type rust -B2 -A2 | head -100

Repository: lightseekorg/smg

Length of output: 8584

🏁 Script executed:

#!/bin/bash # Search for validation that rejects both stop strings and stop_token_ids simultaneously rg -n "stop_token_ids.*stop\|stop.*stop_token_ids" model_gateway/src/routers/grpc --type rust -B3 -A3 | grep -A5 -B5 "error\|invalid\|reject\|bad_request\|400" | head -60

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Check the completion/chat request stages before apply_mlx_stop_sequences for validation rg -n "apply_mlx_stop_sequences" -B50 model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs | grep -A5 -B5 "stop_token_ids\|error\|validation"

Repository: lightseekorg/smg

Length of output: 1532

🏁 Script executed:

#!/bin/bash # Search for PR test plan or any tests that verify the simultaneous stop+stop_token_ids rejection fd -e rs -e toml --type f . | xargs rg -l "both.*stop\|simultaneous.*stop" | head -10

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Find the proto definition for MLX GenerateRequest to see stop_token_ids field fd -e proto --type f . | xargs rg -l "MLX.*Generate\|GenerateRequest" | head -5

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Search for where proto_request.stop_token_ids gets populated initially rg -n "stop_token_ids\s*=" model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs -B5 -A2

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Check the proto structure - search for MLX proto message definitions rg -n "message.*Mlx|stop_token_ids" model_gateway/src/proto --type proto -A3 -B3 2>/dev/null | head -80

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Find proto files find model_gateway -name "*.proto" -type f

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Check how ProtoGenerateRequest::Mlx is constructed - search in the request building stages rg -n "ProtoGenerateRequest::Mlx\|MlxGenerateRequest" model_gateway/src --type rust -A10 -B2 | head -100

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Look at the full chat request building stage to see proto construction cat -n model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs | head -150

Repository: lightseekorg/smg

Length of output: 5254

🏁 Script executed:

#!/bin/bash # Search for build_chat_request implementation to see if it validates stop_token_ids rg -n "fn build_chat_request" model_gateway/src --type rust -A50 | head -100

Repository: lightseekorg/smg

Length of output: 4459

🏁 Script executed:

#!/bin/bash # Search for any validation that explicitly rejects both stop and stop_token_ids rg -n "stop_token_ids.*is_some\|stop_token_ids.*is_some_and" model_gateway/src/routers/grpc --type rust -B5 -A5

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Check if there's validation in preparation stage for MLX rg -n "MLX.*stop\|is_mlx.*stop" model_gateway/src/routers/grpc/regular/stages -B3 -A3 --type rust

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Search for MLX client build_generate_request_from_chat implementation rg -n "impl.*Mlx|build_generate_request_from_chat" model_gateway/src --type rust -A30 | grep -A30 "Mlx" | head -80

Repository: lightseekorg/smg

Length of output: 2645

🏁 Script executed:

#!/bin/bash # Look for where proto request stop_token_ids gets set from the input rg -n "sampling_params\|stop_token_ids" model_gateway/src/routers/grpc/client.rs -B2 -A2 | head -60

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Check if there's a validation stage before MLX processing rg -n "stop_token_ids\s*=\|Some.*stop_token_ids" model_gateway/src/routers/grpc --type rust -B3 -A3 | grep -B5 -A5 "bad_request\|error\|reject" | head -40

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Find the MLX client implementation file find model_gateway -name "*mlx*" -type f | grep -E "\.(rs|proto)$"

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Search for MlxClient or mlx_client struct/impl rg -n "struct MlxClient\|impl.*MlxClient" model_gateway/src --type rust -A60 | head -120

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

/#/bash # Look for where stop_token_ids from ChatCompletionRequest gets used/copied to proto rg -n "body\.stop_token_ids\|\.stop_token_ids\s*:" model_gateway/src/routers/grpc --type rust -B3 -A3

Repository: lightseekorg/smg

Length of output: 114

🏁 Script executed:

#!/bin/bash # Search for all references to stop_token_ids in grpc/client context rg -n "stop_token_ids" model_gateway/src/routers/grpc --type rust -B3 -A3 | grep -E "\.rs-|\.rs:" | head -100

Repository: lightseekorg/smg

Length of output: 9658

🏁 Script executed:

#!/bin/bash # Look for MlxClient struct definition and its build_generate_request_from_chat method rg -n "struct MlxClient" model_gateway/src --type rust -A100 | head -150

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Search more broadly for where MLX proto is being constructed rg -n "MlxGenerateRequest\|GenerateRequest::Mlx" model_gateway/src --type rust -B5 -A10 | head -100

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Search for files containing Mlx proto or client references find model_gateway -type f -name "*.rs" | xargs grep -l "Mlx\|mlx" | grep -v test | head -20

Repository: lightseekorg/smg

Length of output: 978

🏁 Script executed:

#!/bin/bash # Look at files that might contain MLX proto generation rg -n "stop_token_ids" model_gateway/src/routers/grpc/client/mlx* --type rust 2>/dev/null || echo "No mlx client file found"

Repository: lightseekorg/smg

Length of output: 85

🏁 Script executed:

#!/bin/bash # Search for proto definitions or where MlxGenerateRequest is created rg -n "sampling_params\|MlxGenerateRequest\|struct.*Mlx" model_gateway/src --type rust | head -50

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Check proto_wrapper.rs to see how MlxGenerateRequest is defined cat -n model_gateway/src/routers/grpc/proto_wrapper.rs | head -200

Repository: lightseekorg/smg

Length of output: 7912

🏁 Script executed:

#!/bin/bash # Search for where the proto request gets stop_token_ids from user input rg -n "stop_token_ids" model_gateway/src/routers/grpc/proto_wrapper.rs -B5 -A5

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Check where MLX client build_generate_request_from_chat is implemented rg -n "fn build_generate_request_from_chat" model_gateway/src --type rust -B3 -A50 | head -150

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Search for where stop_token_ids from the request body is actually handled in proto creation rg -n "body\." model_gateway/src/routers/grpc/client.rs | grep -i stop

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Look at the actual client implementations - search for stop handling rg -n "SglangClient\|VllmClient\|TrtllmClient\|MlxClient" model_gateway/src/routers/grpc/client.rs -A100 | head -300

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Check if smg_grpc_client provides the client implementations rg -n "use.*MlxClient\|from_path\|build_generate_request" model_gateway/src/routers/grpc/client.rs | head -30

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Look at the client.rs file structure wc -l model_gateway/src/routers/grpc/client.rs

Repository: lightseekorg/smg

Length of output: 105

🏁 Script executed:

#!/bin/bash # Get the full context of how clients are structured cat -n model_gateway/src/routers/grpc/client.rs | sed -n '1,100p'

Repository: lightseekorg/smg

Length of output: 3689

🏁 Script executed:

#!/bin/bash # Look for where body.stop_token_ids might be referenced or used rg -n "\.stop_token_ids" model_gateway/src/routers/grpc/client.rs -B3 -A3

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Check if there's validation logic that rejects stop_token_ids for MLX backends rg -n "stop_token_ids" model_gateway/src/routers/grpc/regular/stages --type rust -B5 -A5 | head -100

Repository: lightseekorg/smg

Length of output: 4784

🏁 Script executed:

#!/bin/bash # Search for validation in preparation stage that might reject stop_token_ids for MLX rg -n "is_mlx\|MLX" model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs -B3 -A3 | head -80

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Look for any error/validation related to stop_token_ids in the entire grpc module rg -n "stop_token_ids.*reject\|stop_token_ids.*error\|stop_token_ids.*bad_request\|unsupported.*stop_token_ids" model_gateway/src/routers/grpc --type rust

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Let's check what the proto request structure looks like - check if stop_token_ids is pre-populated cat -n model_gateway/src/routers/grpc/proto_wrapper.rs | sed -n '200,500p'

Repository: lightseekorg/smg

Length of output: 12973

🏁 Script executed:

#!/bin/bash # Search for ProtoGenerateRequest enum definition rg -n "enum ProtoGenerateRequest\|enum.*Generate" model_gateway/src/routers/grpc/proto_wrapper.rs -A20

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Check if there's any logic that copies stop_token_ids from ChatCompletionRequest to proto rg -n "stop_token_ids" model_gateway/src/routers/grpc/client.rs

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Search for tests related to MLX stop sequences find model_gateway -name "*.rs" -type f | xargs grep -l "mlx.*stop\|stop.*mlx" | head -10

Repository: lightseekorg/smg

Length of output: 511

🏁 Script executed:

#!/bin/bash # Look for any test files for chat_utils.rs find model_gateway -name "*test*" -o -name "*tests*" | grep -i chat

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

#!/bin/bash # Search for PR test plan or documentation about stop_token_ids + stop strings rejection rg -n "both.*stop\|simultaneously.*stop\|PR.*test.*plan" model_gateway --type rust -B2 -A2

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

/#/bash # Check if there's a validation layer that validates both can't be present rg -n "reject\|unsupported" model_gateway/src/routers/grpc/regular/stages/chat --type rust -B3 -A3 | grep -i stop

Repository: lightseekorg/smg

Length of output: 114

🏁 Script executed:

#!/bin/bash # Look for the actual client implementations that do build_generate_request_from_chat # These are external, but let's check if there's any wrapper or validation cd model_gateway && grep -r "build_generate_request_from_chat" --include="*.rs" -A5 -B5 | head -80

Repository: lightseekorg/smg

Length of output: 6402

🏁 Script executed:

#!/bin/bash # Check the actual ChatCompletionRequest structure to see if it has stop_token_ids rg -n "struct ChatCompletionRequest\|stop_token_ids" model_gateway/src --type rust | grep -A10 "ChatCompletionRequest"

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

/#/bash # Look for validation that rejects both stop and stop_token_ids on MLX rg -n "stop_token_ids.*Some\|Some.*stop_token_ids" model_gateway/src/routers/grpc --type rust -B5 -A5 | head -100

Repository: lightseekorg/smg

Length of output: 114

The validation for rejecting simultaneous "stop strings + explicit stop_token_ids" is not present in apply_mlx_stop_sequences.

apply_mlx_stop_sequences (helpers.rs:78–100) receives only the stop strings from the request, not the explicit stop_token_ids field. The function blindly extends sampling.stop_token_ids (line 96) without checking whether stop_token_ids were already populated from the user request. No guard exists in this function to reject the case where both parameters are provided. If this validation is required per the PR test plan, it either needs to be added to apply_mlx_stop_sequences (with access to the original request's stop_token_ids) or implemented elsewhere in the request pipeline before this function is called.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@model_gateway/src/routers/grpc/utils/chat_utils.rs` around lines 496 - 508, apply_mlx_stop_sequences currently extends sampling.stop_token_ids with values converted by resolve_mlx_stop_ids without checking if the request already set explicit stop_token_ids; add a validation that rejects the request when both stop strings and explicit stop_token_ids are provided. To fix, change apply_mlx_stop_sequences (or its caller) to receive the original request's stop_token_ids (or a boolean flag) and if that vector is non-empty and stop_strings is present, return a bad_request error; alternatively perform this check in the caller before invoking apply_mlx_stop_sequences. Ensure the error originates from the same error::bad_request pattern and reference sampling.stop_token_ids, apply_mlx_stop_sequences, and resolve_mlx_stop_ids in your change.

Processing simultaneous "stop strings + explicit stop token ids" is handled in apply_mlx_stop_sequences at helpers.rs. Commit 15c1227 ("fix double gated apply_mlx_stop_sequences") moved resolve_mlx_stop_ids inside the ProtoGenerateRequest::Mlx branch and added a comment clarifying the no-op for non-MLX backends.

If both "stop strings + explicit stop token ids" are received, the behaviours are:

combined: single-token stop + stop_token_ids => 200, return with matched string/id

combined: multi-token stop + stop_token_ids => 400, MLX not support multi-token string stop

These 2 scenarios are covered by test cases 3.x.4 and 3.x.5 here

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \ -H "Content-Type: application/json" \ -d '{ "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit", "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}], "stop": ["6"], "stop_token_ids": [20], "stream": false, "max_tokens": 100 }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{ "response": { "id": "chatcmpl-019e13c9-0ea1-77c0-b430-bf53fbcdede2", "object": "chat.completion", "created": 1778448404, "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "1 \n2 \n3 \n4 \n", "reasoning_content": null }, "finish_reason": "stop", "matched_stop": 20 } ], "usage": { "prompt_tokens": 21, "completion_tokens": 9, "total_tokens": 30 }, "system_fingerprint": "default" }, "status": "HTTP 200" }

and

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \ -H "Content-Type: application/json" \ -d '{ "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit", "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}], "stop": ["hello world"], "stop_token_ids": [20, 21], "stream": false, "max_tokens": 100 }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{ "response": { "error": { "type": "Bad Request", "code": "unsupported_stop_string", "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings", "param": null } }, "status": "HTTP 400" }

@zach-li-sudo, thank you for the detailed clarification and the manual test evidence! My concern was based on an incorrect assumption — I assumed the design should outright reject requests when both stop and stop_token_ids are provided simultaneously, but the actual design is more nuanced and correct:

Single-token stop + stop_token_ids → Both are honored together (merged into sampling.stop_token_ids), and the matched stop is resolved back to the correct user-facing form via resolve_mlx_matched_stop_json. ✅

Multi-token stop + stop_token_ids → Rejected with 400 because resolve_mlx_stop_ids (inside the Mlx branch) fails on multi-token encoding before anything is extended. ✅

The commit 15c1227 properly gates resolve_mlx_stop_ids inside ProtoGenerateRequest::Mlx, and the manual test results for cases 3.x.4 and 3.x.5 confirm the behavior is working as designed. My concern doesn't apply here — the implementation is correct.

( ˶ᵔ ᵕ ᵔ˶ )

🧠 Learnings used

Learnt from: slin1237 Repo: lightseekorg/smg PR: 447 File: model_gateway/src/routers/grpc/client.rs:312-328 Timestamp: 2026-02-17T20:30:27.647Z Learning: Actionable guideline: In model_gateway gRPC metadata discovery (specifically in model_gateway/src/routers/grpc/...), verify how keys are handled for different proto sources. SGLang uses short-form keys (tp_size, dp_size, pp_size) via pick_prost_fields() without normalization, while vLLM/TRT-LLM use long-form keys (tensor_parallel_size, pipeline_parallel_size) that pass through flat_labels() and are normalized by normalize_grpc_keys() in discover_metadata.rs after model_info.to_labels() and device/server_info.to_labels(). Ensure reviewers check that the code paths correctly reflect these normalization rules and that tests cover both code paths.

Learnt from: XinyueZhang369 Repo: lightseekorg/smg PR: 399 File: protocols/src/interactions.rs:505-509 Timestamp: 2026-02-19T03:08:50.192Z Learning: In code reviews for Rust projects using the validator crate (v0.20.0), ensure that custom validation functions for numeric primitive types (e.g., f32, i32, u32, i16, etc.) accept the value by value, not by reference. Example: fn validate(value: f32) { ... }. The validator derive macro has a hardcoded list of numeric types that are passed by value, while all other types are passed by reference. Apply this guideline whenever validating numeric fields to align with the derive macro behavior.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 489 File: model_gateway/src/core/token_bucket.rs:58-63 Timestamp: 2026-02-21T02:30:51.443Z Learning: For lint-only/Clippy enforcement PRs in this repository, avoid introducing behavioral changes (e.g., new input validation or logic changes). Treat such PRs as non-functional changes and plan a separate follow-up issue/PR for hardening or behavior changes. This applies broadly to Rust files across the repo; during review, focus on lint/style corrections and clearly note any intentional exceptions.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 489 File: protocols/src/responses.rs:928-931 Timestamp: 2026-02-21T02:36:00.882Z Learning: In Rust code across the repository, use the marker INVARIANT: to document assumptions in safe code. Reserve SAFETY: for explaining why unsafe blocks are sound. This improves clarity of invariants and safety reasoning. Example reference: protocols/src/responses.rs near validate_tool_choice_with_tools().

Learnt from: slin1237 Repo: lightseekorg/smg PR: 489 File: mesh/src/sync.rs:83-83 Timestamp: 2026-02-21T02:37:01.416Z Learning: General Rust formatting rule: format! with implicit captures only supports simple identifiers, not full expressions like {state.model_id}. For cases where you want to interpolate a field or expression, bind the value first and interpolate the binding, e.g., let model_id = &state.model_id; and then use format!("policy:{}", model_id). In the specific file mesh/src/sync.rs, prefer format!("policy:{}", state.model_id) or bind to a local variable if you need named interpolation, to keep clarity and avoid unintended captures.

Learnt from: zhaowenzi Repo: lightseekorg/smg PR: 807 File: model_gateway/src/middleware.rs:61-81 Timestamp: 2026-03-18T21:32:00.041Z Learning: In Rust code using the http crate, HeaderMap::get() is effectively case-insensitive because HeaderName normalizes keys to lowercase on insertion and lookup. Do not require or perform explicit .to_lowercase() before HeaderMap::get() calls. Mark as not a concern for case-sensitivity in lookups; only consider normalization when inserting or comparing via HeaderName, not in lookups.

Learnt from: key4ng Repo: lightseekorg/smg PR: 867 File: tui/src/app.rs:798-813 Timestamp: 2026-03-22T20:13:55.778Z Learning: In this repo (lightseekorg/smg), treat the workspace `Cargo.toml`’s `package.rust-version` (MSRV) as the source of truth (e.g., `rust-version = "1.85"`). When reviewing Rust changes, do not flag usage of Rust language/library features that were stabilized on or before the MSRV (e.g., `Option::is_none_or`, stabilized in 1.82, is compatible with an MSRV of 1.85). Always verify the MSRV from the workspace `Cargo.toml` rather than relying on issue templates.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 937 File: model_gateway/src/core/worker.rs:0-0 Timestamp: 2026-03-27T03:20:19.917Z Learning: When calling `worker.record_outcome(status_code: u16)` (the unified Circuit Breaker outcome recording API), it’s valid to pass *synthetic* HTTP status codes for transport/connection errors where no real HTTP response was received. For example, callers may pass `502` (send error), `504` (timeout), or other appropriate `502/503/504`-style synthetic codes to preserve CB feedback. Do not flag these calls as incorrect usage of `record_outcome`. Health checks should still handle reachability separately.

key4ng · 2026-05-07T16:10:52Z

Thanks for the contribution, Zhuo! The feature itself is small and the tests are nice — but the diff feels heavier than it needs to be because backend-type branching is leaking into orchestration code. A few things to consider before merge:

1. `is_mlx()` branching is repeated 4× — push it into the wrapper

The same if complete.is_mlx() { resolve_mlx_matched_stop_json(...) } else { complete.matched_stop_json() } block appears in:

model_gateway/src/routers/grpc/regular/processor.rs:187 (chat)
model_gateway/src/routers/grpc/regular/processor.rs:834 (completion)
model_gateway/src/routers/grpc/regular/streaming.rs:495 (chat stream)
model_gateway/src/routers/grpc/regular/streaming.rs:2589 (completion stream)

proto_wrapper.rs already encapsulates backend differences (matched_stop_json(), is_mlx_streaming_finished(), etc.). Could you add a matched_stop_json_with_context(stop, stop_token_ids, tokenizer) method on ProtoGenerateComplete that internally dispatches on the variant? Each caller collapses to one line, and mlx_matched_stop_token_id() no longer needs to be pub. Composes better when the next backend hits the same quirk.

2. `apply_mlx_stop_sequences` is double-gated

model_gateway/src/routers/grpc/common/stages/helpers.rs:82 pattern-matches ProtoGenerateRequest::Mlx(req), but both callers (chat/request_building.rs:111, completion/request_building.rs:90) already gate on builder_client.is_mlx(). Pick one — either the helper is unconditional and no-ops on non-MLX, or callers gate and the helper assumes MLX.

3. Silent encode-error swallow is a footgun

chat_utils.rs:443 — stop_strings_to_token_ids correctly returns Err (→ 400) for multi-token strings, but silently warn!s and drops when the tokenizer errors or returns zero tokens. The new MockTokenizer::failing() test at line 904 locks this in:

// "Tokenizer errors are silently skipped; the function returns Ok(empty)."
assert!(result.unwrap().is_empty());

Both are the same class of "we cannot honor your stop condition" failure — caller sets stop: ["foo"], encoder errors, request proceeds to max_tokens, user blames SMG. Recommend surfacing both as 400.

4. Is `test-case = "3.3.1"` worth a new workspace dep?

It's introduced as a brand-new workspace dependency just for ~13 parameterized cases across 2 functions in chat_utils.rs. Not used anywhere else in the workspace today. A plain #[test] with a slice of (input, expected, name) tuples and a loop gets the same coverage without the dep — and keeps the workspace lean. Up to you, but worth weighing.

Smaller items

streaming.rs got 6 new matched_stop: None struct-literal additions (lines 2422, 2450, 2470, 2505, 2529, 2552). Not blocking, but a ..Default::default() or a constructor on CompletionStreamChoice would prevent this sprawl next time.
Could you run the curl from the PR description on your local Mac (SMG + MLX backend) and paste the actual response in the test plan? Stream + non-stream, single stop string + multi-token rejection. That's the strongest signal we'll get for a Mac-specific path.
mlx_engine.rs:245 comment update is good; retaining reject_stop_strings for messages/generate paths is correct.

After (1)–(3), the MLX surface area shrinks a lot and the "MLX branches everywhere" feel goes away.

mergify · 2026-05-10T22:20:26Z

Hi @zach-li-sudo, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

Use git commit -s every time, or
VSCode: enable Git: Always Sign Off in Settings
PyCharm: enable Sign-off commit in the Commit tool window

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c7e1728cf1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-10T22:23:43Z

-                .matched_stop_token_id
-                .map(|id| serde_json::Value::Number(id.into())),
+            // MLX requires request context to resolve the token ID; use matched_stop_json_with_context.
+            Self::Mlx(_) => unreachable!("matched_stop_json called for MLX backend"),


Avoid panicking for MLX in matched_stop_json

Do not make the MLX branch unreachable here until every caller is migrated to matched_stop_json_with_context: several active paths still call matched_stop_json() directly (for example regular/processor.rs at lines 426 and 713, and regular/streaming.rs at line 2010). For any MLX completion in those flows (non-streaming /generate, non-streaming /messages, and streaming Messages), this now triggers a runtime panic instead of producing a normal response.

Useful? React with 👍 / 👎.

Similar to unreachable!() usage here for MLX multi-modal feature. All occurrence, when matched stop is needed in building JSON response for MLX, can only be resolved by matched_stop_json_with_context() where all other backends (vLLM, SGLang, etc) are no-op. So code execution shouldn't reach here. This Self::Mlx(_) is kept because of the exhaustion match nature of Rust language.

coderabbitai

♻️ Duplicate comments (1)

model_gateway/src/routers/grpc/common/stages/helpers.rs (1)

283-305: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Optimize by checking MLX variant before tokenization.

The current order (check stop, tokenize, then check MLX variant) can waste CPU tokenizing stops for non-MLX backends. Checking the MLX variant first avoids unnecessary tokenization:

♻️ Proposed reordering

 pub(crate) fn apply_mlx_stop_sequences(
     proto_request: &mut ProtoGenerateRequest,
     stop: Option<&StringOrArray>,
     tokenizer: Option<&dyn Tokenizer>,
 ) -> Result<(), Response> {
+    let ProtoGenerateRequest::Mlx(req) = proto_request else {
+        return Ok(());
+    };
     let Some(stop) = stop else {
         return Ok(());
     };
-
-    if let ProtoGenerateRequest::Mlx(req) = proto_request {
-        let token_ids = resolve_mlx_stop_ids(stop, tokenizer)?;
-        let sampling = req.sampling_params.as_mut().ok_or_else(|| {
-            error::internal_error(
-                "mlx_sampling_params_missing",
-                "MLX GenerateRequest has no sampling_params; cannot inject stop IDs",
-            )
-        })?;
-        sampling.stop_token_ids.extend(token_ids);
-    }
-
-    Ok(())
+    let token_ids = resolve_mlx_stop_ids(stop, tokenizer)?;
+    let sampling = req.sampling_params.as_mut().ok_or_else(|| {
+        error::internal_error(
+            "mlx_sampling_params_missing",
+            "MLX GenerateRequest has no sampling_params; cannot inject stop IDs",
+        )
+    })?;
+    sampling.stop_token_ids.extend(token_ids);
+    Ok(())
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model_gateway/src/routers/grpc/common/stages/helpers.rs` around lines 283 -
305, The function apply_mlx_stop_sequences currently tokenizes stop sequences
via resolve_mlx_stop_ids before confirming the proto_request is the MLX variant,
causing unnecessary CPU work for non-MLX backends; modify
apply_mlx_stop_sequences to first early-return if proto_request is not
ProtoGenerateRequest::Mlx, then proceed to check the stop Option and call
resolve_mlx_stop_ids only when inside the MLX branch, and finally mutate
sampling_params (sampling.stop_token_ids.extend(...)) as before to avoid wasted
tokenization for non-MLX requests.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@model_gateway/src/routers/grpc/common/stages/helpers.rs`:
- Around line 283-305: The function apply_mlx_stop_sequences currently tokenizes
stop sequences via resolve_mlx_stop_ids before confirming the proto_request is
the MLX variant, causing unnecessary CPU work for non-MLX backends; modify
apply_mlx_stop_sequences to first early-return if proto_request is not
ProtoGenerateRequest::Mlx, then proceed to check the stop Option and call
resolve_mlx_stop_ids only when inside the MLX branch, and finally mutate
sampling_params (sampling.stop_token_ids.extend(...)) as before to avoid wasted
tokenization for non-MLX requests.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a1ca32dd-17bc-490f-9af2-dc65c5b82612

📥 Commits

Reviewing files that changed from the base of the PR and between b4ce20d and c7e1728.

📒 Files selected for processing (11)

crates/grpc_client/src/mlx_engine.rs
crates/protocols/src/completion.rs
crates/tokenizer/src/mock.rs
model_gateway/src/routers/grpc/common/stages/helpers.rs
model_gateway/src/routers/grpc/proto_wrapper.rs
model_gateway/src/routers/grpc/regular/processor.rs
model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
model_gateway/src/routers/grpc/regular/stages/completion/request_building.rs
model_gateway/src/routers/grpc/regular/streaming.rs
model_gateway/src/routers/grpc/utils/chat_utils.rs
model_gateway/src/routers/grpc/utils/mod.rs

…lightseekorg#1099) Signed-off-by: Zhuo Li <[email protected]>

Signed-off-by: Zhuo Li <[email protected]>

…d no-ops on non-MLX Signed-off-by: Zhuo Li <[email protected]>

Signed-off-by: Zhuo Li <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fac846fbd6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-10T22:34:13Z

                            choices: vec![CompletionStreamChoice {
-                                text: String::new(),
                                index,
-                                logprobs: None,
                                finish_reason: Some("stop".to_string()),
+                                ..Default::default()
                            }],


Include matched_stop in chunk-stopped completion streams

When the local stop decoder fires in the Chunk branch (stopped == true), the emitted terminal CompletionStreamChoice is built with ..Default::default() and never sets matched_stop. Because the index is then added to stopped_indices, the later Complete message is skipped, so the new matched_stop field is never populated for these streams. This affects streamed completions that terminate on gateway-detected stop sequences (for example user-provided stop), and clients lose the stop match metadata even though termination reason is stop.

Useful? React with 👍 / 👎.

Nice catch! I've also tested vLLM backend:

curl http://localhost:3000/v1/completions -s -N \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n", "stop": ["6"], "stream": true, "max_tokens": 100 }' data: {"id":"cmpl_019e1a56-4bfe-7db2-b42d-5af3e093ac4f","object":"text_completion","created":1778558323,"choices":[{"text":"5","index":0,"finish_reason":null}],"model":"Qwen/Qwen2.5-1.5B-Instruct","system_fingerprint":"default"} data: {"id":"cmpl_019e1a56-4bfe-7db2-b42d-5af3e093ac4f","object":"text_completion","created":1778558323,"choices":[{"text":"\n","index":0,"finish_reason":null}],"model":"Qwen/Qwen2.5-1.5B-Instruct","system_fingerprint":"default"} data: {"id":"cmpl_019e1a56-4bfe-7db2-b42d-5af3e093ac4f","object":"text_completion","created":1778558323,"choices":[{"text":"","index":0,"finish_reason":"stop"}],"model":"Qwen/Qwen2.5-1.5B-Instruct","system_fingerprint":"default"} data: [DONE]

no matched_stop="6" was in the response chunk. This impacts both vLLM and MLX paths on completion. Will create a separate PR to add it.

zach-li-sudo · 2026-05-10T22:40:35Z

Manual testing results: stop sequence with MLX backend

1. SMG with MLX backend setup

cargo build
Install the gRPC servicer (first time only):

pip install -e grpc_servicer/

Start the MLX backend:

source .venv/bin/activate && python -m smg_grpc_servicer.mlx.server \
  --model mlx-community/Qwen3-4B-Instruct-2507-4bit --port 50051

Start the SMG gateway server:

./target/debug/smg --worker-urls grpc://localhost:50051 --port 3000

2. Testing scenarios

This PR is to support string stop array in regular
paths: chat, chat completions, streaming chat, streaming chat completion.

Key differences from vLLM

Feature	vLLM	MLX
Single-token stop strings (`"stop": ["6"]`)	✅ supported	✅ tokenized to token ID
Multi-token stop strings (`"stop": ["hello world"]`)	✅ supported	❌ returns HTTP 400
`stop_token_ids`	✅ supported	✅ supported

The gateway converts single-token stop strings to token IDs for MLX requests (the MLX proto has
no native string stop field). Multi-token stop strings are rejected with HTTP 400
(unsupported_stop_string), since MLX can only stop on a single token at a time.

Token ID reference

Token IDs used in the tests below assume the Qwen3 tokenizer, which shares vocabulary with Qwen2.5:

Token 20 → "5"
Token 21 → "6"

Verify with the tokenizer if results are unexpected:

from mlx_lm import load
_, tokenizer = load("mlx-community/Qwen3-4B-Instruct-2507-4bit")
print(tokenizer.encode("5 6"))   # check IDs for context-free "5" and "6"

3. Results

Test matrix: 4 paths × 5 stop modes = 20 cases.

Single-token stop string cases (3.x.1): PASS = 200, matched_stop = the stop string.
Multi-token stop string cases (3.x.2): PASS = 400 with an appropriate error message.
stop_token_ids cases (3.x.3): PASS = 200, matched_stop = the matched token ID integer.
Combined single-token stop string + stop_token_ids (3.x.4): PASS = 200, matched_stop = integer or string depending on which stop fires first.
Combined multi-token stop string + stop_token_ids (3.x.5): PASS = 400 — multi-token string is rejected regardless of stop_token_ids.

#	Endpoint	stream	Stop mode	Expected	Result
3.1.1	`/v1/chat/completions`	false	stop string (single token)	pass	200 ✅
3.1.2	`/v1/chat/completions`	false	stop string (multi-token)	400	400 ✅
3.1.3	`/v1/chat/completions`	false	stop_token_ids	pass	200 ✅
3.1.4	`/v1/chat/completions`	false	combined: single-token stop + stop_token_ids	pass	200 ✅
3.1.5	`/v1/chat/completions`	false	combined: multi-token stop + stop_token_ids	400	400 ✅
3.2.1	`/v1/chat/completions`	true	stop string (single token)	pass	200 ✅
3.2.2	`/v1/chat/completions`	true	stop string (multi-token)	400	400 ✅
3.2.3	`/v1/chat/completions`	true	stop_token_ids	pass	200 ✅
3.2.4	`/v1/chat/completions`	true	combined: single-token stop + stop_token_ids	pass	200 ✅
3.2.5	`/v1/chat/completions`	true	combined: multi-token stop + stop_token_ids	400	400 ✅
3.3.1	`/v1/completions`	false	stop string (single token)	pass	200 ✅
3.3.2	`/v1/completions`	false	stop string (multi-token)	400	400 ✅
3.3.3	`/v1/completions`	false	stop_token_ids	pass	200 ✅
3.3.4	`/v1/completions`	false	combined: single-token stop + stop_token_ids	pass	200 ✅
3.3.5	`/v1/completions`	false	combined: multi-token stop + stop_token_ids	400	400 ✅
3.4.1	`/v1/completions`	true	stop string (single token)	pass	200 ✅
3.4.2	`/v1/completions`	true	stop string (multi-token)	400	400 ✅
3.4.3	`/v1/completions`	true	stop_token_ids	pass	200 ✅
3.4.4	`/v1/completions`	true	combined: single-token stop + stop_token_ids	pass	200 ✅
3.4.5	`/v1/completions`	true	combined: multi-token stop + stop_token_ids	400	400 ✅

Reading the streaming output below: Only the first content chunk and the final (stop) chunk are
shown. Intermediate content chunks are collapsed to # ... (content: "...") ....

3.1 Chat, non-stream

3.1.1 — stop string, single token ("stop": ["6"]) — expect HTTP 200

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "id": "chatcmpl-019e12f3-af39-7bc1-a46c-4173bebe0cbc",
    "object": "chat.completion",
    "created": 1778434420,
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "1  \n2  \n3  \n4  \n5  \n",
          "reasoning_content": null
        },
        "finish_reason": "stop",
        "matched_stop": "6"
      }
    ],
    "usage": {
      "prompt_tokens": 21,
      "completion_tokens": 11,
      "total_tokens": 32
    },
    "system_fingerprint": "default"
  },
  "status": "HTTP 200"
}

3.1.2 — stop string, multi-token ("stop": ["hello world"]) — expect HTTP 400

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Say: hi there and hello world!"}],
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "error": {
      "type": "Bad Request",
      "code": "unsupported_stop_string",
      "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
      "param": null
    }
  },
  "status": "HTTP 400"
}

3.1.3 — stop_token_ids ([20, 21] = tokens for "5" and "6")

curl http://localhost:3000/v1/chat/completions -s \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq

{
  "id": "chatcmpl-019e12f3-b06a-7e30-9026-b277ef4cd022",
  "object": "chat.completion",
  "created": 1778434420,
  "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "1  \n2  \n3  \n4  \n",
        "reasoning_content": null
      },
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ],
  "usage": {
    "prompt_tokens": 21,
    "completion_tokens": 9,
    "total_tokens": 30
  },
  "system_fingerprint": "default"
}

3.1.4 — combined: single-token stop string + stop_token_ids ("stop": ["6"], "stop_token_ids": [20]) — expect HTTP 200

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stop_token_ids": [20],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "id": "chatcmpl-019e13c9-0ea1-77c0-b430-bf53fbcdede2",
    "object": "chat.completion",
    "created": 1778448404,
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "1  \n2  \n3  \n4  \n",
          "reasoning_content": null
        },
        "finish_reason": "stop",
        "matched_stop": 20
      }
    ],
    "usage": {
      "prompt_tokens": 21,
      "completion_tokens": 9,
      "total_tokens": 30
    },
    "system_fingerprint": "default"
  },
  "status": "HTTP 200"
}

3.1.5 — combined: multi-token stop string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20, 21]) — expect HTTP 400

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["hello world"],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "error": {
      "type": "Bad Request",
      "code": "unsupported_stop_string",
      "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
      "param": null
    }
  },
  "status": "HTTP 400"
}

3.2 Chat, stream

3.2.1 — stop string, single token ("stop": ["6"]) — expect HTTP 200

curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100
  }'

data: {"id":"chatcmpl-019e12f4-6c1f-7651-a730-137fffaab34b","object":"chat.completion.chunk","created":1778434468,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"role":"assistant","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-019e12f4-6c1f-7651-a730-137fffaab34b","object":"chat.completion.chunk","created":1778434468,...,"choices":[{"index":0,"delta":{"role":"assistant","content":"1  ","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

# ... (content: "1  \n2  \n3  \n4  \n5  \n") ...

data: {"id":"chatcmpl-019e12f4-6c1f-7651-a730-137fffaab34b","object":"chat.completion.chunk","created":1778434468,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"reasoning_content":null},"logprobs":null,"finish_reason":"stop","matched_stop":"6"}]}

data: [DONE]

HTTP 200

3.2.2 — stop string, multi-token ("stop": ["hello world"]) — expect HTTP 400

curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Repeat exactly: 1 2 3 hello world 4 5"}],
    "stop": ["hello world"],
    "stream": true,
    "max_tokens": 100
  }'

{"error":{"type":"Bad Request","code":"unsupported_stop_string","message":"stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings","param":null}}
HTTP 400

3.2.3 — stop_token_ids ([20, 21] = tokens for "5" and "6")

curl http://localhost:3000/v1/chat/completions -s -N \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }'

data: {"id":"chatcmpl-019e12f4-6d47-7d00-beca-1339545bf2bd","object":"chat.completion.chunk","created":1778434469,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"role":"assistant","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-019e12f4-6d47-7d00-beca-1339545bf2bd","object":"chat.completion.chunk","created":1778434469,...,"choices":[{"index":0,"delta":{"role":"assistant","content":"1","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

# ... (content: "1  \n2  \n3  \n4  \n") ...

data: {"id":"chatcmpl-019e12f4-6d47-7d00-beca-1339545bf2bd","object":"chat.completion.chunk","created":1778434469,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"reasoning_content":null},"logprobs":null,"finish_reason":"stop","matched_stop":20}]}

data: [DONE]

3.2.4 — combined: single-token stop string + stop_token_ids ("stop": ["6"], "stop_token_ids": [20]) — expect HTTP 200

curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stop_token_ids": [20],
    "stream": true,
    "max_tokens": 100
  }'

data: {"id":"chatcmpl-019e13c9-3610-7a81-928c-7aa1e5f75f9d","object":"chat.completion.chunk","created":1778448414,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"role":"assistant","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-019e13c9-3610-7a81-928c-7aa1e5f75f9d","object":"chat.completion.chunk","created":1778448414,...,"choices":[{"index":0,"delta":{"role":"assistant","content":"1  ","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

# ... (content: "1  \n2  \n3  \n4  \n") ...

data: {"id":"chatcmpl-019e13c9-3610-7a81-928c-7aa1e5f75f9d","object":"chat.completion.chunk","created":1778448414,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"reasoning_content":null},"logprobs":null,"finish_reason":"stop","matched_stop":20}]}

data: [DONE]

HTTP 200

3.2.5 — combined: multi-token stop string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20, 21]) — expect HTTP 400

curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["hello world"],
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }'

{"error":{"type":"Bad Request","code":"unsupported_stop_string","message":"stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings","param":null}}
HTTP 400

3.3 Completion, non-stream

3.3.1 — stop string, single token ("stop": ["6"]) — expect HTTP 200

curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "id": "cmpl_019e12f4-9eda-71c2-bf57-ab5fca3036f6",
    "object": "text_completion",
    "created": 1778434481,
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "choices": [
      {
        "text": "5\n",
        "index": 0,
        "finish_reason": "stop",
        "matched_stop": "6"
      }
    ],
    "usage": {
      "prompt_tokens": 22,
      "completion_tokens": 3,
      "total_tokens": 25
    },
    "system_fingerprint": "default"
  },
  "status": "HTTP 200"
}

3.3.2 — stop string, multi-token ("stop": ["hello world"]) — expect HTTP 400

curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Repeat exactly: 1 2 3 hello world 4 5",
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "error": {
      "type": "Bad Request",
      "code": "unsupported_stop_string",
      "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
      "param": null
    }
  },
  "status": "HTTP 400"
}

3.3.3 — stop_token_ids ([20, 21] = tokens for "5" and "6")

curl http://localhost:3000/v1/completions -s \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq

{
  "id": "cmpl_019e12f4-9faf-7582-bcc0-bb7849cf6585",
  "object": "text_completion",
  "created": 1778434482,
  "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ],
  "usage": {
    "prompt_tokens": 22,
    "completion_tokens": 1,
    "total_tokens": 23
  },
  "system_fingerprint": "default"
}

Empty text because the stop token was the very next token after the prompt — stopped immediately.

3.3.4 — combined: single-token stop string + stop_token_ids ("stop": ["6"], "stop_token_ids": [20]) — expect HTTP 200

curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stop_token_ids": [20],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "id": "cmpl_019e13c9-6957-7f10-8038-53f7c215dd87",
    "object": "text_completion",
    "created": 1778448427,
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "choices": [
      {
        "text": "",
        "index": 0,
        "finish_reason": "stop",
        "matched_stop": 20
      }
    ],
    "usage": {
      "prompt_tokens": 22,
      "completion_tokens": 1,
      "total_tokens": 23
    },
    "system_fingerprint": "default"
  },
  "status": "HTTP 200"
}

3.3.5 — combined: multi-token stop string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20, 21]) — expect HTTP 400

curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["hello world"],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'

{
  "response": {
    "error": {
      "type": "Bad Request",
      "code": "unsupported_stop_string",
      "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
      "param": null
    }
  },
  "status": "HTTP 400"
}

3.4 Completion, stream

3.4.1 — stop string, single token ("stop": ["6"]) — expect HTTP 200

curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100
  }'

data: {"id":"cmpl_019e12f4-bb3b-7b80-b016-13a6be12605c","object":"text_completion","created":1778434489,"choices":[{"text":"5","index":0,"finish_reason":null}],"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default"}

# ... (content: "5\n") ...

data: {"id":"cmpl_019e12f4-bb3b-7b80-b016-13a6be12605c","object":"text_completion","created":1778434489,"choices":[{"text":"","index":0,"finish_reason":"stop"}],"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default"}

data: [DONE]

HTTP 200

3.4.2 — stop string, multi-token ("stop": ["hello world"]) — expect HTTP 400

curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Repeat exactly: 1 2 3 hello world 4 5",
    "stop": ["hello world"],
    "stream": true,
    "max_tokens": 100
  }'

{"error":{"type":"Bad Request","code":"unsupported_stop_string","message":"stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings","param":null}}
HTTP 400

3.4.3 — stop_token_ids ([20, 21] = tokens for "5" and "6")

curl http://localhost:3000/v1/completions -s -N \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }'

data: {"id":"cmpl_019e12f4-bc18-7b71-b9a6-fb673df80244","object":"text_completion","created":1778434489,"choices":[{"text":"","index":0,"finish_reason":"stop"}],"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default"}

data: [DONE]

3.4.4 — combined: single-token stop string + stop_token_ids ("stop": ["6"], "stop_token_ids": [20]) — expect HTTP 200

curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stop_token_ids": [20],
    "stream": true,
    "max_tokens": 100
  }'

data: {"id":"cmpl_019e13c9-8d8d-75a0-bba2-d7da2fc33066","object":"text_completion","created":1778448436,"choices":[{"text":"","index":0,"finish_reason":"stop"}],"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default"}

data: [DONE]

HTTP 200

3.4.5 — combined: multi-token stop string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20, 21]) — expect HTTP 400

curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["hello world"],
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }'

{"error":{"type":"Bad Request","code":"unsupported_stop_string","message":"stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings","param":null}}
HTTP 400

zach-li-sudo · 2026-05-11T01:23:20Z

Thanks for the contribution, Zhuo! The feature itself is small and the tests are nice — but the diff feels heavier than it needs to be because backend-type branching is leaking into orchestration code. A few things to consider before merge:

1. is_mlx() branching is repeated 4× — push it into the wrapper

The same if complete.is_mlx() { resolve_mlx_matched_stop_json(...) } else { complete.matched_stop_json() } block appears in:

model_gateway/src/routers/grpc/regular/processor.rs:187 (chat)

model_gateway/src/routers/grpc/regular/processor.rs:834 (completion)

model_gateway/src/routers/grpc/regular/streaming.rs:495 (chat stream)

model_gateway/src/routers/grpc/regular/streaming.rs:2589 (completion stream)

proto_wrapper.rs already encapsulates backend differences (matched_stop_json(), is_mlx_streaming_finished(), etc.). Could you add a matched_stop_json_with_context(stop, stop_token_ids, tokenizer) method on ProtoGenerateComplete that internally dispatches on the variant? Each caller collapses to one line, and mlx_matched_stop_token_id() no longer needs to be pub. Composes better when the next backend hits the same quirk.

2. apply_mlx_stop_sequences is double-gated

model_gateway/src/routers/grpc/common/stages/helpers.rs:82 pattern-matches ProtoGenerateRequest::Mlx(req), but both callers (chat/request_building.rs:111, completion/request_building.rs:90) already gate on builder_client.is_mlx(). Pick one — either the helper is unconditional and no-ops on non-MLX, or callers gate and the helper assumes MLX.

3. Silent encode-error swallow is a footgun

chat_utils.rs:443 — stop_strings_to_token_ids correctly returns Err (→ 400) for multi-token strings, but silently warn!s and drops when the tokenizer errors or returns zero tokens. The new MockTokenizer::failing() test at line 904 locks this in:
// "Tokenizer errors are silently skipped; the function returns Ok(empty)."
assert!(result.unwrap().is_empty());
Both are the same class of "we cannot honor your stop condition" failure — caller sets stop: ["foo"], encoder errors, request proceeds to max_tokens, user blames SMG. Recommend surfacing both as 400.

4. Is test-case = "3.3.1" worth a new workspace dep?

It's introduced as a brand-new workspace dependency just for ~13 parameterized cases across 2 functions in chat_utils.rs. Not used anywhere else in the workspace today. A plain #[test] with a slice of (input, expected, name) tuples and a loop gets the same coverage without the dep — and keeps the workspace lean. Up to you, but worth weighing.

Smaller items

streaming.rs got 6 new matched_stop: None struct-literal additions (lines 2422, 2450, 2470, 2505, 2529, 2552). Not blocking, but a ..Default::default() or a constructor on CompletionStreamChoice would prevent this sprawl next time.

Could you run the curl from the PR description on your local Mac (SMG + MLX backend) and paste the actual response in the test plan? Stream + non-stream, single stop string + multi-token rejection. That's the strongest signal we'll get for a Mac-specific path.

mlx_engine.rs:245 comment update is good; retaining reject_stop_strings for messages/generate paths is correct.

After (1)–(3), the MLX surface area shrinks a lot and the "MLX branches everywhere" feel goes away.

Hi Keyang, thanks for your comments! The issues you've mentioned are resolved as follows:

is_mlx() branching is repeated 4× — push it into the wrapper: fixed in commit #6f390899 by moving mlx match_stop processing logic into proto wrapper
apply_mlx_stop_sequences is double-gated: fixed in commit #15c12271 fix double gated apply_mlx_stop_sequences: helper is unconditional and no-ops on non-MLX
Silent encode-error swallow is a footgun: fixed in commit #271cec93 fix silent encode-error: zero token and failed tokenizer throw 400 error
Is test-case = "3.3.1" worth a new workspace dep: dev dependency removed, newly added unit tests are refactored using a plain #[test] with a slice of (input, expected, name) tuples and a loop.

Other items:

streaming.rs got 6 new matched_stop: None struct-literal additions (lines 2422, 2450, 2470, 2505, 2529, 2552). Not blocking, but a ..Default::default() or a constructor on CompletionStreamChoice would prevent this sprawl next time.

Solution: used ..Default::default() fror the construct CompletionStreamChoice fields. See #94eef5a2.

Could you run the curl from the PR description on your local Mac (SMG + MLX backend) and paste the actual response in the test plan? Stream + non-stream, single stop string + multi-token rejection. That's the strongest signal we'll get for a Mac-specific path.

Yes, the overall test plan executed on my local Mac with real response are done! All test cases are summarized in this test matrix, including {stream, non-stream} x {chat, chat completion} x {stop token ids, stop strings, combined, ...}. Please check linked comment.

I see some other gaps (not directly correlated to this PR) in MLX backend support, like harmony paths, messages/generate endpoints etc. Will discuss these with my investigation later.

zach-li-sudo requested review from CatherineSue, key4ng and slin1237 as code owners May 5, 2026 03:06

github-actions Bot added tokenizer Tokenizer related changes dependencies Dependency updates grpc gRPC client and router changes protocols Protocols crate changes model-gateway Model gateway crate changes labels May 5, 2026

chatgpt-codex-connector Bot reviewed May 5, 2026

View reviewed changes

Comment thread crates/grpc_client/src/mlx_engine.rs

coderabbitai Bot requested changes May 5, 2026

View reviewed changes

slin1237 assigned key4ng May 5, 2026

zach-li-sudo changed the title ~~feat(mlx-grpc): support string stop sequences for chat and completion…~~ feat(mlx-grpc): support string stop sequences for chat and completion May 7, 2026

zach-li-sudo force-pushed the mlx-grpc branch from b4ce20d to c7e1728 Compare May 10, 2026 22:19

github-actions Bot removed the dependencies Dependency updates label May 10, 2026

chatgpt-codex-connector Bot reviewed May 10, 2026

View reviewed changes

coderabbitai Bot reviewed May 10, 2026

View reviewed changes

zach-li-sudo added 8 commits May 10, 2026 15:28

feat(mlx-grpc): support string stop sequences for chat and completion (…

e60ab13

…lightseekorg#1099) Signed-off-by: Zhuo Li <[email protected]>

move mlx match_stop processing logic into proto wrapper

6f39089

Signed-off-by: Zhuo Li <[email protected]>

fix double gated apply_mlx_stop_sequences: helper is unconditional an…

15c1227

…d no-ops on non-MLX Signed-off-by: Zhuo Li <[email protected]>

fix silent encode-error: zero token and failed tokenizer throw 400 error

271cec9

Signed-off-by: Zhuo Li <[email protected]>

fix: remove test-case dep and refactor unit tests in chat utils

486da46

Signed-off-by: Zhuo Li <[email protected]>

use default values for CompletionStreamChoice fields

94eef5a

Signed-off-by: Zhuo Li <[email protected]>

rebase before pushing

b938bc9

Signed-off-by: Zhuo Li <[email protected]>

fix fmt and clippy issues

fac846f

Signed-off-by: Zhuo Li <[email protected]>

zach-li-sudo force-pushed the mlx-grpc branch from c7e1728 to fac846f Compare May 10, 2026 22:29

chatgpt-codex-connector Bot reviewed May 10, 2026

View reviewed changes

Conversation

zach-li-sudo commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Test Plan

Summary by CodeRabbit

Uh oh!

gemini-code-assist Bot commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zach-li-sudo May 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

key4ng commented May 7, 2026

1. is_mlx() branching is repeated 4× — push it into the wrapper

2. apply_mlx_stop_sequences is double-gated

3. Silent encode-error swallow is a footgun

4. Is test-case = "3.3.1" worth a new workspace dep?

Smaller items

Uh oh!

mergify Bot commented May 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

zach-li-sudo May 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

zach-li-sudo May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zach-li-sudo commented May 10, 2026

Manual testing results: stop sequence with MLX backend

1. SMG with MLX backend setup

2. Testing scenarios

Key differences from vLLM

zach-li-sudo commented May 5, 2026 •

edited

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading

coderabbitai Bot May 5, 2026 •

edited

Loading

1. `is_mlx()` branching is repeated 4× — push it into the wrapper

2. `apply_mlx_stop_sequences` is double-gated

4. Is `test-case = "3.3.1"` worth a new workspace dep?

zach-li-sudo May 12, 2026 •

edited

Loading

1. `is_mlx()` branching is repeated 4× — push it into the wrapper

2. `apply_mlx_stop_sequences` is double-gated

4. Is `test-case = "3.3.1"` worth a new workspace dep?