feat(mlx-grpc): support string stop sequences for chat and completion#1447
feat(mlx-grpc): support string stop sequences for chat and completion#1447zach-li-sudo wants to merge 8 commits into
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
📝 WalkthroughWalkthroughThis PR enables MLX stop-sequence support by tokenizing user-provided stop strings into token IDs during request building and resolving matched stop IDs back to user-facing values during response processing using request context. ChangesMLX Stop-Sequence Support
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b4ce20db09
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@model_gateway/src/routers/grpc/common/stages/helpers.rs`:
- Around line 83-99: The code calls resolve_mlx_stop_ids(stop, tokenizer) before
verifying the request is the MLX variant, which can tokenize unnecessarily and
produce spurious errors; change the order so you first match on
ProtoGenerateRequest::Mlx (e.g., if let ProtoGenerateRequest::Mlx(req) =
proto_request { ... }), bail early if not MLX, then check for Some(stop) and
only then call resolve_mlx_stop_ids(stop, tokenizer) and extend
sampling.stop_token_ids; ensure you still return Ok(()) when stop is None or
when sampling_params is missing.
In `@model_gateway/src/routers/grpc/proto_wrapper.rs`:
- Around line 741-747: The MLX variant's matched_stop_json() can return raw
integer token IDs; update the five unguarded call sites
(process_non_streaming_generate_response,
process_non_streaming_messages_response, process_non_streaming_chat_response
(harmony), the Harmony streaming response processing, and the Harmony streaming
variant) so they do not consume raw matched_stop_json() directly: either guard
with is_mlx() and call resolve_mlx_matched_stop_json() for Mlx, or always call
resolve_mlx_matched_stop_json() (which uses mlx_matched_stop_token_id()) before
using the value; ensure any code paths that previously read matched_stop_json()
now receive the resolved string form.
In `@model_gateway/src/routers/grpc/utils/chat_utils.rs`:
- Around line 422-453: Update the docstring for stop_strings_to_token_ids to
document that tokenizer.encode(...) errors are not propagated but are logged and
the corresponding stop string is skipped (i.e., both zero-token encodings and
encoder errors are warn-and-skipped), and clarify that only multi-token
encodings produce an Err result; reference the function name
stop_strings_to_token_ids and the call tokenizer.encode to make clear where this
behavior occurs.
- Around line 496-508: apply_mlx_stop_sequences currently extends
sampling.stop_token_ids with values converted by resolve_mlx_stop_ids without
checking if the request already set explicit stop_token_ids; add a validation
that rejects the request when both stop strings and explicit stop_token_ids are
provided. To fix, change apply_mlx_stop_sequences (or its caller) to receive the
original request's stop_token_ids (or a boolean flag) and if that vector is
non-empty and stop_strings is present, return a bad_request error; alternatively
perform this check in the caller before invoking apply_mlx_stop_sequences.
Ensure the error originates from the same error::bad_request pattern and
reference sampling.stop_token_ids, apply_mlx_stop_sequences, and
resolve_mlx_stop_ids in your change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8297696e-30a4-4e03-9896-ce3f2cb893d9
📒 Files selected for processing (13)
Cargo.tomlcrates/grpc_client/src/mlx_engine.rscrates/protocols/src/completion.rscrates/tokenizer/src/mock.rsmodel_gateway/Cargo.tomlmodel_gateway/src/routers/grpc/common/stages/helpers.rsmodel_gateway/src/routers/grpc/proto_wrapper.rsmodel_gateway/src/routers/grpc/regular/processor.rsmodel_gateway/src/routers/grpc/regular/stages/chat/request_building.rsmodel_gateway/src/routers/grpc/regular/stages/completion/request_building.rsmodel_gateway/src/routers/grpc/regular/streaming.rsmodel_gateway/src/routers/grpc/utils/chat_utils.rsmodel_gateway/src/routers/grpc/utils/mod.rs
| pub(crate) fn resolve_mlx_stop_ids( | ||
| stop_strings: &StringOrArray, | ||
| tokenizer: Option<&dyn Tokenizer>, | ||
| ) -> Result<Vec<u32>, Response> { | ||
| let tok = tokenizer.ok_or_else(|| { | ||
| error::bad_request( | ||
| "tokenizer_unavailable", | ||
| "MLX backend requires a tokenizer to convert string stop sequences", | ||
| ) | ||
| })?; | ||
| stop_strings_to_token_ids(stop_strings.iter(), tok) | ||
| .map_err(|e| error::bad_request("unsupported_stop_string", e)) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm that apply_mlx_stop_sequences (or its caller) validates the
# "stop strings AND explicit stop_token_ids present simultaneously" case.
rg -n -B5 -A30 "apply_mlx_stop_sequences" --type rustRepository: lightseekorg/smg
Length of output: 6747
🏁 Script executed:
#!/bin/bash
# Get the full implementation of apply_mlx_stop_sequences
cat -n model_gateway/src/routers/grpc/common/stages/helpers.rs | sed -n '73,110p'Repository: lightseekorg/smg
Length of output: 1141
🏁 Script executed:
#!/bin/bash
# Search for any validation that checks for both stop strings and stop_token_ids simultaneously
rg -n "stop_token_ids|stop_strings" model_gateway/src/routers/grpc/common/stages/helpers.rs -A3 -B3Repository: lightseekorg/smg
Length of output: 604
🏁 Script executed:
#!/bin/bash
# Check the proto_request structure and any validation in the pipeline before apply_mlx_stop_sequences
rg -n "stop_token_ids" model_gateway/src/routers/grpc --type rust -B2 -A2 | head -100Repository: lightseekorg/smg
Length of output: 8584
🏁 Script executed:
#!/bin/bash
# Search for validation that rejects both stop strings and stop_token_ids simultaneously
rg -n "stop_token_ids.*stop\|stop.*stop_token_ids" model_gateway/src/routers/grpc --type rust -B3 -A3 | grep -A5 -B5 "error\|invalid\|reject\|bad_request\|400" | head -60Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Check the completion/chat request stages before apply_mlx_stop_sequences for validation
rg -n "apply_mlx_stop_sequences" -B50 model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs | grep -A5 -B5 "stop_token_ids\|error\|validation"Repository: lightseekorg/smg
Length of output: 1532
🏁 Script executed:
#!/bin/bash
# Search for PR test plan or any tests that verify the simultaneous stop+stop_token_ids rejection
fd -e rs -e toml --type f . | xargs rg -l "both.*stop\|simultaneous.*stop" | head -10Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Find the proto definition for MLX GenerateRequest to see stop_token_ids field
fd -e proto --type f . | xargs rg -l "MLX.*Generate\|GenerateRequest" | head -5Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Search for where proto_request.stop_token_ids gets populated initially
rg -n "stop_token_ids\s*=" model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs -B5 -A2Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Check the proto structure - search for MLX proto message definitions
rg -n "message.*Mlx|stop_token_ids" model_gateway/src/proto --type proto -A3 -B3 2>/dev/null | head -80Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Find proto files
find model_gateway -name "*.proto" -type fRepository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Check how ProtoGenerateRequest::Mlx is constructed - search in the request building stages
rg -n "ProtoGenerateRequest::Mlx\|MlxGenerateRequest" model_gateway/src --type rust -A10 -B2 | head -100Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Look at the full chat request building stage to see proto construction
cat -n model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs | head -150Repository: lightseekorg/smg
Length of output: 5254
🏁 Script executed:
#!/bin/bash
# Search for build_chat_request implementation to see if it validates stop_token_ids
rg -n "fn build_chat_request" model_gateway/src --type rust -A50 | head -100Repository: lightseekorg/smg
Length of output: 4459
🏁 Script executed:
#!/bin/bash
# Search for any validation that explicitly rejects both stop and stop_token_ids
rg -n "stop_token_ids.*is_some\|stop_token_ids.*is_some_and" model_gateway/src/routers/grpc --type rust -B5 -A5Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Check if there's validation in preparation stage for MLX
rg -n "MLX.*stop\|is_mlx.*stop" model_gateway/src/routers/grpc/regular/stages -B3 -A3 --type rustRepository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Search for MLX client build_generate_request_from_chat implementation
rg -n "impl.*Mlx|build_generate_request_from_chat" model_gateway/src --type rust -A30 | grep -A30 "Mlx" | head -80Repository: lightseekorg/smg
Length of output: 2645
🏁 Script executed:
#!/bin/bash
# Look for where proto request stop_token_ids gets set from the input
rg -n "sampling_params\|stop_token_ids" model_gateway/src/routers/grpc/client.rs -B2 -A2 | head -60Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Check if there's a validation stage before MLX processing
rg -n "stop_token_ids\s*=\|Some.*stop_token_ids" model_gateway/src/routers/grpc --type rust -B3 -A3 | grep -B5 -A5 "bad_request\|error\|reject" | head -40Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Find the MLX client implementation file
find model_gateway -name "*mlx*" -type f | grep -E "\.(rs|proto)$"Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Search for MlxClient or mlx_client struct/impl
rg -n "struct MlxClient\|impl.*MlxClient" model_gateway/src --type rust -A60 | head -120Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
/#/bash
# Look for where stop_token_ids from ChatCompletionRequest gets used/copied to proto
rg -n "body\.stop_token_ids\|\.stop_token_ids\s*:" model_gateway/src/routers/grpc --type rust -B3 -A3Repository: lightseekorg/smg
Length of output: 114
🏁 Script executed:
#!/bin/bash
# Search for all references to stop_token_ids in grpc/client context
rg -n "stop_token_ids" model_gateway/src/routers/grpc --type rust -B3 -A3 | grep -E "\.rs-|\.rs:" | head -100Repository: lightseekorg/smg
Length of output: 9658
🏁 Script executed:
#!/bin/bash
# Look for MlxClient struct definition and its build_generate_request_from_chat method
rg -n "struct MlxClient" model_gateway/src --type rust -A100 | head -150Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Search more broadly for where MLX proto is being constructed
rg -n "MlxGenerateRequest\|GenerateRequest::Mlx" model_gateway/src --type rust -B5 -A10 | head -100Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Search for files containing Mlx proto or client references
find model_gateway -type f -name "*.rs" | xargs grep -l "Mlx\|mlx" | grep -v test | head -20Repository: lightseekorg/smg
Length of output: 978
🏁 Script executed:
#!/bin/bash
# Look at files that might contain MLX proto generation
rg -n "stop_token_ids" model_gateway/src/routers/grpc/client/mlx* --type rust 2>/dev/null || echo "No mlx client file found"Repository: lightseekorg/smg
Length of output: 85
🏁 Script executed:
#!/bin/bash
# Search for proto definitions or where MlxGenerateRequest is created
rg -n "sampling_params\|MlxGenerateRequest\|struct.*Mlx" model_gateway/src --type rust | head -50Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Check proto_wrapper.rs to see how MlxGenerateRequest is defined
cat -n model_gateway/src/routers/grpc/proto_wrapper.rs | head -200Repository: lightseekorg/smg
Length of output: 7912
🏁 Script executed:
#!/bin/bash
# Search for where the proto request gets stop_token_ids from user input
rg -n "stop_token_ids" model_gateway/src/routers/grpc/proto_wrapper.rs -B5 -A5Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Check where MLX client build_generate_request_from_chat is implemented
rg -n "fn build_generate_request_from_chat" model_gateway/src --type rust -B3 -A50 | head -150Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Search for where stop_token_ids from the request body is actually handled in proto creation
rg -n "body\." model_gateway/src/routers/grpc/client.rs | grep -i stopRepository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Look at the actual client implementations - search for stop handling
rg -n "SglangClient\|VllmClient\|TrtllmClient\|MlxClient" model_gateway/src/routers/grpc/client.rs -A100 | head -300Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Check if smg_grpc_client provides the client implementations
rg -n "use.*MlxClient\|from_path\|build_generate_request" model_gateway/src/routers/grpc/client.rs | head -30Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Look at the client.rs file structure
wc -l model_gateway/src/routers/grpc/client.rsRepository: lightseekorg/smg
Length of output: 105
🏁 Script executed:
#!/bin/bash
# Get the full context of how clients are structured
cat -n model_gateway/src/routers/grpc/client.rs | sed -n '1,100p'Repository: lightseekorg/smg
Length of output: 3689
🏁 Script executed:
#!/bin/bash
# Look for where body.stop_token_ids might be referenced or used
rg -n "\.stop_token_ids" model_gateway/src/routers/grpc/client.rs -B3 -A3Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Check if there's validation logic that rejects stop_token_ids for MLX backends
rg -n "stop_token_ids" model_gateway/src/routers/grpc/regular/stages --type rust -B5 -A5 | head -100Repository: lightseekorg/smg
Length of output: 4784
🏁 Script executed:
#!/bin/bash
# Search for validation in preparation stage that might reject stop_token_ids for MLX
rg -n "is_mlx\|MLX" model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs -B3 -A3 | head -80Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Look for any error/validation related to stop_token_ids in the entire grpc module
rg -n "stop_token_ids.*reject\|stop_token_ids.*error\|stop_token_ids.*bad_request\|unsupported.*stop_token_ids" model_gateway/src/routers/grpc --type rustRepository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Let's check what the proto request structure looks like - check if stop_token_ids is pre-populated
cat -n model_gateway/src/routers/grpc/proto_wrapper.rs | sed -n '200,500p'Repository: lightseekorg/smg
Length of output: 12973
🏁 Script executed:
#!/bin/bash
# Search for ProtoGenerateRequest enum definition
rg -n "enum ProtoGenerateRequest\|enum.*Generate" model_gateway/src/routers/grpc/proto_wrapper.rs -A20Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Check if there's any logic that copies stop_token_ids from ChatCompletionRequest to proto
rg -n "stop_token_ids" model_gateway/src/routers/grpc/client.rsRepository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Search for tests related to MLX stop sequences
find model_gateway -name "*.rs" -type f | xargs grep -l "mlx.*stop\|stop.*mlx" | head -10Repository: lightseekorg/smg
Length of output: 511
🏁 Script executed:
#!/bin/bash
# Look for any test files for chat_utils.rs
find model_gateway -name "*test*" -o -name "*tests*" | grep -i chatRepository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Search for PR test plan or documentation about stop_token_ids + stop strings rejection
rg -n "both.*stop\|simultaneously.*stop\|PR.*test.*plan" model_gateway --type rust -B2 -A2Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
/#/bash
# Check if there's a validation layer that validates both can't be present
rg -n "reject\|unsupported" model_gateway/src/routers/grpc/regular/stages/chat --type rust -B3 -A3 | grep -i stopRepository: lightseekorg/smg
Length of output: 114
🏁 Script executed:
#!/bin/bash
# Look for the actual client implementations that do build_generate_request_from_chat
# These are external, but let's check if there's any wrapper or validation
cd model_gateway && grep -r "build_generate_request_from_chat" --include="*.rs" -A5 -B5 | head -80Repository: lightseekorg/smg
Length of output: 6402
🏁 Script executed:
#!/bin/bash
# Check the actual ChatCompletionRequest structure to see if it has stop_token_ids
rg -n "struct ChatCompletionRequest\|stop_token_ids" model_gateway/src --type rust | grep -A10 "ChatCompletionRequest"Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
/#/bash
# Look for validation that rejects both stop and stop_token_ids on MLX
rg -n "stop_token_ids.*Some\|Some.*stop_token_ids" model_gateway/src/routers/grpc --type rust -B5 -A5 | head -100Repository: lightseekorg/smg
Length of output: 114
The validation for rejecting simultaneous "stop strings + explicit stop_token_ids" is not present in apply_mlx_stop_sequences.
apply_mlx_stop_sequences (helpers.rs:78–100) receives only the stop strings from the request, not the explicit stop_token_ids field. The function blindly extends sampling.stop_token_ids (line 96) without checking whether stop_token_ids were already populated from the user request. No guard exists in this function to reject the case where both parameters are provided. If this validation is required per the PR test plan, it either needs to be added to apply_mlx_stop_sequences (with access to the original request's stop_token_ids) or implemented elsewhere in the request pipeline before this function is called.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@model_gateway/src/routers/grpc/utils/chat_utils.rs` around lines 496 - 508,
apply_mlx_stop_sequences currently extends sampling.stop_token_ids with values
converted by resolve_mlx_stop_ids without checking if the request already set
explicit stop_token_ids; add a validation that rejects the request when both
stop strings and explicit stop_token_ids are provided. To fix, change
apply_mlx_stop_sequences (or its caller) to receive the original request's
stop_token_ids (or a boolean flag) and if that vector is non-empty and
stop_strings is present, return a bad_request error; alternatively perform this
check in the caller before invoking apply_mlx_stop_sequences. Ensure the error
originates from the same error::bad_request pattern and reference
sampling.stop_token_ids, apply_mlx_stop_sequences, and resolve_mlx_stop_ids in
your change.
There was a problem hiding this comment.
Processing simultaneous "stop strings + explicit stop token ids" is handled in apply_mlx_stop_sequences at helpers.rs. Commit 15c1227 ("fix double gated apply_mlx_stop_sequences") moved resolve_mlx_stop_ids inside the ProtoGenerateRequest::Mlx branch and added a comment clarifying the no-op for non-MLX backends.
If both "stop strings + explicit stop token ids" are received, the behaviours are:
- combined: single-token stop + stop_token_ids => 200, return with matched string/id
- combined: multi-token stop + stop_token_ids => 400, MLX not support multi-token string stop
These 2 scenarios are covered by test cases 3.x.4 and 3.x.5 here
curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop": ["6"],
"stop_token_ids": [20],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"id": "chatcmpl-019e13c9-0ea1-77c0-b430-bf53fbcdede2",
"object": "chat.completion",
"created": 1778448404,
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "1 \n2 \n3 \n4 \n",
"reasoning_content": null
},
"finish_reason": "stop",
"matched_stop": 20
}
],
"usage": {
"prompt_tokens": 21,
"completion_tokens": 9,
"total_tokens": 30
},
"system_fingerprint": "default"
},
"status": "HTTP 200"
}and
curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop": ["hello world"],
"stop_token_ids": [20, 21],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"error": {
"type": "Bad Request",
"code": "unsupported_stop_string",
"message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
"param": null
}
},
"status": "HTTP 400"
}There was a problem hiding this comment.
@zach-li-sudo, thank you for the detailed clarification and the manual test evidence! My concern was based on an incorrect assumption — I assumed the design should outright reject requests when both stop and stop_token_ids are provided simultaneously, but the actual design is more nuanced and correct:
- Single-token stop +
stop_token_ids→ Both are honored together (merged intosampling.stop_token_ids), and the matched stop is resolved back to the correct user-facing form viaresolve_mlx_matched_stop_json. ✅ - Multi-token stop +
stop_token_ids→ Rejected with 400 becauseresolve_mlx_stop_ids(inside theMlxbranch) fails on multi-token encoding before anything is extended. ✅
The commit 15c1227 properly gates resolve_mlx_stop_ids inside ProtoGenerateRequest::Mlx, and the manual test results for cases 3.x.4 and 3.x.5 confirm the behavior is working as designed. My concern doesn't apply here — the implementation is correct.
( ˶ᵔ ᵕ ᵔ˶ )
🧠 Learnings used
Learnt from: slin1237
Repo: lightseekorg/smg PR: 447
File: model_gateway/src/routers/grpc/client.rs:312-328
Timestamp: 2026-02-17T20:30:27.647Z
Learning: Actionable guideline: In model_gateway gRPC metadata discovery (specifically in model_gateway/src/routers/grpc/...), verify how keys are handled for different proto sources. SGLang uses short-form keys (tp_size, dp_size, pp_size) via pick_prost_fields() without normalization, while vLLM/TRT-LLM use long-form keys (tensor_parallel_size, pipeline_parallel_size) that pass through flat_labels() and are normalized by normalize_grpc_keys() in discover_metadata.rs after model_info.to_labels() and device/server_info.to_labels(). Ensure reviewers check that the code paths correctly reflect these normalization rules and that tests cover both code paths.
Learnt from: XinyueZhang369
Repo: lightseekorg/smg PR: 399
File: protocols/src/interactions.rs:505-509
Timestamp: 2026-02-19T03:08:50.192Z
Learning: In code reviews for Rust projects using the validator crate (v0.20.0), ensure that custom validation functions for numeric primitive types (e.g., f32, i32, u32, i16, etc.) accept the value by value, not by reference. Example: fn validate(value: f32) { ... }. The validator derive macro has a hardcoded list of numeric types that are passed by value, while all other types are passed by reference. Apply this guideline whenever validating numeric fields to align with the derive macro behavior.
Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: model_gateway/src/core/token_bucket.rs:58-63
Timestamp: 2026-02-21T02:30:51.443Z
Learning: For lint-only/Clippy enforcement PRs in this repository, avoid introducing behavioral changes (e.g., new input validation or logic changes). Treat such PRs as non-functional changes and plan a separate follow-up issue/PR for hardening or behavior changes. This applies broadly to Rust files across the repo; during review, focus on lint/style corrections and clearly note any intentional exceptions.
Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: protocols/src/responses.rs:928-931
Timestamp: 2026-02-21T02:36:00.882Z
Learning: In Rust code across the repository, use the marker INVARIANT: to document assumptions in safe code. Reserve SAFETY: for explaining why unsafe blocks are sound. This improves clarity of invariants and safety reasoning. Example reference: protocols/src/responses.rs near validate_tool_choice_with_tools().
Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: mesh/src/sync.rs:83-83
Timestamp: 2026-02-21T02:37:01.416Z
Learning: General Rust formatting rule: format! with implicit captures only supports simple identifiers, not full expressions like {state.model_id}. For cases where you want to interpolate a field or expression, bind the value first and interpolate the binding, e.g., let model_id = &state.model_id; and then use format!("policy:{}", model_id). In the specific file mesh/src/sync.rs, prefer format!("policy:{}", state.model_id) or bind to a local variable if you need named interpolation, to keep clarity and avoid unintended captures.
Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 807
File: model_gateway/src/middleware.rs:61-81
Timestamp: 2026-03-18T21:32:00.041Z
Learning: In Rust code using the http crate, HeaderMap::get() is effectively case-insensitive because HeaderName normalizes keys to lowercase on insertion and lookup. Do not require or perform explicit .to_lowercase() before HeaderMap::get() calls. Mark as not a concern for case-sensitivity in lookups; only consider normalization when inserting or comparing via HeaderName, not in lookups.
Learnt from: key4ng
Repo: lightseekorg/smg PR: 867
File: tui/src/app.rs:798-813
Timestamp: 2026-03-22T20:13:55.778Z
Learning: In this repo (lightseekorg/smg), treat the workspace `Cargo.toml`’s `package.rust-version` (MSRV) as the source of truth (e.g., `rust-version = "1.85"`). When reviewing Rust changes, do not flag usage of Rust language/library features that were stabilized on or before the MSRV (e.g., `Option::is_none_or`, stabilized in 1.82, is compatible with an MSRV of 1.85). Always verify the MSRV from the workspace `Cargo.toml` rather than relying on issue templates.
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 937
File: model_gateway/src/core/worker.rs:0-0
Timestamp: 2026-03-27T03:20:19.917Z
Learning: When calling `worker.record_outcome(status_code: u16)` (the unified Circuit Breaker outcome recording API), it’s valid to pass *synthetic* HTTP status codes for transport/connection errors where no real HTTP response was received. For example, callers may pass `502` (send error), `504` (timeout), or other appropriate `502/503/504`-style synthetic codes to preserve CB feedback. Do not flag these calls as incorrect usage of `record_outcome`. Health checks should still handle reachability separately.
|
Thanks for the contribution, Zhuo! The feature itself is small and the tests are nice — but the diff feels heavier than it needs to be because backend-type branching is leaking into orchestration code. A few things to consider before merge: 1.
|
|
Hi @zach-li-sudo, the DCO sign-off check has failed. All commits must include a To fix existing commits: # Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-leaseTo sign off future commits automatically:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c7e1728cf1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| .matched_stop_token_id | ||
| .map(|id| serde_json::Value::Number(id.into())), | ||
| // MLX requires request context to resolve the token ID; use matched_stop_json_with_context. | ||
| Self::Mlx(_) => unreachable!("matched_stop_json called for MLX backend"), |
There was a problem hiding this comment.
Avoid panicking for MLX in
matched_stop_json
Do not make the MLX branch unreachable here until every caller is migrated to matched_stop_json_with_context: several active paths still call matched_stop_json() directly (for example regular/processor.rs at lines 426 and 713, and regular/streaming.rs at line 2010). For any MLX completion in those flows (non-streaming /generate, non-streaming /messages, and streaming Messages), this now triggers a runtime panic instead of producing a normal response.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Similar to unreachable!() usage here for MLX multi-modal feature. All occurrence, when matched stop is needed in building JSON response for MLX, can only be resolved by matched_stop_json_with_context() where all other backends (vLLM, SGLang, etc) are no-op. So code execution shouldn't reach here. This Self::Mlx(_) is kept because of the exhaustion match nature of Rust language.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
model_gateway/src/routers/grpc/common/stages/helpers.rs (1)
283-305: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winOptimize by checking MLX variant before tokenization.
The current order (check
stop, tokenize, then check MLX variant) can waste CPU tokenizing stops for non-MLX backends. Checking the MLX variant first avoids unnecessary tokenization:♻️ Proposed reordering
pub(crate) fn apply_mlx_stop_sequences( proto_request: &mut ProtoGenerateRequest, stop: Option<&StringOrArray>, tokenizer: Option<&dyn Tokenizer>, ) -> Result<(), Response> { + let ProtoGenerateRequest::Mlx(req) = proto_request else { + return Ok(()); + }; let Some(stop) = stop else { return Ok(()); }; - - if let ProtoGenerateRequest::Mlx(req) = proto_request { - let token_ids = resolve_mlx_stop_ids(stop, tokenizer)?; - let sampling = req.sampling_params.as_mut().ok_or_else(|| { - error::internal_error( - "mlx_sampling_params_missing", - "MLX GenerateRequest has no sampling_params; cannot inject stop IDs", - ) - })?; - sampling.stop_token_ids.extend(token_ids); - } - - Ok(()) + let token_ids = resolve_mlx_stop_ids(stop, tokenizer)?; + let sampling = req.sampling_params.as_mut().ok_or_else(|| { + error::internal_error( + "mlx_sampling_params_missing", + "MLX GenerateRequest has no sampling_params; cannot inject stop IDs", + ) + })?; + sampling.stop_token_ids.extend(token_ids); + Ok(()) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@model_gateway/src/routers/grpc/common/stages/helpers.rs` around lines 283 - 305, The function apply_mlx_stop_sequences currently tokenizes stop sequences via resolve_mlx_stop_ids before confirming the proto_request is the MLX variant, causing unnecessary CPU work for non-MLX backends; modify apply_mlx_stop_sequences to first early-return if proto_request is not ProtoGenerateRequest::Mlx, then proceed to check the stop Option and call resolve_mlx_stop_ids only when inside the MLX branch, and finally mutate sampling_params (sampling.stop_token_ids.extend(...)) as before to avoid wasted tokenization for non-MLX requests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@model_gateway/src/routers/grpc/common/stages/helpers.rs`:
- Around line 283-305: The function apply_mlx_stop_sequences currently tokenizes
stop sequences via resolve_mlx_stop_ids before confirming the proto_request is
the MLX variant, causing unnecessary CPU work for non-MLX backends; modify
apply_mlx_stop_sequences to first early-return if proto_request is not
ProtoGenerateRequest::Mlx, then proceed to check the stop Option and call
resolve_mlx_stop_ids only when inside the MLX branch, and finally mutate
sampling_params (sampling.stop_token_ids.extend(...)) as before to avoid wasted
tokenization for non-MLX requests.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: a1ca32dd-17bc-490f-9af2-dc65c5b82612
📒 Files selected for processing (11)
crates/grpc_client/src/mlx_engine.rscrates/protocols/src/completion.rscrates/tokenizer/src/mock.rsmodel_gateway/src/routers/grpc/common/stages/helpers.rsmodel_gateway/src/routers/grpc/proto_wrapper.rsmodel_gateway/src/routers/grpc/regular/processor.rsmodel_gateway/src/routers/grpc/regular/stages/chat/request_building.rsmodel_gateway/src/routers/grpc/regular/stages/completion/request_building.rsmodel_gateway/src/routers/grpc/regular/streaming.rsmodel_gateway/src/routers/grpc/utils/chat_utils.rsmodel_gateway/src/routers/grpc/utils/mod.rs
…lightseekorg#1099) Signed-off-by: Zhuo Li <[email protected]>
Signed-off-by: Zhuo Li <[email protected]>
…d no-ops on non-MLX Signed-off-by: Zhuo Li <[email protected]>
Signed-off-by: Zhuo Li <[email protected]>
Signed-off-by: Zhuo Li <[email protected]>
Signed-off-by: Zhuo Li <[email protected]>
Signed-off-by: Zhuo Li <[email protected]>
Signed-off-by: Zhuo Li <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fac846fbd6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| choices: vec![CompletionStreamChoice { | ||
| text: String::new(), | ||
| index, | ||
| logprobs: None, | ||
| finish_reason: Some("stop".to_string()), | ||
| ..Default::default() | ||
| }], |
There was a problem hiding this comment.
Include matched_stop in chunk-stopped completion streams
When the local stop decoder fires in the Chunk branch (stopped == true), the emitted terminal CompletionStreamChoice is built with ..Default::default() and never sets matched_stop. Because the index is then added to stopped_indices, the later Complete message is skipped, so the new matched_stop field is never populated for these streams. This affects streamed completions that terminate on gateway-detected stop sequences (for example user-provided stop), and clients lose the stop match metadata even though termination reason is stop.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Nice catch! I've also tested vLLM backend:
curl http://localhost:3000/v1/completions -s -N \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop": ["6"],
"stream": true,
"max_tokens": 100
}'
data: {"id":"cmpl_019e1a56-4bfe-7db2-b42d-5af3e093ac4f","object":"text_completion","created":1778558323,"choices":[{"text":"5","index":0,"finish_reason":null}],"model":"Qwen/Qwen2.5-1.5B-Instruct","system_fingerprint":"default"}
data: {"id":"cmpl_019e1a56-4bfe-7db2-b42d-5af3e093ac4f","object":"text_completion","created":1778558323,"choices":[{"text":"\n","index":0,"finish_reason":null}],"model":"Qwen/Qwen2.5-1.5B-Instruct","system_fingerprint":"default"}
data: {"id":"cmpl_019e1a56-4bfe-7db2-b42d-5af3e093ac4f","object":"text_completion","created":1778558323,"choices":[{"text":"","index":0,"finish_reason":"stop"}],"model":"Qwen/Qwen2.5-1.5B-Instruct","system_fingerprint":"default"}
data: [DONE]
no matched_stop="6" was in the response chunk. This impacts both vLLM and MLX paths on completion. Will create a separate PR to add it.
Manual testing results: stop sequence with MLX backend1. SMG with MLX backend setup
pip install -e grpc_servicer/
source .venv/bin/activate && python -m smg_grpc_servicer.mlx.server \
--model mlx-community/Qwen3-4B-Instruct-2507-4bit --port 50051
./target/debug/smg --worker-urls grpc://localhost:50051 --port 30002. Testing scenariosThis PR is to support string stop array in regular Key differences from vLLM
The gateway converts single-token stop strings to token IDs for MLX requests (the MLX proto has Token ID referenceToken IDs used in the tests below assume the Qwen3 tokenizer, which shares vocabulary with Qwen2.5:
Verify with the tokenizer if results are unexpected: from mlx_lm import load
_, tokenizer = load("mlx-community/Qwen3-4B-Instruct-2507-4bit")
print(tokenizer.encode("5 6")) # check IDs for context-free "5" and "6"3. ResultsTest matrix: 4 paths × 5 stop modes = 20 cases.
3.1 Chat, non-stream3.1.1 — stop string, single token ( curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop": ["6"],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"id": "chatcmpl-019e12f3-af39-7bc1-a46c-4173bebe0cbc",
"object": "chat.completion",
"created": 1778434420,
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "1 \n2 \n3 \n4 \n5 \n",
"reasoning_content": null
},
"finish_reason": "stop",
"matched_stop": "6"
}
],
"usage": {
"prompt_tokens": 21,
"completion_tokens": 11,
"total_tokens": 32
},
"system_fingerprint": "default"
},
"status": "HTTP 200"
}3.1.2 — stop string, multi-token ( curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Say: hi there and hello world!"}],
"stop": ["hello world"],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"error": {
"type": "Bad Request",
"code": "unsupported_stop_string",
"message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
"param": null
}
},
"status": "HTTP 400"
}3.1.3 — stop_token_ids ( curl http://localhost:3000/v1/chat/completions -s \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop_token_ids": [20, 21],
"stream": false,
"max_tokens": 100
}' | jq{
"id": "chatcmpl-019e12f3-b06a-7e30-9026-b277ef4cd022",
"object": "chat.completion",
"created": 1778434420,
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "1 \n2 \n3 \n4 \n",
"reasoning_content": null
},
"finish_reason": "stop",
"matched_stop": 20
}
],
"usage": {
"prompt_tokens": 21,
"completion_tokens": 9,
"total_tokens": 30
},
"system_fingerprint": "default"
}3.1.4 — combined: single-token stop string + stop_token_ids ( curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop": ["6"],
"stop_token_ids": [20],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"id": "chatcmpl-019e13c9-0ea1-77c0-b430-bf53fbcdede2",
"object": "chat.completion",
"created": 1778448404,
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "1 \n2 \n3 \n4 \n",
"reasoning_content": null
},
"finish_reason": "stop",
"matched_stop": 20
}
],
"usage": {
"prompt_tokens": 21,
"completion_tokens": 9,
"total_tokens": 30
},
"system_fingerprint": "default"
},
"status": "HTTP 200"
}3.1.5 — combined: multi-token stop string + stop_token_ids ( curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop": ["hello world"],
"stop_token_ids": [20, 21],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"error": {
"type": "Bad Request",
"code": "unsupported_stop_string",
"message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
"param": null
}
},
"status": "HTTP 400"
}3.2 Chat, stream3.2.1 — stop string, single token ( curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop": ["6"],
"stream": true,
"max_tokens": 100
}'3.2.2 — stop string, multi-token ( curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Repeat exactly: 1 2 3 hello world 4 5"}],
"stop": ["hello world"],
"stream": true,
"max_tokens": 100
}'3.2.3 — stop_token_ids ( curl http://localhost:3000/v1/chat/completions -s -N \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop_token_ids": [20, 21],
"stream": true,
"max_tokens": 100
}'3.2.4 — combined: single-token stop string + stop_token_ids ( curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop": ["6"],
"stop_token_ids": [20],
"stream": true,
"max_tokens": 100
}'3.2.5 — combined: multi-token stop string + stop_token_ids ( curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop": ["hello world"],
"stop_token_ids": [20, 21],
"stream": true,
"max_tokens": 100
}'3.3 Completion, non-stream3.3.1 — stop string, single token ( curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop": ["6"],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"id": "cmpl_019e12f4-9eda-71c2-bf57-ab5fca3036f6",
"object": "text_completion",
"created": 1778434481,
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"choices": [
{
"text": "5\n",
"index": 0,
"finish_reason": "stop",
"matched_stop": "6"
}
],
"usage": {
"prompt_tokens": 22,
"completion_tokens": 3,
"total_tokens": 25
},
"system_fingerprint": "default"
},
"status": "HTTP 200"
}3.3.2 — stop string, multi-token ( curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Repeat exactly: 1 2 3 hello world 4 5",
"stop": ["hello world"],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"error": {
"type": "Bad Request",
"code": "unsupported_stop_string",
"message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
"param": null
}
},
"status": "HTTP 400"
}3.3.3 — stop_token_ids ( curl http://localhost:3000/v1/completions -s \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop_token_ids": [20, 21],
"stream": false,
"max_tokens": 100
}' | jq{
"id": "cmpl_019e12f4-9faf-7582-bcc0-bb7849cf6585",
"object": "text_completion",
"created": 1778434482,
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"choices": [
{
"text": "",
"index": 0,
"finish_reason": "stop",
"matched_stop": 20
}
],
"usage": {
"prompt_tokens": 22,
"completion_tokens": 1,
"total_tokens": 23
},
"system_fingerprint": "default"
}
3.3.4 — combined: single-token stop string + stop_token_ids ( curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop": ["6"],
"stop_token_ids": [20],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"id": "cmpl_019e13c9-6957-7f10-8038-53f7c215dd87",
"object": "text_completion",
"created": 1778448427,
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"choices": [
{
"text": "",
"index": 0,
"finish_reason": "stop",
"matched_stop": 20
}
],
"usage": {
"prompt_tokens": 22,
"completion_tokens": 1,
"total_tokens": 23
},
"system_fingerprint": "default"
},
"status": "HTTP 200"
}3.3.5 — combined: multi-token stop string + stop_token_ids ( curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop": ["hello world"],
"stop_token_ids": [20, 21],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"error": {
"type": "Bad Request",
"code": "unsupported_stop_string",
"message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
"param": null
}
},
"status": "HTTP 400"
}3.4 Completion, stream3.4.1 — stop string, single token ( curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop": ["6"],
"stream": true,
"max_tokens": 100
}'3.4.2 — stop string, multi-token ( curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Repeat exactly: 1 2 3 hello world 4 5",
"stop": ["hello world"],
"stream": true,
"max_tokens": 100
}'3.4.3 — stop_token_ids ( curl http://localhost:3000/v1/completions -s -N \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop_token_ids": [20, 21],
"stream": true,
"max_tokens": 100
}'3.4.4 — combined: single-token stop string + stop_token_ids ( curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop": ["6"],
"stop_token_ids": [20],
"stream": true,
"max_tokens": 100
}'3.4.5 — combined: multi-token stop string + stop_token_ids ( curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop": ["hello world"],
"stop_token_ids": [20, 21],
"stream": true,
"max_tokens": 100
}' |
Hi Keyang, thanks for your comments! The issues you've mentioned are resolved as follows:
Other items:
Solution: used
Yes, the overall test plan executed on my local Mac with real response are done! All test cases are summarized in this test matrix, including I see some other gaps (not directly correlated to this PR) in MLX backend support, like harmony paths, messages/generate endpoints etc. Will discuss these with my investigation later. |
Description
Follow-up on (#1099) to support chat/completion with
stopfield.Problem
support string stop sequences for chat and completion
Solution
convert stop strings to stop token ids before passing to mlx backend
Changes
Test Plan
See detailed curl command and real responses here
Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit
New Features
Improvements