Skip to content

feat(mlx-grpc): support string stop sequences for chat and completion#1447

Open
zach-li-sudo wants to merge 8 commits into
lightseekorg:mainfrom
zach-li-sudo:mlx-grpc
Open

feat(mlx-grpc): support string stop sequences for chat and completion#1447
zach-li-sudo wants to merge 8 commits into
lightseekorg:mainfrom
zach-li-sudo:mlx-grpc

Conversation

@zach-li-sudo
Copy link
Copy Markdown

@zach-li-sudo zach-li-sudo commented May 5, 2026

Description

Follow-up on (#1099) to support chat/completion with stop field.

curl http://localhost:3000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
     "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
     "stop": ["6"],
     "stream": false,
     "max_tokens": 100
   }'

Problem

support string stop sequences for chat and completion

Solution

convert stop strings to stop token ids before passing to mlx backend

Changes

Test Plan

  1. Unit tests for newly added helper functions
  2. Deploy personal dev stack (SMG + mlx inference backend) with manual tests:
    See detailed curl command and real responses here
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features

    • MLX backend now supports string stop sequences for chat and completion API requests
    • Completion streaming responses now include matched stop sequence information
  • Improvements

    • Enhanced stop sequence resolution for MLX backend with full request context awareness, ensuring accurate token identification and proper reporting of matched stops

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added tokenizer Tokenizer related changes dependencies Dependency updates grpc gRPC client and router changes protocols Protocols crate changes model-gateway Model gateway crate changes labels May 5, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

📝 Walkthrough

Walkthrough

This PR enables MLX stop-sequence support by tokenizing user-provided stop strings into token IDs during request building and resolving matched stop IDs back to user-facing values during response processing using request context.

Changes

MLX Stop-Sequence Support

Layer / File(s) Summary
Response Data Model
crates/protocols/src/completion.rs
CompletionStreamChoice adds optional matched_stop field with conditional serialization and derives Default to support populated matched-stop values in streaming responses.
MLX Request Validation
crates/grpc_client/src/mlx_engine.rs
Removes reject_stop_strings validation from chat and completion request builders, allowing string stops; messages and generate pipelines continue to reject them. Documentation updated.
Test Tokenizer Infrastructure
crates/tokenizer/src/mock.rs
MockTokenizer extends with fail_encode: bool field and failing() constructor to simulate tokenizer encode failures in tests.
Stop-Sequence Resolution Helpers
model_gateway/src/routers/grpc/utils/chat_utils.rs
Adds stop_strings_to_token_ids (encode stops to single-token IDs), resolve_mlx_matched_stop_json (map matched token IDs to user values), resolve_mlx_stop_ids (validate and orchestrate with error handling).
Stop-Sequence Test Coverage
model_gateway/src/routers/grpc/utils/chat_utils.rs
Comprehensive test suite covering tokenization success/error cases, HTTP error responses, and matched-stop mapping precedence.
MLX Stop-Sequence Injection Helper
model_gateway/src/routers/grpc/common/stages/helpers.rs
Introduces apply_mlx_stop_sequences function to validate tokenizer, tokenize optional stop sequences, and extend MLX sampling parameters with resolved token IDs.
Chat and Completion Request Building
model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs, model_gateway/src/routers/grpc/regular/stages/completion/request_building.rs
Request-building stages invoke apply_mlx_stop_sequences after metadata injection to populate MLX sampling parameters before finalizing protobuf requests.
Proto Wrapper Context-Aware Resolution
model_gateway/src/routers/grpc/proto_wrapper.rs
Adds internal mlx_matched_stop_token_id() helper. Treats MLX backend as unreachable in matched_stop_json(). Updates matched_stop_json_with_context() to use resolve_mlx_matched_stop_json() with request context.
Non-Streaming Response Processing
model_gateway/src/routers/grpc/regular/processor.rs
process_single_choice and process_non_streaming_completion_response use matched_stop_json_with_context(), passing request stop, stop token IDs, and tokenizer for proper matched-stop resolution.
Streaming Response Processing
model_gateway/src/routers/grpc/regular/streaming.rs
Chat streaming uses context-aware matched_stop_json_with_context() for Complete events. Completion streaming refactors choice construction to use Default::default() and sets matched_stop on final chunks using context-aware resolution.
Utils Re-exports
model_gateway/src/routers/grpc/utils/mod.rs
Extends gRPC utils re-export block to include resolve_mlx_matched_stop_json and resolve_mlx_stop_ids helpers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • lightseekorg/smg#592: Modifies matched_stop handling in model_gateway gRPC pipeline across parser/processor/streaming paths.
  • lightseekorg/smg#606: Makes related changes to stop-sequence handling and router-side tokenization of string vs pre-tokenized stops.
  • lightseekorg/smg#1034: Introduces MLX client/proto and gateway plumbing that this PR directly extends for stop-sequence/token-id handling.

Suggested labels

tests

Suggested reviewers

  • key4ng
  • slin1237
  • CatherineSue

🐰 Stop sequences now dance with MLX tokens,
One string maps to one ID spoken,
Request builds, response resolves with care,
Context flows through the pipeline's air! 🎯

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(mlx-grpc): support string stop sequences for chat and completion' directly and clearly describes the main feature addition—enabling string stop sequences for MLX gRPC chat and completion requests, which is the core objective of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b4ce20db09

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/grpc_client/src/mlx_engine.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@model_gateway/src/routers/grpc/common/stages/helpers.rs`:
- Around line 83-99: The code calls resolve_mlx_stop_ids(stop, tokenizer) before
verifying the request is the MLX variant, which can tokenize unnecessarily and
produce spurious errors; change the order so you first match on
ProtoGenerateRequest::Mlx (e.g., if let ProtoGenerateRequest::Mlx(req) =
proto_request { ... }), bail early if not MLX, then check for Some(stop) and
only then call resolve_mlx_stop_ids(stop, tokenizer) and extend
sampling.stop_token_ids; ensure you still return Ok(()) when stop is None or
when sampling_params is missing.

In `@model_gateway/src/routers/grpc/proto_wrapper.rs`:
- Around line 741-747: The MLX variant's matched_stop_json() can return raw
integer token IDs; update the five unguarded call sites
(process_non_streaming_generate_response,
process_non_streaming_messages_response, process_non_streaming_chat_response
(harmony), the Harmony streaming response processing, and the Harmony streaming
variant) so they do not consume raw matched_stop_json() directly: either guard
with is_mlx() and call resolve_mlx_matched_stop_json() for Mlx, or always call
resolve_mlx_matched_stop_json() (which uses mlx_matched_stop_token_id()) before
using the value; ensure any code paths that previously read matched_stop_json()
now receive the resolved string form.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs`:
- Around line 422-453: Update the docstring for stop_strings_to_token_ids to
document that tokenizer.encode(...) errors are not propagated but are logged and
the corresponding stop string is skipped (i.e., both zero-token encodings and
encoder errors are warn-and-skipped), and clarify that only multi-token
encodings produce an Err result; reference the function name
stop_strings_to_token_ids and the call tokenizer.encode to make clear where this
behavior occurs.
- Around line 496-508: apply_mlx_stop_sequences currently extends
sampling.stop_token_ids with values converted by resolve_mlx_stop_ids without
checking if the request already set explicit stop_token_ids; add a validation
that rejects the request when both stop strings and explicit stop_token_ids are
provided. To fix, change apply_mlx_stop_sequences (or its caller) to receive the
original request's stop_token_ids (or a boolean flag) and if that vector is
non-empty and stop_strings is present, return a bad_request error; alternatively
perform this check in the caller before invoking apply_mlx_stop_sequences.
Ensure the error originates from the same error::bad_request pattern and
reference sampling.stop_token_ids, apply_mlx_stop_sequences, and
resolve_mlx_stop_ids in your change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8297696e-30a4-4e03-9896-ce3f2cb893d9

📥 Commits

Reviewing files that changed from the base of the PR and between 829604f and b4ce20d.

📒 Files selected for processing (13)
  • Cargo.toml
  • crates/grpc_client/src/mlx_engine.rs
  • crates/protocols/src/completion.rs
  • crates/tokenizer/src/mock.rs
  • model_gateway/Cargo.toml
  • model_gateway/src/routers/grpc/common/stages/helpers.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs
  • model_gateway/src/routers/grpc/regular/processor.rs
  • model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
  • model_gateway/src/routers/grpc/regular/stages/completion/request_building.rs
  • model_gateway/src/routers/grpc/regular/streaming.rs
  • model_gateway/src/routers/grpc/utils/chat_utils.rs
  • model_gateway/src/routers/grpc/utils/mod.rs

Comment thread model_gateway/src/routers/grpc/common/stages/helpers.rs
Comment thread model_gateway/src/routers/grpc/proto_wrapper.rs
Comment thread model_gateway/src/routers/grpc/utils/chat_utils.rs
Comment on lines +496 to +508
pub(crate) fn resolve_mlx_stop_ids(
stop_strings: &StringOrArray,
tokenizer: Option<&dyn Tokenizer>,
) -> Result<Vec<u32>, Response> {
let tok = tokenizer.ok_or_else(|| {
error::bad_request(
"tokenizer_unavailable",
"MLX backend requires a tokenizer to convert string stop sequences",
)
})?;
stop_strings_to_token_ids(stop_strings.iter(), tok)
.map_err(|e| error::bad_request("unsupported_stop_string", e))
}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm that apply_mlx_stop_sequences (or its caller) validates the
# "stop strings AND explicit stop_token_ids present simultaneously" case.
rg -n -B5 -A30 "apply_mlx_stop_sequences" --type rust

Repository: lightseekorg/smg

Length of output: 6747


🏁 Script executed:

#!/bin/bash
# Get the full implementation of apply_mlx_stop_sequences
cat -n model_gateway/src/routers/grpc/common/stages/helpers.rs | sed -n '73,110p'

Repository: lightseekorg/smg

Length of output: 1141


🏁 Script executed:

#!/bin/bash
# Search for any validation that checks for both stop strings and stop_token_ids simultaneously
rg -n "stop_token_ids|stop_strings" model_gateway/src/routers/grpc/common/stages/helpers.rs -A3 -B3

Repository: lightseekorg/smg

Length of output: 604


🏁 Script executed:

#!/bin/bash
# Check the proto_request structure and any validation in the pipeline before apply_mlx_stop_sequences
rg -n "stop_token_ids" model_gateway/src/routers/grpc --type rust -B2 -A2 | head -100

Repository: lightseekorg/smg

Length of output: 8584


🏁 Script executed:

#!/bin/bash
# Search for validation that rejects both stop strings and stop_token_ids simultaneously
rg -n "stop_token_ids.*stop\|stop.*stop_token_ids" model_gateway/src/routers/grpc --type rust -B3 -A3 | grep -A5 -B5 "error\|invalid\|reject\|bad_request\|400" | head -60

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check the completion/chat request stages before apply_mlx_stop_sequences for validation
rg -n "apply_mlx_stop_sequences" -B50 model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs | grep -A5 -B5 "stop_token_ids\|error\|validation"

Repository: lightseekorg/smg

Length of output: 1532


🏁 Script executed:

#!/bin/bash
# Search for PR test plan or any tests that verify the simultaneous stop+stop_token_ids rejection
fd -e rs -e toml --type f . | xargs rg -l "both.*stop\|simultaneous.*stop" | head -10

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Find the proto definition for MLX GenerateRequest to see stop_token_ids field
fd -e proto --type f . | xargs rg -l "MLX.*Generate\|GenerateRequest" | head -5

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Search for where proto_request.stop_token_ids gets populated initially
rg -n "stop_token_ids\s*=" model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs -B5 -A2

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check the proto structure - search for MLX proto message definitions
rg -n "message.*Mlx|stop_token_ids" model_gateway/src/proto --type proto -A3 -B3 2>/dev/null | head -80

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Find proto files
find model_gateway -name "*.proto" -type f

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check how ProtoGenerateRequest::Mlx is constructed - search in the request building stages
rg -n "ProtoGenerateRequest::Mlx\|MlxGenerateRequest" model_gateway/src --type rust -A10 -B2 | head -100

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Look at the full chat request building stage to see proto construction
cat -n model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs | head -150

Repository: lightseekorg/smg

Length of output: 5254


🏁 Script executed:

#!/bin/bash
# Search for build_chat_request implementation to see if it validates stop_token_ids
rg -n "fn build_chat_request" model_gateway/src --type rust -A50 | head -100

Repository: lightseekorg/smg

Length of output: 4459


🏁 Script executed:

#!/bin/bash
# Search for any validation that explicitly rejects both stop and stop_token_ids
rg -n "stop_token_ids.*is_some\|stop_token_ids.*is_some_and" model_gateway/src/routers/grpc --type rust -B5 -A5

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check if there's validation in preparation stage for MLX
rg -n "MLX.*stop\|is_mlx.*stop" model_gateway/src/routers/grpc/regular/stages -B3 -A3 --type rust

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Search for MLX client build_generate_request_from_chat implementation
rg -n "impl.*Mlx|build_generate_request_from_chat" model_gateway/src --type rust -A30 | grep -A30 "Mlx" | head -80

Repository: lightseekorg/smg

Length of output: 2645


🏁 Script executed:

#!/bin/bash
# Look for where proto request stop_token_ids gets set from the input
rg -n "sampling_params\|stop_token_ids" model_gateway/src/routers/grpc/client.rs -B2 -A2 | head -60

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check if there's a validation stage before MLX processing
rg -n "stop_token_ids\s*=\|Some.*stop_token_ids" model_gateway/src/routers/grpc --type rust -B3 -A3 | grep -B5 -A5 "bad_request\|error\|reject" | head -40

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Find the MLX client implementation file
find model_gateway -name "*mlx*" -type f | grep -E "\.(rs|proto)$"

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Search for MlxClient or mlx_client struct/impl
rg -n "struct MlxClient\|impl.*MlxClient" model_gateway/src --type rust -A60 | head -120

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

/#/bash
# Look for where stop_token_ids from ChatCompletionRequest gets used/copied to proto
rg -n "body\.stop_token_ids\|\.stop_token_ids\s*:" model_gateway/src/routers/grpc --type rust -B3 -A3

Repository: lightseekorg/smg

Length of output: 114


🏁 Script executed:

#!/bin/bash
# Search for all references to stop_token_ids in grpc/client context
rg -n "stop_token_ids" model_gateway/src/routers/grpc --type rust -B3 -A3 | grep -E "\.rs-|\.rs:" | head -100

Repository: lightseekorg/smg

Length of output: 9658


🏁 Script executed:

#!/bin/bash
# Look for MlxClient struct definition and its build_generate_request_from_chat method
rg -n "struct MlxClient" model_gateway/src --type rust -A100 | head -150

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Search more broadly for where MLX proto is being constructed
rg -n "MlxGenerateRequest\|GenerateRequest::Mlx" model_gateway/src --type rust -B5 -A10 | head -100

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Search for files containing Mlx proto or client references
find model_gateway -type f -name "*.rs" | xargs grep -l "Mlx\|mlx" | grep -v test | head -20

Repository: lightseekorg/smg

Length of output: 978


🏁 Script executed:

#!/bin/bash
# Look at files that might contain MLX proto generation
rg -n "stop_token_ids" model_gateway/src/routers/grpc/client/mlx* --type rust 2>/dev/null || echo "No mlx client file found"

Repository: lightseekorg/smg

Length of output: 85


🏁 Script executed:

#!/bin/bash
# Search for proto definitions or where MlxGenerateRequest is created
rg -n "sampling_params\|MlxGenerateRequest\|struct.*Mlx" model_gateway/src --type rust | head -50

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check proto_wrapper.rs to see how MlxGenerateRequest is defined
cat -n model_gateway/src/routers/grpc/proto_wrapper.rs | head -200

Repository: lightseekorg/smg

Length of output: 7912


🏁 Script executed:

#!/bin/bash
# Search for where the proto request gets stop_token_ids from user input
rg -n "stop_token_ids" model_gateway/src/routers/grpc/proto_wrapper.rs -B5 -A5

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check where MLX client build_generate_request_from_chat is implemented
rg -n "fn build_generate_request_from_chat" model_gateway/src --type rust -B3 -A50 | head -150

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Search for where stop_token_ids from the request body is actually handled in proto creation
rg -n "body\." model_gateway/src/routers/grpc/client.rs | grep -i stop

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Look at the actual client implementations - search for stop handling
rg -n "SglangClient\|VllmClient\|TrtllmClient\|MlxClient" model_gateway/src/routers/grpc/client.rs -A100 | head -300

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check if smg_grpc_client provides the client implementations
rg -n "use.*MlxClient\|from_path\|build_generate_request" model_gateway/src/routers/grpc/client.rs | head -30

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Look at the client.rs file structure
wc -l model_gateway/src/routers/grpc/client.rs

Repository: lightseekorg/smg

Length of output: 105


🏁 Script executed:

#!/bin/bash
# Get the full context of how clients are structured
cat -n model_gateway/src/routers/grpc/client.rs | sed -n '1,100p'

Repository: lightseekorg/smg

Length of output: 3689


🏁 Script executed:

#!/bin/bash
# Look for where body.stop_token_ids might be referenced or used
rg -n "\.stop_token_ids" model_gateway/src/routers/grpc/client.rs -B3 -A3

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check if there's validation logic that rejects stop_token_ids for MLX backends
rg -n "stop_token_ids" model_gateway/src/routers/grpc/regular/stages --type rust -B5 -A5 | head -100

Repository: lightseekorg/smg

Length of output: 4784


🏁 Script executed:

#!/bin/bash
# Search for validation in preparation stage that might reject stop_token_ids for MLX
rg -n "is_mlx\|MLX" model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs -B3 -A3 | head -80

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Look for any error/validation related to stop_token_ids in the entire grpc module
rg -n "stop_token_ids.*reject\|stop_token_ids.*error\|stop_token_ids.*bad_request\|unsupported.*stop_token_ids" model_gateway/src/routers/grpc --type rust

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Let's check what the proto request structure looks like - check if stop_token_ids is pre-populated
cat -n model_gateway/src/routers/grpc/proto_wrapper.rs | sed -n '200,500p'

Repository: lightseekorg/smg

Length of output: 12973


🏁 Script executed:

#!/bin/bash
# Search for ProtoGenerateRequest enum definition
rg -n "enum ProtoGenerateRequest\|enum.*Generate" model_gateway/src/routers/grpc/proto_wrapper.rs -A20

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check if there's any logic that copies stop_token_ids from ChatCompletionRequest to proto
rg -n "stop_token_ids" model_gateway/src/routers/grpc/client.rs

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Search for tests related to MLX stop sequences
find model_gateway -name "*.rs" -type f | xargs grep -l "mlx.*stop\|stop.*mlx" | head -10

Repository: lightseekorg/smg

Length of output: 511


🏁 Script executed:

#!/bin/bash
# Look for any test files for chat_utils.rs
find model_gateway -name "*test*" -o -name "*tests*" | grep -i chat

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Search for PR test plan or documentation about stop_token_ids + stop strings rejection
rg -n "both.*stop\|simultaneously.*stop\|PR.*test.*plan" model_gateway --type rust -B2 -A2

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

/#/bash
# Check if there's a validation layer that validates both can't be present
rg -n "reject\|unsupported" model_gateway/src/routers/grpc/regular/stages/chat --type rust -B3 -A3 | grep -i stop

Repository: lightseekorg/smg

Length of output: 114


🏁 Script executed:

#!/bin/bash
# Look for the actual client implementations that do build_generate_request_from_chat
# These are external, but let's check if there's any wrapper or validation
cd model_gateway && grep -r "build_generate_request_from_chat" --include="*.rs" -A5 -B5 | head -80

Repository: lightseekorg/smg

Length of output: 6402


🏁 Script executed:

#!/bin/bash
# Check the actual ChatCompletionRequest structure to see if it has stop_token_ids
rg -n "struct ChatCompletionRequest\|stop_token_ids" model_gateway/src --type rust | grep -A10 "ChatCompletionRequest"

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

/#/bash
# Look for validation that rejects both stop and stop_token_ids on MLX
rg -n "stop_token_ids.*Some\|Some.*stop_token_ids" model_gateway/src/routers/grpc --type rust -B5 -A5 | head -100

Repository: lightseekorg/smg

Length of output: 114


The validation for rejecting simultaneous "stop strings + explicit stop_token_ids" is not present in apply_mlx_stop_sequences.

apply_mlx_stop_sequences (helpers.rs:78–100) receives only the stop strings from the request, not the explicit stop_token_ids field. The function blindly extends sampling.stop_token_ids (line 96) without checking whether stop_token_ids were already populated from the user request. No guard exists in this function to reject the case where both parameters are provided. If this validation is required per the PR test plan, it either needs to be added to apply_mlx_stop_sequences (with access to the original request's stop_token_ids) or implemented elsewhere in the request pipeline before this function is called.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs` around lines 496 - 508,
apply_mlx_stop_sequences currently extends sampling.stop_token_ids with values
converted by resolve_mlx_stop_ids without checking if the request already set
explicit stop_token_ids; add a validation that rejects the request when both
stop strings and explicit stop_token_ids are provided. To fix, change
apply_mlx_stop_sequences (or its caller) to receive the original request's
stop_token_ids (or a boolean flag) and if that vector is non-empty and
stop_strings is present, return a bad_request error; alternatively perform this
check in the caller before invoking apply_mlx_stop_sequences. Ensure the error
originates from the same error::bad_request pattern and reference
sampling.stop_token_ids, apply_mlx_stop_sequences, and resolve_mlx_stop_ids in
your change.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Processing simultaneous "stop strings + explicit stop token ids" is handled in apply_mlx_stop_sequences at helpers.rs. Commit 15c1227 ("fix double gated apply_mlx_stop_sequences") moved resolve_mlx_stop_ids inside the ProtoGenerateRequest::Mlx branch and added a comment clarifying the no-op for non-MLX backends.

If both "stop strings + explicit stop token ids" are received, the behaviours are:

  • combined: single-token stop + stop_token_ids => 200, return with matched string/id
  • combined: multi-token stop + stop_token_ids => 400, MLX not support multi-token string stop

These 2 scenarios are covered by test cases 3.x.4 and 3.x.5 here

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stop_token_ids": [20],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'
{
  "response": {
    "id": "chatcmpl-019e13c9-0ea1-77c0-b430-bf53fbcdede2",
    "object": "chat.completion",
    "created": 1778448404,
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "1  \n2  \n3  \n4  \n",
          "reasoning_content": null
        },
        "finish_reason": "stop",
        "matched_stop": 20
      }
    ],
    "usage": {
      "prompt_tokens": 21,
      "completion_tokens": 9,
      "total_tokens": 30
    },
    "system_fingerprint": "default"
  },
  "status": "HTTP 200"
}

and

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["hello world"],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'
{
  "response": {
    "error": {
      "type": "Bad Request",
      "code": "unsupported_stop_string",
      "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
      "param": null
    }
  },
  "status": "HTTP 400"
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zach-li-sudo, thank you for the detailed clarification and the manual test evidence! My concern was based on an incorrect assumption — I assumed the design should outright reject requests when both stop and stop_token_ids are provided simultaneously, but the actual design is more nuanced and correct:

  • Single-token stop + stop_token_ids → Both are honored together (merged into sampling.stop_token_ids), and the matched stop is resolved back to the correct user-facing form via resolve_mlx_matched_stop_json. ✅
  • Multi-token stop + stop_token_ids → Rejected with 400 because resolve_mlx_stop_ids (inside the Mlx branch) fails on multi-token encoding before anything is extended. ✅

The commit 15c1227 properly gates resolve_mlx_stop_ids inside ProtoGenerateRequest::Mlx, and the manual test results for cases 3.x.4 and 3.x.5 confirm the behavior is working as designed. My concern doesn't apply here — the implementation is correct.

( ˶ᵔ ᵕ ᵔ˶ )


🧠 Learnings used
Learnt from: slin1237
Repo: lightseekorg/smg PR: 447
File: model_gateway/src/routers/grpc/client.rs:312-328
Timestamp: 2026-02-17T20:30:27.647Z
Learning: Actionable guideline: In model_gateway gRPC metadata discovery (specifically in model_gateway/src/routers/grpc/...), verify how keys are handled for different proto sources. SGLang uses short-form keys (tp_size, dp_size, pp_size) via pick_prost_fields() without normalization, while vLLM/TRT-LLM use long-form keys (tensor_parallel_size, pipeline_parallel_size) that pass through flat_labels() and are normalized by normalize_grpc_keys() in discover_metadata.rs after model_info.to_labels() and device/server_info.to_labels(). Ensure reviewers check that the code paths correctly reflect these normalization rules and that tests cover both code paths.

Learnt from: XinyueZhang369
Repo: lightseekorg/smg PR: 399
File: protocols/src/interactions.rs:505-509
Timestamp: 2026-02-19T03:08:50.192Z
Learning: In code reviews for Rust projects using the validator crate (v0.20.0), ensure that custom validation functions for numeric primitive types (e.g., f32, i32, u32, i16, etc.) accept the value by value, not by reference. Example: fn validate(value: f32) { ... }. The validator derive macro has a hardcoded list of numeric types that are passed by value, while all other types are passed by reference. Apply this guideline whenever validating numeric fields to align with the derive macro behavior.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: model_gateway/src/core/token_bucket.rs:58-63
Timestamp: 2026-02-21T02:30:51.443Z
Learning: For lint-only/Clippy enforcement PRs in this repository, avoid introducing behavioral changes (e.g., new input validation or logic changes). Treat such PRs as non-functional changes and plan a separate follow-up issue/PR for hardening or behavior changes. This applies broadly to Rust files across the repo; during review, focus on lint/style corrections and clearly note any intentional exceptions. 

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: protocols/src/responses.rs:928-931
Timestamp: 2026-02-21T02:36:00.882Z
Learning: In Rust code across the repository, use the marker INVARIANT: to document assumptions in safe code. Reserve SAFETY: for explaining why unsafe blocks are sound. This improves clarity of invariants and safety reasoning. Example reference: protocols/src/responses.rs near validate_tool_choice_with_tools().

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: mesh/src/sync.rs:83-83
Timestamp: 2026-02-21T02:37:01.416Z
Learning: General Rust formatting rule: format! with implicit captures only supports simple identifiers, not full expressions like {state.model_id}. For cases where you want to interpolate a field or expression, bind the value first and interpolate the binding, e.g., let model_id = &state.model_id; and then use format!("policy:{}", model_id). In the specific file mesh/src/sync.rs, prefer format!("policy:{}", state.model_id) or bind to a local variable if you need named interpolation, to keep clarity and avoid unintended captures.

Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 807
File: model_gateway/src/middleware.rs:61-81
Timestamp: 2026-03-18T21:32:00.041Z
Learning: In Rust code using the http crate, HeaderMap::get() is effectively case-insensitive because HeaderName normalizes keys to lowercase on insertion and lookup. Do not require or perform explicit .to_lowercase() before HeaderMap::get() calls. Mark as not a concern for case-sensitivity in lookups; only consider normalization when inserting or comparing via HeaderName, not in lookups.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 867
File: tui/src/app.rs:798-813
Timestamp: 2026-03-22T20:13:55.778Z
Learning: In this repo (lightseekorg/smg), treat the workspace `Cargo.toml`’s `package.rust-version` (MSRV) as the source of truth (e.g., `rust-version = "1.85"`). When reviewing Rust changes, do not flag usage of Rust language/library features that were stabilized on or before the MSRV (e.g., `Option::is_none_or`, stabilized in 1.82, is compatible with an MSRV of 1.85). Always verify the MSRV from the workspace `Cargo.toml` rather than relying on issue templates.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 937
File: model_gateway/src/core/worker.rs:0-0
Timestamp: 2026-03-27T03:20:19.917Z
Learning: When calling `worker.record_outcome(status_code: u16)` (the unified Circuit Breaker outcome recording API), it’s valid to pass *synthetic* HTTP status codes for transport/connection errors where no real HTTP response was received. For example, callers may pass `502` (send error), `504` (timeout), or other appropriate `502/503/504`-style synthetic codes to preserve CB feedback. Do not flag these calls as incorrect usage of `record_outcome`. Health checks should still handle reachability separately.

@zach-li-sudo zach-li-sudo changed the title feat(mlx-grpc): support string stop sequences for chat and completion… feat(mlx-grpc): support string stop sequences for chat and completion May 7, 2026
@key4ng
Copy link
Copy Markdown
Collaborator

key4ng commented May 7, 2026

Thanks for the contribution, Zhuo! The feature itself is small and the tests are nice — but the diff feels heavier than it needs to be because backend-type branching is leaking into orchestration code. A few things to consider before merge:

1. is_mlx() branching is repeated 4× — push it into the wrapper

The same if complete.is_mlx() { resolve_mlx_matched_stop_json(...) } else { complete.matched_stop_json() } block appears in:

  • model_gateway/src/routers/grpc/regular/processor.rs:187 (chat)
  • model_gateway/src/routers/grpc/regular/processor.rs:834 (completion)
  • model_gateway/src/routers/grpc/regular/streaming.rs:495 (chat stream)
  • model_gateway/src/routers/grpc/regular/streaming.rs:2589 (completion stream)

proto_wrapper.rs already encapsulates backend differences (matched_stop_json(), is_mlx_streaming_finished(), etc.). Could you add a matched_stop_json_with_context(stop, stop_token_ids, tokenizer) method on ProtoGenerateComplete that internally dispatches on the variant? Each caller collapses to one line, and mlx_matched_stop_token_id() no longer needs to be pub. Composes better when the next backend hits the same quirk.

2. apply_mlx_stop_sequences is double-gated

model_gateway/src/routers/grpc/common/stages/helpers.rs:82 pattern-matches ProtoGenerateRequest::Mlx(req), but both callers (chat/request_building.rs:111, completion/request_building.rs:90) already gate on builder_client.is_mlx(). Pick one — either the helper is unconditional and no-ops on non-MLX, or callers gate and the helper assumes MLX.

3. Silent encode-error swallow is a footgun

chat_utils.rs:443stop_strings_to_token_ids correctly returns Err (→ 400) for multi-token strings, but silently warn!s and drops when the tokenizer errors or returns zero tokens. The new MockTokenizer::failing() test at line 904 locks this in:

// "Tokenizer errors are silently skipped; the function returns Ok(empty)."
assert!(result.unwrap().is_empty());

Both are the same class of "we cannot honor your stop condition" failure — caller sets stop: ["foo"], encoder errors, request proceeds to max_tokens, user blames SMG. Recommend surfacing both as 400.

4. Is test-case = "3.3.1" worth a new workspace dep?

It's introduced as a brand-new workspace dependency just for ~13 parameterized cases across 2 functions in chat_utils.rs. Not used anywhere else in the workspace today. A plain #[test] with a slice of (input, expected, name) tuples and a loop gets the same coverage without the dep — and keeps the workspace lean. Up to you, but worth weighing.

Smaller items

  • streaming.rs got 6 new matched_stop: None struct-literal additions (lines 2422, 2450, 2470, 2505, 2529, 2552). Not blocking, but a ..Default::default() or a constructor on CompletionStreamChoice would prevent this sprawl next time.
  • Could you run the curl from the PR description on your local Mac (SMG + MLX backend) and paste the actual response in the test plan? Stream + non-stream, single stop string + multi-token rejection. That's the strongest signal we'll get for a Mac-specific path.
  • mlx_engine.rs:245 comment update is good; retaining reject_stop_strings for messages/generate paths is correct.

After (1)–(3), the MLX surface area shrinks a lot and the "MLX branches everywhere" feel goes away.

@github-actions github-actions Bot removed the dependencies Dependency updates label May 10, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 10, 2026

Hi @zach-li-sudo, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

  • Use git commit -s every time, or
  • VSCode: enable Git: Always Sign Off in Settings
  • PyCharm: enable Sign-off commit in the Commit tool window

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c7e1728cf1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

.matched_stop_token_id
.map(|id| serde_json::Value::Number(id.into())),
// MLX requires request context to resolve the token ID; use matched_stop_json_with_context.
Self::Mlx(_) => unreachable!("matched_stop_json called for MLX backend"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid panicking for MLX in matched_stop_json

Do not make the MLX branch unreachable here until every caller is migrated to matched_stop_json_with_context: several active paths still call matched_stop_json() directly (for example regular/processor.rs at lines 426 and 713, and regular/streaming.rs at line 2010). For any MLX completion in those flows (non-streaming /generate, non-streaming /messages, and streaming Messages), this now triggers a runtime panic instead of producing a normal response.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to unreachable!() usage here for MLX multi-modal feature. All occurrence, when matched stop is needed in building JSON response for MLX, can only be resolved by matched_stop_json_with_context() where all other backends (vLLM, SGLang, etc) are no-op. So code execution shouldn't reach here. This Self::Mlx(_) is kept because of the exhaustion match nature of Rust language.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
model_gateway/src/routers/grpc/common/stages/helpers.rs (1)

283-305: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Optimize by checking MLX variant before tokenization.

The current order (check stop, tokenize, then check MLX variant) can waste CPU tokenizing stops for non-MLX backends. Checking the MLX variant first avoids unnecessary tokenization:

♻️ Proposed reordering
 pub(crate) fn apply_mlx_stop_sequences(
     proto_request: &mut ProtoGenerateRequest,
     stop: Option<&StringOrArray>,
     tokenizer: Option<&dyn Tokenizer>,
 ) -> Result<(), Response> {
+    let ProtoGenerateRequest::Mlx(req) = proto_request else {
+        return Ok(());
+    };
     let Some(stop) = stop else {
         return Ok(());
     };
-
-    if let ProtoGenerateRequest::Mlx(req) = proto_request {
-        let token_ids = resolve_mlx_stop_ids(stop, tokenizer)?;
-        let sampling = req.sampling_params.as_mut().ok_or_else(|| {
-            error::internal_error(
-                "mlx_sampling_params_missing",
-                "MLX GenerateRequest has no sampling_params; cannot inject stop IDs",
-            )
-        })?;
-        sampling.stop_token_ids.extend(token_ids);
-    }
-
-    Ok(())
+    let token_ids = resolve_mlx_stop_ids(stop, tokenizer)?;
+    let sampling = req.sampling_params.as_mut().ok_or_else(|| {
+        error::internal_error(
+            "mlx_sampling_params_missing",
+            "MLX GenerateRequest has no sampling_params; cannot inject stop IDs",
+        )
+    })?;
+    sampling.stop_token_ids.extend(token_ids);
+    Ok(())
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model_gateway/src/routers/grpc/common/stages/helpers.rs` around lines 283 -
305, The function apply_mlx_stop_sequences currently tokenizes stop sequences
via resolve_mlx_stop_ids before confirming the proto_request is the MLX variant,
causing unnecessary CPU work for non-MLX backends; modify
apply_mlx_stop_sequences to first early-return if proto_request is not
ProtoGenerateRequest::Mlx, then proceed to check the stop Option and call
resolve_mlx_stop_ids only when inside the MLX branch, and finally mutate
sampling_params (sampling.stop_token_ids.extend(...)) as before to avoid wasted
tokenization for non-MLX requests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@model_gateway/src/routers/grpc/common/stages/helpers.rs`:
- Around line 283-305: The function apply_mlx_stop_sequences currently tokenizes
stop sequences via resolve_mlx_stop_ids before confirming the proto_request is
the MLX variant, causing unnecessary CPU work for non-MLX backends; modify
apply_mlx_stop_sequences to first early-return if proto_request is not
ProtoGenerateRequest::Mlx, then proceed to check the stop Option and call
resolve_mlx_stop_ids only when inside the MLX branch, and finally mutate
sampling_params (sampling.stop_token_ids.extend(...)) as before to avoid wasted
tokenization for non-MLX requests.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a1ca32dd-17bc-490f-9af2-dc65c5b82612

📥 Commits

Reviewing files that changed from the base of the PR and between b4ce20d and c7e1728.

📒 Files selected for processing (11)
  • crates/grpc_client/src/mlx_engine.rs
  • crates/protocols/src/completion.rs
  • crates/tokenizer/src/mock.rs
  • model_gateway/src/routers/grpc/common/stages/helpers.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs
  • model_gateway/src/routers/grpc/regular/processor.rs
  • model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
  • model_gateway/src/routers/grpc/regular/stages/completion/request_building.rs
  • model_gateway/src/routers/grpc/regular/streaming.rs
  • model_gateway/src/routers/grpc/utils/chat_utils.rs
  • model_gateway/src/routers/grpc/utils/mod.rs

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fac846fbd6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 2456 to 2460
choices: vec![CompletionStreamChoice {
text: String::new(),
index,
logprobs: None,
finish_reason: Some("stop".to_string()),
..Default::default()
}],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include matched_stop in chunk-stopped completion streams

When the local stop decoder fires in the Chunk branch (stopped == true), the emitted terminal CompletionStreamChoice is built with ..Default::default() and never sets matched_stop. Because the index is then added to stopped_indices, the later Complete message is skipped, so the new matched_stop field is never populated for these streams. This affects streamed completions that terminate on gateway-detected stop sequences (for example user-provided stop), and clients lose the stop match metadata even though termination reason is stop.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

@zach-li-sudo zach-li-sudo May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I've also tested vLLM backend:

curl http://localhost:3000/v1/completions -s -N \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100
  }'
data: {"id":"cmpl_019e1a56-4bfe-7db2-b42d-5af3e093ac4f","object":"text_completion","created":1778558323,"choices":[{"text":"5","index":0,"finish_reason":null}],"model":"Qwen/Qwen2.5-1.5B-Instruct","system_fingerprint":"default"}

data: {"id":"cmpl_019e1a56-4bfe-7db2-b42d-5af3e093ac4f","object":"text_completion","created":1778558323,"choices":[{"text":"\n","index":0,"finish_reason":null}],"model":"Qwen/Qwen2.5-1.5B-Instruct","system_fingerprint":"default"}

data: {"id":"cmpl_019e1a56-4bfe-7db2-b42d-5af3e093ac4f","object":"text_completion","created":1778558323,"choices":[{"text":"","index":0,"finish_reason":"stop"}],"model":"Qwen/Qwen2.5-1.5B-Instruct","system_fingerprint":"default"}

data: [DONE]

no matched_stop="6" was in the response chunk. This impacts both vLLM and MLX paths on completion. Will create a separate PR to add it.

@zach-li-sudo
Copy link
Copy Markdown
Author

Manual testing results: stop sequence with MLX backend

1. SMG with MLX backend setup

  1. cargo build
  2. Install the gRPC servicer (first time only):
pip install -e grpc_servicer/
  1. Start the MLX backend:
source .venv/bin/activate && python -m smg_grpc_servicer.mlx.server \
  --model mlx-community/Qwen3-4B-Instruct-2507-4bit --port 50051
  1. Start the SMG gateway server:
./target/debug/smg --worker-urls grpc://localhost:50051 --port 3000

2. Testing scenarios

This PR is to support string stop array in regular
paths: chat, chat completions, streaming chat, streaming chat completion.

Key differences from vLLM

Feature vLLM MLX
Single-token stop strings ("stop": ["6"]) ✅ supported ✅ tokenized to token ID
Multi-token stop strings ("stop": ["hello world"]) ✅ supported ❌ returns HTTP 400
stop_token_ids ✅ supported ✅ supported

The gateway converts single-token stop strings to token IDs for MLX requests (the MLX proto has
no native string stop field). Multi-token stop strings are rejected with HTTP 400
(unsupported_stop_string), since MLX can only stop on a single token at a time.

Token ID reference

Token IDs used in the tests below assume the Qwen3 tokenizer, which shares vocabulary with Qwen2.5:

  • Token 20 → "5"
  • Token 21 → "6"

Verify with the tokenizer if results are unexpected:

from mlx_lm import load
_, tokenizer = load("mlx-community/Qwen3-4B-Instruct-2507-4bit")
print(tokenizer.encode("5 6"))   # check IDs for context-free "5" and "6"

3. Results

Test matrix: 4 paths × 5 stop modes = 20 cases.

Single-token stop string cases (3.x.1): PASS = 200, matched_stop = the stop string.
Multi-token stop string cases (3.x.2): PASS = 400 with an appropriate error message.
stop_token_ids cases (3.x.3): PASS = 200, matched_stop = the matched token ID integer.
Combined single-token stop string + stop_token_ids (3.x.4): PASS = 200, matched_stop = integer or string depending on which stop fires first.
Combined multi-token stop string + stop_token_ids (3.x.5): PASS = 400 — multi-token string is rejected regardless of stop_token_ids.

# Endpoint stream Stop mode Expected Result
3.1.1 /v1/chat/completions false stop string (single token) pass 200 ✅
3.1.2 /v1/chat/completions false stop string (multi-token) 400 400 ✅
3.1.3 /v1/chat/completions false stop_token_ids pass 200 ✅
3.1.4 /v1/chat/completions false combined: single-token stop + stop_token_ids pass 200 ✅
3.1.5 /v1/chat/completions false combined: multi-token stop + stop_token_ids 400 400 ✅
3.2.1 /v1/chat/completions true stop string (single token) pass 200 ✅
3.2.2 /v1/chat/completions true stop string (multi-token) 400 400 ✅
3.2.3 /v1/chat/completions true stop_token_ids pass 200 ✅
3.2.4 /v1/chat/completions true combined: single-token stop + stop_token_ids pass 200 ✅
3.2.5 /v1/chat/completions true combined: multi-token stop + stop_token_ids 400 400 ✅
3.3.1 /v1/completions false stop string (single token) pass 200 ✅
3.3.2 /v1/completions false stop string (multi-token) 400 400 ✅
3.3.3 /v1/completions false stop_token_ids pass 200 ✅
3.3.4 /v1/completions false combined: single-token stop + stop_token_ids pass 200 ✅
3.3.5 /v1/completions false combined: multi-token stop + stop_token_ids 400 400 ✅
3.4.1 /v1/completions true stop string (single token) pass 200 ✅
3.4.2 /v1/completions true stop string (multi-token) 400 400 ✅
3.4.3 /v1/completions true stop_token_ids pass 200 ✅
3.4.4 /v1/completions true combined: single-token stop + stop_token_ids pass 200 ✅
3.4.5 /v1/completions true combined: multi-token stop + stop_token_ids 400 400 ✅

Reading the streaming output below: Only the first content chunk and the final (stop) chunk are
shown. Intermediate content chunks are collapsed to # ... (content: "...") ....


3.1 Chat, non-stream

3.1.1 — stop string, single token ("stop": ["6"]) — expect HTTP 200

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'
{
  "response": {
    "id": "chatcmpl-019e12f3-af39-7bc1-a46c-4173bebe0cbc",
    "object": "chat.completion",
    "created": 1778434420,
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "1  \n2  \n3  \n4  \n5  \n",
          "reasoning_content": null
        },
        "finish_reason": "stop",
        "matched_stop": "6"
      }
    ],
    "usage": {
      "prompt_tokens": 21,
      "completion_tokens": 11,
      "total_tokens": 32
    },
    "system_fingerprint": "default"
  },
  "status": "HTTP 200"
}

3.1.2 — stop string, multi-token ("stop": ["hello world"]) — expect HTTP 400

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Say: hi there and hello world!"}],
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'
{
  "response": {
    "error": {
      "type": "Bad Request",
      "code": "unsupported_stop_string",
      "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
      "param": null
    }
  },
  "status": "HTTP 400"
}

3.1.3 — stop_token_ids ([20, 21] = tokens for "5" and "6")

curl http://localhost:3000/v1/chat/completions -s \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq
{
  "id": "chatcmpl-019e12f3-b06a-7e30-9026-b277ef4cd022",
  "object": "chat.completion",
  "created": 1778434420,
  "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "1  \n2  \n3  \n4  \n",
        "reasoning_content": null
      },
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ],
  "usage": {
    "prompt_tokens": 21,
    "completion_tokens": 9,
    "total_tokens": 30
  },
  "system_fingerprint": "default"
}

3.1.4 — combined: single-token stop string + stop_token_ids ("stop": ["6"], "stop_token_ids": [20]) — expect HTTP 200

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stop_token_ids": [20],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'
{
  "response": {
    "id": "chatcmpl-019e13c9-0ea1-77c0-b430-bf53fbcdede2",
    "object": "chat.completion",
    "created": 1778448404,
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "1  \n2  \n3  \n4  \n",
          "reasoning_content": null
        },
        "finish_reason": "stop",
        "matched_stop": 20
      }
    ],
    "usage": {
      "prompt_tokens": 21,
      "completion_tokens": 9,
      "total_tokens": 30
    },
    "system_fingerprint": "default"
  },
  "status": "HTTP 200"
}

3.1.5 — combined: multi-token stop string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20, 21]) — expect HTTP 400

curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["hello world"],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'
{
  "response": {
    "error": {
      "type": "Bad Request",
      "code": "unsupported_stop_string",
      "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
      "param": null
    }
  },
  "status": "HTTP 400"
}

3.2 Chat, stream

3.2.1 — stop string, single token ("stop": ["6"]) — expect HTTP 200

curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100
  }'
data: {"id":"chatcmpl-019e12f4-6c1f-7651-a730-137fffaab34b","object":"chat.completion.chunk","created":1778434468,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"role":"assistant","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-019e12f4-6c1f-7651-a730-137fffaab34b","object":"chat.completion.chunk","created":1778434468,...,"choices":[{"index":0,"delta":{"role":"assistant","content":"1  ","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

# ... (content: "1  \n2  \n3  \n4  \n5  \n") ...

data: {"id":"chatcmpl-019e12f4-6c1f-7651-a730-137fffaab34b","object":"chat.completion.chunk","created":1778434468,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"reasoning_content":null},"logprobs":null,"finish_reason":"stop","matched_stop":"6"}]}

data: [DONE]

HTTP 200

3.2.2 — stop string, multi-token ("stop": ["hello world"]) — expect HTTP 400

curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Repeat exactly: 1 2 3 hello world 4 5"}],
    "stop": ["hello world"],
    "stream": true,
    "max_tokens": 100
  }'
{"error":{"type":"Bad Request","code":"unsupported_stop_string","message":"stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings","param":null}}
HTTP 400

3.2.3 — stop_token_ids ([20, 21] = tokens for "5" and "6")

curl http://localhost:3000/v1/chat/completions -s -N \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }'
data: {"id":"chatcmpl-019e12f4-6d47-7d00-beca-1339545bf2bd","object":"chat.completion.chunk","created":1778434469,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"role":"assistant","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-019e12f4-6d47-7d00-beca-1339545bf2bd","object":"chat.completion.chunk","created":1778434469,...,"choices":[{"index":0,"delta":{"role":"assistant","content":"1","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

# ... (content: "1  \n2  \n3  \n4  \n") ...

data: {"id":"chatcmpl-019e12f4-6d47-7d00-beca-1339545bf2bd","object":"chat.completion.chunk","created":1778434469,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"reasoning_content":null},"logprobs":null,"finish_reason":"stop","matched_stop":20}]}

data: [DONE]

3.2.4 — combined: single-token stop string + stop_token_ids ("stop": ["6"], "stop_token_ids": [20]) — expect HTTP 200

curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stop_token_ids": [20],
    "stream": true,
    "max_tokens": 100
  }'
data: {"id":"chatcmpl-019e13c9-3610-7a81-928c-7aa1e5f75f9d","object":"chat.completion.chunk","created":1778448414,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"role":"assistant","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-019e13c9-3610-7a81-928c-7aa1e5f75f9d","object":"chat.completion.chunk","created":1778448414,...,"choices":[{"index":0,"delta":{"role":"assistant","content":"1  ","reasoning_content":null},"logprobs":null,"finish_reason":null}]}

# ... (content: "1  \n2  \n3  \n4  \n") ...

data: {"id":"chatcmpl-019e13c9-3610-7a81-928c-7aa1e5f75f9d","object":"chat.completion.chunk","created":1778448414,"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default","choices":[{"index":0,"delta":{"reasoning_content":null},"logprobs":null,"finish_reason":"stop","matched_stop":20}]}

data: [DONE]

HTTP 200

3.2.5 — combined: multi-token stop string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20, 21]) — expect HTTP 400

curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["hello world"],
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }'
{"error":{"type":"Bad Request","code":"unsupported_stop_string","message":"stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings","param":null}}
HTTP 400

3.3 Completion, non-stream

3.3.1 — stop string, single token ("stop": ["6"]) — expect HTTP 200

curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'
{
  "response": {
    "id": "cmpl_019e12f4-9eda-71c2-bf57-ab5fca3036f6",
    "object": "text_completion",
    "created": 1778434481,
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "choices": [
      {
        "text": "5\n",
        "index": 0,
        "finish_reason": "stop",
        "matched_stop": "6"
      }
    ],
    "usage": {
      "prompt_tokens": 22,
      "completion_tokens": 3,
      "total_tokens": 25
    },
    "system_fingerprint": "default"
  },
  "status": "HTTP 200"
}

3.3.2 — stop string, multi-token ("stop": ["hello world"]) — expect HTTP 400

curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Repeat exactly: 1 2 3 hello world 4 5",
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'
{
  "response": {
    "error": {
      "type": "Bad Request",
      "code": "unsupported_stop_string",
      "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
      "param": null
    }
  },
  "status": "HTTP 400"
}

3.3.3 — stop_token_ids ([20, 21] = tokens for "5" and "6")

curl http://localhost:3000/v1/completions -s \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq
{
  "id": "cmpl_019e12f4-9faf-7582-bcc0-bb7849cf6585",
  "object": "text_completion",
  "created": 1778434482,
  "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ],
  "usage": {
    "prompt_tokens": 22,
    "completion_tokens": 1,
    "total_tokens": 23
  },
  "system_fingerprint": "default"
}

Empty text because the stop token was the very next token after the prompt — stopped immediately.

3.3.4 — combined: single-token stop string + stop_token_ids ("stop": ["6"], "stop_token_ids": [20]) — expect HTTP 200

curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stop_token_ids": [20],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'
{
  "response": {
    "id": "cmpl_019e13c9-6957-7f10-8038-53f7c215dd87",
    "object": "text_completion",
    "created": 1778448427,
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "choices": [
      {
        "text": "",
        "index": 0,
        "finish_reason": "stop",
        "matched_stop": 20
      }
    ],
    "usage": {
      "prompt_tokens": 22,
      "completion_tokens": 1,
      "total_tokens": 23
    },
    "system_fingerprint": "default"
  },
  "status": "HTTP 200"
}

3.3.5 — combined: multi-token stop string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20, 21]) — expect HTTP 400

curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["hello world"],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'
{
  "response": {
    "error": {
      "type": "Bad Request",
      "code": "unsupported_stop_string",
      "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
      "param": null
    }
  },
  "status": "HTTP 400"
}

3.4 Completion, stream

3.4.1 — stop string, single token ("stop": ["6"]) — expect HTTP 200

curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100
  }'
data: {"id":"cmpl_019e12f4-bb3b-7b80-b016-13a6be12605c","object":"text_completion","created":1778434489,"choices":[{"text":"5","index":0,"finish_reason":null}],"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default"}

# ... (content: "5\n") ...

data: {"id":"cmpl_019e12f4-bb3b-7b80-b016-13a6be12605c","object":"text_completion","created":1778434489,"choices":[{"text":"","index":0,"finish_reason":"stop"}],"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default"}

data: [DONE]

HTTP 200

3.4.2 — stop string, multi-token ("stop": ["hello world"]) — expect HTTP 400

curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Repeat exactly: 1 2 3 hello world 4 5",
    "stop": ["hello world"],
    "stream": true,
    "max_tokens": 100
  }'
{"error":{"type":"Bad Request","code":"unsupported_stop_string","message":"stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings","param":null}}
HTTP 400

3.4.3 — stop_token_ids ([20, 21] = tokens for "5" and "6")

curl http://localhost:3000/v1/completions -s -N \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }'
data: {"id":"cmpl_019e12f4-bc18-7b71-b9a6-fb673df80244","object":"text_completion","created":1778434489,"choices":[{"text":"","index":0,"finish_reason":"stop"}],"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default"}

data: [DONE]

3.4.4 — combined: single-token stop string + stop_token_ids ("stop": ["6"], "stop_token_ids": [20]) — expect HTTP 200

curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stop_token_ids": [20],
    "stream": true,
    "max_tokens": 100
  }'
data: {"id":"cmpl_019e13c9-8d8d-75a0-bba2-d7da2fc33066","object":"text_completion","created":1778448436,"choices":[{"text":"","index":0,"finish_reason":"stop"}],"model":"mlx-community/Qwen3-4B-Instruct-2507-4bit","system_fingerprint":"default"}

data: [DONE]

HTTP 200

3.4.5 — combined: multi-token stop string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20, 21]) — expect HTTP 400

curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["hello world"],
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }'
{"error":{"type":"Bad Request","code":"unsupported_stop_string","message":"stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings","param":null}}
HTTP 400

@zach-li-sudo
Copy link
Copy Markdown
Author

Thanks for the contribution, Zhuo! The feature itself is small and the tests are nice — but the diff feels heavier than it needs to be because backend-type branching is leaking into orchestration code. A few things to consider before merge:

1. is_mlx() branching is repeated 4× — push it into the wrapper

The same if complete.is_mlx() { resolve_mlx_matched_stop_json(...) } else { complete.matched_stop_json() } block appears in:

  • model_gateway/src/routers/grpc/regular/processor.rs:187 (chat)
  • model_gateway/src/routers/grpc/regular/processor.rs:834 (completion)
  • model_gateway/src/routers/grpc/regular/streaming.rs:495 (chat stream)
  • model_gateway/src/routers/grpc/regular/streaming.rs:2589 (completion stream)

proto_wrapper.rs already encapsulates backend differences (matched_stop_json(), is_mlx_streaming_finished(), etc.). Could you add a matched_stop_json_with_context(stop, stop_token_ids, tokenizer) method on ProtoGenerateComplete that internally dispatches on the variant? Each caller collapses to one line, and mlx_matched_stop_token_id() no longer needs to be pub. Composes better when the next backend hits the same quirk.

2. apply_mlx_stop_sequences is double-gated

model_gateway/src/routers/grpc/common/stages/helpers.rs:82 pattern-matches ProtoGenerateRequest::Mlx(req), but both callers (chat/request_building.rs:111, completion/request_building.rs:90) already gate on builder_client.is_mlx(). Pick one — either the helper is unconditional and no-ops on non-MLX, or callers gate and the helper assumes MLX.

3. Silent encode-error swallow is a footgun

chat_utils.rs:443stop_strings_to_token_ids correctly returns Err (→ 400) for multi-token strings, but silently warn!s and drops when the tokenizer errors or returns zero tokens. The new MockTokenizer::failing() test at line 904 locks this in:

// "Tokenizer errors are silently skipped; the function returns Ok(empty)."
assert!(result.unwrap().is_empty());

Both are the same class of "we cannot honor your stop condition" failure — caller sets stop: ["foo"], encoder errors, request proceeds to max_tokens, user blames SMG. Recommend surfacing both as 400.

4. Is test-case = "3.3.1" worth a new workspace dep?

It's introduced as a brand-new workspace dependency just for ~13 parameterized cases across 2 functions in chat_utils.rs. Not used anywhere else in the workspace today. A plain #[test] with a slice of (input, expected, name) tuples and a loop gets the same coverage without the dep — and keeps the workspace lean. Up to you, but worth weighing.

Smaller items

  • streaming.rs got 6 new matched_stop: None struct-literal additions (lines 2422, 2450, 2470, 2505, 2529, 2552). Not blocking, but a ..Default::default() or a constructor on CompletionStreamChoice would prevent this sprawl next time.
  • Could you run the curl from the PR description on your local Mac (SMG + MLX backend) and paste the actual response in the test plan? Stream + non-stream, single stop string + multi-token rejection. That's the strongest signal we'll get for a Mac-specific path.
  • mlx_engine.rs:245 comment update is good; retaining reject_stop_strings for messages/generate paths is correct.

After (1)–(3), the MLX surface area shrinks a lot and the "MLX branches everywhere" feel goes away.

Hi Keyang, thanks for your comments! The issues you've mentioned are resolved as follows:

  1. is_mlx() branching is repeated 4× — push it into the wrapper: fixed in commit #6f390899 by moving mlx match_stop processing logic into proto wrapper
  2. apply_mlx_stop_sequences is double-gated: fixed in commit #15c12271 fix double gated apply_mlx_stop_sequences: helper is unconditional and no-ops on non-MLX
  3. Silent encode-error swallow is a footgun: fixed in commit #271cec93 fix silent encode-error: zero token and failed tokenizer throw 400 error
  4. Is test-case = "3.3.1" worth a new workspace dep: dev dependency removed, newly added unit tests are refactored using a plain #[test] with a slice of (input, expected, name) tuples and a loop.

Other items:

  • streaming.rs got 6 new matched_stop: None struct-literal additions (lines 2422, 2450, 2470, 2505, 2529, 2552). Not blocking, but a ..Default::default() or a constructor on CompletionStreamChoice would prevent this sprawl next time.

Solution: used ..Default::default() fror the construct CompletionStreamChoice fields. See #94eef5a2.

  • Could you run the curl from the PR description on your local Mac (SMG + MLX backend) and paste the actual response in the test plan? Stream + non-stream, single stop string + multi-token rejection. That's the strongest signal we'll get for a Mac-specific path.

Yes, the overall test plan executed on my local Mac with real response are done! All test cases are summarized in this test matrix, including {stream, non-stream} x {chat, chat completion} x {stop token ids, stop strings, combined, ...}. Please check linked comment.

I see some other gaps (not directly correlated to this PR) in MLX backend support, like harmony paths, messages/generate endpoints etc. Will discuss these with my investigation later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

grpc gRPC client and router changes model-gateway Model gateway crate changes protocols Protocols crate changes tokenizer Tokenizer related changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants