Skip to content

ci(tokenspeed): add CI install + GPU e2e coverage (Part 3/3)#1465

Open
key4ng wants to merge 24 commits into
mainfrom
feat/grpc-tokenspeed-ci-tests
Open

ci(tokenspeed): add CI install + GPU e2e coverage (Part 3/3)#1465
key4ng wants to merge 24 commits into
mainfrom
feat/grpc-tokenspeed-ci-tests

Conversation

@key4ng
Copy link
Copy Markdown
Collaborator

@key4ng key4ng commented May 8, 2026

Description

Problem

With the Rust router (#1351) and Python servicer (#1464) in place, TokenSpeed needs to be exercised in CI: source-installed at a pinned ref, launched as a worker through the e2e infrastructure, and covered by the function-calling and tool_choice suites.

Solution

A composite GitHub Action plus install script that builds and caches the TokenSpeed kernel + scheduler, a tokenspeed lane in the GPU e2e job, and e2e_test/infra updates that teach the suite about a tokenspeed runtime alongside sglang/vllm/trtllm.

3-PR Stack

This is part 3 of 3 splitting the original #1351:

Stacked on PR2. The e2e wiring exercises both the Rust router (PR1) and the Python servicer (PR2) against a live TokenSpeed worker.

Changes

  • .github/actions/setup-tokenspeed/action.yml — composite action wrapping scripts/ci_install_tokenspeed.sh
  • scripts/ci_install_tokenspeed.sh — source install of TokenSpeed kernel + scheduler at a pinned git ref, with a wheel cache so repeat runs skip the ~20 min compile
  • .github/workflows/e2e-gpu-job.yml — add a tokenspeed engine lane, gated on secret access so forked PRs skip cleanly
  • .github/workflows/pr-test-rust.yml — install proto deps so Rust-only changes touching crates/grpc_client still cover the new tokenspeed proto
  • e2e_test/infra/{constants,worker,model_specs}.py and e2e_test/fixtures/hooks.py — teach the suite about a tokenspeed runtime
  • Suite-wide @pytest.mark.engine(...) markers expand to include tokenspeed
  • Function-calling and tool_choice suites use Qwen/Qwen3-4B-Instruct-2507 (the Qwen3 family is what TokenSpeed's model registry currently supports)

Test Plan

  • YAML parse for all 3 workflow / action files
  • bash -n scripts/ci_install_tokenspeed.sh passes
  • pre-commit run --files <changed> passes (rustfmt / clippy / ruff / ruff format / codespell / yaml / toml all green)
  • GPU e2e runs once this PR's CI is enabled (function-calling matrix on Qwen/Qwen3-4B-Instruct-2507 against a live tokenspeed worker, plus the existing reference engines for parity)

Review Threads from #1351

Addressed in this PR:

  • e2e_test/chat_completions/test_function_calling.py — verbose Qwen3-4B docstring on TestToolChoiceQwen removed; that context lives in this PR description instead.
Checklist
  • cargo +nightly fmt passes (no Rust changes)
  • cargo clippy --all-targets --all-features -- -D warnings passes (no Rust changes)
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features

    • TokenSpeed added as a supported inference backend with gRPC-based generation, streaming, health/abort, model/server loads and metadata, sampling-parameter mapping, and integration into gateway and client paths.
    • New local Qwen model entry (Qwen3-4B-Instruct-2507) added to model specs.
  • Testing

    • E2E and unit test matrices expanded to exercise TokenSpeed; new unit tests cover TokenSpeed servicer and health behavior.
  • Chores

    • CI/workflows and CI install scripts updated to provision and run TokenSpeed in CI.

Review Change Stack

Adds TokenSpeed as a first-class GPU backend on the Rust side: a self-
contained `tokenspeed.grpc.scheduler.TokenSpeedScheduler` proto, the
`TokenSpeedSchedulerClient` wrapper that translates SGLang-shaped
request/response types at the boundary, and the model_gateway router
plumbing (client dispatch, runtime detection, harmony/regular request
builders, multimodal and embedding stages).

This is part 1 of 3 splitting #1351:
  - PR1 (this): Rust gRPC + protocol
  - PR2: Python servicer (grpc_servicer)
  - PR3: CI workflows + e2e tests

PR1 is functionally inert without PR2 — the router can dial a TokenSpeed
worker, but the worker process lives in the Python servicer landed by PR2.

Addresses CatherineSue's review on #1351:
  - shorten the TokenSpeed RuntimeType doc in protocols/worker.rs
  - drop the verbose TokenSpeed note in grpc_client/build.rs
  - restore the concise module doc in detect_backend.rs, just adding
    tokenspeed to the existing health-check ordering

Signed-off-by: key4ng <[email protected]>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR integrates TokenSpeed as a new gRPC inference backend across protocol definitions, Rust and Python clients/servicers, model gateway routing, E2E test infrastructure, and CI workflows. It adds a full TokenSpeed proto, shared sampling-params builders, a Rust TokenSpeed client, a Python TokenSpeed servicer/server/health, model-gateway plumbing and proto-wrapper variants, CI install/workflow steps, and broad E2E test coverage.

Changes

TokenSpeed Backend Support

Layer / File(s) Summary
Protocol contract and shared sampling parameter builders
crates/grpc_client/proto/tokenspeed_scheduler.proto, crates/grpc_client/build.rs, crates/grpc_client/python/smg_grpc_proto/__init__.py, crates/grpc_client/src/lib.rs, crates/grpc_client/src/sampling_params.rs
TokenSpeed protobuf schema and build integration; Python re-exports; new Rust sampling_params converts Chat/Responses/Messages/Completion/plain generate inputs into gRPC SamplingParams with constraint normalization.
Refactor SGLang client to use shared sampling builders
crates/grpc_client/src/sglang_scheduler.rs
SGLang client delegates sampling-params construction to crate::sampling_params, removing in-file helpers and updating unit tests.
Rust TokenSpeed gRPC client implementation
crates/grpc_client/src/tokenspeed_scheduler.rs
Implements TokenSpeedSchedulerClient, AbortOnDropStream, streaming generate (returns abort-on-drop stream), unary RPCs (health, abort, model/server/loads), request builders translating router request shapes to TokenSpeed proto, and translation helpers.
Add TokenSpeed to RuntimeType enum
crates/protocols/src/worker.rs
Adds RuntimeType::TokenSpeed with Display and FromStr parsing for "tokenspeed".
Python gRPC servicer implementation
grpc_servicer/smg_grpc_servicer/tokenspeed/
Adds TokenSpeed servicer package: TokenSpeedSchedulerServicer (Generate/HealthCheck/Abort/GetModelInfo/GetServerInfo/GetLoads), TokenSpeedHealthServicer with stuck-scheduler checks, scheduler_launcher, server with warmup and graceful shutdown, and CLI entrypoint.
Python servicer unit tests
grpc_servicer/tests/conftest.py, grpc_servicer/tests/test_tokenspeed_health_servicer.py, grpc_servicer/tests/test_tokenspeed_servicer.py
Async unit tests covering health transitions, stuck-scheduler detection, sampling-params conversion, Generate streaming/non-streaming behavior, abort/cancel semantics, GetModelInfo/GetServerInfo/GetLoads, and output-logprobs conversion.
Model gateway GrpcClient wrapper for TokenSpeed
model_gateway/src/routers/grpc/client.rs
Adds GrpcClient::TokenSpeed variant, typed accessors, connect dispatch, TokenSpeed branches for health/get_model_info/get_loads/generate, and request-builder TokenSpeed paths (with multimodal rejection where applicable).
Proto wrapper unified types for TokenSpeed
model_gateway/src/routers/grpc/proto_wrapper.rs
Extend unified enums (ProtoGenerateRequest/Response/StreamChunk/Complete/ProtoStream) with TokenSpeed variants and implement field accessors/conversions (token ids, finish reason, output logprobs semantics, matched-stop decoding).
Model gateway stages for TokenSpeed routing
model_gateway/src/routers/grpc/common/stages/helpers.rs, model_gateway/src/routers/grpc/common/stages/request_execution.rs, model_gateway/src/routers/grpc/harmony/stages/request_building.rs, model_gateway/src/routers/grpc/multimodal.rs, model_gateway/src/routers/grpc/regular/stages/embedding/request_building.rs
Add TokenSpeed handling to sampling-defaults application, PD-disaggregated request execution, Harmony stop-token injection, embedding rejection, and multimodal assembly rejection.
Backend runtime detection for TokenSpeed
model_gateway/src/workflow/steps/local/detect_backend.rs, model_gateway/src/workflow/steps/util.rs
Include TokenSpeed in gRPC health-check probes for local detection and parallel reachability checks.
E2E test infrastructure updates
e2e_test/infra/constants.py, e2e_test/infra/model_specs.py, e2e_test/infra/worker.py, e2e_test/fixtures/hooks.py
Add Runtime.TOKENSPEED and is_tokenspeed(), insert Qwen3-4B-Instruct-2507 model spec, implement _build_tokenspeed_grpc_cmd() with TokenSpeed-specific flags, and update skip_for_runtime marker handling to iterate all marks.
E2E test coverage expansion
e2e_test/chat_completions/*, e2e_test/completions/*, e2e_test/responses/*, e2e_test/router/*
Add tokenspeed engine marker to numerous test classes and adjust parametrizations (e.g., conditionally skip logprobs=5 under TokenSpeed).
CI workflow and setup action
.github/actions/setup-tokenspeed/action.yml, .github/workflows/e2e-gpu-job.yml, .github/workflows/pr-test-rust.yml, grpc_servicer/pyproject.toml, scripts/ci_install_tokenspeed.sh
New composite GitHub Action to prepare Python venv and install TokenSpeed from source; workflows updated to conditionally use TokenSpeed setup step; PR-test path-filter and GPU matrices extended; pyproject test extras and pytest marker added; CI install script bootstraps TokenSpeed (wheel caching, build-isolation flags, verification imports).

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • CatherineSue
  • slin1237
  • gongwei-130
  • XinyueZhang369

"🐰 I hopped through protos, clients, and test beds,
Built TokenSpeed homes where the gRPC thread treads.
From CI to gateway, with a carrot-shaped grin,
Streams and health checks now hum from within.
Hooray — a speedy backend for every small win!"

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/grpc-tokenspeed-ci-tests

@github-actions github-actions Bot added ci CI/CD configuration changes tests Test changes labels May 8, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates the TokenSpeed backend into the end-to-end testing suite. Key changes include the addition of a GitHub Action and a comprehensive installation script for TokenSpeed, infrastructure updates to support the new engine in the worker and constants, and the inclusion of TokenSpeed in numerous test suites. Additionally, the pytest_runtest_setup hook was improved to correctly handle multiple runtime skip markers. Feedback was provided to fix the caching logic in the installation script, as the current implementation fails to save the compiled kernel wheel, which would result in redundant 30-minute builds during CI.

Comment on lines +142 to +152
echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..."
MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \
uv pip install tokenspeed-kernel/python/ --no-build-isolation
# Cache the built wheel — uv stores wheels under its cache, copy out.
mkdir -p "$WHEEL_CACHE"
python3 -c "import tokenspeed_kernel, os, shutil, glob; \
d = os.path.dirname(tokenspeed_kernel.__file__); \
site = os.path.dirname(d); \
whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \
print('kernel install dir:', whls)" || true
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The caching logic for tokenspeed-kernel is non-functional. The else block builds the kernel from source but fails to save the resulting wheel to $WHEEL_CACHE. Additionally, the Python snippet on lines 147-151 only prints the installation path instead of copying the built artifacts. This means the expensive ~30-minute compilation will occur on every CI run. Using uv pip wheel to generate the wheel directly into the cache directory is a more robust approach and ensures consistency with the use of uv in CI scripts.

Suggested change
echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..."
MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \
uv pip install tokenspeed-kernel/python/ --no-build-isolation
# Cache the built wheel — uv stores wheels under its cache, copy out.
mkdir -p "$WHEEL_CACHE"
python3 -c "import tokenspeed_kernel, os, shutil, glob; \
d = os.path.dirname(tokenspeed_kernel.__file__); \
site = os.path.dirname(d); \
whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \
print('kernel install dir:', whls)" || true
fi
echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..."
mkdir -p "$WHEEL_CACHE"
MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \
uv pip wheel tokenspeed-kernel/python/ -w "$WHEEL_CACHE" --no-build-isolation
uv pip install "$WHEEL_CACHE"/tokenspeed_kernel-*.whl --no-build-isolation
fi
References
  1. When installing Python packages in CI scripts, use uv pip install for consistency and performance if uv is already in use for other installations.

Comment on lines +148 to +163
d = os.path.dirname(tokenspeed_kernel.__file__); \
site = os.path.dirname(d); \
whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \
print('kernel install dir:', whls)" || true
fi

# Step 4: scheduler (scikit-build-core + nanobind + CMake).
echo "Building tokenspeed-scheduler..."
uv pip install tokenspeed-scheduler/

# Step 5: the Python runtime (pure-Python).
uv pip install "./python" --no-build-isolation

# ── Persist env to subsequent CI steps ─────────────────────────────────────
if [ -n "${GITHUB_ENV:-}" ]; then
echo "CUDA_HOME=$CUDA_HOME" >> "$GITHUB_ENV"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: The wheel-cache write path is non-functional. After building from source, it creates $WHEEL_CACHE and runs a Python snippet that prints the dist-info directory but never copies a .whl file into $WHEEL_CACHE. The cache-read path (lines 142-146) looks for tokenspeed_kernel-*.whl in that directory, so it will never find anything — every run rebuilds from source (~30 min).

Either:

  • Actually locate and copy the built wheel into $WHEEL_CACHE, or
  • Remove the dead caching code and document that every run is a fresh build (the || true at the end already swallows failures, so this isn't a correctness bug, just a misleading comment + wasted mkdir).


cd "$TOKENSPEED_DIR"

# ── System dependencies (mirrors docker/Dockerfile) ─────────────────────────
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: This echo prints $TOKENSPEED_REPO which, after line 63, contains the embedded access token (https://x-access-token:<token>@github.com/...). GitHub Actions redacts known secret values in logs, so this is likely safe in practice, but it's still better hygiene to avoid echoing credential-bearing URLs. Consider printing the repo without the auth portion:

Suggested change
# ── System dependencies (mirrors docker/Dockerfile) ─────────────────────────
echo "Cloning TokenSpeed ${TOKENSPEED_REF} from ${TOKENSPEED_REPO%%@*}@..."

Comment on lines +1581 to +1583
@pytest.mark.engine("sglang", "vllm", "trtllm", "tokenspeed")
@pytest.mark.gpu(1)
@pytest.mark.model("Qwen/Qwen3-4B-Instruct-2507")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: This changes TestMultiTurnToolCall from Qwen2.5-14B (2 GPUs) to Qwen3-4B (1 GPU) for all engines, not just TokenSpeed. The 14B model was presumably chosen for reliable multi-turn tool-calling behavior — a 4B model may be flakier at following the tool-call/tool-result conversation pattern.

If the intent is just to add TokenSpeed coverage (which needs a Qwen3 model), consider keeping the existing 14B config for sglang/vllm/trtllm and adding a separate TestMultiTurnToolCallTokenSpeed class with the 4B model, or at minimum confirming the 4B model passes the existing multi-turn tests reliably on the other engines before merging.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid CI + e2e wiring for TokenSpeed. The iter_markers fix in hooks.py is a real bug fix — nice catch.

Summary: 0 🔴 Important · 3 🟡 Nit · 0 🟣 Pre-existing

Nits posted inline:

  1. Wheel cache is non-functional (ci_install_tokenspeed.sh): the cache-save path prints debug info but never writes a .whl to $WHEEL_CACHE, so every run rebuilds from source.
  2. Credential in echo (ci_install_tokenspeed.sh:123): echo prints $TOKENSPEED_REPO after the token has been embedded in it. GitHub Actions redacts secrets, but better to strip the auth portion.
  3. Model downgrade for all engines (test_function_calling.py): TestMultiTurnToolCall moved from Qwen2.5-14B (2 GPU) → Qwen3-4B (1 GPU) for all engines, not just TokenSpeed — worth confirming the smaller model is reliable for multi-turn tool calling on sglang/vllm/trtllm.

Also noted (couldn't post inline — line not in diff): the engine marker help text in hooks.py:28 still says (sglang, vllm, trtllm) without tokenspeed.

key4ng added 2 commits May 9, 2026 12:19
…dule

The 5 ``build_*_sampling_params_from_*`` helpers (chat / responses /
messages / completion / plain) plus their constraint helpers were
sitting on ``SglangSchedulerClient`` even though the OpenAI mapping is
backend-neutral. TokenSpeed's client was reaching across to call them
through ``SglangSchedulerClient::*`` which suggested SGLang owned the
OpenAI→sampling translation, when it doesn't.

Move them to ``crate::sampling_params`` as free functions. Both the
SGLang and TokenSpeed clients (and any future client that wants the
same mapping) now call ``crate::sampling_params::build_*`` directly.

The return type is still ``sglang::SamplingParams`` because that proto
happens to be the most permissive shape across our supported backends;
TokenSpeed translates to its own slimmer shape at the wire boundary.
When a backend grows a sampling field SGLang lacks, this is the place
to add it.

No behavior change. Tests stay green; the call sites in
``build_generate_request_from_*`` are mechanically updated.

Signed-off-by: key4ng <[email protected]>
…ation)

TokenSpeed previously rode the SGLang IR arms in proto_wrapper.rs:
``TokenSpeedSchedulerClient::generate()`` accepted ``sglang::GenerateRequest``,
the streaming response was translated back into ``sglang::GenerateResponse``
at the wire boundary, and the router dispatched through
``ProtoGenerateRequest::Sglang``. This let TokenSpeed reuse the existing
match arms but coupled it to SGLang's evolving schema — every SGLang field
addition forced a TokenSpeed translator stub (most recently
``default_sampling_params_json``).

Add native TokenSpeed arms to the router IR:

  - ``ProtoGenerateRequest::TokenSpeed(Box<tokenspeed::GenerateRequest>)``
  - ``ProtoGenerateResponse::TokenSpeed(Box<tokenspeed::GenerateResponse>)``
  - ``ProtoGenerateStreamChunk::TokenSpeed(tokenspeed::GenerateStreamChunk)``
  - ``ProtoGenerateComplete::TokenSpeed(tokenspeed::GenerateComplete)``

…with ``as_tokenspeed`` / ``as_tokenspeed_mut`` / ``is_tokenspeed`` accessors
mirroring the existing per-backend pattern (Mlx / Vllm / Trtllm) and
``TokenSpeed`` arms in every aggregator method (``token_ids``, ``index``,
``output_logprobs``, ``input_logprobs``, ``prompt_tokens``,
``completion_tokens``, ``cached_tokens``, ``finish_reason``,
``matched_stop_json``, ``output_ids``, ``set_stream``, ``request_id``,
``set_max_tokens_for_prefill``, ``clear_mm_inputs``,
``set_kv_transfer_params``, ``kv_transfer_params``).

Client-side:

  - ``TokenSpeedSchedulerClient::generate()`` now takes
    ``tokenspeed_proto::GenerateRequest``
  - ``AbortOnDropStream::Item`` is ``tokenspeed_proto::GenerateResponse``
  - The 5 ``build_*_request`` builders return native ``tokenspeed_proto``
    types
  - ``translate::generate_request`` / ``generate_response`` /
    ``stream_chunk`` / ``complete`` / ``output_logprobs`` are gone — the
    only translation kept is ``translate::sampling_params`` (a thin field
    map) plus the unary RPC adapters (``model_info`` / ``server_info`` /
    ``loads``), which still produce SGLang shapes because the router's
    ``ModelInfo`` / ``ServerInfo`` enums consume those — that's a
    separate cleanup.

Router-side:

  - ``client.rs::generate()`` dispatch arm now matches
    ``(Self::TokenSpeed(_), ProtoGenerateRequest::TokenSpeed(_))``.
  - The 5 ``build_*_request`` paths in ``GrpcClient`` wrap into
    ``ProtoGenerateRequest::TokenSpeed`` instead of ``::Sglang``.
  - ``harmony/stages/request_building.rs`` builds
    ``ProtoGenerateRequest::TokenSpeed`` and grew a ``TokenSpeed`` arm
    in the Harmony stop-token injection match.
  - ``common/stages/helpers.rs::apply_sampling_defaults_to_generate_request``
    early-returns for TokenSpeed (alongside Trtllm) since neither
    backend plumbs sampling defaults through today; the explicit arm
    keeps the match exhaustive.

PD-disagg paths (``response_collection.rs``) remain SGLang-keyed — the
``if let ProtoGenerateComplete::Sglang(...)`` checks simply won't match
TokenSpeed responses, which is the correct behavior since TokenSpeed
doesn't ship PD-disaggregation.

Verification:

  - ``cargo +nightly fmt --all -- --check`` passes
  - ``cargo clippy -p smg-grpc-client -p smg --all-targets --all-features -- -D warnings`` passes
  - ``cargo check -p smg --bin smg`` passes

This addresses the architectural concern raised on #1351: SGLang's proto
shouldn't be the de-facto router IR. Each backend now has its own arm,
matching how vLLM / MLX / TRT-LLM are integrated.

Signed-off-by: key4ng <[email protected]>
@key4ng key4ng force-pushed the feat/grpc-tokenspeed-servicer branch from 52cda9d to c3c3c02 Compare May 9, 2026 19:36
@key4ng key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch 2 times, most recently from 0234f43 to 5297fbe Compare May 9, 2026 20:24
key4ng added 2 commits May 9, 2026 13:52
…ection

The chat-template tool-shape pre-processor was correct for Kimi-K2.5
(BFCL accuracy +6 pp on simple_python, +24 pp on parallel_multiple)
but breaks Mistral chat templates: their template at line 32 iterates
``tool.function.parameters.properties.items()``, which raises
``unknown method: undefined has no method named items`` once we unwrap
``{"type": "function", "function": {...}}`` into the bare inner dict.

The shape a chat template expects is template-dependent, not
engine-dependent. Reverting the unconditional unwrap; full rationale,
accuracy data, and proposed per-model fix in
docs/proposals/2026-05-09-deferred-chat-template-tools-strip.md.

Affected CI lane: e2e_test/chat_completions/test_function_calling.py::TestToolChoiceMistral
(20 tests failing with chat:32 render error).

Signed-off-by: key4ng <[email protected]>
Cuts ~125 lines of doc-comment / inline-rationale prose without losing
information. Hot spots:

- tokenspeed_scheduler.rs module doc (28 → 7 lines)
- tokenspeed_scheduler.proto service header + SamplingParams comment
  + per-field rationale (~40 lines)
- sampling_params.rs module doc (14 → 4 lines)
- translate::sampling_params explainer (12 → 3 lines)
- inline arm comments in client.rs / multimodal.rs / harmony/stages /
  common/stages that just re-stated what the code already shows

Behavior unchanged. ``cargo +nightly fmt --check`` passes;
``cargo clippy -p smg-grpc-client -p smg --all-targets --all-features
-- -D warnings`` passes.

Signed-off-by: key4ng <[email protected]>
@key4ng key4ng force-pushed the feat/grpc-tokenspeed-servicer branch from 21ef827 to 788933a Compare May 9, 2026 20:53
@key4ng key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch from 5297fbe to 71a7883 Compare May 9, 2026 20:53
@key4ng key4ng force-pushed the feat/grpc-tokenspeed-servicer branch from 788933a to d16d38f Compare May 11, 2026 03:42
@key4ng key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch from 71a7883 to 9432064 Compare May 11, 2026 03:42
Tightens the comments introduced (or modified) by this PR to describe
behavior directly. No behavior change; literal type / path references
left intact.

Files touched:
- crates/grpc_client/build.rs
- crates/grpc_client/proto/tokenspeed_scheduler.proto
- crates/grpc_client/src/sampling_params.rs
- crates/grpc_client/src/tokenspeed_scheduler.rs
- model_gateway/src/routers/grpc/client.rs
- model_gateway/src/routers/grpc/proto_wrapper.rs

Signed-off-by: key4ng <[email protected]>
@key4ng key4ng force-pushed the feat/grpc-tokenspeed-servicer branch from d16d38f to 3f8983a Compare May 11, 2026 03:44
@key4ng key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch from 9432064 to ace5e6d Compare May 11, 2026 03:44
The IR refactor split TokenSpeed off the SGLang request arm, which
left two gaps in the sampling-defaults path:

1. `translate::model_info` populated the `preferred_sampling_params`
   label slot from TokenSpeed's response but left
   `default_sampling_params_json` empty. Worker discovery reads only
   the latter, so model-published defaults never reached the label
   map and were invisible to the request-stage injector.

2. `apply_sampling_defaults_to_generate_request` early-returned for
   `ProtoGenerateRequest::TokenSpeed(_)`. Even if a worker's labels
   carried defaults, the injector skipped the TokenSpeed arm — so a
   model publishing `temperature=0.7` in its generation config would
   apply to the other engines but TokenSpeed would fall through to
   the hardcoded 1.0 in the request builder.

Both fixed:

- `tokenspeed_scheduler.rs`: surface `preferred_sampling_params` in
  both `preferred_sampling_params` and `default_sampling_params_json`
  slots so the discovery path picks it up.
- `helpers.rs`: drop TokenSpeed from the early-return, add a
  `TokenSpeed(req)` match arm calling a new
  `apply_tokenspeed_sampling_defaults`. TokenSpeed's wire declares
  every sampling scalar as `optional`, so the helper writes
  `Some(value)` rather than the bare value — preserving the
  set-vs-unset distinction the servicer's `HasField()` checks rely on.

Verification: `cargo +nightly fmt --all -- --check` and `cargo clippy
-p smg-grpc-client -p smg --all-targets --all-features -- -D warnings`
both pass.

Signed-off-by: key4ng <[email protected]>
@key4ng key4ng force-pushed the feat/grpc-tokenspeed-servicer branch from 3f8983a to 2ecbbb9 Compare May 11, 2026 03:52
@key4ng key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch from ace5e6d to 243c1b5 Compare May 11, 2026 03:53
key4ng added 2 commits May 10, 2026 21:01
- Collapse the duplicate 2-line crate-level docstring into a single
  neutral line covering all supported backends.
- Drop the 5-line TokenSpeed re-export rationale block; the design
  rationale already lives in `tokenspeed_scheduler.rs`'s module doc.

Signed-off-by: key4ng <[email protected]>
Adds the Python servicer that runs alongside a TokenSpeed scheduler
process and serves the gRPC protocol PR1 introduced. Includes:

  - the async scheduler servicer (Generate/HealthCheck/Abort/
    GetModelInfo/GetServerInfo/GetLoads), with cancellation handling
    for streaming, non-streaming, channel-close, and n>1 paths
  - a health-service bridge that flips SERVING/NOT_SERVING based on
    scheduler liveness (deep probe with bounded staleness)
  - a scheduler launcher that boots TokenSpeed's AsyncLLM in-process
  - the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint
  - real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so
    router-side load balancing reflects scheduler-side metrics
  - 57 unit tests covering the servicer, health service, proto
    conversion, finish reasons, sampling params, streaming/non-
    streaming generation, abort/cancel (incl. n>1), model/server
    info, and load metrics

This is part 2 of 3 splitting #1351:
  - PR1: Rust gRPC + protocol (merged first)
  - PR2 (this): Python servicer + unit tests
  - PR3: CI workflows + e2e tests

Stacked on PR1 — the servicer imports the proto stubs PR1 generates
from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``.

Fixes a 🔴 critical from review on #1351:
  - FakeAsyncLLM.generate_request crashed with
    ``TypeError: unhashable type: 'list'`` when n>1, because
    ``_build_generate_req`` rewrites ``rid`` to a list of per-choice
    ids. The fake engine now registers state for each child rid, so
    ``test_cancel_aborts_all_n_children`` exercises the cancel sweep
    instead of dying at setup.

Signed-off-by: key4ng <[email protected]>
@key4ng key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch from bf462f9 to d79af1d Compare May 12, 2026 06:42
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c57a3ec22

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread e2e_test/infra/worker.py
Comment on lines +292 to +293
"--tensor-parallel-size",
str(tp_size),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use TokenSpeed's parallelism flag

When the TokenSpeed e2e worker is launched, this argument is passed through to prepare_server_args, but the pinned TokenSpeed ServerArgs.add_cli_args does not define --tensor-parallel-size; its parallelism flags are --world-size, --attn-tp-size, and --nprocs-per-node (and allow_abbrev is disabled). Any TokenSpeed test worker therefore exits during argument parsing before it can become healthy, so the new TokenSpeed GPU lanes never exercise the servicer.

Useful? React with 👍 / 👎.

description = "SMG gRPC servicer implementations for LLM inference engines (vLLM, SGLang, MLX, TokenSpeed)"
requires-python = ">=3.10"
dependencies = [
"smg-grpc-proto>=0.4.6",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Important: The proto version floor >=0.4.6 is too low. This PR adds tokenspeed_scheduler_pb2 / tokenspeed_scheduler_pb2_grpc to smg-grpc-proto, but the proto package version stays at 0.4.7 (same as main). If someone already has smg-grpc-proto==0.4.7 installed (the old build without TokenSpeed stubs), pip will see >=0.4.6 satisfied and skip the upgrade → from smg_grpc_proto.generated import tokenspeed_scheduler_pb2 fails at import time.

Fix: bump crates/grpc_client/python/pyproject.toml version to 0.4.8 and set this floor to >=0.4.8:

Suggested change
"smg-grpc-proto>=0.4.6",
"smg-grpc-proto>=0.4.8",

@claude
Copy link
Copy Markdown

claude Bot commented May 12, 2026

Incremental Review Summary (synchronize)

Reviewed the full diff (origin/main...HEAD, ~4,500+ lines across 49 files). This is a follow-up review after new commits were pushed.

Open 🔴 Important issues (4)

All four were flagged in the prior review cycle and remain unaddressed in the latest push:

  1. model_gateway/src/routers/grpc/common/stages/helpers.rs:159-166apply_sampling_defaults_to_generate_request early-returns when TokenSpeed sampling_params is None, silently skipping worker defaults. Should get_or_insert_with(Default::default) instead.
  2. model_gateway/src/routers/grpc/harmony/stages/request_building.rs:369-377 — Harmony stop tokens (<|return|>, <|call|>) silently dropped when sampling_params is None. Same root cause as (1).
  3. model_gateway/src/routers/grpc/multimodal.rs:711-713unreachable!() is reachable. Pre-guards in chat/request_building.rs and messages/request_building.rs only check is_mlx(), not is_tokenspeed(), so multimodal requests to a TokenSpeed backend will panic.
  4. grpc_servicer/pyproject.toml:11 — Proto version floor >=0.4.6 is too low; needs >=0.4.8 (and proto package version bump) since TokenSpeed stubs are new in this PR. (Just posted inline comment with suggestion.)

Resolved since last review (1)

  • scripts/ci_install_tokenspeed.sh — Token-in-URL echo issue resolved by commit 2b15dc2 (repo URL is now public, no embedded PAT).

Open 🟡 Nit issues (3)

  • scripts/ci_install_tokenspeed.sh:116-131 — Wheel cache write path is non-functional (never actually copies a .whl).
  • grpc_servicer/smg_grpc_servicer/tokenspeed/server.py:169 — Warmup self-probe uses server_args.host directly, which fails on wildcard 0.0.0.0/::.
  • e2e_test/chat_completions/test_function_calling.py:1583 / test_openai_server.py:469 — TokenSpeed added to engine markers for tests that likely can't run on TokenSpeed yet.

Not approving — synchronize event, and 🔴 issues remain open.

@key4ng key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch 2 times, most recently from 06fe339 to bb6e97e Compare May 12, 2026 14:21
key4ng added 9 commits May 12, 2026 07:54
… models

When tokenspeed runs with a reasoning parser that has an xgrammar
template (e.g. ``gpt-oss`` → ``harmony``), forwarding a raw JSON-schema
constraint causes xgrammar to fight the Harmony channel preamble
(``<|channel|>analysis<|message|>…``): the model either generates
garbage or stalls until ``max_tokens``, leaving ``content`` empty.

Mirror tokenspeed's HTTP entrypoint (``serving_chat.py``): when a
``reasoning_parser`` is configured, wrap the user's JSON schema via
``structural_tag_for_reasoning_json_schema()`` so the grammar only
activates inside the response channel. Parsers without an xgrammar
mapping fall back to the raw json_schema unchanged.

Plumbs ``reasoning_parser`` into ``_sampling_params_from_proto`` as a
keyword-only argument so the helper stays a static method and existing
tests keep passing without modification. The new import of
``tokenspeed.runtime.grammar.reasoning_structural_tag`` is wrapped in
``try/except ImportError`` so stale tokenspeed pins fall back to raw
json_schema rather than crashing.

Signed-off-by: key4ng <[email protected]>
Wires TokenSpeed into CI and the GPU e2e suite:

  - ``.github/actions/setup-tokenspeed`` composite action and
    ``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed
    (kernel + scheduler) at a pinned ref, with a wheel cache lookup so
    repeat runs skip the ~20 min compile
  - e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret
    access so forked PRs skip cleanly
  - pr-test-rust.yml: install the same proto deps so Rust-only changes
    that touch ``crates/grpc_client`` still cover the tokenspeed proto
  - e2e_test infra: ``constants``, ``hooks``, ``worker``, and
    ``model_specs`` learn about a ``tokenspeed`` runtime alongside
    sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the
    suite-wide ``@pytest.mark.engine(...)`` markers expand to include
    tokenspeed
  - Function-calling and tool_choice e2e suites swap to
    ``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3
    family is what TokenSpeed's model registry currently supports)

This is part 3 of 3 splitting #1351:
  - PR1: Rust gRPC + protocol
  - PR2: Python servicer + unit tests
  - PR3 (this): CI workflows + e2e tests

Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and
the Python servicer from PR2 against a live TokenSpeed worker.

Addresses CatherineSue's review on #1351:
  - drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` —
    that context belongs in the PR description, not in the test file

Signed-off-by: key4ng <[email protected]>
…urce

lightseekorg/tokenspeed is now public, so the clone no longer needs
HTTPS basic auth and the fork-PR skip-on-missing-secret guard is no
longer necessary.

- scripts/ci_install_tokenspeed.sh: drop the
  ``TOKENSPEED_GITHUB_TOKEN`` rewrite block; the default
  ``https://github.com/...`` URL clones anonymously.
- .github/actions/setup-tokenspeed/action.yml: drop the
  ``github-token`` input and the env-forwarding step.
- .github/workflows/e2e-gpu-job.yml: drop the job-level ``if``
  guard that skipped tokenspeed on forked PRs, and the ``with:
  github-token: ${{ secrets.TOKENSPEED_GITHUB_TOKEN }}`` plumbing
  on the ``Setup TokenSpeed backend`` step.

TOKENSPEED_REF stays pinned to a tested SHA — bumping that is a
separate decision.

Signed-off-by: key4ng <[email protected]>
Switches the TokenSpeed e2e lane to run inside the upstream-published
``lightseekorg/tokenspeed-runner:cu130-torch-2.11.0`` image instead of
on the bare runner host. That image already ships CUDA 13.0, Torch
2.11.0, nanobind, cmake, libopenmpi-dev, and libssl-dev — everything
the install script previously apt-installed at job time.

Changes:

- ``.github/workflows/e2e-gpu-job.yml``: add a job-level ``container:``
  expression that evaluates to the runner image only when
  ``inputs.engine == 'tokenspeed'``, otherwise ``null`` (bare host, no
  change for the SGLang/vLLM/TRT-LLM/MLX lanes). Bind-mounts
  ``/models`` and ``/tmp/tokenspeed-wheel-cache`` so the existing model
  cache and kernel-wheel cache work unchanged inside the container.

- ``scripts/ci_install_tokenspeed.sh``: drop the CUDA-toolkit apt-get
  block (~45 lines), the ``libssl-dev``/``libopenmpi-dev``/``cmake``
  install, and the ``GITHUB_ENV``/``GITHUB_PATH`` persistence — all
  redundant now that the container baseline already exports
  ``CUDA_HOME``, ``LD_LIBRARY_PATH``, etc. Script shrinks from 189 →
  105 lines.

- Bump ``TOKENSPEED_REF`` to ``70030b29`` (latest
  lightseekorg/tokenspeed main, "Add KV cache events to scheduler").

The unavoidable ~30 min CUDA kernel compile still happens on first
run because upstream doesn't publish pre-built engine wheels yet —
the wheel cache at ``/tmp/tokenspeed-wheel-cache`` short-circuits it
on every subsequent run.

Signed-off-by: key4ng <[email protected]>
The previous commit (388cfc1) moved the TokenSpeed lane inside
``lightseekorg/tokenspeed-runner:cu130-torch-2.11.0`` via a job-level
``container:`` directive. On the k8s-runner-gpu pods this fails: the
runner image pull hits ``unexpected EOF`` mid-download (likely
ephemeral-storage pressure on the pod for a large CUDA image) and the
retry can't reach the docker daemon at all (it's not stably available
inside the pod).

Roll back the workflow ``container:`` block and restore the host-side
toolkit install in ``ci_install_tokenspeed.sh`` (CUDA-13 apt-get,
``libssl-dev``/``libopenmpi-dev``/``cmake``, ``GITHUB_ENV``/``GITHUB_PATH``
persistence). Keep the bumped ``TOKENSPEED_REF`` — that part is
independent of the container experiment.

The "use the upstream runner image" simplification is still the right
direction in principle; revisiting after the runner infra grows
stable container support, or after upstream publishes engine wheels
(which would remove the source-build step entirely).

Signed-off-by: key4ng <[email protected]>
…-path)

Upstream lightseekorg/tokenspeed (current main) renamed the
``--model-path`` argparse flag to ``--model``. The old name remains
only as a positional alias on the parser; passing ``--model-path``
fails the worker boot with:

  __main__.py: error: unrecognized arguments: --model-path

Switch ``_build_tokenspeed_grpc_cmd`` to the new flag form. The
servicer's ``server_args.model_path`` attribute is still populated
because upstream's ``prepare_server_args`` resolves both the positional
``model_path`` and ``--model`` sources into the same attribute.

Only the tokenspeed lane is affected; sglang / vllm / trtllm / mlx
each have their own argparse and continue to accept ``--model-path``.

Signed-off-by: key4ng <[email protected]>
The bumped ``TOKENSPEED_REF`` dropped ``Qwen3MoeForCausalLM`` from
its model registry, so the existing ``Qwen/Qwen3-30B-A3B`` worker
fails to start on the tokenspeed lane with:

  ValueError: Model architectures ['Qwen3MoeForCausalLM'] are not
  supported for now.

Switch ``TestEnableThinking`` to ``Qwen/Qwen3.5-27B``. The model:

- is ``Qwen3_5ForConditionalGeneration`` — in the current tokenspeed
  registry (the family upstream is investing in);
- has the ``enable_thinking`` chat-template toggle the test exercises;
- emits the same ``<think>...</think>`` reasoning markers, so the
  existing ``qwen3`` SMG reasoning parser handles it without changes;
- weighs ~57 GB (54 GB text + ~3 GB vision encoder), which fits on
  the 1×H100 80GB lane with ~20 GB headroom for KV cache and
  activations.

No SMG-side parser changes needed for this test. For future Qwen3.5
function-calling tests, the SMG tool-parser factory already
auto-maps ``Qwen/Qwen3.5*`` → ``qwen_xml`` at
``crates/tool_parser/src/factory.rs:353``.

Changes:

- ``e2e_test/infra/model_specs.py``: add a ``Qwen/Qwen3.5-27B`` spec
  entry (``tp: 1``, thinking + reasoning features) and point
  ``DEFAULT_ENABLE_THINKING_MODEL_PATH`` at it. The existing
  ``Qwen/Qwen3-30B-A3B`` entry stays for the nightly perf benchmark.
- ``e2e_test/chat_completions/test_enable_thinking.py``: update the
  ``@pytest.mark.model`` to ``Qwen/Qwen3.5-27B``.

Signed-off-by: key4ng <[email protected]>
After model load (~57 GB on 1×H100 80GB) the remaining ~16 GB of GPU
memory can't pre-allocate KV cache for the model's 256K native context.
vLLM fails the worker boot with:

  ValueError: To serve at least one request with the models's max seq
  len (262144), (16.17 GiB KV cache is needed, which is larger than
  the available KV cache memory (16.0 GiB).

The TestEnableThinking test sends short single-turn chats ("Hello"),
so cap the context at 16K — same value the ``Qwen/Qwen2.5-14B-Instruct``
spec uses in this file. Apply across all engines that pre-allocate
KV cache by max-seq-len (sglang, vllm, tokenspeed); TRT-LLM allocates
dynamically and keeps the existing ``free_gpu_memory_fraction: 0.8``
knob.

Also pass ``--enforce-eager`` on the engines that respect it — the
hybrid Mamba (Gated DeltaNet) + attention layout makes CUDA-graph
capture finicky and is unnecessary for a smoke test.

Signed-off-by: key4ng <[email protected]>
Dense Qwen3 (``Qwen3ForCausalLM``) is in the bumped tokenspeed pin's
model registry where ``Qwen3MoeForCausalLM`` (``Qwen3-30B-A3B``) and
``Qwen3_5ForConditionalGeneration`` (``Qwen3.5-27B``, which also
overflowed vLLM's CUDA-13 ``nixl_ep`` dependency) are not. Hybrid
Qwen3 keeps the ``enable_thinking`` chat-template toggle the test
needs and emits the same ``<think>...</think>`` markers the existing
``qwen3`` SMG reasoning parser handles.

4B fits trivially on the 1×H100 lane (~8 GB weights), so the
engine-specific KV cap and ``--enforce-eager`` args the Qwen3.5-27B
spec needed are no longer required.

Changes:

- ``e2e_test/infra/model_specs.py``: replace the ``Qwen/Qwen3.5-27B``
  entry with ``Qwen/Qwen3-4B``; point
  ``DEFAULT_ENABLE_THINKING_MODEL_PATH`` at it. Keep the
  ``Qwen/Qwen3-30B-A3B`` entry for the nightly perf benchmark.
- ``e2e_test/chat_completions/test_enable_thinking.py``: update
  ``@pytest.mark.model`` to ``Qwen/Qwen3-4B``.

Signed-off-by: key4ng <[email protected]>
@key4ng key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch from bb6e97e to 779445a Compare May 12, 2026 15:12
@github-actions github-actions Bot added the reasoning-parser Reasoning parser changes label May 12, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 12, 2026

Hi @key4ng, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

  • Use git commit -s every time, or
  • VSCode: enable Git: Always Sign Off in Settings
  • PyCharm: enable Sign-off commit in the Commit tool window

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ab36713b8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +17 to +18
tokenspeed_scheduler_pb2,
tokenspeed_scheduler_pb2_grpc,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bump the proto package version for TokenSpeed stubs

When the TokenSpeed servicer is installed from published packages rather than the CI editable checkout, these new tokenspeed_scheduler_pb2(_grpc) exports need a distinct smg-grpc-proto release to depend on; however crates/grpc_client/python/pyproject.toml still declares version 0.4.7, the same version that predates these stubs. Without bumping the proto package version, downstream installs cannot reliably require a build that contains these modules, so TokenSpeed startup can still fail at import time even if the servicer dependency is tightened.

Useful? React with 👍 / 👎.

grpc_servicer/pyproject.toml
if [ "${RELEASE_PROTO}" = "true" ]; then
# Matches both the core dep ("smg-grpc-proto>=0.4.6") and the
# mlx extra ("smg-grpc-proto>=0.4.7").
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: always() also runs on workflow cancellation. If a user cancels after one build job has already succeeded, the release job will still execute and publish artifacts. Using !cancelled() instead preserves the intended behavior (run when dependencies are skipped) while correctly skipping on cancellation.

Suggested change
# mlx extra ("smg-grpc-proto>=0.4.7").
!cancelled()

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 921a3d9f14

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".



@pytest.mark.engine("sglang", "vllm")
@pytest.mark.engine("sglang", "vllm", "tokenspeed")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add TokenSpeed to non-chat e2e matrices

When this marker is added, these tests are only selected if a CI job runs with E2E_ENGINE=tokenspeed. I checked .github/workflows/pr-test-rust.yml: TokenSpeed was added only to the chat matrices, while the completions job still lists only sglang/vllm and the router/responses jobs likewise do not run TokenSpeed. As a result, the newly marked TokenSpeed completions/router/responses coverage is never exercised in PR CI; add TokenSpeed to the corresponding matrices or drop the marks.

Useful? React with 👍 / 👎.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 14, 2026

Hi @key4ng, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

@mergify mergify Bot added the needs-rebase PR has merge conflicts that need to be resolved label May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci CI/CD configuration changes dependencies Dependency updates grpc gRPC client and router changes model-gateway Model gateway crate changes needs-rebase PR has merge conflicts that need to be resolved priority:high High priority protocols Protocols crate changes reasoning-parser Reasoning parser changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants