ci(tokenspeed): add CI install + GPU e2e coverage (Part 3/3) by key4ng · Pull Request #1465 · lightseekorg/smg

key4ng · 2026-05-08T20:12:49Z

Description

Problem

With the Rust router (#1351) and Python servicer (#1464) in place, TokenSpeed needs to be exercised in CI: source-installed at a pinned ref, launched as a worker through the e2e infrastructure, and covered by the function-calling and tool_choice suites.

Solution

A composite GitHub Action plus install script that builds and caches the TokenSpeed kernel + scheduler, a tokenspeed lane in the GPU e2e job, and e2e_test/infra updates that teach the suite about a tokenspeed runtime alongside sglang/vllm/trtllm.

3-PR Stack

This is part 3 of 3 splitting the original #1351:

PR1 (feat(grpc): add TokenSpeed gRPC client and router wiring (Part 1/3) #1351): Rust gRPC + protocol — base main
PR2 (feat(grpc_servicer): add TokenSpeed servicer (Part 2/3) #1464): Python servicer + unit tests — base feat/grpc-servicer-tokenspeed
PR3 (this): CI workflows + e2e tests — base feat/grpc-tokenspeed-servicer (= PR2)

Stacked on PR2. The e2e wiring exercises both the Rust router (PR1) and the Python servicer (PR2) against a live TokenSpeed worker.

Changes

.github/actions/setup-tokenspeed/action.yml — composite action wrapping scripts/ci_install_tokenspeed.sh
scripts/ci_install_tokenspeed.sh — source install of TokenSpeed kernel + scheduler at a pinned git ref, with a wheel cache so repeat runs skip the ~20 min compile
.github/workflows/e2e-gpu-job.yml — add a tokenspeed engine lane, gated on secret access so forked PRs skip cleanly
.github/workflows/pr-test-rust.yml — install proto deps so Rust-only changes touching crates/grpc_client still cover the new tokenspeed proto
e2e_test/infra/{constants,worker,model_specs}.py and e2e_test/fixtures/hooks.py — teach the suite about a tokenspeed runtime
Suite-wide @pytest.mark.engine(...) markers expand to include tokenspeed
Function-calling and tool_choice suites use Qwen/Qwen3-4B-Instruct-2507 (the Qwen3 family is what TokenSpeed's model registry currently supports)

Test Plan

YAML parse for all 3 workflow / action files
bash -n scripts/ci_install_tokenspeed.sh passes
pre-commit run --files <changed> passes (rustfmt / clippy / ruff / ruff format / codespell / yaml / toml all green)
GPU e2e runs once this PR's CI is enabled (function-calling matrix on Qwen/Qwen3-4B-Instruct-2507 against a live tokenspeed worker, plus the existing reference engines for parity)

Review Threads from #1351

Addressed in this PR:

e2e_test/chat_completions/test_function_calling.py — verbose Qwen3-4B docstring on TestToolChoiceQwen removed; that context lives in this PR description instead.

Checklist

cargo +nightly fmt passes (no Rust changes)
cargo clippy --all-targets --all-features -- -D warnings passes (no Rust changes)
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

New Features
- TokenSpeed added as a supported inference backend with gRPC-based generation, streaming, health/abort, model/server loads and metadata, sampling-parameter mapping, and integration into gateway and client paths.
- New local Qwen model entry (Qwen3-4B-Instruct-2507) added to model specs.
Testing
- E2E and unit test matrices expanded to exercise TokenSpeed; new unit tests cover TokenSpeed servicer and health behavior.
Chores
- CI/workflows and CI install scripts updated to provision and run TokenSpeed in CI.

Adds TokenSpeed as a first-class GPU backend on the Rust side: a self- contained `tokenspeed.grpc.scheduler.TokenSpeedScheduler` proto, the `TokenSpeedSchedulerClient` wrapper that translates SGLang-shaped request/response types at the boundary, and the model_gateway router plumbing (client dispatch, runtime detection, harmony/regular request builders, multimodal and embedding stages). This is part 1 of 3 splitting #1351: - PR1 (this): Rust gRPC + protocol - PR2: Python servicer (grpc_servicer) - PR3: CI workflows + e2e tests PR1 is functionally inert without PR2 — the router can dial a TokenSpeed worker, but the worker process lives in the Python servicer landed by PR2. Addresses CatherineSue's review on #1351: - shorten the TokenSpeed RuntimeType doc in protocols/worker.rs - drop the verbose TokenSpeed note in grpc_client/build.rs - restore the concise module doc in detect_backend.rs, just adding tokenspeed to the existing health-check ordering Signed-off-by: key4ng <[email protected]>

coderabbitai · 2026-05-08T20:12:55Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR integrates TokenSpeed as a new gRPC inference backend across protocol definitions, Rust and Python clients/servicers, model gateway routing, E2E test infrastructure, and CI workflows. It adds a full TokenSpeed proto, shared sampling-params builders, a Rust TokenSpeed client, a Python TokenSpeed servicer/server/health, model-gateway plumbing and proto-wrapper variants, CI install/workflow steps, and broad E2E test coverage.

Changes

TokenSpeed Backend Support

Layer / File(s)	Summary
Protocol contract and shared sampling parameter builders `crates/grpc_client/proto/tokenspeed_scheduler.proto`, `crates/grpc_client/build.rs`, `crates/grpc_client/python/smg_grpc_proto/__init__.py`, `crates/grpc_client/src/lib.rs`, `crates/grpc_client/src/sampling_params.rs`	TokenSpeed protobuf schema and build integration; Python re-exports; new Rust `sampling_params` converts Chat/Responses/Messages/Completion/plain generate inputs into gRPC `SamplingParams` with constraint normalization.
Refactor SGLang client to use shared sampling builders `crates/grpc_client/src/sglang_scheduler.rs`	SGLang client delegates sampling-params construction to `crate::sampling_params`, removing in-file helpers and updating unit tests.
Rust TokenSpeed gRPC client implementation `crates/grpc_client/src/tokenspeed_scheduler.rs`	Implements `TokenSpeedSchedulerClient`, `AbortOnDropStream`, streaming `generate` (returns abort-on-drop stream), unary RPCs (health, abort, model/server/loads), request builders translating router request shapes to TokenSpeed proto, and translation helpers.
Add TokenSpeed to RuntimeType enum `crates/protocols/src/worker.rs`	Adds `RuntimeType::TokenSpeed` with Display and FromStr parsing for `"tokenspeed"`.
Python gRPC servicer implementation `grpc_servicer/smg_grpc_servicer/tokenspeed/`	Adds TokenSpeed servicer package: `TokenSpeedSchedulerServicer` (Generate/HealthCheck/Abort/GetModelInfo/GetServerInfo/GetLoads), `TokenSpeedHealthServicer` with stuck-scheduler checks, `scheduler_launcher`, `server` with warmup and graceful shutdown, and CLI entrypoint.
Python servicer unit tests `grpc_servicer/tests/conftest.py`, `grpc_servicer/tests/test_tokenspeed_health_servicer.py`, `grpc_servicer/tests/test_tokenspeed_servicer.py`	Async unit tests covering health transitions, stuck-scheduler detection, sampling-params conversion, Generate streaming/non-streaming behavior, abort/cancel semantics, GetModelInfo/GetServerInfo/GetLoads, and output-logprobs conversion.
Model gateway GrpcClient wrapper for TokenSpeed `model_gateway/src/routers/grpc/client.rs`	Adds `GrpcClient::TokenSpeed` variant, typed accessors, connect dispatch, TokenSpeed branches for health/get_model_info/get_loads/generate, and request-builder TokenSpeed paths (with multimodal rejection where applicable).
Proto wrapper unified types for TokenSpeed `model_gateway/src/routers/grpc/proto_wrapper.rs`	Extend unified enums (ProtoGenerateRequest/Response/StreamChunk/Complete/ProtoStream) with TokenSpeed variants and implement field accessors/conversions (token ids, finish reason, output logprobs semantics, matched-stop decoding).
Model gateway stages for TokenSpeed routing `model_gateway/src/routers/grpc/common/stages/helpers.rs`, `model_gateway/src/routers/grpc/common/stages/request_execution.rs`, `model_gateway/src/routers/grpc/harmony/stages/request_building.rs`, `model_gateway/src/routers/grpc/multimodal.rs`, `model_gateway/src/routers/grpc/regular/stages/embedding/request_building.rs`	Add TokenSpeed handling to sampling-defaults application, PD-disaggregated request execution, Harmony stop-token injection, embedding rejection, and multimodal assembly rejection.
Backend runtime detection for TokenSpeed `model_gateway/src/workflow/steps/local/detect_backend.rs`, `model_gateway/src/workflow/steps/util.rs`	Include TokenSpeed in gRPC health-check probes for local detection and parallel reachability checks.
E2E test infrastructure updates `e2e_test/infra/constants.py`, `e2e_test/infra/model_specs.py`, `e2e_test/infra/worker.py`, `e2e_test/fixtures/hooks.py`	Add `Runtime.TOKENSPEED` and `is_tokenspeed()`, insert Qwen3-4B-Instruct-2507 model spec, implement `_build_tokenspeed_grpc_cmd()` with TokenSpeed-specific flags, and update skip_for_runtime marker handling to iterate all marks.
E2E test coverage expansion `e2e_test/chat_completions/`, `e2e_test/completions/`, `e2e_test/responses/`, `e2e_test/router/`	Add `tokenspeed` engine marker to numerous test classes and adjust parametrizations (e.g., conditionally skip logprobs=5 under TokenSpeed).
CI workflow and setup action `.github/actions/setup-tokenspeed/action.yml`, `.github/workflows/e2e-gpu-job.yml`, `.github/workflows/pr-test-rust.yml`, `grpc_servicer/pyproject.toml`, `scripts/ci_install_tokenspeed.sh`	New composite GitHub Action to prepare Python venv and install TokenSpeed from source; workflows updated to conditionally use TokenSpeed setup step; PR-test path-filter and GPU matrices extended; pyproject test extras and pytest marker added; CI install script bootstraps TokenSpeed (wheel caching, build-isolation flags, verification imports).

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

lightseekorg/smg#880: Overlaps SGLang/Responses sampling-param changes and request-building paths.
lightseekorg/smg#569: Related pytest marker/fixture handling changes.
lightseekorg/smg#645: Changes to GrpcClient generate/embed signatures and client dispatch that may overlap.

Suggested reviewers

CatherineSue
slin1237
gongwei-130
XinyueZhang369

"🐰 I hopped through protos, clients, and test beds,
Built TokenSpeed homes where the gRPC thread treads.
From CI to gateway, with a carrot-shaped grin,
Streams and health checks now hum from within.
Hooray — a speedy backend for every small win!"

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/grpc-tokenspeed-ci-tests

gemini-code-assist

Code Review

This pull request integrates the TokenSpeed backend into the end-to-end testing suite. Key changes include the addition of a GitHub Action and a comprehensive installation script for TokenSpeed, infrastructure updates to support the new engine in the worker and constants, and the inclusion of TokenSpeed in numerous test suites. Additionally, the pytest_runtest_setup hook was improved to correctly handle multiple runtime skip markers. Feedback was provided to fix the caching logic in the installation script, as the current implementation fails to save the compiled kernel wheel, which would result in redundant 30-minute builds during CI.

gemini-code-assist · 2026-05-08T20:15:01Z

+    echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..."
+    MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \
+        uv pip install tokenspeed-kernel/python/ --no-build-isolation
+    # Cache the built wheel — uv stores wheels under its cache, copy out.
+    mkdir -p "$WHEEL_CACHE"
+    python3 -c "import tokenspeed_kernel, os, shutil, glob; \
+        d = os.path.dirname(tokenspeed_kernel.__file__); \
+        site = os.path.dirname(d); \
+        whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \
+        print('kernel install dir:', whls)" || true
+fi


The caching logic for tokenspeed-kernel is non-functional. The else block builds the kernel from source but fails to save the resulting wheel to $WHEEL_CACHE. Additionally, the Python snippet on lines 147-151 only prints the installation path instead of copying the built artifacts. This means the expensive ~30-minute compilation will occur on every CI run. Using uv pip wheel to generate the wheel directly into the cache directory is a more robust approach and ensures consistency with the use of uv in CI scripts.

Suggested change

echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..."

MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \

uv pip install tokenspeed-kernel/python/ --no-build-isolation

# Cache the built wheel — uv stores wheels under its cache, copy out.

mkdir -p "$WHEEL_CACHE"

python3 -c "import tokenspeed_kernel, os, shutil, glob; \

d = os.path.dirname(tokenspeed_kernel.__file__); \

site = os.path.dirname(d); \

whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \

print('kernel install dir:', whls)" || true

fi

echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..."

mkdir -p "$WHEEL_CACHE"

MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \

uv pip wheel tokenspeed-kernel/python/ -w "$WHEEL_CACHE" --no-build-isolation

uv pip install "$WHEEL_CACHE"/tokenspeed_kernel-*.whl --no-build-isolation

fi

References

When installing Python packages in CI scripts, use uv pip install for consistency and performance if uv is already in use for other installations.

claude · 2026-05-08T20:15:28Z

+        d = os.path.dirname(tokenspeed_kernel.__file__); \
+        site = os.path.dirname(d); \
+        whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \
+        print('kernel install dir:', whls)" || true
+fi
+
+# Step 4: scheduler (scikit-build-core + nanobind + CMake).
+echo "Building tokenspeed-scheduler..."
+uv pip install tokenspeed-scheduler/
+
+# Step 5: the Python runtime (pure-Python).
+uv pip install "./python" --no-build-isolation
+
+# ── Persist env to subsequent CI steps ─────────────────────────────────────
+if [ -n "${GITHUB_ENV:-}" ]; then
+    echo "CUDA_HOME=$CUDA_HOME" >> "$GITHUB_ENV"


🟡 Nit: The wheel-cache write path is non-functional. After building from source, it creates $WHEEL_CACHE and runs a Python snippet that prints the dist-info directory but never copies a .whl file into $WHEEL_CACHE. The cache-read path (lines 142-146) looks for tokenspeed_kernel-*.whl in that directory, so it will never find anything — every run rebuilds from source (~30 min).

Either:

Actually locate and copy the built wheel into $WHEEL_CACHE, or

Remove the dead caching code and document that every run is a fresh build (the || true at the end already swallows failures, so this isn't a correctness bug, just a misleading comment + wasted mkdir).

claude · 2026-05-08T20:15:32Z

+
+cd "$TOKENSPEED_DIR"
+
+# ── System dependencies (mirrors docker/Dockerfile) ─────────────────────────


🟡 Nit: This echo prints $TOKENSPEED_REPO which, after line 63, contains the embedded access token (https://x-access-token:<token>@github.com/...). GitHub Actions redacts known secret values in logs, so this is likely safe in practice, but it's still better hygiene to avoid echoing credential-bearing URLs. Consider printing the repo without the auth portion:

Suggested change

# ── System dependencies (mirrors docker/Dockerfile) ─────────────────────────

echo "Cloning TokenSpeed ${TOKENSPEED_REF} from ${TOKENSPEED_REPO%%@*}@..."

claude · 2026-05-08T20:15:38Z

+@pytest.mark.engine("sglang", "vllm", "trtllm", "tokenspeed")
+@pytest.mark.gpu(1)
+@pytest.mark.model("Qwen/Qwen3-4B-Instruct-2507")


🟡 Nit: This changes TestMultiTurnToolCall from Qwen2.5-14B (2 GPUs) to Qwen3-4B (1 GPU) for all engines, not just TokenSpeed. The 14B model was presumably chosen for reliable multi-turn tool-calling behavior — a 4B model may be flakier at following the tool-call/tool-result conversation pattern.

If the intent is just to add TokenSpeed coverage (which needs a Qwen3 model), consider keeping the existing 14B config for sglang/vllm/trtllm and adding a separate TestMultiTurnToolCallTokenSpeed class with the 4B model, or at minimum confirming the 4B model passes the existing multi-turn tests reliably on the other engines before merging.

claude

Solid CI + e2e wiring for TokenSpeed. The iter_markers fix in hooks.py is a real bug fix — nice catch.

Summary: 0 🔴 Important · 3 🟡 Nit · 0 🟣 Pre-existing

Nits posted inline:

Wheel cache is non-functional (ci_install_tokenspeed.sh): the cache-save path prints debug info but never writes a .whl to $WHEEL_CACHE, so every run rebuilds from source.
Credential in echo (ci_install_tokenspeed.sh:123): echo prints $TOKENSPEED_REPO after the token has been embedded in it. GitHub Actions redacts secrets, but better to strip the auth portion.
Model downgrade for all engines (test_function_calling.py): TestMultiTurnToolCall moved from Qwen2.5-14B (2 GPU) → Qwen3-4B (1 GPU) for all engines, not just TokenSpeed — worth confirming the smaller model is reliable for multi-turn tool calling on sglang/vllm/trtllm.

Also noted (couldn't post inline — line not in diff): the engine marker help text in hooks.py:28 still says (sglang, vllm, trtllm) without tokenspeed.

…dule The 5 ``build_*_sampling_params_from_*`` helpers (chat / responses / messages / completion / plain) plus their constraint helpers were sitting on ``SglangSchedulerClient`` even though the OpenAI mapping is backend-neutral. TokenSpeed's client was reaching across to call them through ``SglangSchedulerClient::*`` which suggested SGLang owned the OpenAI→sampling translation, when it doesn't. Move them to ``crate::sampling_params`` as free functions. Both the SGLang and TokenSpeed clients (and any future client that wants the same mapping) now call ``crate::sampling_params::build_*`` directly. The return type is still ``sglang::SamplingParams`` because that proto happens to be the most permissive shape across our supported backends; TokenSpeed translates to its own slimmer shape at the wire boundary. When a backend grows a sampling field SGLang lacks, this is the place to add it. No behavior change. Tests stay green; the call sites in ``build_generate_request_from_*`` are mechanically updated. Signed-off-by: key4ng <[email protected]>

…ation) TokenSpeed previously rode the SGLang IR arms in proto_wrapper.rs: ``TokenSpeedSchedulerClient::generate()`` accepted ``sglang::GenerateRequest``, the streaming response was translated back into ``sglang::GenerateResponse`` at the wire boundary, and the router dispatched through ``ProtoGenerateRequest::Sglang``. This let TokenSpeed reuse the existing match arms but coupled it to SGLang's evolving schema — every SGLang field addition forced a TokenSpeed translator stub (most recently ``default_sampling_params_json``). Add native TokenSpeed arms to the router IR: - ``ProtoGenerateRequest::TokenSpeed(Box<tokenspeed::GenerateRequest>)`` - ``ProtoGenerateResponse::TokenSpeed(Box<tokenspeed::GenerateResponse>)`` - ``ProtoGenerateStreamChunk::TokenSpeed(tokenspeed::GenerateStreamChunk)`` - ``ProtoGenerateComplete::TokenSpeed(tokenspeed::GenerateComplete)`` …with ``as_tokenspeed`` / ``as_tokenspeed_mut`` / ``is_tokenspeed`` accessors mirroring the existing per-backend pattern (Mlx / Vllm / Trtllm) and ``TokenSpeed`` arms in every aggregator method (``token_ids``, ``index``, ``output_logprobs``, ``input_logprobs``, ``prompt_tokens``, ``completion_tokens``, ``cached_tokens``, ``finish_reason``, ``matched_stop_json``, ``output_ids``, ``set_stream``, ``request_id``, ``set_max_tokens_for_prefill``, ``clear_mm_inputs``, ``set_kv_transfer_params``, ``kv_transfer_params``). Client-side: - ``TokenSpeedSchedulerClient::generate()`` now takes ``tokenspeed_proto::GenerateRequest`` - ``AbortOnDropStream::Item`` is ``tokenspeed_proto::GenerateResponse`` - The 5 ``build_*_request`` builders return native ``tokenspeed_proto`` types - ``translate::generate_request`` / ``generate_response`` / ``stream_chunk`` / ``complete`` / ``output_logprobs`` are gone — the only translation kept is ``translate::sampling_params`` (a thin field map) plus the unary RPC adapters (``model_info`` / ``server_info`` / ``loads``), which still produce SGLang shapes because the router's ``ModelInfo`` / ``ServerInfo`` enums consume those — that's a separate cleanup. Router-side: - ``client.rs::generate()`` dispatch arm now matches ``(Self::TokenSpeed(_), ProtoGenerateRequest::TokenSpeed(_))``. - The 5 ``build_*_request`` paths in ``GrpcClient`` wrap into ``ProtoGenerateRequest::TokenSpeed`` instead of ``::Sglang``. - ``harmony/stages/request_building.rs`` builds ``ProtoGenerateRequest::TokenSpeed`` and grew a ``TokenSpeed`` arm in the Harmony stop-token injection match. - ``common/stages/helpers.rs::apply_sampling_defaults_to_generate_request`` early-returns for TokenSpeed (alongside Trtllm) since neither backend plumbs sampling defaults through today; the explicit arm keeps the match exhaustive. PD-disagg paths (``response_collection.rs``) remain SGLang-keyed — the ``if let ProtoGenerateComplete::Sglang(...)`` checks simply won't match TokenSpeed responses, which is the correct behavior since TokenSpeed doesn't ship PD-disaggregation. Verification: - ``cargo +nightly fmt --all -- --check`` passes - ``cargo clippy -p smg-grpc-client -p smg --all-targets --all-features -- -D warnings`` passes - ``cargo check -p smg --bin smg`` passes This addresses the architectural concern raised on #1351: SGLang's proto shouldn't be the de-facto router IR. Each backend now has its own arm, matching how vLLM / MLX / TRT-LLM are integrated. Signed-off-by: key4ng <[email protected]>

…ection The chat-template tool-shape pre-processor was correct for Kimi-K2.5 (BFCL accuracy +6 pp on simple_python, +24 pp on parallel_multiple) but breaks Mistral chat templates: their template at line 32 iterates ``tool.function.parameters.properties.items()``, which raises ``unknown method: undefined has no method named items`` once we unwrap ``{"type": "function", "function": {...}}`` into the bare inner dict. The shape a chat template expects is template-dependent, not engine-dependent. Reverting the unconditional unwrap; full rationale, accuracy data, and proposed per-model fix in docs/proposals/2026-05-09-deferred-chat-template-tools-strip.md. Affected CI lane: e2e_test/chat_completions/test_function_calling.py::TestToolChoiceMistral (20 tests failing with chat:32 render error). Signed-off-by: key4ng <[email protected]>

Cuts ~125 lines of doc-comment / inline-rationale prose without losing information. Hot spots: - tokenspeed_scheduler.rs module doc (28 → 7 lines) - tokenspeed_scheduler.proto service header + SamplingParams comment + per-field rationale (~40 lines) - sampling_params.rs module doc (14 → 4 lines) - translate::sampling_params explainer (12 → 3 lines) - inline arm comments in client.rs / multimodal.rs / harmony/stages / common/stages that just re-stated what the code already shows Behavior unchanged. ``cargo +nightly fmt --check`` passes; ``cargo clippy -p smg-grpc-client -p smg --all-targets --all-features -- -D warnings`` passes. Signed-off-by: key4ng <[email protected]>

Tightens the comments introduced (or modified) by this PR to describe behavior directly. No behavior change; literal type / path references left intact. Files touched: - crates/grpc_client/build.rs - crates/grpc_client/proto/tokenspeed_scheduler.proto - crates/grpc_client/src/sampling_params.rs - crates/grpc_client/src/tokenspeed_scheduler.rs - model_gateway/src/routers/grpc/client.rs - model_gateway/src/routers/grpc/proto_wrapper.rs Signed-off-by: key4ng <[email protected]>

The IR refactor split TokenSpeed off the SGLang request arm, which left two gaps in the sampling-defaults path: 1. `translate::model_info` populated the `preferred_sampling_params` label slot from TokenSpeed's response but left `default_sampling_params_json` empty. Worker discovery reads only the latter, so model-published defaults never reached the label map and were invisible to the request-stage injector. 2. `apply_sampling_defaults_to_generate_request` early-returned for `ProtoGenerateRequest::TokenSpeed(_)`. Even if a worker's labels carried defaults, the injector skipped the TokenSpeed arm — so a model publishing `temperature=0.7` in its generation config would apply to the other engines but TokenSpeed would fall through to the hardcoded 1.0 in the request builder. Both fixed: - `tokenspeed_scheduler.rs`: surface `preferred_sampling_params` in both `preferred_sampling_params` and `default_sampling_params_json` slots so the discovery path picks it up. - `helpers.rs`: drop TokenSpeed from the early-return, add a `TokenSpeed(req)` match arm calling a new `apply_tokenspeed_sampling_defaults`. TokenSpeed's wire declares every sampling scalar as `optional`, so the helper writes `Some(value)` rather than the bare value — preserving the set-vs-unset distinction the servicer's `HasField()` checks rely on. Verification: `cargo +nightly fmt --all -- --check` and `cargo clippy -p smg-grpc-client -p smg --all-targets --all-features -- -D warnings` both pass. Signed-off-by: key4ng <[email protected]>

- Collapse the duplicate 2-line crate-level docstring into a single neutral line covering all supported backends. - Drop the 5-line TokenSpeed re-export rationale block; the design rationale already lives in `tokenspeed_scheduler.rs`'s module doc. Signed-off-by: key4ng <[email protected]>

Adds the Python servicer that runs alongside a TokenSpeed scheduler process and serves the gRPC protocol PR1 introduced. Includes: - the async scheduler servicer (Generate/HealthCheck/Abort/ GetModelInfo/GetServerInfo/GetLoads), with cancellation handling for streaming, non-streaming, channel-close, and n>1 paths - a health-service bridge that flips SERVING/NOT_SERVING based on scheduler liveness (deep probe with bounded staleness) - a scheduler launcher that boots TokenSpeed's AsyncLLM in-process - the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint - real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so router-side load balancing reflects scheduler-side metrics - 57 unit tests covering the servicer, health service, proto conversion, finish reasons, sampling params, streaming/non- streaming generation, abort/cancel (incl. n>1), model/server info, and load metrics This is part 2 of 3 splitting #1351: - PR1: Rust gRPC + protocol (merged first) - PR2 (this): Python servicer + unit tests - PR3: CI workflows + e2e tests Stacked on PR1 — the servicer imports the proto stubs PR1 generates from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``. Fixes a 🔴 critical from review on #1351: - FakeAsyncLLM.generate_request crashed with ``TypeError: unhashable type: 'list'`` when n>1, because ``_build_generate_req`` rewrites ``rid`` to a list of per-choice ids. The fake engine now registers state for each child rid, so ``test_cancel_aborts_all_n_children`` exercises the cancel sweep instead of dying at setup. Signed-off-by: key4ng <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c57a3ec22

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-12T06:48:28Z

+            "--tensor-parallel-size",
+            str(tp_size),


Use TokenSpeed's parallelism flag

When the TokenSpeed e2e worker is launched, this argument is passed through to prepare_server_args, but the pinned TokenSpeed ServerArgs.add_cli_args does not define --tensor-parallel-size; its parallelism flags are --world-size, --attn-tp-size, and --nprocs-per-node (and allow_abbrev is disabled). Any TokenSpeed test worker therefore exits during argument parsing before it can become healthy, so the new TokenSpeed GPU lanes never exercise the servicer.

Useful? React with 👍 / 👎.

claude · 2026-05-12T06:51:36Z

+description = "SMG gRPC servicer implementations for LLM inference engines (vLLM, SGLang, MLX, TokenSpeed)"
 requires-python = ">=3.10"
 dependencies = [
    "smg-grpc-proto>=0.4.6",


🔴 Important: The proto version floor >=0.4.6 is too low. This PR adds tokenspeed_scheduler_pb2 / tokenspeed_scheduler_pb2_grpc to smg-grpc-proto, but the proto package version stays at 0.4.7 (same as main). If someone already has smg-grpc-proto==0.4.7 installed (the old build without TokenSpeed stubs), pip will see >=0.4.6 satisfied and skip the upgrade → from smg_grpc_proto.generated import tokenspeed_scheduler_pb2 fails at import time.

Fix: bump crates/grpc_client/python/pyproject.toml version to 0.4.8 and set this floor to >=0.4.8:

Suggested change

"smg-grpc-proto>=0.4.6",

"smg-grpc-proto>=0.4.8",

claude · 2026-05-12T06:52:06Z

Incremental Review Summary (synchronize)

Reviewed the full diff (origin/main...HEAD, ~4,500+ lines across 49 files). This is a follow-up review after new commits were pushed.

Open 🔴 Important issues (4)

All four were flagged in the prior review cycle and remain unaddressed in the latest push:

model_gateway/src/routers/grpc/common/stages/helpers.rs:159-166 — apply_sampling_defaults_to_generate_request early-returns when TokenSpeed sampling_params is None, silently skipping worker defaults. Should get_or_insert_with(Default::default) instead.
model_gateway/src/routers/grpc/harmony/stages/request_building.rs:369-377 — Harmony stop tokens (<|return|>, <|call|>) silently dropped when sampling_params is None. Same root cause as (1).
model_gateway/src/routers/grpc/multimodal.rs:711-713 — unreachable!() is reachable. Pre-guards in chat/request_building.rs and messages/request_building.rs only check is_mlx(), not is_tokenspeed(), so multimodal requests to a TokenSpeed backend will panic.
grpc_servicer/pyproject.toml:11 — Proto version floor >=0.4.6 is too low; needs >=0.4.8 (and proto package version bump) since TokenSpeed stubs are new in this PR. (Just posted inline comment with suggestion.)

Resolved since last review (1)

scripts/ci_install_tokenspeed.sh — Token-in-URL echo issue resolved by commit 2b15dc2 (repo URL is now public, no embedded PAT).

Open 🟡 Nit issues (3)

scripts/ci_install_tokenspeed.sh:116-131 — Wheel cache write path is non-functional (never actually copies a .whl).
grpc_servicer/smg_grpc_servicer/tokenspeed/server.py:169 — Warmup self-probe uses server_args.host directly, which fails on wildcard 0.0.0.0/::.
e2e_test/chat_completions/test_function_calling.py:1583 / test_openai_server.py:469 — TokenSpeed added to engine markers for tests that likely can't run on TokenSpeed yet.

Not approving — synchronize event, and 🔴 issues remain open.

… models When tokenspeed runs with a reasoning parser that has an xgrammar template (e.g. ``gpt-oss`` → ``harmony``), forwarding a raw JSON-schema constraint causes xgrammar to fight the Harmony channel preamble (``<|channel|>analysis<|message|>…``): the model either generates garbage or stalls until ``max_tokens``, leaving ``content`` empty. Mirror tokenspeed's HTTP entrypoint (``serving_chat.py``): when a ``reasoning_parser`` is configured, wrap the user's JSON schema via ``structural_tag_for_reasoning_json_schema()`` so the grammar only activates inside the response channel. Parsers without an xgrammar mapping fall back to the raw json_schema unchanged. Plumbs ``reasoning_parser`` into ``_sampling_params_from_proto`` as a keyword-only argument so the helper stays a static method and existing tests keep passing without modification. The new import of ``tokenspeed.runtime.grammar.reasoning_structural_tag`` is wrapped in ``try/except ImportError`` so stale tokenspeed pins fall back to raw json_schema rather than crashing. Signed-off-by: key4ng <[email protected]>

Wires TokenSpeed into CI and the GPU e2e suite: - ``.github/actions/setup-tokenspeed`` composite action and ``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed (kernel + scheduler) at a pinned ref, with a wheel cache lookup so repeat runs skip the ~20 min compile - e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret access so forked PRs skip cleanly - pr-test-rust.yml: install the same proto deps so Rust-only changes that touch ``crates/grpc_client`` still cover the tokenspeed proto - e2e_test infra: ``constants``, ``hooks``, ``worker``, and ``model_specs`` learn about a ``tokenspeed`` runtime alongside sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the suite-wide ``@pytest.mark.engine(...)`` markers expand to include tokenspeed - Function-calling and tool_choice e2e suites swap to ``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3 family is what TokenSpeed's model registry currently supports) This is part 3 of 3 splitting #1351: - PR1: Rust gRPC + protocol - PR2: Python servicer + unit tests - PR3 (this): CI workflows + e2e tests Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and the Python servicer from PR2 against a live TokenSpeed worker. Addresses CatherineSue's review on #1351: - drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` — that context belongs in the PR description, not in the test file Signed-off-by: key4ng <[email protected]>

…urce lightseekorg/tokenspeed is now public, so the clone no longer needs HTTPS basic auth and the fork-PR skip-on-missing-secret guard is no longer necessary. - scripts/ci_install_tokenspeed.sh: drop the ``TOKENSPEED_GITHUB_TOKEN`` rewrite block; the default ``https://github.com/...`` URL clones anonymously. - .github/actions/setup-tokenspeed/action.yml: drop the ``github-token`` input and the env-forwarding step. - .github/workflows/e2e-gpu-job.yml: drop the job-level ``if`` guard that skipped tokenspeed on forked PRs, and the ``with: github-token: ${{ secrets.TOKENSPEED_GITHUB_TOKEN }}`` plumbing on the ``Setup TokenSpeed backend`` step. TOKENSPEED_REF stays pinned to a tested SHA — bumping that is a separate decision. Signed-off-by: key4ng <[email protected]>

Switches the TokenSpeed e2e lane to run inside the upstream-published ``lightseekorg/tokenspeed-runner:cu130-torch-2.11.0`` image instead of on the bare runner host. That image already ships CUDA 13.0, Torch 2.11.0, nanobind, cmake, libopenmpi-dev, and libssl-dev — everything the install script previously apt-installed at job time. Changes: - ``.github/workflows/e2e-gpu-job.yml``: add a job-level ``container:`` expression that evaluates to the runner image only when ``inputs.engine == 'tokenspeed'``, otherwise ``null`` (bare host, no change for the SGLang/vLLM/TRT-LLM/MLX lanes). Bind-mounts ``/models`` and ``/tmp/tokenspeed-wheel-cache`` so the existing model cache and kernel-wheel cache work unchanged inside the container. - ``scripts/ci_install_tokenspeed.sh``: drop the CUDA-toolkit apt-get block (~45 lines), the ``libssl-dev``/``libopenmpi-dev``/``cmake`` install, and the ``GITHUB_ENV``/``GITHUB_PATH`` persistence — all redundant now that the container baseline already exports ``CUDA_HOME``, ``LD_LIBRARY_PATH``, etc. Script shrinks from 189 → 105 lines. - Bump ``TOKENSPEED_REF`` to ``70030b29`` (latest lightseekorg/tokenspeed main, "Add KV cache events to scheduler"). The unavoidable ~30 min CUDA kernel compile still happens on first run because upstream doesn't publish pre-built engine wheels yet — the wheel cache at ``/tmp/tokenspeed-wheel-cache`` short-circuits it on every subsequent run. Signed-off-by: key4ng <[email protected]>

The previous commit (388cfc1) moved the TokenSpeed lane inside ``lightseekorg/tokenspeed-runner:cu130-torch-2.11.0`` via a job-level ``container:`` directive. On the k8s-runner-gpu pods this fails: the runner image pull hits ``unexpected EOF`` mid-download (likely ephemeral-storage pressure on the pod for a large CUDA image) and the retry can't reach the docker daemon at all (it's not stably available inside the pod). Roll back the workflow ``container:`` block and restore the host-side toolkit install in ``ci_install_tokenspeed.sh`` (CUDA-13 apt-get, ``libssl-dev``/``libopenmpi-dev``/``cmake``, ``GITHUB_ENV``/``GITHUB_PATH`` persistence). Keep the bumped ``TOKENSPEED_REF`` — that part is independent of the container experiment. The "use the upstream runner image" simplification is still the right direction in principle; revisiting after the runner infra grows stable container support, or after upstream publishes engine wheels (which would remove the source-build step entirely). Signed-off-by: key4ng <[email protected]>

…-path) Upstream lightseekorg/tokenspeed (current main) renamed the ``--model-path`` argparse flag to ``--model``. The old name remains only as a positional alias on the parser; passing ``--model-path`` fails the worker boot with: __main__.py: error: unrecognized arguments: --model-path Switch ``_build_tokenspeed_grpc_cmd`` to the new flag form. The servicer's ``server_args.model_path`` attribute is still populated because upstream's ``prepare_server_args`` resolves both the positional ``model_path`` and ``--model`` sources into the same attribute. Only the tokenspeed lane is affected; sglang / vllm / trtllm / mlx each have their own argparse and continue to accept ``--model-path``. Signed-off-by: key4ng <[email protected]>

The bumped ``TOKENSPEED_REF`` dropped ``Qwen3MoeForCausalLM`` from its model registry, so the existing ``Qwen/Qwen3-30B-A3B`` worker fails to start on the tokenspeed lane with: ValueError: Model architectures ['Qwen3MoeForCausalLM'] are not supported for now. Switch ``TestEnableThinking`` to ``Qwen/Qwen3.5-27B``. The model: - is ``Qwen3_5ForConditionalGeneration`` — in the current tokenspeed registry (the family upstream is investing in); - has the ``enable_thinking`` chat-template toggle the test exercises; - emits the same ``<think>...</think>`` reasoning markers, so the existing ``qwen3`` SMG reasoning parser handles it without changes; - weighs ~57 GB (54 GB text + ~3 GB vision encoder), which fits on the 1×H100 80GB lane with ~20 GB headroom for KV cache and activations. No SMG-side parser changes needed for this test. For future Qwen3.5 function-calling tests, the SMG tool-parser factory already auto-maps ``Qwen/Qwen3.5*`` → ``qwen_xml`` at ``crates/tool_parser/src/factory.rs:353``. Changes: - ``e2e_test/infra/model_specs.py``: add a ``Qwen/Qwen3.5-27B`` spec entry (``tp: 1``, thinking + reasoning features) and point ``DEFAULT_ENABLE_THINKING_MODEL_PATH`` at it. The existing ``Qwen/Qwen3-30B-A3B`` entry stays for the nightly perf benchmark. - ``e2e_test/chat_completions/test_enable_thinking.py``: update the ``@pytest.mark.model`` to ``Qwen/Qwen3.5-27B``. Signed-off-by: key4ng <[email protected]>

After model load (~57 GB on 1×H100 80GB) the remaining ~16 GB of GPU memory can't pre-allocate KV cache for the model's 256K native context. vLLM fails the worker boot with: ValueError: To serve at least one request with the models's max seq len (262144), (16.17 GiB KV cache is needed, which is larger than the available KV cache memory (16.0 GiB). The TestEnableThinking test sends short single-turn chats ("Hello"), so cap the context at 16K — same value the ``Qwen/Qwen2.5-14B-Instruct`` spec uses in this file. Apply across all engines that pre-allocate KV cache by max-seq-len (sglang, vllm, tokenspeed); TRT-LLM allocates dynamically and keeps the existing ``free_gpu_memory_fraction: 0.8`` knob. Also pass ``--enforce-eager`` on the engines that respect it — the hybrid Mamba (Gated DeltaNet) + attention layout makes CUDA-graph capture finicky and is unnecessary for a smoke test. Signed-off-by: key4ng <[email protected]>

Dense Qwen3 (``Qwen3ForCausalLM``) is in the bumped tokenspeed pin's model registry where ``Qwen3MoeForCausalLM`` (``Qwen3-30B-A3B``) and ``Qwen3_5ForConditionalGeneration`` (``Qwen3.5-27B``, which also overflowed vLLM's CUDA-13 ``nixl_ep`` dependency) are not. Hybrid Qwen3 keeps the ``enable_thinking`` chat-template toggle the test needs and emits the same ``<think>...</think>`` markers the existing ``qwen3`` SMG reasoning parser handles. 4B fits trivially on the 1×H100 lane (~8 GB weights), so the engine-specific KV cap and ``--enforce-eager`` args the Qwen3.5-27B spec needed are no longer required. Changes: - ``e2e_test/infra/model_specs.py``: replace the ``Qwen/Qwen3.5-27B`` entry with ``Qwen/Qwen3-4B``; point ``DEFAULT_ENABLE_THINKING_MODEL_PATH`` at it. Keep the ``Qwen/Qwen3-30B-A3B`` entry for the nightly perf benchmark. - ``e2e_test/chat_completions/test_enable_thinking.py``: update ``@pytest.mark.model`` to ``Qwen/Qwen3-4B``. Signed-off-by: key4ng <[email protected]>

mergify · 2026-05-12T23:28:21Z

Hi @key4ng, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

Use git commit -s every time, or
VSCode: enable Git: Always Sign Off in Settings
PyCharm: enable Sign-off commit in the Commit tool window

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ab36713b8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-12T23:32:08Z

+        tokenspeed_scheduler_pb2,
+        tokenspeed_scheduler_pb2_grpc,


Bump the proto package version for TokenSpeed stubs

When the TokenSpeed servicer is installed from published packages rather than the CI editable checkout, these new tokenspeed_scheduler_pb2(_grpc) exports need a distinct smg-grpc-proto release to depend on; however crates/grpc_client/python/pyproject.toml still declares version 0.4.7, the same version that predates these stubs. Without bumping the proto package version, downstream installs cannot reliably require a build that contains these modules, so TokenSpeed startup can still fail at import time even if the servicer dependency is tightened.

Useful? React with 👍 / 👎.

Signed-off-by: key4ng <[email protected]>

Signed-off-by: zhyncs <[email protected]>

claude · 2026-05-13T00:10:54Z

+            grpc_servicer/pyproject.toml
+          if [ "${RELEASE_PROTO}" = "true" ]; then
+            # Matches both the core dep ("smg-grpc-proto>=0.4.6") and the
+            # mlx extra ("smg-grpc-proto>=0.4.7").


🟡 Nit: always() also runs on workflow cancellation. If a user cancels after one build job has already succeeded, the release job will still execute and publish artifacts. Using !cancelled() instead preserves the intended behavior (run when dependencies are skipped) while correctly skipping on cancellation.

Suggested change

# mlx extra ("smg-grpc-proto>=0.4.7").

!cancelled()

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 921a3d9f14

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T00:11:54Z



-@pytest.mark.engine("sglang", "vllm")
+@pytest.mark.engine("sglang", "vllm", "tokenspeed")


Add TokenSpeed to non-chat e2e matrices

When this marker is added, these tests are only selected if a CI job runs with E2E_ENGINE=tokenspeed. I checked .github/workflows/pr-test-rust.yml: TokenSpeed was added only to the chat matrices, while the completions job still lists only sglang/vllm and the router/responses jobs likewise do not run TokenSpeed. As a result, the newly marked TokenSpeed completions/router/responses coverage is never exercised in PR CI; add TokenSpeed to the corresponding matrices or drop the marks.

Useful? React with 👍 / 👎.

mergify · 2026-05-14T17:45:05Z

Hi @key4ng, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

github-actions Bot added ci CI/CD configuration changes tests Test changes labels May 8, 2026

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

claude Bot reviewed May 8, 2026

View reviewed changes

key4ng mentioned this pull request May 8, 2026

feat(grpc): add TokenSpeed gRPC client and router wiring (Part 1/3) #1351

Open

8 tasks

claude Bot reviewed May 8, 2026

View reviewed changes

claude Bot approved these changes May 8, 2026

View reviewed changes

key4ng mentioned this pull request May 8, 2026

feat(grpc_servicer): add TokenSpeed servicer (Part 2/3) #1464

Draft

5 tasks

key4ng added 2 commits May 9, 2026 12:19

key4ng force-pushed the feat/grpc-tokenspeed-servicer branch from 52cda9d to c3c3c02 Compare May 9, 2026 19:36

key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch 2 times, most recently from 0234f43 to 5297fbe Compare May 9, 2026 20:24

key4ng added 2 commits May 9, 2026 13:52

key4ng force-pushed the feat/grpc-tokenspeed-servicer branch from 21ef827 to 788933a Compare May 9, 2026 20:53

key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch from 5297fbe to 71a7883 Compare May 9, 2026 20:53

key4ng force-pushed the feat/grpc-tokenspeed-servicer branch from 788933a to d16d38f Compare May 11, 2026 03:42

key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch from 71a7883 to 9432064 Compare May 11, 2026 03:42

key4ng force-pushed the feat/grpc-tokenspeed-servicer branch from d16d38f to 3f8983a Compare May 11, 2026 03:44

key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch from 9432064 to ace5e6d Compare May 11, 2026 03:44

key4ng force-pushed the feat/grpc-tokenspeed-servicer branch from 3f8983a to 2ecbbb9 Compare May 11, 2026 03:52

key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch from ace5e6d to 243c1b5 Compare May 11, 2026 03:53

key4ng added 2 commits May 10, 2026 21:01

key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch from bf462f9 to d79af1d Compare May 12, 2026 06:42

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

claude Bot reviewed May 12, 2026

View reviewed changes

key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch 2 times, most recently from 06fe339 to bb6e97e Compare May 12, 2026 14:21

key4ng added 9 commits May 12, 2026 07:54

key4ng force-pushed the feat/grpc-tokenspeed-ci-tests branch from bb6e97e to 779445a Compare May 12, 2026 15:12

add none reasoning parser

5ab3671

github-actions Bot added the reasoning-parser Reasoning parser changes label May 12, 2026

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

key4ng and others added 2 commits May 13, 2026 00:05

ci(release): add dev wheel workflow → GitHub Releases (#1471)

963baa4

Signed-off-by: key4ng <[email protected]>

ci(release): publish dev wheels to whl index (#1473)

921a3d9

Signed-off-by: zhyncs <[email protected]>

claude Bot reviewed May 13, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

lightseek-bot assigned CatherineSue, key4ng and slin1237 May 14, 2026

lightseek-bot added the priority:high High priority label May 14, 2026

mergify Bot added the needs-rebase PR has merge conflicts that need to be resolved label May 14, 2026


		cd "$TOKENSPEED_DIR"

		# ── System dependencies (mirrors docker/Dockerfile) ─────────────────────────

	# ── System dependencies (mirrors docker/Dockerfile) ─────────────────────────
	echo "Cloning TokenSpeed ${TOKENSPEED_REF} from ${TOKENSPEED_REPO%%@*}@..."



		@pytest.mark.engine("sglang", "vllm")
		@pytest.mark.engine("sglang", "vllm", "tokenspeed")

Conversation

key4ng commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

3-PR Stack

Changes

Test Plan

Review Threads from #1351

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 12, 2026

Incremental Review Summary (synchronize)

Open 🔴 Important issues (4)

Resolved since last review (1)

Open 🟡 Nit issues (3)

Uh oh!

mergify Bot commented May 12, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

key4ng commented May 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading