feat(grpc): add TokenSpeed gRPC client and router wiring (Part 1/3) by yetone · Pull Request #1351 · lightseekorg/smg

yetone · 2026-04-23T14:35:58Z

Description

Problem

SMG has no first-class TokenSpeed backend. Routing to a TokenSpeed worker requires the gateway to speak TokenSpeed's dedicated gRPC protocol, plumb it through the router's per-backend IR, and pick TokenSpeed up correctly during runtime detection.

Solution

This PR (Part 1 of 3) adds the Rust side of TokenSpeed support: a self-contained tokenspeed.grpc.scheduler.TokenSpeedScheduler proto, a TokenSpeedSchedulerClient that produces and consumes native tokenspeed_proto types end-to-end, dedicated ProtoGenerateRequest::TokenSpeed / ProtoGenerateStreamChunk::TokenSpeed / ProtoGenerateComplete::TokenSpeed IR arms in the router (same shape as the existing per-backend arms), and the RuntimeType::TokenSpeed variant.

PR1 is functionally inert without PR2 — the router can dial a TokenSpeed worker, but the worker process is in PR2's Python servicer.

3-PR Stack

PR1 (this): Rust gRPC + protocol — base main
PR2 (feat(grpc_servicer): add TokenSpeed servicer (Part 2/3) #1464): Python servicer (grpc_servicer/smg_grpc_servicer/tokenspeed/) + unit tests — base feat/grpc-servicer-tokenspeed
PR3 (ci(tokenspeed): add CI install + GPU e2e coverage (Part 3/3) #1465): CI workflows + e2e tests + install scripts — base feat/grpc-tokenspeed-servicer

Changes

New crates/grpc_client/proto/tokenspeed_scheduler.proto — fully self-contained wire definition (package tokenspeed.grpc.scheduler)
New crates/grpc_client/src/tokenspeed_scheduler.rs — TokenSpeedSchedulerClient returning native tokenspeed_proto::* types
crates/protocols/src/worker.rs — RuntimeType::TokenSpeed variant
model_gateway/src/routers/grpc/proto_wrapper.rs — dedicated TokenSpeed arms across the IR enums + per-arm accessors
model_gateway/src/routers/grpc/client.rs — dispatch routes the new variant through TokenSpeedSchedulerClient
model_gateway/src/workflow/steps/local/detect_backend.rs — health-check ordering extended with tokenspeed
Stage updates: harmony/regular request building, multimodal, embedding

Test Plan

cargo +nightly fmt --all -- --check passes
cargo clippy -p smg-grpc-client --all-targets --all-features -- -D warnings passes
cargo clippy -p smg --all-targets --all-features -- -D warnings passes
cargo check -p smg --bin smg succeeds against current main

End-to-end coverage runs in PR3 once both the Rust router and the Python servicer (PR2) are in place.

Checklist

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

coderabbitai · 2026-04-23T14:36:15Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds TokenSpeed as a supported runtime across tests, worker wiring, CI, gRPC servicer (launcher, server, health, servicer), model gateway clients/workflow, Rust/Cargo proto build, and new unit tests and packaging extras.

Changes

Cohort / File(s)	Summary
E2E Tests & Runtime `e2e_test/chat_completions/test_function_calling.py`, `e2e_test/infra/constants.py`, `e2e_test/infra/model_specs.py`	Introduce `tokenspeed` runtime, predicate helper, include in LOCAL_RUNTIMES/labels, update model specs, and gate/xfail tests for TokenSpeed limitations.
Worker Integration `e2e_test/infra/worker.py`	Add TokenSpeed branch that enforces gRPC mode and builds TokenSpeed server start command (supports `tokenspeed_args`).
Packaging & Pytest Config `grpc_servicer/pyproject.toml`, `grpc_servicer/tests/conftest.py`	Document TokenSpeed, add `tokenspeed` and `test` extras, pytest asyncio/testpaths/markers, and test conftest to adjust sys.path and register `tokenspeed` marker.
TokenSpeed Package Entrypoints `grpc_servicer/smg_grpc_servicer/tokenspeed/__init__.py`, `.../__main__.py`	Expose servicer API and add CLI entrypoint that prepares server args, sets uvloop policy, and runs `serve_grpc`.
Server, Launcher & Health `grpc_servicer/smg_grpc_servicer/tokenspeed/server.py`, `.../scheduler_launcher.py`, `.../health_servicer.py`	Add serve_grpc coroutine with warmup/polling, reflection, signal handling and shutdown; add launch_engine delegating to TokenSpeed entrypoint; add health servicer with dual-scope and stuck-scheduler checks.
TokenSpeed gRPC Servicer `grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`	Implement `TokenSpeedSchedulerServicer` (Generate/Embed/HealthCheck/Abort/GetModelInfo/GetServerInfo/GetLoads/GetTokenizer/SubscribeKvEvents) and related helpers and lifecycle handling.
Unit Tests `grpc_servicer/tests/test_tokenspeed_servicer.py`, `grpc_servicer/tests/test_tokenspeed_health_servicer.py`	Add comprehensive tests for health servicer, servicer RPCs, helpers, sampling-param mapping, streaming/cancellation, abort semantics, and tokenizer streaming.
CI / Workflows / Scripts `.github/actions/setup-tokenspeed/action.yml`, `.github/workflows/e2e-gpu-job.yml`, `.github/workflows/pr-test-rust.yml`, `scripts/ci_install_tokenspeed.sh`	Add composite action and CI install script for TokenSpeed, conditional workflow step and engine input, extend PR test matrix and path filters for TokenSpeed artifacts.
Protocols & Gateway Wiring `crates/protocols/src/worker.rs`, `model_gateway/src/routers/grpc/client.rs`, `model_gateway/src/routers/grpc/...`	Add `tokenspeed` runtime enum, TokenSpeed gRPC client module and routing, adapt request-building stages and multimodal/embedding paths to support TokenSpeed, mark PD-disallowed where applicable.
Rust gRPC Client & Proto `crates/grpc_client/build.rs`, `crates/grpc_client/proto/tokenspeed_scheduler.proto`, `crates/grpc_client/src/tokenspeed_scheduler.rs`, `crates/grpc_client/src/lib.rs`, `crates/grpc_client/python/smg_grpc_proto/__init__.py`	Add TokenSpeed proto, build.rs generation pass, Rust client wrapper, re-exports, and Python proto re-exports.
Internal refactors & helpers `crates/grpc_client/src/sglang_scheduler.rs`, `model_gateway/src/.../request_building.rs`, `model_gateway/src/.../classify.rs`, `model_gateway/src/.../detect_backend.rs`, `model_gateway/src/.../util.rs`	Refactor abort-on-drop to callback dispatcher, expose helpers, and extend gateway detection/building stages to handle TokenSpeed and adjust runtime probing order.

Sequence Diagram(s)

sequenceDiagram
    participant Client as gRPC Client
    participant Server as TokenSpeed gRPC Server
    participant Launcher as Scheduler Launcher
    participant AsyncLLM as AsyncLLM Engine
    participant Health as Health Servicer

    Client->>Server: Generate / Embed / HealthCheck
    Server->>Launcher: launch_engine(server_args)
    Launcher->>AsyncLLM: _launch_subprocesses()
    AsyncLLM-->>Launcher: AsyncLLM instance + scheduler_info
    Launcher-->>Server: (async_llm, scheduler_info)
    Server->>Health: instantiate TokenSpeedHealthServicer
    par Warmup
        Server->>AsyncLLM: probe (one-token Generate/Embed)
        AsyncLLM-->>Server: probe response
        Server->>Health: set_serving()
    end
    Server->>AsyncLLM: dispatch Generate/Embed request
    AsyncLLM-->>Server: stream/generate results
    Server-->>Client: stream/return RPC responses
    Client->>Server: Abort (or cancel)
    Server->>AsyncLLM: abort_request for rid(s)
    AsyncLLM-->>Server: acknowledge
    Server-->>Client: finalize/error

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat(mlx): add MlxEngine gRPC proto and Rust client #1034: Adds a new runtime/backend in parallel — overlaps runtime enum, gateway, and routing changes.
fix: gRPC error handling — proper status codes and circuit breaker accuracy #645: Modifies grpc client generate/embed surfaces — related to TokenSpeed client additions and dispatch.
feat(core): add DetectBackendStep for vLLM HTTP support #438: Adjusts backend detection logic — related to detect_backend and probing order changes.

Suggested labels

python-bindings

Suggested reviewers

CatherineSue
key4ng
slin1237

Poem

🐰 I hopped to wire a faster stream,

Tokens raced down a new bright beam.
Health kept watch as warmups sang,
Scheduler woke and bells went clang.
gRPC now hums in TokenSpeed gleam.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.89% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title 'feat(grpc): add TokenSpeed gRPC client and router wiring (Part 1/3)' accurately describes the main changes in this pull request—adding TokenSpeed gRPC support to the client and routing layers, and is part of a larger multi-part effort.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/grpc-servicer-tokenspeed

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces support for the TokenSpeed inference engine by implementing a gRPC servicer that mirrors the SGLang protocol, allowing the existing Rust router to interface with TokenSpeed workers. The changes encompass a new gRPC server implementation, health checks, subprocess management, and infrastructure updates for end-to-end testing. A critical issue was identified in the streaming generation logic where client disconnections during multi-sample requests (n > 1) fail to abort all derived backend requests, potentially leading to resource leaks.

gemini-code-assist · 2026-04-23T14:41:41Z

+        except asyncio.CancelledError:
+            # Client disconnected — tell the scheduler to drop the request.
+            aborted = True
+            self.async_llm.abort_request(rid)


When a client disconnects from a streaming Generate request with n > 1, not all underlying requests are being aborted on the backend. The asyncio.CancelledError exception handler calls self.async_llm.abort_request(rid), where rid is the original request.request_id. However, for requests with n > 1, the servicer creates multiple backend requests with derived request IDs (e.g., f"{rid}-n{i}"), which are stored in expanded_rid. The engine is only aware of these derived IDs. This can lead to orphaned requests continuing to run on the backend, consuming resources. The fix is to iterate over expanded_rid and abort each request individually to ensure all acquired resources are tracked and released consistently to prevent leaks.

Suggested change

self.async_llm.abort_request(rid)

rids_to_abort = (

list(expanded_rid)

if isinstance(expanded_rid, list)

else ([expanded_rid] if isinstance(expanded_rid, str) else [])

)

for r in rids_to_abort:

self.async_llm.abort_request(r)

References

When managing resources that are acquired and need to be released, ensure that all acquired resources are tracked immediately and consistently to prevent resource leaks, especially when exceptions might occur during subsequent operations.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 34e647141e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-23T14:42:19Z

+        if params.temperature:
+            out["temperature"] = params.temperature


Preserve explicit zero temperature in sampling params

The truthy check drops temperature=0.0, so requests that explicitly ask for greedy/deterministic decoding lose that setting before reaching TokenSpeed. The Rust gRPC builder can emit 0.0 for user-specified temperature, but this branch treats it as "unset" and omits it, which can silently fall back to backend defaults and change generation behavior.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-23T14:42:19Z

+        logger.warning("TokenSpeed warmup failed: %s", e)
+    finally:
+        channel.close()
+
+    health_servicer.set_serving()


Set health status to SERVING only after warmup succeeds

Warmup failures in Generate/Embed are logged but still followed by set_serving(), so readiness can report healthy even when the model path is not actually able to serve inference. This can route production traffic to a broken worker right after startup instead of keeping it out of rotation.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-23T14:42:19Z

+        except asyncio.CancelledError:
+            # Client disconnected — tell the scheduler to drop the request.
+            aborted = True
+            self.async_llm.abort_request(rid)
+            raise


Abort all expanded request IDs on client cancellation

On cancellation, the handler aborts only the original request_id, but _build_generate_req expands rid into per-choice IDs when n>1. Since aborted=True also skips the finally cleanup path, those expanded child requests can continue running after the client disconnects, wasting scheduler/GPU work until they finish naturally.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

e2e_test/infra/constants.py (1)

57-57: ⚠️ Potential issue | 🟡 Minor

Stale ENV_RUNTIME comment.

The inline comment still enumerates only "sglang", "vllm", or "trtllm" even though tokenspeed is now a supported runtime (and openai, xai, etc. are also accepted via the enum). Consider updating to avoid drift.

-ENV_RUNTIME = "E2E_RUNTIME"  # Runtime for gRPC tests: "sglang", "vllm", or "trtllm"
+ENV_RUNTIME = "E2E_RUNTIME"  # Runtime for gRPC tests: "sglang", "vllm", "trtllm", or "tokenspeed"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@e2e_test/infra/constants.py` at line 57, Update the stale inline comment for
the ENV_RUNTIME constant: modify the comment on ENV_RUNTIME to list the
currently supported runtimes (e.g., include "tokenspeed" and other accepted
values such as "openai", "xai", etc.) or rephrase it to reference the
authoritative enum so it won't drift; locate ENV_RUNTIME in
e2e_test/infra/constants.py and replace the old enumeration string with the
updated list or a short note pointing to the enum.

e2e_test/infra/worker.py (1)

37-37: ⚠️ Potential issue | 🟡 Minor

Update engine docstring to include tokenspeed.

The field comment still lists only three engines even though tokenspeed is now dispatched in _build_cmd.

-    engine: str  # "sglang", "vllm", or "trtllm"
+    engine: str  # "sglang", "vllm", "trtllm", or "tokenspeed"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@e2e_test/infra/worker.py` at line 37, Update the inline comment for the
engine field to reflect all supported engine values; change the comment on the
'engine: str' field to include "tokenspeed" alongside "sglang", "vllm", and
"trtllm", and ensure any mention in _build_cmd's dispatch logic remains
consistent with that docstring so reviewers see the available options match the
actual dispatch cases in _build_cmd.

e2e_test/infra/model_specs.py (1)

55-86: ⚠️ Potential issue | 🔴 Critical

Duplicate dict key silently drops function_calling/tool_choice from Qwen/Qwen3-30B-A3B — breaks this PR's primary purpose.

"Qwen/Qwen3-30B-A3B" is defined twice in MODEL_SPECS (here at 55–59 and again at 80–86). Python dict literals keep the last value for duplicate keys, so the new entry with features: ["chat","streaming","function_calling","tool_choice"] is overwritten by the pre-existing entry whose features are ["chat","streaming","thinking","reasoning"].

Downstream impact:

get_models_with_feature("function_calling") and get_models_with_feature("tool_choice") will not include Qwen/Qwen3-30B-A3B.
TestOpenAIServerFunctionCalling (which this PR switches to Qwen/Qwen3-30B-A3B) will run against a spec that doesn't advertise the features it needs, and vllm_args=["--enforce-eager"] / trtllm_extra_config from the second entry remain in effect.

Merge the two entries into one so features union and the vLLM/TRT-LLM config is preserved:

🛠 Proposed fix: merge duplicate entries

-    # Qwen3 small chat model — used as a TokenSpeed-supported substitute
-    # while TokenSpeed's model registry does not cover LlamaForCausalLM.
-    "Qwen/Qwen3-4B": {
-        "model": _resolve_model_path("Qwen/Qwen3-4B"),
-        "tp": 1,
-        "features": ["chat", "streaming", "function_calling", "tool_choice"],
-    },
-    # Larger Qwen3 MoE — much stronger at function calling than 4B dense.
-    # Arch: Qwen3MoeForCausalLM (in TokenSpeed's registry); 30B total /
-    # 3B active, so inference speed stays close to the 4B variant.
-    "Qwen/Qwen3-30B-A3B": {
-        "model": _resolve_model_path("Qwen/Qwen3-30B-A3B"),
-        "tp": 1,
-        "features": ["chat", "streaming", "function_calling", "tool_choice"],
-    },
+    # Qwen3 small chat model — used as a TokenSpeed-supported substitute
+    # while TokenSpeed's model registry does not cover LlamaForCausalLM.
+    "Qwen/Qwen3-4B": {
+        "model": _resolve_model_path("Qwen/Qwen3-4B"),
+        "tp": 1,
+        "features": ["chat", "streaming", "function_calling", "tool_choice"],
+    },
     # Function calling specialist
@@
-    # Thinking/reasoning model (larger)
-    "Qwen/Qwen3-30B-A3B": {
-        "model": _resolve_model_path("Qwen/Qwen3-30B-A3B"),
-        "tp": 1,
-        "features": ["chat", "streaming", "thinking", "reasoning"],
-        "vllm_args": [] if _is_nightly else ["--enforce-eager"],
-        "trtllm_extra_config": {"kv_cache_config": {"free_gpu_memory_fraction": 0.8}},
-    },
+    # Thinking/reasoning model (larger) — also used as TokenSpeed fn-calling
+    # substitute while TokenSpeed's registry lacks LlamaForCausalLM.
+    # Arch: Qwen3MoeForCausalLM; 30B total / 3B active.
+    "Qwen/Qwen3-30B-A3B": {
+        "model": _resolve_model_path("Qwen/Qwen3-30B-A3B"),
+        "tp": 1,
+        "features": [
+            "chat", "streaming", "thinking", "reasoning",
+            "function_calling", "tool_choice",
+        ],
+        "vllm_args": [] if _is_nightly else ["--enforce-eager"],
+        "trtllm_extra_config": {"kv_cache_config": {"free_gpu_memory_fraction": 0.8}},
+    },

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@e2e_test/infra/model_specs.py` around lines 55 - 86, The dict key
"Qwen/Qwen3-30B-A3B" is defined twice in MODEL_SPECS which causes the first
entry (with function_calling/tool_choice) to be overwritten; merge the two
entries into a single "Qwen/Qwen3-30B-A3B" entry in MODEL_SPECS by combining the
features (union of ["chat","streaming","function_calling","tool_choice"] and
["chat","streaming","thinking","reasoning"]) and preserving the vllm_args and
trtllm_extra_config from the second entry as well as the appropriate tp and
model path; remove the duplicate dict literal so
get_models_with_feature("function_calling") / ("tool_choice") will include
Qwen/Qwen3-30B-A3B and the vllm_args/trtllm_extra_config remain in effect.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e_test/chat_completions/test_function_calling.py`:
- Around line 103-115: This test was switched from Llama to Qwen3 but still
contains Llama-specific labels and will OOM on a single 80GB GPU; update the
test to either use a quantized Qwen3 variant or require more GPUs and rename
Llama-specific text: change the pytest model marker value (e.g.,
"Qwen/Qwen3-30B-A3B" → a Q4/quantized variant like "Qwen/Qwen3-30B-A3B-q4" or
change `@pytest.mark.gpu`(1) → `@pytest.mark.gpu`(4)), update the module/class
docstring that mentions "Llama tool parser" to "Qwen tool parser", replace
occurrences of LLAMA_SYSTEM_MESSAGE with a Qwen-appropriate system message
constant or remove it (references around the test helper and setup), and edit
comments that reference “Llama-3.2-1B is flaky” to refer to Qwen3 or remove;
locate these in the test file by symbols TestToolChoiceLlama,
LLAMA_SYSTEM_MESSAGE, and the top-level module docstring and adjust markers and
messages accordingly.

In `@e2e_test/infra/worker.py`:
- Around line 271-298: The tokenspeed gRPC command builder
(_build_tokenspeed_grpc_cmd) is fine, but the start_workers function's engine
docstring still lists only three engines; update that docstring to include the
new "tokenspeed" engine name so documentation matches the implemented backend
and keep symmetry with other builders (also check the "engine" mention near
start_workers for any enumerations or examples to include "tokenspeed").

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py`:
- Around line 34-38: The comment about priming env vars is misplaced: it refers
to the env/resource setup that actually happens in prepare_server_args(argv),
but it sits above asyncio.set_event_loop_policy() and
asyncio.run(serve_grpc(server_args)); either move the comment to directly above
the prepare_server_args(argv) call (or inside prepare_server_args) so it
documents the env priming, or remove the comment entirely if it's stale; adjust
text to reference prepare_server_args, asyncio.set_event_loop_policy, and
asyncio.run/serve_grpc to ensure clarity.
- Around line 24-38: The uvloop event loop policy is being set after grpc/aio
internals (via the module-level serve_grpc import) are already initialized; move
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) to the very start of
main() before any grpc-related initialization by invoking it before
prepare_server_args(argv) and before any use of serve_grpc/other grpc-related
code so the policy is applied prior to gRPC internals being
imported/initialized.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/server.py`:
- Around line 157-203: The code relies on an implicit invariant that `response`
is set when `connected` is True; to harden this, initialize `response` (e.g.,
response = None) before the GetModelInfo loop and/or capture the successful
model info into an explicit variable like `model_info_response` when you set
`connected = True`, then use that variable in the warmup path instead of the
possibly-unbound `response`; update references in the warmup block (the
is_generation check and any uses of response) to read from the explicit variable
so future refactors of the GetModelInfo loop can't leave `response` undefined.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 156-222: The CancelledError handler currently aborts only the
original proto rid and sets aborted=True, which prevents the finally block from
cleaning up per-choice expanded rids for n>1; change the asyncio.CancelledError
except block to (1) compute the same expanded rids from req_obj.rid (use the
existing logic that builds rids_to_check), (2) call
self.async_llm.abort_request(r) for each expanded rid, and (3) do NOT set
aborted=True so the finally block still executes its defensive cleanup; also add
an n>1 unit test in test_tokenspeed_servicer.py mirroring
test_cancel_calls_abort_request to assert abort_request is invoked for each
expanded per-choice rid.
- Around line 535-546: In SubscribeKvEvents the current return appears before
the required yield which is confusing; keep the abort call but ensure the
function is still recognized as an async generator by moving the yield into an
unreachable branch (e.g. use "if False: yield" or "if False: yield
common_pb2.KvEventBatch()") after the abort so the function retains an
async-generator signature while aborting via context.abort; update the
SubscribeKvEvents body to call context.abort(...) and then have an unreachable
yield branch to satisfy the async-generator contract.
- Around line 559-617: The code currently does obj.bootstrap_port =
p.bootstrap_port or None which treats proto3 default 0 as "unset" and loses the
ability to distinguish a real 0 value; update handling so presence is explicit:
if the DisaggregatedParams proto declares bootstrap_port as optional, check
p.HasField("bootstrap_port") and set obj.bootstrap_port = p.bootstrap_port when
present (else None); if you cannot change the proto, add documentation/comments
about the asymmetry or introduce an explicit presence flag (e.g.
bootstrap_port_present) and use that to decide whether to set
obj.bootstrap_port. Ensure you modify the logic around p.bootstrap_port and
obj.bootstrap_port (and related proto definition) rather than changing unrelated
code.
- Around line 619-688: The code in _sampling_params_from_proto drops explicit
zero values by using truthy checks (e.g., if params.min_new_tokens:) and
similarly the non-optional proto scalar logprob_start_len will always appear
"present" but may be mishandled; change these checks to use HasField for fields
that must distinguish "unset" from explicit zero (e.g., replace truthy guards
for min_new_tokens and logprob_start_len with params.HasField("min_new_tokens")
/ params.HasField("logprob_start_len") and forward the raw proto value), and if
the proto currently defines those scalars as non-optional, update the proto
definitions to make them optional so HasField works.

In `@grpc_servicer/tests/conftest.py`:
- Around line 1-22: Update the module docstring to accurately describe what this
file does: remove or reword the claim about "declares an asyncio-mode default"
since asyncio_mode is configured in pyproject.toml, and instead state that the
file adds the package root to sys.path and registers the pytest marker via
pytest_configure (the function registering the "tokenspeed" marker); keep the
rest of the file unchanged.

In `@grpc_servicer/tests/test_tokenspeed_health_servicer.py`:
- Line 7: The import SimpleNamespace is unused in the test file; remove the
unused import statement "from types import SimpleNamespace" from
grpc_servicer/tests/test_tokenspeed_health_servicer.py (locate the top-level
imports in that file) to clean up the module namespace and update any imports if
necessary, then run the test suite to ensure nothing else depended on it.

In `@grpc_servicer/tests/test_tokenspeed_servicer.py`:
- Around line 529-555: Add a new async test similar to
test_cancel_calls_abort_request but set the request sampling params to n>1 (use
_make_generate_request and adjust req.sampling_params.n = 3 or desired count),
keep fake_engine.generate_fn = never_finish, start servicer.Generate(req, ctx)
and run it via asyncio.create_task(_drain(gen)), cancel the task and await the
CancelledError, then assert that each expanded rid string (e.g., "rid-1-n0",
"rid-1-n1", ..., matching the count) is present in fake_engine.aborted_rids;
this targets the expanded-rid path in Generate and uses the same helpers
(_drain, _make_generate_request, fake_engine.aborted_rids) so the new test
follows the existing pattern.

---

Outside diff comments:
In `@e2e_test/infra/constants.py`:
- Line 57: Update the stale inline comment for the ENV_RUNTIME constant: modify
the comment on ENV_RUNTIME to list the currently supported runtimes (e.g.,
include "tokenspeed" and other accepted values such as "openai", "xai", etc.) or
rephrase it to reference the authoritative enum so it won't drift; locate
ENV_RUNTIME in e2e_test/infra/constants.py and replace the old enumeration
string with the updated list or a short note pointing to the enum.

In `@e2e_test/infra/model_specs.py`:
- Around line 55-86: The dict key "Qwen/Qwen3-30B-A3B" is defined twice in
MODEL_SPECS which causes the first entry (with function_calling/tool_choice) to
be overwritten; merge the two entries into a single "Qwen/Qwen3-30B-A3B" entry
in MODEL_SPECS by combining the features (union of
["chat","streaming","function_calling","tool_choice"] and
["chat","streaming","thinking","reasoning"]) and preserving the vllm_args and
trtllm_extra_config from the second entry as well as the appropriate tp and
model path; remove the duplicate dict literal so
get_models_with_feature("function_calling") / ("tool_choice") will include
Qwen/Qwen3-30B-A3B and the vllm_args/trtllm_extra_config remain in effect.

In `@e2e_test/infra/worker.py`:
- Line 37: Update the inline comment for the engine field to reflect all
supported engine values; change the comment on the 'engine: str' field to
include "tokenspeed" alongside "sglang", "vllm", and "trtllm", and ensure any
mention in _build_cmd's dispatch logic remains consistent with that docstring so
reviewers see the available options match the actual dispatch cases in
_build_cmd.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: fc6bc210-4187-4c99-bc26-b7c2ce2eb33a

📥 Commits

Reviewing files that changed from the base of the PR and between f394f60 and 34e6471.

📒 Files selected for processing (15)

e2e_test/chat_completions/test_function_calling.py
e2e_test/infra/constants.py
e2e_test/infra/model_specs.py
e2e_test/infra/worker.py
grpc_servicer/pyproject.toml
grpc_servicer/smg_grpc_servicer/tokenspeed/__init__.py
grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py
grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py
grpc_servicer/smg_grpc_servicer/tokenspeed/scheduler_launcher.py
grpc_servicer/smg_grpc_servicer/tokenspeed/server.py
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
grpc_servicer/tests/__init__.py
grpc_servicer/tests/conftest.py
grpc_servicer/tests/test_tokenspeed_health_servicer.py
grpc_servicer/tests/test_tokenspeed_servicer.py

coderabbitai · 2026-04-23T14:43:31Z

+def main(argv: list[str] | None = None) -> None:
+    if argv is None:
+        argv = sys.argv[1:]
+
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s [%(name)s] %(levelname)s %(message)s",
+    )
+
+    server_args = prepare_server_args(argv)
+    # The scheduler processes will read these env vars; make sure we ran
+    # through TokenSpeed's shared env/resource setup path instead of
+    # duplicating it here.
+    asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
+    asyncio.run(serve_grpc(server_args))


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check that prepare_server_args isn't eagerly creating a loop (sanity check on TokenSpeed's API shape). rg -nP -C3 'def prepare_server_args' rg -nP 'asyncio\.(new_event_loop|get_event_loop|set_event_loop)\b' -C2

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

# First, let's find the file mentioned in the review comment find . -type f -name "__main__.py" | head -20

Repository: lightseekorg/smg

Length of output: 154

🏁 Script executed:

# Search for prepare_server_args function definition across the codebase rg "def prepare_server_args" -A 20

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

# Check for asyncio event loop operations in the entire codebase rg "asyncio\.(new_event_loop|get_event_loop|set_event_loop)" -B 2 -A 2

Repository: lightseekorg/smg

Length of output: 532

🏁 Script executed:

# Look for uvloop imports rg "uvloop" -B 2 -A 2

Repository: lightseekorg/smg

Length of output: 1235

🏁 Script executed:

# Read the full __main__.py file to understand imports and structure cat -n grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py

Repository: lightseekorg/smg

Length of output: 1508

🏁 Script executed:

# Search for prepare_server_args in the tokenspeed module find . -path "*/tokenspeed/runtime/server_args*" -type f

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

# Look for any asyncio.get_event_loop or asyncio.new_event_loop in tokenspeed modules rg "asyncio\." grpc_servicer/smg_grpc_servicer/tokenspeed/ -A 2

Repository: lightseekorg/smg

Length of output: 1646

🏁 Script executed:

# Search for grpc imports in the codebase rg "import.*grpc" grpc_servicer/smg_grpc_servicer/tokenspeed/ -B 2 -A 2

Repository: lightseekorg/smg

Length of output: 2361

🏁 Script executed:

# Check the serve_grpc implementation and its imports cat -n grpc_servicer/smg_grpc_servicer/tokenspeed/server.py | head -50

Repository: lightseekorg/smg

Length of output: 2356

Move asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) to the beginning of main() before importing grpc.aio internals.

The serve_grpc function is imported at the module level (line 21), which indirectly imports the grpc module in server.py (line 23). This happens before asyncio.set_event_loop_policy() is called at line 37. The uvloop policy must be set before gRPC internals are initialized. Move the policy setting to the start of main(), before prepare_server_args() is called.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py` around lines 24 - 38, The uvloop event loop policy is being set after grpc/aio internals (via the module-level serve_grpc import) are already initialized; move asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) to the very start of main() before any grpc-related initialization by invoking it before prepare_server_args(argv) and before any use of serve_grpc/other grpc-related code so the policy is applied prior to gRPC internals being imported/initialized.

coderabbitai · 2026-04-23T14:43:31Z

+    # The scheduler processes will read these env vars; make sure we ran
+    # through TokenSpeed's shared env/resource setup path instead of
+    # duplicating it here.
+    asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
+    asyncio.run(serve_grpc(server_args))


⚠️ Potential issue | 🟡 Minor

Comment doesn't match the code below it.

The comment on lines 34-36 says "The scheduler processes will read these env vars; make sure we ran through TokenSpeed's shared env/resource setup path instead of duplicating it here" — but the immediately following lines set the uvloop event loop policy and call asyncio.run. No env vars are set or referenced here. The comment appears to be misplaced (or a leftover from an earlier iteration that set env vars between prepare_server_args and asyncio.run). Either move it above prepare_server_args(argv) where the env priming actually occurs (inside that call), or drop it.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py` around lines 34 - 38, The comment about priming env vars is misplaced: it refers to the env/resource setup that actually happens in prepare_server_args(argv), but it sits above asyncio.set_event_loop_policy() and asyncio.run(serve_grpc(server_args)); either move the comment to directly above the prepare_server_args(argv) call (or inside prepare_server_args) so it documents the env priming, or remove the comment entirely if it's stale; adjust text to reference prepare_server_args, asyncio.set_event_loop_policy, and asyncio.run/serve_grpc to ensure clarity.

coderabbitai · 2026-04-23T14:43:31Z

+"""Pytest configuration for smg-grpc-servicer unit tests.
+
+Adds the parent directory to ``sys.path`` so editable installs work
+without needing ``pip install -e``, and declares an asyncio-mode default.
+"""
+
+from __future__ import annotations
+
+import pathlib
+import sys
+
+import pytest
+
+_HERE = pathlib.Path(__file__).resolve().parent
+_PKG_ROOT = _HERE.parent
+
+if str(_PKG_ROOT) not in sys.path:
+    sys.path.insert(0, str(_PKG_ROOT))
+
+
+def pytest_configure(config: pytest.Config) -> None:
+    config.addinivalue_line("markers", "tokenspeed: tests that require TokenSpeed")


⚠️ Potential issue | 🟡 Minor

Minor: docstring mentions an asyncio-mode default that isn't declared in this file.

The module docstring says it "declares an asyncio-mode default," but asyncio_mode = "auto" is actually set in grpc_servicer/pyproject.toml under [tool.pytest.ini_options]. Only the tokenspeed marker is registered here. Consider tightening the docstring to reflect the current behavior.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/tests/conftest.py` around lines 1 - 22, Update the module docstring to accurately describe what this file does: remove or reword the claim about "declares an asyncio-mode default" since asyncio_mode is configured in pyproject.toml, and instead state that the file adds the package root to sys.path and registers the pytest marker via pytest_configure (the function registering the "tokenspeed" marker); keep the rest of the file unchanged.

claude · 2026-04-23T15:05:04Z

+        try:
+            embedding: list[float] | None = None
+            prompt_tokens = 0
+            async for output in self.async_llm.generate_request(obj):
+                # EmbeddingReqInput is non-streaming: the loop yields exactly
+                # one dict at finish, carrying the embedding vector.
+                embedding = output.get("embedding")
+                prompt_tokens = output.get("meta_info", {}).get("prompt_tokens", 0)
+
+            if embedding is None:
+                await context.abort(grpc.StatusCode.INTERNAL, "Empty embedding result")
+                return
+
+            return sglang_scheduler_pb2.EmbedResponse(
+                embedding=list(embedding),
+                prompt_tokens=prompt_tokens,
+                embedding_dim=len(embedding),
+            )
+        except grpc.aio.AbortError:
+            raise
+        except ValueError as e:
+            logger.warning("Embed invalid request %s: %s", rid, e)
+            await context.abort(grpc.StatusCode.INVALID_ARGUMENT, str(e))
+        except Exception as e:
+            logger.exception("Embed failed for request %s", rid)
+            await context.abort(grpc.StatusCode.INTERNAL, str(e))


🟡 Nit: Embed is missing asyncio.CancelledError handling and finally cleanup, unlike Generate (lines 198–221). If a client disconnects or times out while the async for loop is awaiting the scheduler, CancelledError propagates unhandled and the rid leaks in rid_to_state. For embeddings this is unlikely to matter in practice (single forward pass, fast), but the inconsistency could bite once longer-running embedding models are onboarded.

Minimal fix — mirror the Generate pattern:

try: ... except asyncio.CancelledError: self.async_llm.abort_request(rid) raise except grpc.aio.AbortError: raise ... finally: state = self.async_llm.rid_to_state.get(rid) if state is not None and not getattr(state, "finished", False): self.async_llm.abort_request(rid)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9056d9ef25

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-23T15:34:08Z

+        known = rid in self.async_llm.rid_to_state
+        try:
+            self.async_llm.abort_request(rid)
+            return sglang_scheduler_pb2.AbortResponse(


Abort all child RIDs for n>1 requests

When sampling_params.n > 1, _build_generate_req rewrites the request into per-choice IDs (<request_id>-n*), but Abort still checks and aborts only the original request_id. In that case known becomes false and the engine never receives aborts for the active child requests, so explicit aborts (for example router timeout/cleanup paths) report "not found" and generation continues consuming GPU work until natural completion.

Useful? React with 👍 / 👎.

yetone · 2026-04-23T16:29:29Z

Review addressed — pushed 7aca1716 (force-push).

Fixed (from CodeRabbit review)

🔴 model_specs.py duplicate key (line 55-59 + line 80-86). I added a second Qwen/Qwen3-30B-A3B entry instead of merging; Python kept the later (thinking/reasoning) one, silently dropping function_calling + tool_choice features. This was also the python-lint / pre-commit CI failure (ruff F601). Merged into the existing entry with the union of features.
🟠 P1: Abort sweep for n>1 child rids. Previously Abort only hit the original request_id; for n>1 requests Generate minted {rid}-n{i} children that kept consuming GPU work on client-driven abort. Abort now sweeps the original and every {rid}-n* child.
P1: Cancel path also sweeps n>1 children. Same class of bug in the except asyncio.CancelledError branch in Generate — fixed alongside a new test_cancel_aborts_all_n_children unit test.
🟡 Embed missing CancelledError / finally cleanup. Now mirrors Generate's cleanup pattern so a client disconnect during an embed call doesn't leak the rid in rid_to_state.
Stale docstrings (constants.py, worker.py, test_function_calling.py class-level comment/docstring, "Llama-3.2-1B is flaky" comment).
Unused SimpleNamespace import in test_tokenspeed_health_servicer.py.
Ruff UP038 issues (PEP 604 isinstance union style) in servicer.py.
Ruff format applied to all touched files.

Fixed (from `e2e-1gpu-chat` CI failures)

Model swap → OOM on 80 GB H100 CI: reviewer correctly predicted Qwen/Qwen3-30B-A3B (~66 GB bf16 weights alone) wouldn't fit. Switched TestOpenAIServerFunctionCalling to Qwen/Qwen3-4B (dense, still Qwen3ForCausalLM — TokenSpeed-supported). This also unbreaks the sglang / vllm / trtllm variants of the same test class that the 30B swap had inadvertently broken.
Kept Qwen/Qwen3-30B-A3B available in model_specs.py (for the thinking/reasoning test classes that already used it) — the duplicate was the only bug.

Added — TokenSpeed in GitHub CI

Missing from the first push (my bad). Wired up end-to-end:

scripts/ci_install_tokenspeed.sh — clones TokenSpeed, builds kernel (CUDA) + scheduler (C++/CMake/nanobind) + Python runtime, caches the built kernel wheel at /tmp/tokenspeed-wheel-cache/ to skip the ~30-min CUDA compile on subsequent runs. Mirrors the install-from-source + wheel-cache pattern already used for ci_install_trtllm.sh.
.github/actions/setup-tokenspeed/action.yml — composite action wiring the install script alongside the venv setup.
.github/workflows/e2e-gpu-job.yml — new Setup TokenSpeed backend step gated on inputs.engine == 'tokenspeed', plus a docstring update.
.github/workflows/pr-test-rust.yml:
- e2e-1gpu-chat matrix: added tokenspeed (60-min timeout, matches TRT-LLM's pattern for build-from-source engines).
- detect-changes paths-filter: added scripts/ci_install_tokenspeed.sh to common and chat-completions filters so edits to the install script re-trigger the test.

Servicer fallback dropped

Servicer / scheduler_launcher no longer dual-mode import. Target: only the post-Phase-D.3 AsyncLLM (standalone class, not the legacy TokenizerManager wrapper). This matches the e2e evidence from the prior asciinema (https://asciinema.org/a/3pl8PKYo92DnSA68) where AsyncLLM.__bases__ = (EngineClient, TokenizerCommunicatorMixin).

What I did NOT change (out of scope for this PR)

TestToolChoiceLlama still uses meta-llama/Llama-3.2-1B-Instruct + llama parser — baseline Llama regression coverage preserved for sglang/vllm/trtllm. This class is NOT marked to run on tokenspeed because TokenSpeed's registry doesn't include LlamaForCausalLM; flip that marker back in a follow-up once TokenSpeed registers it.
smg gateway's tool_choice="required" / tool_choice="specific" constraint translation — vLLM fails these same tests identically on the same model, so the failure is upstream of the engine and belongs in a separate smg-gateway PR.

claude · 2026-04-23T16:32:00Z

+    async def generate_request(self, obj):
+        # Record the request so tests can assert on what was forwarded.
+        rid = getattr(obj, "rid", None) or "no-rid"
+        self.rid_to_state[rid] = _FakeState()


🔴 Important: test_cancel_aborts_all_n_children will crash here with TypeError: unhashable type: 'list' before it ever reaches the cancel path.

When sampling_params.n = 3, _build_generate_req sets obj.rid = ["rid-1-n0", "rid-1-n1", "rid-1-n2"]. This line then tries to use that list as a dict key (self.rid_to_state[rid]), which is illegal. The resulting TypeError propagates into Generate's except Exception handler → context.abort(INTERNAL, ...) → the test gets _FakeAbortError instead of CancelledError, and the assertion fails.

Fix: handle list rids the same way the real AsyncLLM does — register each child rid individually:

Suggested change

self.rid_to_state[rid] = _FakeState()

rid = getattr(obj, "rid", None) or "no-rid"

if isinstance(rid, list):

for r in rid:

self.rid_to_state[r] = _FakeState()

else:

self.rid_to_state[rid] = _FakeState()

claude · 2026-04-23T16:32:05Z

+    # Cache the built wheel — uv stores wheels under its cache, copy out.
+    mkdir -p "$WHEEL_CACHE"
+    python3 -c "import tokenspeed_kernel, os, shutil, glob; \
+        d = os.path.dirname(tokenspeed_kernel.__file__); \
+        site = os.path.dirname(d); \
+        whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \
+        print('kernel install dir:', whls)" || true


🟡 Nit: The wheel cache write is a no-op — this block creates $WHEEL_CACHE and prints the dist-info path, but never actually copies a .whl file into the cache directory. Every CI run will fall through to the 30-minute source build because find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" on line 75 will always come up empty.

uv pip install from a local path doesn't produce a standalone wheel in the site-packages. You'd need to build the wheel explicitly first (e.g. uv pip wheel tokenspeed-kernel/python/ -w "$WHEEL_CACHE" --no-build-isolation) and then install from the cached artifact, or use uv's own cache directory (uv cache dir) to locate the built wheel and copy it out.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7aca171673

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-23T16:37:05Z

+# Pin to a tested TokenSpeed ref so CI is reproducible. Bump explicitly when
+# we want a newer runtime; keeping it pinned avoids surprise breakage when
+# TokenSpeed main moves ahead of what our gRPC servicer was verified against.
+TOKENSPEED_REF="${TOKENSPEED_REF:-main}"


Pin TokenSpeed install to immutable git ref

Defaulting TOKENSPEED_REF to main makes CI non-reproducible: the same SMG commit can build against different TokenSpeed revisions on different days, which can introduce unrelated breakages or behavior drift in the gRPC servicer tests. Because this script is used in PR and E2E setup, a moving default branch undermines bisectability and makes failures hard to attribute to SMG changes.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-23T16:37:05Z

+        print('kernel install dir:', whls)" || true
+fi
+
+# Step 4: scheduler (scikit-build-core + nanobind + CMake).
+echo "Building tokenspeed-scheduler..."
+uv pip install tokenspeed-scheduler/
+


Write built kernel wheel into configured cache

The cache read path looks for tokenspeed_kernel-*.whl in WHEEL_CACHE, but after a source build this branch only prints install metadata and never copies a wheel into that directory. As a result, later runs miss the cache and rebuild the CUDA kernel every time, which materially increases CI runtime and can push the TokenSpeed lane into timeout territory.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

e2e_test/chat_completions/test_function_calling.py (1)
25-37: ⚠️ Potential issue | 🟡 Minor

Remove unused LLAMA_SYSTEM_MESSAGE constant and stale comment.

LLAMA_SYSTEM_MESSAGE is defined at line 28 but never referenced anywhere in the codebase. The comment claiming it is "Used by TestToolChoiceLlama below" is inaccurate—TestToolChoiceLlama (extending _TestToolChoiceBase) uses get_test_messages() instead, which does not reference this constant. Delete the constant and its associated comment.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e_test/chat_completions/test_function_calling.py` around lines 25 - 37,
Delete the unused LLAMA_SYSTEM_MESSAGE constant and its preceding stale comment
block from the test file; specifically remove the LLAMA_SYSTEM_MESSAGE
definition and the comment claiming it's "Used by TestToolChoiceLlama", and
verify there are no remaining references to LLAMA_SYSTEM_MESSAGE anywhere
(confirm TestToolChoiceLlama/_TestToolChoiceBase and get_test_messages() are
unchanged and do not rely on this constant).

♻️ Duplicate comments (4)

e2e_test/infra/worker.py (1)

441-442: ⚠️ Potential issue | 🟡 Minor

Docstring still enumerates only three engines.

start_workers docstring lists engines as "sglang", "vllm", "trtllm"; please include "tokenspeed" to match the _build_cmd dispatch and the Worker.engine field comment on line 37.

📝 Suggested fix

-        engine: Runtime engine ("sglang", "vllm", "trtllm").
+        engine: Runtime engine ("sglang", "vllm", "trtllm", or "tokenspeed").
                 If None, auto-detected from E2E_RUNTIME env var.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@e2e_test/infra/worker.py` around lines 441 - 442, The start_workers docstring
enumerates only "sglang", "vllm", "trtllm" but omits "tokenspeed", causing a
mismatch with the _build_cmd dispatch and the Worker.engine field; update the
start_workers docstring to include "tokenspeed" among the supported Runtime
engine options so it matches the _build_cmd function's cases and the
Worker.engine comment (see start_workers docstring, _build_cmd, Worker.engine).

grpc_servicer/tests/conftest.py (1)

1-5: ⚠️ Potential issue | 🟡 Minor

Docstring still references an asyncio-mode default not declared in this file.

asyncio_mode is configured in grpc_servicer/pyproject.toml under [tool.pytest.ini_options], not here. This file only prepends the package root to sys.path and registers the tokenspeed marker.

📝 Suggested fix

 """Pytest configuration for smg-grpc-servicer unit tests.

-Adds the parent directory to ``sys.path`` so editable installs work
-without needing ``pip install -e``, and declares an asyncio-mode default.
+Adds the parent directory to ``sys.path`` so tests work without an
+editable install, and registers custom pytest markers (e.g. ``tokenspeed``).
 """

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/tests/conftest.py` around lines 1 - 5, The module docstring
incorrectly states it "declares an asyncio-mode default"; update the docstring
to accurately describe what this file does: mention it prepends the package root
to sys.path to support editable installs and registers the "tokenspeed" pytest
marker, and remove the claim about configuring asyncio_mode (which is set in
pyproject.toml under [tool.pytest.ini_options]). Keep the description concise
and consistent with the actual behavior (sys.path modification and marker
registration).

grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py (2)

33-38: ⚠️ Potential issue | 🟡 Minor

Move or remove the env/resource setup comment.

Lines 34-36 document the side effects of prepare_server_args(argv), but they sit above uvloop setup and asyncio.run(...). Put the comment directly above line 33 or drop it.

Proposed fix

-    server_args = prepare_server_args(argv)
     # The scheduler processes will read these env vars; make sure we ran
     # through TokenSpeed's shared env/resource setup path instead of
     # duplicating it here.
+    server_args = prepare_server_args(argv)
     asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py` around lines 33 - 38,
The comment about env/resource setup belongs with prepare_server_args(argv) but
is currently above asyncio setup; move that comment so it immediately precedes
the prepare_server_args(argv) call (or remove it entirely if redundant) to avoid
implying it documents the uvloop/asyncio.run calls; update or relocate the
comment near the prepare_server_args symbol and leave
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) and
asyncio.run(serve_grpc(server_args)) untouched.

18-21: ⚠️ Potential issue | 🟠 Major

Set the uvloop policy before importing the gRPC server module.

serve_grpc is imported at module load, which imports server.py and grpc before line 37 applies the uvloop policy. Make serve_grpc a lazy import after asyncio.set_event_loop_policy(...); moving line 37 alone is not sufficient.

Proposed fix

 import uvloop
 from tokenspeed.runtime.server_args import prepare_server_args
 
-from smg_grpc_servicer.tokenspeed.server import serve_grpc
-
 
 def main(argv: list[str] | None = None) -> None:
     if argv is None:
         argv = sys.argv[1:]
@@
 
     server_args = prepare_server_args(argv)
-    # The scheduler processes will read these env vars; make sure we ran
-    # through TokenSpeed's shared env/resource setup path instead of
-    # duplicating it here.
     asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
+    from smg_grpc_servicer.tokenspeed.server import serve_grpc
+
     asyncio.run(serve_grpc(server_args))

Verification:

#!/bin/bash
# Expect: no module-level import of serve_grpc before set_event_loop_policy.
rg -nP 'from smg_grpc_servicer\.tokenspeed\.server import serve_grpc|asyncio\.set_event_loop_policy|import grpc' \
  grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py \
  grpc_servicer/smg_grpc_servicer/tokenspeed/server.py

Also applies to: 33-38

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py` around lines 18 - 21,
The module currently imports serve_grpc from smg_grpc_servicer.tokenspeed.server
at import time which causes grpc (and its event loop behavior) to be imported
before the uvloop policy is set; change this to a lazy import by removing the
top-level "from smg_grpc_servicer.tokenspeed.server import serve_grpc" and
instead import serve_grpc inside the main entrypoint after calling
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) (ensure uvloop is set
via uvloop.install() or asyncio.set_event_loop_policy(...) first), so that
serve_grpc and any grpc imports occur only after the uvloop policy is applied;
verify no other module-level imports of serve_grpc or grpc remain.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e_test/infra/model_specs.py`:
- Around line 71-75: The comment above the "Qwen/Qwen3-30B-A3B" model spec is
stale and should be updated to reflect that TestOpenAIServerFunctionCalling now
routes through "Qwen/Qwen3-4B"; edit the comment near the "Qwen/Qwen3-30B-A3B"
entry in e2e_test/infra/model_specs.py to remove or replace the reference to
TokenSpeed using this 30B model for TestOpenAIServerFunctionCalling and instead
note that the test uses "Qwen/Qwen3-4B" so future maintainers won't be directed
to the OOM-prone 30B model.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/server.py`:
- Around line 162-205: The health flag is being set unconditionally; change the
flow so health_servicer.set_serving() is only called when the warmup actually
succeeded: for Generate ensure final is not None and final.HasField("complete"),
for Embed ensure resp.embedding_dim > 0; on failure (exceptions, missing
Complete, or empty embedding) log the error/warning and do not call
health_servicer.set_serving(). Keep the existing try/except/finally structure
and still close the channel in finally, but move the
health_servicer.set_serving() into the success branches (or gate it behind a
boolean like warmup_succeeded set true only on those success conditions).

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 386-404: The current child_prefix logic can match unrelated IDs
and iterates the live state map; fix by snapshotting the keys (e.g., keys =
list(self.async_llm.rid_to_state)) and compute targets as those equal to rid or
that start with child_prefix AND have only digits after the prefix (validate
suffix with isdigit), then call self.async_llm.abort_request(r) for each target;
this ensures only true `{rid}-n<int>` children are aborted and avoids
concurrent-mutation issues when iterating state_map.

In `@grpc_servicer/tests/test_tokenspeed_servicer.py`:
- Around line 167-179: The test fails because generate_request treats obj.rid as
a hashable key while _build_generate_req may expand obj.rid into a list when
sampling_params.n > 1; update generate_request (and the corresponding later
block around the same logic) to detect if rid is a list/iterable and register a
_FakeState for each child rid in self.rid_to_state (e.g., iterate over obj.rid
when it's a list and assign self.rid_to_state[child_rid] = _FakeState()), yield
outputs the same way, and set each child's .finished = True at the end so
cancellation/never_finish logic sees the per-child started/finished states.
Ensure you reference generate_request and self.rid_to_state when making the
change.

In `@scripts/ci_install_tokenspeed.sh`:
- Around line 24-30: The script currently defaults TOKENSPEED_REF to "main",
which breaks reproducibility; change the logic in ci_install_tokenspeed.sh so
TOKENSPEED_REF is not defaulted to a moving branch but required: remove the
default assignment for TOKENSPEED_REF="${TOKENSPEED_REF:-main}" and add an
explicit validation that exits with a non‑zero status if TOKENSPEED_REF is empty
or equals "main", prompting the caller to supply a pinned commit SHA or tag;
update any surrounding comment to state that a pinned ref is required.
- Around line 74-90: The build branch never writes a wheel into WHEEL_CACHE so
subsequent CI runs rebuild; change the build/install flow so the kernel wheel is
created into the cache and then installed from that cached file: when building
tokenspeed-kernel (the branch that currently runs uv pip install
tokenspeed-kernel/python/ and the python3 inspection that prints kernel install
dir), run pip wheel --wheel-dir "$WHEEL_CACHE" (or otherwise copy the produced
wheel into $WHEEL_CACHE) so a tokenspeed_kernel-*.whl appears in $WHEEL_CACHE,
then set CACHED_KERNEL_WHEEL to that file and install it (uv pip install
"$CACHED_KERNEL_WHEEL" or pip install "$CACHED_KERNEL_WHEEL") instead of leaving
the artifact only in the uv cache; ensure the existing CACHED_KERNEL_WHEEL
lookup, MAX_JOBS, FLASHINFER_CUDA_ARCH_LIST and tokenspeed-kernel reference
remain consistent.

---

Outside diff comments:
In `@e2e_test/chat_completions/test_function_calling.py`:
- Around line 25-37: Delete the unused LLAMA_SYSTEM_MESSAGE constant and its
preceding stale comment block from the test file; specifically remove the
LLAMA_SYSTEM_MESSAGE definition and the comment claiming it's "Used by
TestToolChoiceLlama", and verify there are no remaining references to
LLAMA_SYSTEM_MESSAGE anywhere (confirm TestToolChoiceLlama/_TestToolChoiceBase
and get_test_messages() are unchanged and do not rely on this constant).

---

Duplicate comments:
In `@e2e_test/infra/worker.py`:
- Around line 441-442: The start_workers docstring enumerates only "sglang",
"vllm", "trtllm" but omits "tokenspeed", causing a mismatch with the _build_cmd
dispatch and the Worker.engine field; update the start_workers docstring to
include "tokenspeed" among the supported Runtime engine options so it matches
the _build_cmd function's cases and the Worker.engine comment (see start_workers
docstring, _build_cmd, Worker.engine).

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py`:
- Around line 33-38: The comment about env/resource setup belongs with
prepare_server_args(argv) but is currently above asyncio setup; move that
comment so it immediately precedes the prepare_server_args(argv) call (or remove
it entirely if redundant) to avoid implying it documents the uvloop/asyncio.run
calls; update or relocate the comment near the prepare_server_args symbol and
leave asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) and
asyncio.run(serve_grpc(server_args)) untouched.
- Around line 18-21: The module currently imports serve_grpc from
smg_grpc_servicer.tokenspeed.server at import time which causes grpc (and its
event loop behavior) to be imported before the uvloop policy is set; change this
to a lazy import by removing the top-level "from
smg_grpc_servicer.tokenspeed.server import serve_grpc" and instead import
serve_grpc inside the main entrypoint after calling
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) (ensure uvloop is set
via uvloop.install() or asyncio.set_event_loop_policy(...) first), so that
serve_grpc and any grpc imports occur only after the uvloop policy is applied;
verify no other module-level imports of serve_grpc or grpc remain.

In `@grpc_servicer/tests/conftest.py`:
- Around line 1-5: The module docstring incorrectly states it "declares an
asyncio-mode default"; update the docstring to accurately describe what this
file does: mention it prepends the package root to sys.path to support editable
installs and registers the "tokenspeed" pytest marker, and remove the claim
about configuring asyncio_mode (which is set in pyproject.toml under
[tool.pytest.ini_options]). Keep the description concise and consistent with the
actual behavior (sys.path modification and marker registration).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9bc812b8-569c-44c4-9d4e-7bad5a192d00

📥 Commits

Reviewing files that changed from the base of the PR and between 34e6471 and 7aca171.

📒 Files selected for processing (19)

.github/actions/setup-tokenspeed/action.yml
.github/workflows/e2e-gpu-job.yml
.github/workflows/pr-test-rust.yml
e2e_test/chat_completions/test_function_calling.py
e2e_test/infra/constants.py
e2e_test/infra/model_specs.py
e2e_test/infra/worker.py
grpc_servicer/pyproject.toml
grpc_servicer/smg_grpc_servicer/tokenspeed/__init__.py
grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py
grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py
grpc_servicer/smg_grpc_servicer/tokenspeed/scheduler_launcher.py
grpc_servicer/smg_grpc_servicer/tokenspeed/server.py
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
grpc_servicer/tests/__init__.py
grpc_servicer/tests/conftest.py
grpc_servicer/tests/test_tokenspeed_health_servicer.py
grpc_servicer/tests/test_tokenspeed_servicer.py
scripts/ci_install_tokenspeed.sh

coderabbitai · 2026-04-23T16:40:27Z

+    async def generate_request(self, obj):
+        # Record the request so tests can assert on what was forwarded.
+        rid = getattr(obj, "rid", None) or "no-rid"
+        self.rid_to_state[rid] = _FakeState()
+        if self.generate_fn is not None:
+            async for out in self.generate_fn(obj):
+                self.last_receive_tstamp = 9999.0  # anything > tic
+                yield out
+            return
+        for out in self.outputs:
+            self.last_receive_tstamp = 9999.0
+            yield out
+        self.rid_to_state[rid].finished = True


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Expect: fake engine normalizes list rids before using them as dict keys. sed -n '167,180p' grpc_servicer/tests/test_tokenspeed_servicer.py sed -n '561,592p' grpc_servicer/tests/test_tokenspeed_servicer.py

Repository: lightseekorg/smg

Length of output: 1876

🏁 Script executed:

# Find _build_generate_req definition to confirm it expands rid to list rg "_build_generate_req" grpc_servicer/tests/test_tokenspeed_servicer.py -A 20 -B 2

Repository: lightseekorg/smg

Length of output: 2304

🏁 Script executed:

# Check servicer.Generate to see how it calls fake_engine.generate_request rg "class TokenSpeedSchedulerServicer" grpc_servicer -A 50 | head -80

Repository: lightseekorg/smg

Length of output: 4936

🏁 Script executed:

# Look for the actual Generate method implementation fd "servicer.py" grpc_servicer --type f | xargs rg "def Generate" -A 30 | head -100

Repository: lightseekorg/smg

Length of output: 8735

🏁 Script executed:

# Find _build_generate_req implementation rg "def _build_generate_req" grpc_servicer/tests/test_tokenspeed_servicer.py -A 50

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

# Also check if there's _build_generate_req in the actual servicer fd "servicer.py" grpc_servicer/smg_grpc_servicer/tokenspeed --type f -exec rg "def _build_generate_req" {} -A 50 \;

Repository: lightseekorg/smg

Length of output: 290

🏁 Script executed:

# Check if rid is ever used as dict key elsewhere in the test to understand how it's meant to work rg "rid_to_state\[" grpc_servicer/tests/test_tokenspeed_servicer.py

Repository: lightseekorg/smg

Length of output: 439

🏁 Script executed:

# Search for _build_generate_req more carefully rg "_build_generate_req" grpc_servicer/tests/test_tokenspeed_servicer.py -B 2 -A 30

Repository: lightseekorg/smg

Length of output: 2606

🏁 Script executed:

# Also search in the actual servicer implementation fd "servicer.py" grpc_servicer/smg_grpc_servicer/tokenspeed --type f | head -1 | xargs cat | grep -n "_build_generate_req" -A 30

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

# Look for where rid is expanded/created as a list rg "rid.*\[.*n0" grpc_servicer/tests/test_tokenspeed_servicer.py

Repository: lightseekorg/smg

Length of output: 136

🏁 Script executed:

# Look for _build_generate_req in the actual servicer file rg "def _build_generate_req" grpc_servicer/smg_grpc_servicer/tokenspeed/ -A 80

Repository: lightseekorg/smg

Length of output: 8350

🏁 Script executed:

# Check what the test says about how rid gets converted rg "rid-1-n" grpc_servicer/tests/test_tokenspeed_servicer.py -B 5 -A 5

Repository: lightseekorg/smg

Length of output: 1300

Teach the fake engine to handle expanded rid lists.

When sampling_params.n > 1, _build_generate_req expands obj.rid to a list (e.g., ["rid-1-n0", "rid-1-n1", "rid-1-n2"]), but line 170 attempts to use it as a dict key. This raises TypeError: unhashable type: 'list' before never_finish() signals started, causing test_cancel_aborts_all_n_children to fail immediately instead of exercising cancellation logic.

Proposed fix

async def generate_request(self, obj): # Record the request so tests can assert on what was forwarded. rid = getattr(obj, "rid", None) or "no-rid" + rids = rid if isinstance(rid, list) else [rid] + for one_rid in rids: + self.rid_to_state[one_rid] = _FakeState() - self.rid_to_state[rid] = _FakeState() if self.generate_fn is not None: async for out in self.generate_fn(obj): self.last_receive_tstamp = 9999.0 # anything > tic yield out return for out in self.outputs: self.last_receive_tstamp = 9999.0 yield out + for one_rid in rids: + if one_rid in self.rid_to_state: + self.rid_to_state[one_rid].finished = True - self.rid_to_state[rid].finished = True

Also applies to: 561-592 (corresponding update needed at line 181 for consistency).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/tests/test_tokenspeed_servicer.py` around lines 167 - 179, The test fails because generate_request treats obj.rid as a hashable key while _build_generate_req may expand obj.rid into a list when sampling_params.n > 1; update generate_request (and the corresponding later block around the same logic) to detect if rid is a list/iterable and register a _FakeState for each child rid in self.rid_to_state (e.g., iterate over obj.rid when it's a list and assign self.rid_to_state[child_rid] = _FakeState()), yield outputs the same way, and set each child's .finished = True at the end so cancellation/never_finish logic sees the per-child started/finished states. Ensure you reference generate_request and self.rid_to_state when making the change.

coderabbitai · 2026-04-23T16:40:27Z

+# Pin to a tested TokenSpeed ref so CI is reproducible. Bump explicitly when
+# we want a newer runtime; keeping it pinned avoids surprise breakage when
+# TokenSpeed main moves ahead of what our gRPC servicer was verified against.
+TOKENSPEED_REF="${TOKENSPEED_REF:-main}"
+TOKENSPEED_REPO="${TOKENSPEED_REPO:-https://github.com/lightseekorg/tokenspeed.git}"
+TOKENSPEED_DIR="${TOKENSPEED_DIR:-/tmp/tokenspeed-src}"
+WHEEL_CACHE="${TOKENSPEED_WHEEL_CACHE:-/tmp/tokenspeed-wheel-cache}"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat scripts/ci_install_tokenspeed.sh

Repository: lightseekorg/smg

Length of output: 5294

Default TOKENSPEED_REF to "main" contradicts reproducibility goal.

Line 27 defaults to a moving branch despite the comment stating this is pinned for reproducibility. Upstream TokenSpeed changes can break CI or silently alter runtime behavior without any changes to this repository.

Require explicit pinning to a commit SHA or tag instead:

Proposed fix

-# Pin to a tested TokenSpeed ref so CI is reproducible. Bump explicitly when -# we want a newer runtime; keeping it pinned avoids surprise breakage when -# TokenSpeed main moves ahead of what our gRPC servicer was verified against. -TOKENSPEED_REF="${TOKENSPEED_REF:-main}" +# Pin to a tested TokenSpeed commit/tag so CI is reproducible. Bump explicitly +# when we want a newer runtime. +: "${TOKENSPEED_REF:?Set TOKENSPEED_REF to the tested TokenSpeed commit SHA or tag}" +if [ "$TOKENSPEED_REF" = "main" ]; then + echo "TOKENSPEED_REF must be a pinned commit SHA or tag, not main" >&2 + exit 1 +fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Pin to a tested TokenSpeed ref so CI is reproducible. Bump explicitly when

# we want a newer runtime; keeping it pinned avoids surprise breakage when

# TokenSpeed main moves ahead of what our gRPC servicer was verified against.

TOKENSPEED_REF="${TOKENSPEED_REF:-main}"

TOKENSPEED_REPO="${TOKENSPEED_REPO:-https://github.com/lightseekorg/tokenspeed.git}"

TOKENSPEED_DIR="${TOKENSPEED_DIR:-/tmp/tokenspeed-src}"

WHEEL_CACHE="${TOKENSPEED_WHEEL_CACHE:-/tmp/tokenspeed-wheel-cache}"

# Pin to a tested TokenSpeed commit/tag so CI is reproducible. Bump explicitly

# when we want a newer runtime.

: "${TOKENSPEED_REF:?Set TOKENSPEED_REF to the tested TokenSpeed commit SHA or tag}"

if [ "$TOKENSPEED_REF" = "main" ]; then

echo "TOKENSPEED_REF must be a pinned commit SHA or tag, not main" >&2

exit 1

fi

TOKENSPEED_REPO="${TOKENSPEED_REPO:-https://github.com/lightseekorg/tokenspeed.git}"

TOKENSPEED_DIR="${TOKENSPEED_DIR:-/tmp/tokenspeed-src}"

WHEEL_CACHE="${TOKENSPEED_WHEEL_CACHE:-/tmp/tokenspeed-wheel-cache}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/ci_install_tokenspeed.sh` around lines 24 - 30, The script currently defaults TOKENSPEED_REF to "main", which breaks reproducibility; change the logic in ci_install_tokenspeed.sh so TOKENSPEED_REF is not defaulted to a moving branch but required: remove the default assignment for TOKENSPEED_REF="${TOKENSPEED_REF:-main}" and add an explicit validation that exits with a non‑zero status if TOKENSPEED_REF is empty or equals "main", prompting the caller to supply a pinned commit SHA or tag; update any surrounding comment to state that a pinned ref is required.

coderabbitai · 2026-04-23T16:40:27Z

+# Step 3: kernel (CUDA compile — the expensive one). Try the cached wheel first.
+CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" 2>/dev/null | head -1 || true)
+if [ -n "$CACHED_KERNEL_WHEEL" ] && [ -f "$CACHED_KERNEL_WHEEL" ]; then
+    echo "Installing cached tokenspeed-kernel wheel: $CACHED_KERNEL_WHEEL"
+    uv pip install "$CACHED_KERNEL_WHEEL" --no-build-isolation
+else
+    echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..."
+    MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \
+        uv pip install tokenspeed-kernel/python/ --no-build-isolation
+    # Cache the built wheel — uv stores wheels under its cache, copy out.
+    mkdir -p "$WHEEL_CACHE"
+    python3 -c "import tokenspeed_kernel, os, shutil, glob; \
+        d = os.path.dirname(tokenspeed_kernel.__file__); \
+        site = os.path.dirname(d); \
+        whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \
+        print('kernel install dir:', whls)" || true
+fi


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Expect: the build branch creates/copies a tokenspeed_kernel wheel into WHEEL_CACHE. rg -n 'WHEEL_CACHE|tokenspeed_kernel-\*\.whl|pip wheel|uv pip install tokenspeed-kernel' scripts/ci_install_tokenspeed.sh

Repository: lightseekorg/smg

Length of output: 420

🏁 Script executed:

sed -n '74,95p' scripts/ci_install_tokenspeed.sh

Repository: lightseekorg/smg

Length of output: 1226

Fix cache population: the build branch never writes wheels to $WHEEL_CACHE.

The Python script on lines 85–89 only prints the kernel install directory but doesn't copy the wheel into the cache. This means the first CI run builds from source and wastes the build; the second CI run rebuilds again (defeating the intended speedup).

Replace uv pip install with pip wheel --wheel-dir "$WHEEL_CACHE" to build the wheel directly into the cache, then install from the cached file:

Proposed fix

else echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..." - MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \ - uv pip install tokenspeed-kernel/python/ --no-build-isolation - # Cache the built wheel — uv stores wheels under its cache, copy out. mkdir -p "$WHEEL_CACHE" - python3 -c "import tokenspeed_kernel, os, shutil, glob; \ - d = os.path.dirname(tokenspeed_kernel.__file__); \ - site = os.path.dirname(d); \ - whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \ - print('kernel install dir:', whls)" || true + MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \ + python3 -m pip wheel tokenspeed-kernel/python/ \ + --no-build-isolation \ + --wheel-dir "$WHEEL_CACHE" + CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" | head -1) + uv pip install "$CACHED_KERNEL_WHEEL" --no-build-isolation fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Step 3: kernel (CUDA compile — the expensive one). Try the cached wheel first.

CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" 2>/dev/null | head -1 || true)

if [ -n "$CACHED_KERNEL_WHEEL" ] && [ -f "$CACHED_KERNEL_WHEEL" ]; then

echo "Installing cached tokenspeed-kernel wheel: $CACHED_KERNEL_WHEEL"

uv pip install "$CACHED_KERNEL_WHEEL" --no-build-isolation

else

echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..."

MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \

uv pip install tokenspeed-kernel/python/ --no-build-isolation

# Cache the built wheel — uv stores wheels under its cache, copy out.

mkdir -p "$WHEEL_CACHE"

python3 -c "import tokenspeed_kernel, os, shutil, glob; \

d = os.path.dirname(tokenspeed_kernel.__file__); \

site = os.path.dirname(d); \

whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \

print('kernel install dir:', whls)" || true

fi

# Step 3: kernel (CUDA compile — the expensive one). Try the cached wheel first.

CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" 2>/dev/null | head -1 || true)

if [ -n "$CACHED_KERNEL_WHEEL" ] && [ -f "$CACHED_KERNEL_WHEEL" ]; then

echo "Installing cached tokenspeed-kernel wheel: $CACHED_KERNEL_WHEEL"

uv pip install "$CACHED_KERNEL_WHEEL" --no-build-isolation

else

echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..."

mkdir -p "$WHEEL_CACHE"

MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \

python3 -m pip wheel tokenspeed-kernel/python/ \

--no-build-isolation \

--wheel-dir "$WHEEL_CACHE"

CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" | head -1)

uv pip install "$CACHED_KERNEL_WHEEL" --no-build-isolation

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/ci_install_tokenspeed.sh` around lines 74 - 90, The build branch never writes a wheel into WHEEL_CACHE so subsequent CI runs rebuild; change the build/install flow so the kernel wheel is created into the cache and then installed from that cached file: when building tokenspeed-kernel (the branch that currently runs uv pip install tokenspeed-kernel/python/ and the python3 inspection that prints kernel install dir), run pip wheel --wheel-dir "$WHEEL_CACHE" (or otherwise copy the produced wheel into $WHEEL_CACHE) so a tokenspeed_kernel-*.whl appears in $WHEEL_CACHE, then set CACHED_KERNEL_WHEEL to that file and install it (uv pip install "$CACHED_KERNEL_WHEEL" or pip install "$CACHED_KERNEL_WHEEL") instead of leaving the artifact only in the uv cache; ensure the existing CACHED_KERNEL_WHEEL lookup, MAX_JOBS, FLASHINFER_CUDA_ARCH_LIST and tokenspeed-kernel reference remain consistent.

claude · 2026-04-23T16:46:23Z

+    # Also used by TestOpenAIServerFunctionCalling as a TokenSpeed-supported
+    # substitute while TokenSpeed's model registry does not cover plain
+    # LlamaForCausalLM (only MoE / Eagle3 variants). Arch: Qwen3MoeForCausalLM.


🟡 Nit: This comment is stale — TestOpenAIServerFunctionCalling still targets meta-llama/Llama-3.2-1B-Instruct (line 118 of test_function_calling.py), and the new TokenSpeed subclass uses Qwen/Qwen3-4B, not Qwen3-30B-A3B. The comment describes an earlier design that was superseded when the tests switched to the 4B model to fit on a single 80GB GPU.

Suggested change

# Also used by TestOpenAIServerFunctionCalling as a TokenSpeed-supported

# substitute while TokenSpeed's model registry does not cover plain

# LlamaForCausalLM (only MoE / Eagle3 variants). Arch: Qwen3MoeForCausalLM.

# Thinking/reasoning model (larger). Arch: Qwen3MoeForCausalLM.

# function_calling / tool_choice features are listed for completeness —

# active TokenSpeed function-calling tests use Qwen3-4B (see

# TestOpenAIServerFunctionCallingTokenSpeed in test_function_calling.py).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d264a5b8fa

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-23T16:50:03Z

+# ── Clone TokenSpeed ────────────────────────────────────────────────────────
+if [ ! -d "$TOKENSPEED_DIR" ]; then
+    echo "Cloning TokenSpeed ${TOKENSPEED_REF} from ${TOKENSPEED_REPO}..."
+    git clone --depth 1 --branch "$TOKENSPEED_REF" "$TOKENSPEED_REPO" "$TOKENSPEED_DIR"


Allow TOKENSPEED_REF to be a commit SHA

git clone --branch accepts branch/tag names (see git clone -h: “checkout instead of the remote’s HEAD”), so setting TOKENSPEED_REF to an immutable commit SHA makes first-run installs fail with “Remote branch ... not found”. That blocks the pin-by-SHA workflow this installer is supposed to support on clean CI runners.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-23T16:50:03Z

+    git clone --depth 1 --branch "$TOKENSPEED_REF" "$TOKENSPEED_REPO" "$TOKENSPEED_DIR"
+else
+    echo "TokenSpeed clone exists at $TOKENSPEED_DIR, reusing"
+    (cd "$TOKENSPEED_DIR" && git fetch --depth 1 origin "$TOKENSPEED_REF" && git checkout "$TOKENSPEED_REF")


Fast-forward reused clone to fetched TokenSpeed ref

In the reuse path, git fetch ... origin "$TOKENSPEED_REF" && git checkout "$TOKENSPEED_REF" does not advance an existing local branch to the fetched remote tip. When TOKENSPEED_REF=main, warm runners can keep building an old local main while fresh runners build newer code, causing inconsistent CI behavior across nodes.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be83fd282f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-23T17:19:00Z

+    --no-build-isolation
+
+# Step 3: kernel (CUDA compile — the expensive one). Try the cached wheel first.
+CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" 2>/dev/null | head -1 || true)


Scope cached kernel wheel to current TokenSpeed ref

The cache lookup installs the first tokenspeed_kernel-*.whl found without checking TOKENSPEED_REF, so warm runners can reuse a wheel built from an older TokenSpeed revision while the scheduler/runtime code is checked out at a different ref. That mismatch can produce nondeterministic CI behavior (or runtime import/ABI failures) when the kernel package changes across refs; the cache keying needs to include the selected ref (and ideally interpreter/ABI) rather than using an unqualified glob.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-23T17:19:00Z

+        rid = getattr(obj, "rid", None) or "no-rid"
+        self.rid_to_state[rid] = _FakeState()


Handle list RIDs in fake AsyncLLM state tracking

For n>1 requests the servicer sets obj.rid to a list of child IDs, but this test double stores rid directly as a dict key; when rid is a list, self.rid_to_state[rid] = ... raises TypeError: unhashable type: 'list'. That makes the n>1 cancellation path fail before assertions run, so the newly added abort-all-children behavior is not actually testable in this fixture.

Useful? React with 👍 / 👎.

yetone · 2026-04-23T19:49:45Z

Dug into the 4 remaining `tool_choice="auto"` failures — root-caused + fixed

Reproduction

Full-stack e2e on a single H100: tokenspeed gRPC worker + smg gateway + openai client against meta-llama/Llama-3.2-1B-Instruct. Setup matches CI (ci_install_tokenspeed.sh + TestOpenAIServerFunctionCalling).

Root cause

Instrumented the servicer to log what flows out:

[SRV-AFTER] completion=22 raw=27
first10=[128009, 128006, 78191, 128007, 271, 5018, 609, 794, 330, 723]
last5  =[65, 794, 330, 20, 32075]
+ trailing 128008 (<|eom_id|>)

TokenSpeed's AsyncLLM prepends the Llama-3 assistant-role header tokens — [<|eot_id|>, <|start_header_id|>, "assistant", <|end_header_id|>, "\n\n"] — to every output_ids, plus includes the matched stop token. When the smg gateway detokenizes with skip_special_tokens=True, the 128xxx control tokens get stripped, but "assistant" (78191) and "\n\n" (271) are regular word tokens so they survive. The decoded text arriving at the llama parser looks like:

assistant\n\n{"name": "add", "parameters": {"a": "3", "b": "5"}}

The llama parser's parse_complete does text.trim_start().starts_with('{') which returns false because of the assistant\n\n prefix, falls through to "no JSON structure found", and emits an empty tool-call list.

Why this didn't hit Qwen3-30B: the qwen parser uses a regex to extract <tool_call>...</tool_call> tags and is tolerant of both leading prefix and trailing special tokens.

Why this didn't hit sglang/vllm/trtllm on the same model: their respective gRPC servicers/backends emit token_ids containing only the model's generated content — no header echo, no stop token included. SGLang's on-wire shape is the contract the gateway was built for; this PR's tokenspeed servicer was forwarding output_ids verbatim, inheriting TokenSpeed-AsyncLLM-internal details.

Fix

Added _generated_output_ids(output, reason_dict) to the tokenspeed servicer. It:

Slices raw[-meta.completion_tokens:] to drop the header prefix.
If the finish reason is "stop" and the trailing id matches reason.matched, drops that too.

matched_token_id still rides in the separate proto field. meta_info.completion_tokens passes through unchanged (SGLang's convention — it reports generation accounting, not wire-payload length).

Updated unit tests:

test_non_streaming_emits_complete now asserts the stripped wire shape ([10, 11] with matched_token_id=12).
New test_strips_chat_template_prefix — reproduces the real 27-token payload shape and asserts the servicer forwards only the 21 content tokens.

Result

TestOpenAIServerFunctionCalling against unsloth/Llama-3.2-1B-Instruct + tokenspeed (ungated mirror of the official repo; identical weights; local test environment had no HF gated-repo token):

Run	passed	failed	skipped
Before fix	6	10	2
After fix	10	6	2

The 6 remaining failures are the pre-existing smg-gateway tool_choice=required/specific constraint-translation bug (same failure set on vLLM for the same model — unrelated to this PR). See the earlier reference runs in the PR description.

Commit: 7718843

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

e2e_test/infra/constants.py (1)

67-72: ⚠️ Potential issue | 🟡 Minor

Stale docstring in get_runtime.

The docstring still says "sglang or vllm" but the function now returns any of sglang, vllm, trtllm, tokenspeed. Update for clarity.

🔧 Proposed fix

 def get_runtime() -> str:
-    """Get the current test runtime (sglang or vllm).
+    """Get the current test runtime (sglang, vllm, trtllm, or tokenspeed).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@e2e_test/infra/constants.py` around lines 67 - 72, Update the stale docstring
for get_runtime to list the current allowed runtime values (sglang, vllm,
trtllm, tokenspeed) and clarify that it reads the E2E_RUNTIME environment
variable with a default of "sglang"; modify the docstring text in the
get_runtime function to reflect these exact runtime names and the default
behavior so the documentation matches the implementation.

♻️ Duplicate comments (1)

grpc_servicer/tests/test_tokenspeed_servicer.py (1)
167-179: ⚠️ Potential issue | 🔴 Critical

Fake engine still breaks test_cancel_aborts_all_n_children for list rids.

When sampling_params.n > 1, _build_generate_req sets obj.rid = ["rid-1-n0", "rid-1-n1", "rid-1-n2"]. The line self.rid_to_state[rid] = _FakeState() then raises TypeError: unhashable type: 'list' before never_finish() runs, so started.set() is never called and the newly-added test_cancel_aborts_all_n_children will hang on await started.wait() (or fail via the servicer's outer except Exception branch converting the TypeError into an INVALID/INTERNAL abort, not a CancelledError).

Please teach the fake engine to iterate list rids so the n>1 cancel test actually exercises the expanded-rid path it's meant to guard.
🛠 Proposed fix
     async def generate_request(self, obj):
         # Record the request so tests can assert on what was forwarded.
         rid = getattr(obj, "rid", None) or "no-rid"
-        self.rid_to_state[rid] = _FakeState()
+        rids = rid if isinstance(rid, list) else [rid]
+        for one_rid in rids:
+            self.rid_to_state[one_rid] = _FakeState()
         if self.generate_fn is not None:
             async for out in self.generate_fn(obj):
                 self.last_receive_tstamp = 9999.0  # anything > tic
                 yield out
             return
         for out in self.outputs:
             self.last_receive_tstamp = 9999.0
             yield out
-        self.rid_to_state[rid].finished = True
+        for one_rid in rids:
+            if one_rid in self.rid_to_state:
+                self.rid_to_state[one_rid].finished = True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/tests/test_tokenspeed_servicer.py` around lines 167 - 179, The
fake engine's generate_request currently treats obj.rid as a single hashable key
and does self.rid_to_state[rid] = _FakeState(), which fails when
_build_generate_req sets obj.rid to a list for sampling_params.n > 1; update
generate_request to detect if rid is a list/iterable and iterate over each
element, creating an entry self.rid_to_state[child_rid] = _FakeState() for each
child_rid (and mark each .finished appropriately) so
started.set()/never_finish() can run and the test_cancel_aborts_all_n_children
path exercises the expanded-rid behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py`:
- Around line 57-67: The set_serving and set_not_serving methods mutate the
shared _serving_status dict from a non-event-loop thread while Check() reads it
on the asyncio loop; make this thread-safe by marshaling updates onto the event
loop (use the servicer's asyncio loop and call loop.call_soon_threadsafe to
schedule the dict assignments and logger.info), or alternatively protect
_serving_status with a threading.Lock and use it in set_serving, set_not_serving
and Check; locate the methods set_serving, set_not_serving and the Check method
to implement the chosen approach and ensure the warmup thread in server.py
(where set_serving is invoked) remains unchanged.
- Around line 77-78: The Check handler currently returns NOT_SERVING whenever
self.async_llm.gracefully_exit is true, which masks SERVICE_UNKNOWN for
unrecognized service names; in the health servicer's Check method (or the class
containing it), move the unknown-service validation logic (the branch that
returns SERVICE_UNKNOWN / NOT_FOUND for bogus service names) to run before the
gracefully_exit check so that unknown services return SERVICE_UNKNOWN even
during shutdown, preserving parity with the SGLang behavior.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/scheduler_launcher.py`:
- Around line 35-53: The docstring Raises section is inaccurate: update the
Raises text in the launch_engine wrapper to state that RuntimeError is raised
when async_llm is None (i.e., the current node is not rank 0 in a multi-node
deployment or scheduler died during init) and separately note that actual rank-0
scheduler initialization failures are propagated by _launch_subprocesses (not
wrapped by this RuntimeError); reference _launch_subprocesses, async_llm, and
the RuntimeError guard in the body when editing the docstring.

In `@grpc_servicer/tests/test_tokenspeed_health_servicer.py`:
- Around line 12-15: Remove the redundant noqa comments on the top-level import
lines: delete the "# noqa: E402" trailing each import of health_pb2 and
TokenSpeedHealthServicer in the test module so the imports read normally (refer
to the import statements importing grpc_health.v1.health_pb2 and
smg_grpc_servicer.tokenspeed.health_servicer.TokenSpeedHealthServicer).

In `@scripts/ci_install_tokenspeed.sh`:
- Around line 60-63: The reuse branch fetch/checkout path can leave the working
tree on a stale local branch; after fetching origin "$TOKENSPEED_REF" in the
TOKENSPEED_DIR reuse block, explicitly reset the working tree to the fetched ref
(e.g., run a hard reset to FETCH_HEAD) so the clone always matches the remote
ref; update the commands in the reuse branch that reference TOKENSPEED_DIR and
TOKENSPEED_REF to perform git fetch --depth 1 origin "$TOKENSPEED_REF" followed
by a git reset --hard FETCH_HEAD (and optionally git clean -fd if you need to
remove untracked files) to ensure the working tree is exactly the fetched
commit.

---

Outside diff comments:
In `@e2e_test/infra/constants.py`:
- Around line 67-72: Update the stale docstring for get_runtime to list the
current allowed runtime values (sglang, vllm, trtllm, tokenspeed) and clarify
that it reads the E2E_RUNTIME environment variable with a default of "sglang";
modify the docstring text in the get_runtime function to reflect these exact
runtime names and the default behavior so the documentation matches the
implementation.

---

Duplicate comments:
In `@grpc_servicer/tests/test_tokenspeed_servicer.py`:
- Around line 167-179: The fake engine's generate_request currently treats
obj.rid as a single hashable key and does self.rid_to_state[rid] = _FakeState(),
which fails when _build_generate_req sets obj.rid to a list for
sampling_params.n > 1; update generate_request to detect if rid is a
list/iterable and iterate over each element, creating an entry
self.rid_to_state[child_rid] = _FakeState() for each child_rid (and mark each
.finished appropriately) so started.set()/never_finish() can run and the
test_cancel_aborts_all_n_children path exercises the expanded-rid behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8c93b593-996b-44c3-8529-c06dbfd6418f

📥 Commits

Reviewing files that changed from the base of the PR and between 7aca171 and 7718843.

📒 Files selected for processing (19)

.github/actions/setup-tokenspeed/action.yml
.github/workflows/e2e-gpu-job.yml
.github/workflows/pr-test-rust.yml
e2e_test/chat_completions/test_function_calling.py
e2e_test/infra/constants.py
e2e_test/infra/model_specs.py
e2e_test/infra/worker.py
grpc_servicer/pyproject.toml
grpc_servicer/smg_grpc_servicer/tokenspeed/__init__.py
grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py
grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py
grpc_servicer/smg_grpc_servicer/tokenspeed/scheduler_launcher.py
grpc_servicer/smg_grpc_servicer/tokenspeed/server.py
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
grpc_servicer/tests/__init__.py
grpc_servicer/tests/conftest.py
grpc_servicer/tests/test_tokenspeed_health_servicer.py
grpc_servicer/tests/test_tokenspeed_servicer.py
scripts/ci_install_tokenspeed.sh

coderabbitai · 2026-04-23T19:54:55Z

+    def set_serving(self) -> None:
+        """Flip both services to SERVING (call after successful warmup)."""
+        self._serving_status[self.OVERALL_SERVER] = health_pb2.HealthCheckResponse.SERVING
+        self._serving_status[self.SGLANG_SERVICE] = health_pb2.HealthCheckResponse.SERVING
+        logger.info("TokenSpeed gRPC health status -> SERVING")
+
+    def set_not_serving(self) -> None:
+        """Flip both services to NOT_SERVING (call on shutdown)."""
+        self._serving_status[self.OVERALL_SERVER] = health_pb2.HealthCheckResponse.NOT_SERVING
+        self._serving_status[self.SGLANG_SERVICE] = health_pb2.HealthCheckResponse.NOT_SERVING
+        logger.info("TokenSpeed gRPC health status -> NOT_SERVING")


🧹 Nitpick | 🔵 Trivial

set_serving/set_not_serving are called from a non-loop thread.

server.py invokes health_servicer.set_serving() from the warmup daemon thread (line 205 there), while Check() reads _serving_status on the asyncio event loop. CPython dict item assignment is GIL-atomic so this is practically safe today, but a reader-visible subtle bug if the implementation grows (e.g., multi-field transitions, logging ordering). Consider either documenting the "called from any thread" contract here or marshaling via loop.call_soon_threadsafe.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py` around lines 57 - 67, The set_serving and set_not_serving methods mutate the shared _serving_status dict from a non-event-loop thread while Check() reads it on the asyncio loop; make this thread-safe by marshaling updates onto the event loop (use the servicer's asyncio loop and call loop.call_soon_threadsafe to schedule the dict assignments and logger.info), or alternatively protect _serving_status with a threading.Lock and use it in set_serving, set_not_serving and Check; locate the methods set_serving, set_not_serving and the Check method to implement the chosen approach and ensure the warmup thread in server.py (where set_serving is invoked) remains unchanged.

coderabbitai · 2026-04-23T19:54:55Z

+        if self.async_llm.gracefully_exit:
+            return health_pb2.HealthCheckResponse(status=health_pb2.HealthCheckResponse.NOT_SERVING)


⚠️ Potential issue | 🟡 Minor

gracefully_exit short-circuit returns NOT_SERVING for unknown services.

During shutdown, a Check for a bogus service name will return NOT_SERVING rather than SERVICE_UNKNOWN / NOT_FOUND. Minor — most clients won't observe it, and the response is still "don't route here" — but if parity with the SGLang health servicer matters, consider reordering the unknown-service check before the gracefully_exit gate.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py` around lines 77 - 78, The Check handler currently returns NOT_SERVING whenever self.async_llm.gracefully_exit is true, which masks SERVICE_UNKNOWN for unrecognized service names; in the health servicer's Check method (or the class containing it), move the unknown-service validation logic (the branch that returns SERVICE_UNKNOWN / NOT_FOUND for bogus service names) to run before the gracefully_exit check so that unknown services return SERVICE_UNKNOWN even during shutdown, preserving parity with the SGLang behavior.

coderabbitai · 2026-04-23T19:54:55Z

+from grpc_health.v1 import health_pb2  # noqa: E402
+from smg_grpc_servicer.tokenspeed.health_servicer import (  # noqa: E402
+    TokenSpeedHealthServicer,
+)


🧹 Nitpick | 🔵 Trivial

Unnecessary # noqa: E402.

These imports are at the top of the module (no preceding non-import code), so E402 won't fire. The noqa directives are redundant.

🔧 Proposed fix

-from grpc_health.v1 import health_pb2 # noqa: E402 -from smg_grpc_servicer.tokenspeed.health_servicer import ( # noqa: E402 +from grpc_health.v1 import health_pb2 +from smg_grpc_servicer.tokenspeed.health_servicer import ( TokenSpeedHealthServicer, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from grpc_health.v1 import health_pb2 # noqa: E402

from smg_grpc_servicer.tokenspeed.health_servicer import ( # noqa: E402

TokenSpeedHealthServicer,

)

from grpc_health.v1 import health_pb2

from smg_grpc_servicer.tokenspeed.health_servicer import (

TokenSpeedHealthServicer,

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/tests/test_tokenspeed_health_servicer.py` around lines 12 - 15, Remove the redundant noqa comments on the top-level import lines: delete the "# noqa: E402" trailing each import of health_pb2 and TokenSpeedHealthServicer in the test module so the imports read normally (refer to the import statements importing grpc_health.v1.health_pb2 and smg_grpc_servicer.tokenspeed.health_servicer.TokenSpeedHealthServicer).

coderabbitai · 2026-04-23T19:54:55Z

+else
+    echo "TokenSpeed clone exists at $TOKENSPEED_DIR, reusing"
+    (cd "$TOKENSPEED_DIR" && git fetch --depth 1 origin "$TOKENSPEED_REF" && git checkout "$TOKENSPEED_REF")
+fi


⚠️ Potential issue | 🟠 Major

Reuse path does not advance to the latest ref.

When the clone directory already exists, git fetch --depth 1 origin "$TOKENSPEED_REF" && git checkout "$TOKENSPEED_REF" leaves the working tree on the previously checked-out commit if the local branch already exists and diverges from the remote — checkout won't fast-forward. On a long-lived self-hosted runner, CI can silently build a stale TokenSpeed. Reset explicitly to FETCH_HEAD:

🔧 Proposed fix

-else - echo "TokenSpeed clone exists at $TOKENSPEED_DIR, reusing" - (cd "$TOKENSPEED_DIR" && git fetch --depth 1 origin "$TOKENSPEED_REF" && git checkout "$TOKENSPEED_REF") -fi +else + echo "TokenSpeed clone exists at $TOKENSPEED_DIR, reusing" + (cd "$TOKENSPEED_DIR" \ + && git fetch --depth 1 origin "$TOKENSPEED_REF" \ + && git checkout -B "$TOKENSPEED_REF" FETCH_HEAD) +fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/ci_install_tokenspeed.sh` around lines 60 - 63, The reuse branch fetch/checkout path can leave the working tree on a stale local branch; after fetching origin "$TOKENSPEED_REF" in the TOKENSPEED_DIR reuse block, explicitly reset the working tree to the fetched ref (e.g., run a hard reset to FETCH_HEAD) so the clone always matches the remote ref; update the commands in the reuse branch that reference TOKENSPEED_DIR and TOKENSPEED_REF to perform git fetch --depth 1 origin "$TOKENSPEED_REF" followed by a git reset --hard FETCH_HEAD (and optionally git clean -fd if you need to remove untracked files) to ensure the working tree is exactly the fetched commit.

key4ng · 2026-05-08T20:18:21Z

Reorganized into a 3-PR stack

@CatherineSue per your review, split this into three reviewable PRs:

PR	Scope	Base
#1351 (this)	Rust gRPC + protocol — proto, generated stubs, `TokenSpeedSchedulerClient`, router wiring, `RuntimeType::TokenSpeed`	`main`
#1464	Python servicer + unit tests	`feat/grpc-servicer-tokenspeed`
#1465	CI workflows + e2e tests + install scripts	`feat/grpc-tokenspeed-servicer`

PR #1464 and #1465 are open as draft until this PR is in shape to merge.

What changed in this PR vs. the previous tip

All servicer / CI / e2e content moved out — see feat(grpc_servicer): add TokenSpeed servicer (Part 2/3) #1464 and ci(tokenspeed): add CI install + GPU e2e coverage (Part 3/3) #1465
All 5 of your inline asks addressed (replied inline with the new commit)
Dropped the verbose comments per your feedback (worker.rs, build.rs, detect_backend.rs)
Rebased onto current main
Force-pushed (single commit 6b696e53); old commit history is preserved in feat/grpc-servicer-tokenspeed-v2 if needed

Verification (this PR)

cargo +nightly fmt --all -- --check ✅
cargo clippy -p smg-grpc-client --all-targets --all-features -- -D warnings ✅
cargo clippy -p smg --all-targets --all-features -- -D warnings ✅

Other reviewer notes

🔴 critical bot finding (FakeAsyncLLM crashing on n>1 list rid) is fixed in PR feat(grpc_servicer): add TokenSpeed servicer (Part 2/3) #1464.
The duplicate Qwen/Qwen3-30B-A3B dict-key flagged earlier is not present on the current branch (resolved in an earlier yetone commit; verified grep -n shows only one entry).
Lower-priority bot findings (P1/P2 on warmup-then-SERVING ordering, temperature=0 preservation, bootstrap_port=0 ambiguity, etc.) are deliberately out of scope for this restructuring; happy to address them in follow-up PRs once this stack lands.

cc @yetone @slin1237 @XinyueZhang369 @gongwei-130

…dule The 5 ``build_*_sampling_params_from_*`` helpers (chat / responses / messages / completion / plain) plus their constraint helpers were sitting on ``SglangSchedulerClient`` even though the OpenAI mapping is backend-neutral. TokenSpeed's client was reaching across to call them through ``SglangSchedulerClient::*`` which suggested SGLang owned the OpenAI→sampling translation, when it doesn't. Move them to ``crate::sampling_params`` as free functions. Both the SGLang and TokenSpeed clients (and any future client that wants the same mapping) now call ``crate::sampling_params::build_*`` directly. The return type is still ``sglang::SamplingParams`` because that proto happens to be the most permissive shape across our supported backends; TokenSpeed translates to its own slimmer shape at the wire boundary. When a backend grows a sampling field SGLang lacks, this is the place to add it. No behavior change. Tests stay green; the call sites in ``build_generate_request_from_*`` are mechanically updated. Signed-off-by: key4ng <rukeyang@gmail.com>

…ation) TokenSpeed previously rode the SGLang IR arms in proto_wrapper.rs: ``TokenSpeedSchedulerClient::generate()`` accepted ``sglang::GenerateRequest``, the streaming response was translated back into ``sglang::GenerateResponse`` at the wire boundary, and the router dispatched through ``ProtoGenerateRequest::Sglang``. This let TokenSpeed reuse the existing match arms but coupled it to SGLang's evolving schema — every SGLang field addition forced a TokenSpeed translator stub (most recently ``default_sampling_params_json``). Add native TokenSpeed arms to the router IR: - ``ProtoGenerateRequest::TokenSpeed(Box<tokenspeed::GenerateRequest>)`` - ``ProtoGenerateResponse::TokenSpeed(Box<tokenspeed::GenerateResponse>)`` - ``ProtoGenerateStreamChunk::TokenSpeed(tokenspeed::GenerateStreamChunk)`` - ``ProtoGenerateComplete::TokenSpeed(tokenspeed::GenerateComplete)`` …with ``as_tokenspeed`` / ``as_tokenspeed_mut`` / ``is_tokenspeed`` accessors mirroring the existing per-backend pattern (Mlx / Vllm / Trtllm) and ``TokenSpeed`` arms in every aggregator method (``token_ids``, ``index``, ``output_logprobs``, ``input_logprobs``, ``prompt_tokens``, ``completion_tokens``, ``cached_tokens``, ``finish_reason``, ``matched_stop_json``, ``output_ids``, ``set_stream``, ``request_id``, ``set_max_tokens_for_prefill``, ``clear_mm_inputs``, ``set_kv_transfer_params``, ``kv_transfer_params``). Client-side: - ``TokenSpeedSchedulerClient::generate()`` now takes ``tokenspeed_proto::GenerateRequest`` - ``AbortOnDropStream::Item`` is ``tokenspeed_proto::GenerateResponse`` - The 5 ``build_*_request`` builders return native ``tokenspeed_proto`` types - ``translate::generate_request`` / ``generate_response`` / ``stream_chunk`` / ``complete`` / ``output_logprobs`` are gone — the only translation kept is ``translate::sampling_params`` (a thin field map) plus the unary RPC adapters (``model_info`` / ``server_info`` / ``loads``), which still produce SGLang shapes because the router's ``ModelInfo`` / ``ServerInfo`` enums consume those — that's a separate cleanup. Router-side: - ``client.rs::generate()`` dispatch arm now matches ``(Self::TokenSpeed(_), ProtoGenerateRequest::TokenSpeed(_))``. - The 5 ``build_*_request`` paths in ``GrpcClient`` wrap into ``ProtoGenerateRequest::TokenSpeed`` instead of ``::Sglang``. - ``harmony/stages/request_building.rs`` builds ``ProtoGenerateRequest::TokenSpeed`` and grew a ``TokenSpeed`` arm in the Harmony stop-token injection match. - ``common/stages/helpers.rs::apply_sampling_defaults_to_generate_request`` early-returns for TokenSpeed (alongside Trtllm) since neither backend plumbs sampling defaults through today; the explicit arm keeps the match exhaustive. PD-disagg paths (``response_collection.rs``) remain SGLang-keyed — the ``if let ProtoGenerateComplete::Sglang(...)`` checks simply won't match TokenSpeed responses, which is the correct behavior since TokenSpeed doesn't ship PD-disaggregation. Verification: - ``cargo +nightly fmt --all -- --check`` passes - ``cargo clippy -p smg-grpc-client -p smg --all-targets --all-features -- -D warnings`` passes - ``cargo check -p smg --bin smg`` passes This addresses the architectural concern raised on #1351: SGLang's proto shouldn't be the de-facto router IR. Each backend now has its own arm, matching how vLLM / MLX / TRT-LLM are integrated. Signed-off-by: key4ng <rukeyang@gmail.com>

Adds the Python servicer that runs alongside a TokenSpeed scheduler process and serves the gRPC protocol PR1 introduced. Includes: - the async scheduler servicer (Generate/HealthCheck/Abort/ GetModelInfo/GetServerInfo/GetLoads), with cancellation handling for streaming, non-streaming, channel-close, and n>1 paths - a health-service bridge that flips SERVING/NOT_SERVING based on scheduler liveness (deep probe with bounded staleness) - a scheduler launcher that boots TokenSpeed's AsyncLLM in-process - the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint - real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so router-side load balancing reflects scheduler-side metrics - 57 unit tests covering the servicer, health service, proto conversion, finish reasons, sampling params, streaming/non- streaming generation, abort/cancel (incl. n>1), model/server info, and load metrics This is part 2 of 3 splitting #1351: - PR1: Rust gRPC + protocol (merged first) - PR2 (this): Python servicer + unit tests - PR3: CI workflows + e2e tests Stacked on PR1 — the servicer imports the proto stubs PR1 generates from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``. Fixes a 🔴 critical from review on #1351: - FakeAsyncLLM.generate_request crashed with ``TypeError: unhashable type: 'list'`` when n>1, because ``_build_generate_req`` rewrites ``rid`` to a list of per-choice ids. The fake engine now registers state for each child rid, so ``test_cancel_aborts_all_n_children`` exercises the cancel sweep instead of dying at setup. Signed-off-by: key4ng <rukeyang@gmail.com>

Wires TokenSpeed into CI and the GPU e2e suite: - ``.github/actions/setup-tokenspeed`` composite action and ``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed (kernel + scheduler) at a pinned ref, with a wheel cache lookup so repeat runs skip the ~20 min compile - e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret access so forked PRs skip cleanly - pr-test-rust.yml: install the same proto deps so Rust-only changes that touch ``crates/grpc_client`` still cover the tokenspeed proto - e2e_test infra: ``constants``, ``hooks``, ``worker``, and ``model_specs`` learn about a ``tokenspeed`` runtime alongside sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the suite-wide ``@pytest.mark.engine(...)`` markers expand to include tokenspeed - Function-calling and tool_choice e2e suites swap to ``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3 family is what TokenSpeed's model registry currently supports) This is part 3 of 3 splitting #1351: - PR1: Rust gRPC + protocol - PR2: Python servicer + unit tests - PR3 (this): CI workflows + e2e tests Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and the Python servicer from PR2 against a live TokenSpeed worker. Addresses CatherineSue's review on #1351: - drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` — that context belongs in the PR description, not in the test file Signed-off-by: key4ng <rukeyang@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 76bf5720e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T19:41:18Z

-    if matches!(request, ProtoGenerateRequest::Trtllm(_)) {
+    if matches!(
+        request,
+        ProtoGenerateRequest::Trtllm(_) | ProtoGenerateRequest::TokenSpeed(_)


Apply sampling defaults to TokenSpeed requests

When a selected TokenSpeed worker has a default_sampling_params_json label and the client omits fields like temperature or top_p, this early return skips the default-injection path entirely. The TokenSpeed request builders already populate fallback values before dispatch, so the model/worker-preferred defaults that are applied for SGLang/vLLM/MLX are silently ignored for TokenSpeed, changing generation behavior for workers that rely on those labels.

Useful? React with 👍 / 👎.

Fixed in c478f1ac. Dropped TokenSpeed from the early-return, added a
ProtoGenerateRequest::TokenSpeed(req) match arm calling a new
apply_tokenspeed_sampling_defaults helper. Defaults now flow
end-to-end: translate::model_info populates
default_sampling_params_json (sibling thread), discovery flattens it
into the label map, and the injector applies the masked fields with
Some(value) per TokenSpeed's optional scalar shape.

claude · 2026-05-09T19:41:21Z

+    if matches!(
+        request,
+        ProtoGenerateRequest::Trtllm(_) | ProtoGenerateRequest::TokenSpeed(_)
+    ) {
        return;
    }


🟡 Nit: This early-return is a behavioral regression from the previous code. Before this push, TokenSpeed requests were wrapped as ProtoGenerateRequest::Sglang, so they fell into the SGLang arm below and received model-discovered sampling defaults (e.g., temperature=0.7 from generation_config.json). Now that TokenSpeed has its own variant, defaults are skipped entirely.

Concretely: if a model publishes temperature=0.7 in its generation config and a user sends a chat request without specifying temperature, SGLang/vLLM/MLX workers will use 0.7, but TokenSpeed workers will use the hardcoded 1.0 from the sampling-params builder.

This is straightforward to fix (write an apply_tokenspeed_sampling_defaults similar to the existing per-backend functions), but since TokenSpeed is still in development I understand deferring it. Worth tracking explicitly so it doesn't get lost.

Fixed in c478f1ac rather than deferred — good catch on the
regression. apply_sampling_defaults_to_generate_request no longer
early-returns for TokenSpeed; there's a dedicated
apply_tokenspeed_sampling_defaults mirroring the SGLang/vLLM/MLX
helpers but writing Some(value) because every TokenSpeed
SamplingParams scalar is optional on the wire (the servicer's
HasField() checks rely on that set-vs-unset distinction).

Paired with the translate::model_info fix in the same commit so
default_sampling_params_json carries the JSON the discovery path
expects.

Wires TokenSpeed into CI and the GPU e2e suite: - ``.github/actions/setup-tokenspeed`` composite action and ``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed (kernel + scheduler) at a pinned ref, with a wheel cache lookup so repeat runs skip the ~20 min compile - e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret access so forked PRs skip cleanly - pr-test-rust.yml: install the same proto deps so Rust-only changes that touch ``crates/grpc_client`` still cover the tokenspeed proto - e2e_test infra: ``constants``, ``hooks``, ``worker``, and ``model_specs`` learn about a ``tokenspeed`` runtime alongside sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the suite-wide ``@pytest.mark.engine(...)`` markers expand to include tokenspeed - Function-calling and tool_choice e2e suites swap to ``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3 family is what TokenSpeed's model registry currently supports) This is part 3 of 3 splitting #1351: - PR1: Rust gRPC + protocol - PR2: Python servicer + unit tests - PR3 (this): CI workflows + e2e tests Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and the Python servicer from PR2 against a live TokenSpeed worker. Addresses CatherineSue's review on #1351: - drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` — that context belongs in the PR description, not in the test file Signed-off-by: key4ng <rukeyang@gmail.com>

…ection The chat-template tool-shape pre-processor was correct for Kimi-K2.5 (BFCL accuracy +6 pp on simple_python, +24 pp on parallel_multiple) but breaks Mistral chat templates: their template at line 32 iterates ``tool.function.parameters.properties.items()``, which raises ``unknown method: undefined has no method named items`` once we unwrap ``{"type": "function", "function": {...}}`` into the bare inner dict. The shape a chat template expects is template-dependent, not engine-dependent. Reverting the unconditional unwrap; full rationale, accuracy data, and proposed per-model fix in docs/proposals/2026-05-09-deferred-chat-template-tools-strip.md. Affected CI lane: e2e_test/chat_completions/test_function_calling.py::TestToolChoiceMistral (20 tests failing with chat:32 render error). Signed-off-by: key4ng <rukeyang@gmail.com>

Cuts ~125 lines of doc-comment / inline-rationale prose without losing information. Hot spots: - tokenspeed_scheduler.rs module doc (28 → 7 lines) - tokenspeed_scheduler.proto service header + SamplingParams comment + per-field rationale (~40 lines) - sampling_params.rs module doc (14 → 4 lines) - translate::sampling_params explainer (12 → 3 lines) - inline arm comments in client.rs / multimodal.rs / harmony/stages / common/stages that just re-stated what the code already shows Behavior unchanged. ``cargo +nightly fmt --check`` passes; ``cargo clippy -p smg-grpc-client -p smg --all-targets --all-features -- -D warnings`` passes. Signed-off-by: key4ng <rukeyang@gmail.com>

Adds the Python servicer that runs alongside a TokenSpeed scheduler process and serves the gRPC protocol PR1 introduced. Includes: - the async scheduler servicer (Generate/HealthCheck/Abort/ GetModelInfo/GetServerInfo/GetLoads), with cancellation handling for streaming, non-streaming, channel-close, and n>1 paths - a health-service bridge that flips SERVING/NOT_SERVING based on scheduler liveness (deep probe with bounded staleness) - a scheduler launcher that boots TokenSpeed's AsyncLLM in-process - the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint - real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so router-side load balancing reflects scheduler-side metrics - 57 unit tests covering the servicer, health service, proto conversion, finish reasons, sampling params, streaming/non- streaming generation, abort/cancel (incl. n>1), model/server info, and load metrics This is part 2 of 3 splitting #1351: - PR1: Rust gRPC + protocol (merged first) - PR2 (this): Python servicer + unit tests - PR3: CI workflows + e2e tests Stacked on PR1 — the servicer imports the proto stubs PR1 generates from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``. Fixes a 🔴 critical from review on #1351: - FakeAsyncLLM.generate_request crashed with ``TypeError: unhashable type: 'list'`` when n>1, because ``_build_generate_req`` rewrites ``rid`` to a list of per-choice ids. The fake engine now registers state for each child rid, so ``test_cancel_aborts_all_n_children`` exercises the cancel sweep instead of dying at setup. Signed-off-by: key4ng <rukeyang@gmail.com>

Wires TokenSpeed into CI and the GPU e2e suite: - ``.github/actions/setup-tokenspeed`` composite action and ``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed (kernel + scheduler) at a pinned ref, with a wheel cache lookup so repeat runs skip the ~20 min compile - e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret access so forked PRs skip cleanly - pr-test-rust.yml: install the same proto deps so Rust-only changes that touch ``crates/grpc_client`` still cover the tokenspeed proto - e2e_test infra: ``constants``, ``hooks``, ``worker``, and ``model_specs`` learn about a ``tokenspeed`` runtime alongside sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the suite-wide ``@pytest.mark.engine(...)`` markers expand to include tokenspeed - Function-calling and tool_choice e2e suites swap to ``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3 family is what TokenSpeed's model registry currently supports) This is part 3 of 3 splitting #1351: - PR1: Rust gRPC + protocol - PR2: Python servicer + unit tests - PR3 (this): CI workflows + e2e tests Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and the Python servicer from PR2 against a live TokenSpeed worker. Addresses CatherineSue's review on #1351: - drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` — that context belongs in the PR description, not in the test file Signed-off-by: key4ng <rukeyang@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8478a636ba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Adds the Python servicer that runs alongside a TokenSpeed scheduler process and serves the gRPC protocol PR1 introduced. Includes: - the async scheduler servicer (Generate/HealthCheck/Abort/ GetModelInfo/GetServerInfo/GetLoads), with cancellation handling for streaming, non-streaming, channel-close, and n>1 paths - a health-service bridge that flips SERVING/NOT_SERVING based on scheduler liveness (deep probe with bounded staleness) - a scheduler launcher that boots TokenSpeed's AsyncLLM in-process - the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint - real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so router-side load balancing reflects scheduler-side metrics - 57 unit tests covering the servicer, health service, proto conversion, finish reasons, sampling params, streaming/non- streaming generation, abort/cancel (incl. n>1), model/server info, and load metrics This is part 2 of 3 splitting #1351: - PR1: Rust gRPC + protocol (merged first) - PR2 (this): Python servicer + unit tests - PR3: CI workflows + e2e tests Stacked on PR1 — the servicer imports the proto stubs PR1 generates from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``. Fixes a 🔴 critical from review on #1351: - FakeAsyncLLM.generate_request crashed with ``TypeError: unhashable type: 'list'`` when n>1, because ``_build_generate_req`` rewrites ``rid`` to a list of per-choice ids. The fake engine now registers state for each child rid, so ``test_cancel_aborts_all_n_children`` exercises the cancel sweep instead of dying at setup. Signed-off-by: key4ng <rukeyang@gmail.com>

Wires TokenSpeed into CI and the GPU e2e suite: - ``.github/actions/setup-tokenspeed`` composite action and ``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed (kernel + scheduler) at a pinned ref, with a wheel cache lookup so repeat runs skip the ~20 min compile - e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret access so forked PRs skip cleanly - pr-test-rust.yml: install the same proto deps so Rust-only changes that touch ``crates/grpc_client`` still cover the tokenspeed proto - e2e_test infra: ``constants``, ``hooks``, ``worker``, and ``model_specs`` learn about a ``tokenspeed`` runtime alongside sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the suite-wide ``@pytest.mark.engine(...)`` markers expand to include tokenspeed - Function-calling and tool_choice e2e suites swap to ``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3 family is what TokenSpeed's model registry currently supports) This is part 3 of 3 splitting #1351: - PR1: Rust gRPC + protocol - PR2: Python servicer + unit tests - PR3 (this): CI workflows + e2e tests Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and the Python servicer from PR2 against a live TokenSpeed worker. Addresses CatherineSue's review on #1351: - drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` — that context belongs in the PR description, not in the test file Signed-off-by: key4ng <rukeyang@gmail.com>

Tightens the comments introduced (or modified) by this PR to describe behavior directly. No behavior change; literal type / path references left intact. Files touched: - crates/grpc_client/build.rs - crates/grpc_client/proto/tokenspeed_scheduler.proto - crates/grpc_client/src/sampling_params.rs - crates/grpc_client/src/tokenspeed_scheduler.rs - model_gateway/src/routers/grpc/client.rs - model_gateway/src/routers/grpc/proto_wrapper.rs Signed-off-by: key4ng <rukeyang@gmail.com>

Adds the Python servicer that runs alongside a TokenSpeed scheduler process and serves the gRPC protocol PR1 introduced. Includes: - the async scheduler servicer (Generate/HealthCheck/Abort/ GetModelInfo/GetServerInfo/GetLoads), with cancellation handling for streaming, non-streaming, channel-close, and n>1 paths - a health-service bridge that flips SERVING/NOT_SERVING based on scheduler liveness (deep probe with bounded staleness) - a scheduler launcher that boots TokenSpeed's AsyncLLM in-process - the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint - real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so router-side load balancing reflects scheduler-side metrics - 57 unit tests covering the servicer, health service, proto conversion, finish reasons, sampling params, streaming/non- streaming generation, abort/cancel (incl. n>1), model/server info, and load metrics This is part 2 of 3 splitting #1351: - PR1: Rust gRPC + protocol (merged first) - PR2 (this): Python servicer + unit tests - PR3: CI workflows + e2e tests Stacked on PR1 — the servicer imports the proto stubs PR1 generates from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``. Fixes a 🔴 critical from review on #1351: - FakeAsyncLLM.generate_request crashed with ``TypeError: unhashable type: 'list'`` when n>1, because ``_build_generate_req`` rewrites ``rid`` to a list of per-choice ids. The fake engine now registers state for each child rid, so ``test_cancel_aborts_all_n_children`` exercises the cancel sweep instead of dying at setup. Signed-off-by: key4ng <rukeyang@gmail.com>

Wires TokenSpeed into CI and the GPU e2e suite: - ``.github/actions/setup-tokenspeed`` composite action and ``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed (kernel + scheduler) at a pinned ref, with a wheel cache lookup so repeat runs skip the ~20 min compile - e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret access so forked PRs skip cleanly - pr-test-rust.yml: install the same proto deps so Rust-only changes that touch ``crates/grpc_client`` still cover the tokenspeed proto - e2e_test infra: ``constants``, ``hooks``, ``worker``, and ``model_specs`` learn about a ``tokenspeed`` runtime alongside sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the suite-wide ``@pytest.mark.engine(...)`` markers expand to include tokenspeed - Function-calling and tool_choice e2e suites swap to ``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3 family is what TokenSpeed's model registry currently supports) This is part 3 of 3 splitting #1351: - PR1: Rust gRPC + protocol - PR2: Python servicer + unit tests - PR3 (this): CI workflows + e2e tests Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and the Python servicer from PR2 against a live TokenSpeed worker. Addresses CatherineSue's review on #1351: - drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` — that context belongs in the PR description, not in the test file Signed-off-by: key4ng <rukeyang@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 656f1c2e98

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The IR refactor split TokenSpeed off the SGLang request arm, which left two gaps in the sampling-defaults path: 1. `translate::model_info` populated the `preferred_sampling_params` label slot from TokenSpeed's response but left `default_sampling_params_json` empty. Worker discovery reads only the latter, so model-published defaults never reached the label map and were invisible to the request-stage injector. 2. `apply_sampling_defaults_to_generate_request` early-returned for `ProtoGenerateRequest::TokenSpeed(_)`. Even if a worker's labels carried defaults, the injector skipped the TokenSpeed arm — so a model publishing `temperature=0.7` in its generation config would apply to the other engines but TokenSpeed would fall through to the hardcoded 1.0 in the request builder. Both fixed: - `tokenspeed_scheduler.rs`: surface `preferred_sampling_params` in both `preferred_sampling_params` and `default_sampling_params_json` slots so the discovery path picks it up. - `helpers.rs`: drop TokenSpeed from the early-return, add a `TokenSpeed(req)` match arm calling a new `apply_tokenspeed_sampling_defaults`. TokenSpeed's wire declares every sampling scalar as `optional`, so the helper writes `Some(value)` rather than the bare value — preserving the set-vs-unset distinction the servicer's `HasField()` checks rely on. Verification: `cargo +nightly fmt --all -- --check` and `cargo clippy -p smg-grpc-client -p smg --all-targets --all-features -- -D warnings` both pass. Signed-off-by: key4ng <rukeyang@gmail.com>

Adds the Python servicer that runs alongside a TokenSpeed scheduler process and serves the gRPC protocol PR1 introduced. Includes: - the async scheduler servicer (Generate/HealthCheck/Abort/ GetModelInfo/GetServerInfo/GetLoads), with cancellation handling for streaming, non-streaming, channel-close, and n>1 paths - a health-service bridge that flips SERVING/NOT_SERVING based on scheduler liveness (deep probe with bounded staleness) - a scheduler launcher that boots TokenSpeed's AsyncLLM in-process - the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint - real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so router-side load balancing reflects scheduler-side metrics - 57 unit tests covering the servicer, health service, proto conversion, finish reasons, sampling params, streaming/non- streaming generation, abort/cancel (incl. n>1), model/server info, and load metrics This is part 2 of 3 splitting #1351: - PR1: Rust gRPC + protocol (merged first) - PR2 (this): Python servicer + unit tests - PR3: CI workflows + e2e tests Stacked on PR1 — the servicer imports the proto stubs PR1 generates from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``. Fixes a 🔴 critical from review on #1351: - FakeAsyncLLM.generate_request crashed with ``TypeError: unhashable type: 'list'`` when n>1, because ``_build_generate_req`` rewrites ``rid`` to a list of per-choice ids. The fake engine now registers state for each child rid, so ``test_cancel_aborts_all_n_children`` exercises the cancel sweep instead of dying at setup. Signed-off-by: key4ng <rukeyang@gmail.com>

Wires TokenSpeed into CI and the GPU e2e suite: - ``.github/actions/setup-tokenspeed`` composite action and ``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed (kernel + scheduler) at a pinned ref, with a wheel cache lookup so repeat runs skip the ~20 min compile - e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret access so forked PRs skip cleanly - pr-test-rust.yml: install the same proto deps so Rust-only changes that touch ``crates/grpc_client`` still cover the tokenspeed proto - e2e_test infra: ``constants``, ``hooks``, ``worker``, and ``model_specs`` learn about a ``tokenspeed`` runtime alongside sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the suite-wide ``@pytest.mark.engine(...)`` markers expand to include tokenspeed - Function-calling and tool_choice e2e suites swap to ``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3 family is what TokenSpeed's model registry currently supports) This is part 3 of 3 splitting #1351: - PR1: Rust gRPC + protocol - PR2: Python servicer + unit tests - PR3 (this): CI workflows + e2e tests Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and the Python servicer from PR2 against a live TokenSpeed worker. Addresses CatherineSue's review on #1351: - drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` — that context belongs in the PR description, not in the test file Signed-off-by: key4ng <rukeyang@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c478f1acd7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Collapse the duplicate 2-line crate-level docstring into a single neutral line covering all supported backends. - Drop the 5-line TokenSpeed re-export rationale block; the design rationale already lives in `tokenspeed_scheduler.rs`'s module doc. Signed-off-by: key4ng <rukeyang@gmail.com>

Adds the Python servicer that runs alongside a TokenSpeed scheduler process and serves the gRPC protocol PR1 introduced. Includes: - the async scheduler servicer (Generate/HealthCheck/Abort/ GetModelInfo/GetServerInfo/GetLoads), with cancellation handling for streaming, non-streaming, channel-close, and n>1 paths - a health-service bridge that flips SERVING/NOT_SERVING based on scheduler liveness (deep probe with bounded staleness) - a scheduler launcher that boots TokenSpeed's AsyncLLM in-process - the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint - real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so router-side load balancing reflects scheduler-side metrics - 57 unit tests covering the servicer, health service, proto conversion, finish reasons, sampling params, streaming/non- streaming generation, abort/cancel (incl. n>1), model/server info, and load metrics This is part 2 of 3 splitting #1351: - PR1: Rust gRPC + protocol (merged first) - PR2 (this): Python servicer + unit tests - PR3: CI workflows + e2e tests Stacked on PR1 — the servicer imports the proto stubs PR1 generates from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``. Fixes a 🔴 critical from review on #1351: - FakeAsyncLLM.generate_request crashed with ``TypeError: unhashable type: 'list'`` when n>1, because ``_build_generate_req`` rewrites ``rid`` to a list of per-choice ids. The fake engine now registers state for each child rid, so ``test_cancel_aborts_all_n_children`` exercises the cancel sweep instead of dying at setup. Signed-off-by: key4ng <rukeyang@gmail.com>

yetone requested review from CatherineSue, XinyueZhang369, key4ng and slin1237 as code owners April 23, 2026 14:35

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 23, 2026

View reviewed changes

github-actions Bot added dependencies Dependency updates grpc gRPC client and router changes tests Test changes labels Apr 23, 2026

claude Bot reviewed Apr 23, 2026

View reviewed changes

yetone force-pushed the feat/grpc-servicer-tokenspeed branch from 34e6471 to 9056d9e Compare April 23, 2026 15:26

claude Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread e2e_test/infra/model_specs.py Outdated

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

yetone force-pushed the feat/grpc-servicer-tokenspeed branch from 9056d9e to 7aca171 Compare April 23, 2026 16:28

yetone requested a review from gongwei-130 as a code owner April 23, 2026 16:28

github-actions Bot added the ci CI/CD configuration changes label Apr 23, 2026

claude Bot reviewed Apr 23, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

yetone force-pushed the feat/grpc-servicer-tokenspeed branch from 7aca171 to d264a5b Compare April 23, 2026 16:38

coderabbitai Bot requested changes Apr 23, 2026

View reviewed changes

claude Bot reviewed Apr 23, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

yetone force-pushed the feat/grpc-servicer-tokenspeed branch from d264a5b to be83fd2 Compare April 23, 2026 17:09

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 23, 2026

View reviewed changes

key4ng mentioned this pull request May 8, 2026

ci(tokenspeed): add CI install + GPU e2e coverage (Part 3/3) #1465

Open

8 tasks

key4ng added 2 commits May 9, 2026 12:19

chatgpt-codex-connector Bot reviewed May 9, 2026

View reviewed changes

claude Bot reviewed May 9, 2026

View reviewed changes

key4ng added 2 commits May 9, 2026 13:52

github-actions Bot removed the tokenizer Tokenizer related changes label May 9, 2026

chatgpt-codex-connector Bot reviewed May 9, 2026

View reviewed changes

Comment thread crates/grpc_client/src/tokenspeed_scheduler.rs Outdated

key4ng force-pushed the feat/grpc-servicer-tokenspeed branch from 8057d10 to 656f1c2 Compare May 11, 2026 03:44

-            self.async_llm.abort_request(rid)
+            rids_to_abort = (
+                list(expanded_rid)
+                if isinstance(expanded_rid, list)
+                else ([expanded_rid] if isinstance(expanded_rid, str) else [])
+            )
+            for r in rids_to_abort:
+                self.async_llm.abort_request(r)

		if params.temperature:
		out["temperature"] = params.temperature

-    # Also used by TestOpenAIServerFunctionCalling as a TokenSpeed-supported
-    # substitute while TokenSpeed's model registry does not cover plain
-    # LlamaForCausalLM (only MoE / Eagle3 variants). Arch: Qwen3MoeForCausalLM.
+    # Thinking/reasoning model (larger). Arch: Qwen3MoeForCausalLM.
+    # function_calling / tool_choice features are listed for completeness —
+    # active TokenSpeed function-calling tests use Qwen3-4B (see
+    # TestOpenAIServerFunctionCallingTokenSpeed in test_function_calling.py).

		rid = getattr(obj, "rid", None) or "no-rid"
		self.rid_to_state[rid] = _FakeState()

		if self.async_llm.gracefully_exit:
		return health_pb2.HealthCheckResponse(status=health_pb2.HealthCheckResponse.NOT_SERVING)

Conversation

yetone commented Apr 23, 2026 • edited by key4ng Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

3-PR Stack

Changes

Test Plan

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

yetone commented Apr 23, 2026

Fixed (from CodeRabbit review)

Fixed (from e2e-1gpu-chat CI failures)

Added — TokenSpeed in GitHub CI

Servicer fallback dropped

What I did NOT change (out of scope for this PR)

Uh oh!

yetone commented Apr 23, 2026 •

edited by key4ng

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

Fixed (from `e2e-1gpu-chat` CI failures)

Dug into the 4 remaining `tool_choice="auto"` failures — root-caused + fixed