Skip to content

feat(grpc): add TokenSpeed gRPC client and router wiring (Part 1/3)#1351

Open
yetone wants to merge 8 commits into
mainfrom
feat/grpc-servicer-tokenspeed
Open

feat(grpc): add TokenSpeed gRPC client and router wiring (Part 1/3)#1351
yetone wants to merge 8 commits into
mainfrom
feat/grpc-servicer-tokenspeed

Conversation

@yetone
Copy link
Copy Markdown
Collaborator

@yetone yetone commented Apr 23, 2026

Description

Problem

SMG has no first-class TokenSpeed backend. Routing to a TokenSpeed worker requires the gateway to speak TokenSpeed's dedicated gRPC protocol, plumb it through the router's per-backend IR, and pick TokenSpeed up correctly during runtime detection.

Solution

This PR (Part 1 of 3) adds the Rust side of TokenSpeed support: a self-contained tokenspeed.grpc.scheduler.TokenSpeedScheduler proto, a TokenSpeedSchedulerClient that produces and consumes native tokenspeed_proto types end-to-end, dedicated ProtoGenerateRequest::TokenSpeed / ProtoGenerateStreamChunk::TokenSpeed / ProtoGenerateComplete::TokenSpeed IR arms in the router (same shape as the existing per-backend arms), and the RuntimeType::TokenSpeed variant.

PR1 is functionally inert without PR2 — the router can dial a TokenSpeed worker, but the worker process is in PR2's Python servicer.

3-PR Stack

Changes

  • New crates/grpc_client/proto/tokenspeed_scheduler.proto — fully self-contained wire definition (package tokenspeed.grpc.scheduler)
  • New crates/grpc_client/src/tokenspeed_scheduler.rsTokenSpeedSchedulerClient returning native tokenspeed_proto::* types
  • crates/protocols/src/worker.rsRuntimeType::TokenSpeed variant
  • model_gateway/src/routers/grpc/proto_wrapper.rs — dedicated TokenSpeed arms across the IR enums + per-arm accessors
  • model_gateway/src/routers/grpc/client.rs — dispatch routes the new variant through TokenSpeedSchedulerClient
  • model_gateway/src/workflow/steps/local/detect_backend.rs — health-check ordering extended with tokenspeed
  • Stage updates: harmony/regular request building, multimodal, embedding

Test Plan

  • cargo +nightly fmt --all -- --check passes
  • cargo clippy -p smg-grpc-client --all-targets --all-features -- -D warnings passes
  • cargo clippy -p smg --all-targets --all-features -- -D warnings passes
  • cargo check -p smg --bin smg succeeds against current main

End-to-end coverage runs in PR3 once both the Rust router and the Python servicer (PR2) are in place.

Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds TokenSpeed as a supported runtime across tests, worker wiring, CI, gRPC servicer (launcher, server, health, servicer), model gateway clients/workflow, Rust/Cargo proto build, and new unit tests and packaging extras.

Changes

Cohort / File(s) Summary
E2E Tests & Runtime
e2e_test/chat_completions/test_function_calling.py, e2e_test/infra/constants.py, e2e_test/infra/model_specs.py
Introduce tokenspeed runtime, predicate helper, include in LOCAL_RUNTIMES/labels, update model specs, and gate/xfail tests for TokenSpeed limitations.
Worker Integration
e2e_test/infra/worker.py
Add TokenSpeed branch that enforces gRPC mode and builds TokenSpeed server start command (supports tokenspeed_args).
Packaging & Pytest Config
grpc_servicer/pyproject.toml, grpc_servicer/tests/conftest.py
Document TokenSpeed, add tokenspeed and test extras, pytest asyncio/testpaths/markers, and test conftest to adjust sys.path and register tokenspeed marker.
TokenSpeed Package Entrypoints
grpc_servicer/smg_grpc_servicer/tokenspeed/__init__.py, .../__main__.py
Expose servicer API and add CLI entrypoint that prepares server args, sets uvloop policy, and runs serve_grpc.
Server, Launcher & Health
grpc_servicer/smg_grpc_servicer/tokenspeed/server.py, .../scheduler_launcher.py, .../health_servicer.py
Add serve_grpc coroutine with warmup/polling, reflection, signal handling and shutdown; add launch_engine delegating to TokenSpeed entrypoint; add health servicer with dual-scope and stuck-scheduler checks.
TokenSpeed gRPC Servicer
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
Implement TokenSpeedSchedulerServicer (Generate/Embed/HealthCheck/Abort/GetModelInfo/GetServerInfo/GetLoads/GetTokenizer/SubscribeKvEvents) and related helpers and lifecycle handling.
Unit Tests
grpc_servicer/tests/test_tokenspeed_servicer.py, grpc_servicer/tests/test_tokenspeed_health_servicer.py
Add comprehensive tests for health servicer, servicer RPCs, helpers, sampling-param mapping, streaming/cancellation, abort semantics, and tokenizer streaming.
CI / Workflows / Scripts
.github/actions/setup-tokenspeed/action.yml, .github/workflows/e2e-gpu-job.yml, .github/workflows/pr-test-rust.yml, scripts/ci_install_tokenspeed.sh
Add composite action and CI install script for TokenSpeed, conditional workflow step and engine input, extend PR test matrix and path filters for TokenSpeed artifacts.
Protocols & Gateway Wiring
crates/protocols/src/worker.rs, model_gateway/src/routers/grpc/client.rs, model_gateway/src/routers/grpc/...
Add tokenspeed runtime enum, TokenSpeed gRPC client module and routing, adapt request-building stages and multimodal/embedding paths to support TokenSpeed, mark PD-disallowed where applicable.
Rust gRPC Client & Proto
crates/grpc_client/build.rs, crates/grpc_client/proto/tokenspeed_scheduler.proto, crates/grpc_client/src/tokenspeed_scheduler.rs, crates/grpc_client/src/lib.rs, crates/grpc_client/python/smg_grpc_proto/__init__.py
Add TokenSpeed proto, build.rs generation pass, Rust client wrapper, re-exports, and Python proto re-exports.
Internal refactors & helpers
crates/grpc_client/src/sglang_scheduler.rs, model_gateway/src/.../request_building.rs, model_gateway/src/.../classify.rs, model_gateway/src/.../detect_backend.rs, model_gateway/src/.../util.rs
Refactor abort-on-drop to callback dispatcher, expose helpers, and extend gateway detection/building stages to handle TokenSpeed and adjust runtime probing order.

Sequence Diagram(s)

sequenceDiagram
    participant Client as gRPC Client
    participant Server as TokenSpeed gRPC Server
    participant Launcher as Scheduler Launcher
    participant AsyncLLM as AsyncLLM Engine
    participant Health as Health Servicer

    Client->>Server: Generate / Embed / HealthCheck
    Server->>Launcher: launch_engine(server_args)
    Launcher->>AsyncLLM: _launch_subprocesses()
    AsyncLLM-->>Launcher: AsyncLLM instance + scheduler_info
    Launcher-->>Server: (async_llm, scheduler_info)
    Server->>Health: instantiate TokenSpeedHealthServicer
    par Warmup
        Server->>AsyncLLM: probe (one-token Generate/Embed)
        AsyncLLM-->>Server: probe response
        Server->>Health: set_serving()
    end
    Server->>AsyncLLM: dispatch Generate/Embed request
    AsyncLLM-->>Server: stream/generate results
    Server-->>Client: stream/return RPC responses
    Client->>Server: Abort (or cancel)
    Server->>AsyncLLM: abort_request for rid(s)
    AsyncLLM-->>Server: acknowledge
    Server-->>Client: finalize/error
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

python-bindings

Suggested reviewers

  • CatherineSue
  • key4ng
  • slin1237

Poem

🐰 I hopped to wire a faster stream,

Tokens raced down a new bright beam.
Health kept watch as warmups sang,
Scheduler woke and bells went clang.
gRPC now hums in TokenSpeed gleam.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title 'feat(grpc): add TokenSpeed gRPC client and router wiring (Part 1/3)' accurately describes the main changes in this pull request—adding TokenSpeed gRPC support to the client and routing layers, and is part of a larger multi-part effort.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/grpc-servicer-tokenspeed

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the TokenSpeed inference engine by implementing a gRPC servicer that mirrors the SGLang protocol, allowing the existing Rust router to interface with TokenSpeed workers. The changes encompass a new gRPC server implementation, health checks, subprocess management, and infrastructure updates for end-to-end testing. A critical issue was identified in the streaming generation logic where client disconnections during multi-sample requests (n > 1) fail to abort all derived backend requests, potentially leading to resource leaks.

except asyncio.CancelledError:
# Client disconnected — tell the scheduler to drop the request.
aborted = True
self.async_llm.abort_request(rid)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When a client disconnects from a streaming Generate request with n > 1, not all underlying requests are being aborted on the backend. The asyncio.CancelledError exception handler calls self.async_llm.abort_request(rid), where rid is the original request.request_id. However, for requests with n > 1, the servicer creates multiple backend requests with derived request IDs (e.g., f"{rid}-n{i}"), which are stored in expanded_rid. The engine is only aware of these derived IDs. This can lead to orphaned requests continuing to run on the backend, consuming resources. The fix is to iterate over expanded_rid and abort each request individually to ensure all acquired resources are tracked and released consistently to prevent leaks.

Suggested change
self.async_llm.abort_request(rid)
rids_to_abort = (
list(expanded_rid)
if isinstance(expanded_rid, list)
else ([expanded_rid] if isinstance(expanded_rid, str) else [])
)
for r in rids_to_abort:
self.async_llm.abort_request(r)
References
  1. When managing resources that are acquired and need to be released, ensure that all acquired resources are tracked immediately and consistently to prevent resource leaks, especially when exceptions might occur during subsequent operations.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 34e647141e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +639 to +640
if params.temperature:
out["temperature"] = params.temperature
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve explicit zero temperature in sampling params

The truthy check drops temperature=0.0, so requests that explicitly ask for greedy/deterministic decoding lose that setting before reaching TokenSpeed. The Rust gRPC builder can emit 0.0 for user-specified temperature, but this branch treats it as "unset" and omits it, which can silently fall back to backend defaults and change generation behavior.

Useful? React with 👍 / 👎.

Comment on lines +201 to +205
logger.warning("TokenSpeed warmup failed: %s", e)
finally:
channel.close()

health_servicer.set_serving()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Set health status to SERVING only after warmup succeeds

Warmup failures in Generate/Embed are logged but still followed by set_serving(), so readiness can report healthy even when the model path is not actually able to serve inference. This can route production traffic to a broken worker right after startup instead of keeping it out of rotation.

Useful? React with 👍 / 👎.

Comment on lines +198 to +202
except asyncio.CancelledError:
# Client disconnected — tell the scheduler to drop the request.
aborted = True
self.async_llm.abort_request(rid)
raise
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Abort all expanded request IDs on client cancellation

On cancellation, the handler aborts only the original request_id, but _build_generate_req expands rid into per-choice IDs when n>1. Since aborted=True also skips the finally cleanup path, those expanded child requests can continue running after the client disconnects, wasting scheduler/GPU work until they finish naturally.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
e2e_test/infra/constants.py (1)

57-57: ⚠️ Potential issue | 🟡 Minor

Stale ENV_RUNTIME comment.

The inline comment still enumerates only "sglang", "vllm", or "trtllm" even though tokenspeed is now a supported runtime (and openai, xai, etc. are also accepted via the enum). Consider updating to avoid drift.

-ENV_RUNTIME = "E2E_RUNTIME"  # Runtime for gRPC tests: "sglang", "vllm", or "trtllm"
+ENV_RUNTIME = "E2E_RUNTIME"  # Runtime for gRPC tests: "sglang", "vllm", "trtllm", or "tokenspeed"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e_test/infra/constants.py` at line 57, Update the stale inline comment for
the ENV_RUNTIME constant: modify the comment on ENV_RUNTIME to list the
currently supported runtimes (e.g., include "tokenspeed" and other accepted
values such as "openai", "xai", etc.) or rephrase it to reference the
authoritative enum so it won't drift; locate ENV_RUNTIME in
e2e_test/infra/constants.py and replace the old enumeration string with the
updated list or a short note pointing to the enum.
e2e_test/infra/worker.py (1)

37-37: ⚠️ Potential issue | 🟡 Minor

Update engine docstring to include tokenspeed.

The field comment still lists only three engines even though tokenspeed is now dispatched in _build_cmd.

-    engine: str  # "sglang", "vllm", or "trtllm"
+    engine: str  # "sglang", "vllm", "trtllm", or "tokenspeed"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e_test/infra/worker.py` at line 37, Update the inline comment for the
engine field to reflect all supported engine values; change the comment on the
'engine: str' field to include "tokenspeed" alongside "sglang", "vllm", and
"trtllm", and ensure any mention in _build_cmd's dispatch logic remains
consistent with that docstring so reviewers see the available options match the
actual dispatch cases in _build_cmd.
e2e_test/infra/model_specs.py (1)

55-86: ⚠️ Potential issue | 🔴 Critical

Duplicate dict key silently drops function_calling/tool_choice from Qwen/Qwen3-30B-A3B — breaks this PR's primary purpose.

"Qwen/Qwen3-30B-A3B" is defined twice in MODEL_SPECS (here at 55–59 and again at 80–86). Python dict literals keep the last value for duplicate keys, so the new entry with features: ["chat","streaming","function_calling","tool_choice"] is overwritten by the pre-existing entry whose features are ["chat","streaming","thinking","reasoning"].

Downstream impact:

  • get_models_with_feature("function_calling") and get_models_with_feature("tool_choice") will not include Qwen/Qwen3-30B-A3B.
  • TestOpenAIServerFunctionCalling (which this PR switches to Qwen/Qwen3-30B-A3B) will run against a spec that doesn't advertise the features it needs, and vllm_args=["--enforce-eager"] / trtllm_extra_config from the second entry remain in effect.

Merge the two entries into one so features union and the vLLM/TRT-LLM config is preserved:

🛠 Proposed fix: merge duplicate entries
-    # Qwen3 small chat model — used as a TokenSpeed-supported substitute
-    # while TokenSpeed's model registry does not cover LlamaForCausalLM.
-    "Qwen/Qwen3-4B": {
-        "model": _resolve_model_path("Qwen/Qwen3-4B"),
-        "tp": 1,
-        "features": ["chat", "streaming", "function_calling", "tool_choice"],
-    },
-    # Larger Qwen3 MoE — much stronger at function calling than 4B dense.
-    # Arch: Qwen3MoeForCausalLM (in TokenSpeed's registry); 30B total /
-    # 3B active, so inference speed stays close to the 4B variant.
-    "Qwen/Qwen3-30B-A3B": {
-        "model": _resolve_model_path("Qwen/Qwen3-30B-A3B"),
-        "tp": 1,
-        "features": ["chat", "streaming", "function_calling", "tool_choice"],
-    },
+    # Qwen3 small chat model — used as a TokenSpeed-supported substitute
+    # while TokenSpeed's model registry does not cover LlamaForCausalLM.
+    "Qwen/Qwen3-4B": {
+        "model": _resolve_model_path("Qwen/Qwen3-4B"),
+        "tp": 1,
+        "features": ["chat", "streaming", "function_calling", "tool_choice"],
+    },
     # Function calling specialist
@@
-    # Thinking/reasoning model (larger)
-    "Qwen/Qwen3-30B-A3B": {
-        "model": _resolve_model_path("Qwen/Qwen3-30B-A3B"),
-        "tp": 1,
-        "features": ["chat", "streaming", "thinking", "reasoning"],
-        "vllm_args": [] if _is_nightly else ["--enforce-eager"],
-        "trtllm_extra_config": {"kv_cache_config": {"free_gpu_memory_fraction": 0.8}},
-    },
+    # Thinking/reasoning model (larger) — also used as TokenSpeed fn-calling
+    # substitute while TokenSpeed's registry lacks LlamaForCausalLM.
+    # Arch: Qwen3MoeForCausalLM; 30B total / 3B active.
+    "Qwen/Qwen3-30B-A3B": {
+        "model": _resolve_model_path("Qwen/Qwen3-30B-A3B"),
+        "tp": 1,
+        "features": [
+            "chat", "streaming", "thinking", "reasoning",
+            "function_calling", "tool_choice",
+        ],
+        "vllm_args": [] if _is_nightly else ["--enforce-eager"],
+        "trtllm_extra_config": {"kv_cache_config": {"free_gpu_memory_fraction": 0.8}},
+    },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e_test/infra/model_specs.py` around lines 55 - 86, The dict key
"Qwen/Qwen3-30B-A3B" is defined twice in MODEL_SPECS which causes the first
entry (with function_calling/tool_choice) to be overwritten; merge the two
entries into a single "Qwen/Qwen3-30B-A3B" entry in MODEL_SPECS by combining the
features (union of ["chat","streaming","function_calling","tool_choice"] and
["chat","streaming","thinking","reasoning"]) and preserving the vllm_args and
trtllm_extra_config from the second entry as well as the appropriate tp and
model path; remove the duplicate dict literal so
get_models_with_feature("function_calling") / ("tool_choice") will include
Qwen/Qwen3-30B-A3B and the vllm_args/trtllm_extra_config remain in effect.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e_test/chat_completions/test_function_calling.py`:
- Around line 103-115: This test was switched from Llama to Qwen3 but still
contains Llama-specific labels and will OOM on a single 80GB GPU; update the
test to either use a quantized Qwen3 variant or require more GPUs and rename
Llama-specific text: change the pytest model marker value (e.g.,
"Qwen/Qwen3-30B-A3B" → a Q4/quantized variant like "Qwen/Qwen3-30B-A3B-q4" or
change `@pytest.mark.gpu`(1) → `@pytest.mark.gpu`(4)), update the module/class
docstring that mentions "Llama tool parser" to "Qwen tool parser", replace
occurrences of LLAMA_SYSTEM_MESSAGE with a Qwen-appropriate system message
constant or remove it (references around the test helper and setup), and edit
comments that reference “Llama-3.2-1B is flaky” to refer to Qwen3 or remove;
locate these in the test file by symbols TestToolChoiceLlama,
LLAMA_SYSTEM_MESSAGE, and the top-level module docstring and adjust markers and
messages accordingly.

In `@e2e_test/infra/worker.py`:
- Around line 271-298: The tokenspeed gRPC command builder
(_build_tokenspeed_grpc_cmd) is fine, but the start_workers function's engine
docstring still lists only three engines; update that docstring to include the
new "tokenspeed" engine name so documentation matches the implemented backend
and keep symmetry with other builders (also check the "engine" mention near
start_workers for any enumerations or examples to include "tokenspeed").

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py`:
- Around line 34-38: The comment about priming env vars is misplaced: it refers
to the env/resource setup that actually happens in prepare_server_args(argv),
but it sits above asyncio.set_event_loop_policy() and
asyncio.run(serve_grpc(server_args)); either move the comment to directly above
the prepare_server_args(argv) call (or inside prepare_server_args) so it
documents the env priming, or remove the comment entirely if it's stale; adjust
text to reference prepare_server_args, asyncio.set_event_loop_policy, and
asyncio.run/serve_grpc to ensure clarity.
- Around line 24-38: The uvloop event loop policy is being set after grpc/aio
internals (via the module-level serve_grpc import) are already initialized; move
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) to the very start of
main() before any grpc-related initialization by invoking it before
prepare_server_args(argv) and before any use of serve_grpc/other grpc-related
code so the policy is applied prior to gRPC internals being
imported/initialized.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/server.py`:
- Around line 157-203: The code relies on an implicit invariant that `response`
is set when `connected` is True; to harden this, initialize `response` (e.g.,
response = None) before the GetModelInfo loop and/or capture the successful
model info into an explicit variable like `model_info_response` when you set
`connected = True`, then use that variable in the warmup path instead of the
possibly-unbound `response`; update references in the warmup block (the
is_generation check and any uses of response) to read from the explicit variable
so future refactors of the GetModelInfo loop can't leave `response` undefined.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 156-222: The CancelledError handler currently aborts only the
original proto rid and sets aborted=True, which prevents the finally block from
cleaning up per-choice expanded rids for n>1; change the asyncio.CancelledError
except block to (1) compute the same expanded rids from req_obj.rid (use the
existing logic that builds rids_to_check), (2) call
self.async_llm.abort_request(r) for each expanded rid, and (3) do NOT set
aborted=True so the finally block still executes its defensive cleanup; also add
an n>1 unit test in test_tokenspeed_servicer.py mirroring
test_cancel_calls_abort_request to assert abort_request is invoked for each
expanded per-choice rid.
- Around line 535-546: In SubscribeKvEvents the current return appears before
the required yield which is confusing; keep the abort call but ensure the
function is still recognized as an async generator by moving the yield into an
unreachable branch (e.g. use "if False: yield" or "if False: yield
common_pb2.KvEventBatch()") after the abort so the function retains an
async-generator signature while aborting via context.abort; update the
SubscribeKvEvents body to call context.abort(...) and then have an unreachable
yield branch to satisfy the async-generator contract.
- Around line 559-617: The code currently does obj.bootstrap_port =
p.bootstrap_port or None which treats proto3 default 0 as "unset" and loses the
ability to distinguish a real 0 value; update handling so presence is explicit:
if the DisaggregatedParams proto declares bootstrap_port as optional, check
p.HasField("bootstrap_port") and set obj.bootstrap_port = p.bootstrap_port when
present (else None); if you cannot change the proto, add documentation/comments
about the asymmetry or introduce an explicit presence flag (e.g.
bootstrap_port_present) and use that to decide whether to set
obj.bootstrap_port. Ensure you modify the logic around p.bootstrap_port and
obj.bootstrap_port (and related proto definition) rather than changing unrelated
code.
- Around line 619-688: The code in _sampling_params_from_proto drops explicit
zero values by using truthy checks (e.g., if params.min_new_tokens:) and
similarly the non-optional proto scalar logprob_start_len will always appear
"present" but may be mishandled; change these checks to use HasField for fields
that must distinguish "unset" from explicit zero (e.g., replace truthy guards
for min_new_tokens and logprob_start_len with params.HasField("min_new_tokens")
/ params.HasField("logprob_start_len") and forward the raw proto value), and if
the proto currently defines those scalars as non-optional, update the proto
definitions to make them optional so HasField works.

In `@grpc_servicer/tests/conftest.py`:
- Around line 1-22: Update the module docstring to accurately describe what this
file does: remove or reword the claim about "declares an asyncio-mode default"
since asyncio_mode is configured in pyproject.toml, and instead state that the
file adds the package root to sys.path and registers the pytest marker via
pytest_configure (the function registering the "tokenspeed" marker); keep the
rest of the file unchanged.

In `@grpc_servicer/tests/test_tokenspeed_health_servicer.py`:
- Line 7: The import SimpleNamespace is unused in the test file; remove the
unused import statement "from types import SimpleNamespace" from
grpc_servicer/tests/test_tokenspeed_health_servicer.py (locate the top-level
imports in that file) to clean up the module namespace and update any imports if
necessary, then run the test suite to ensure nothing else depended on it.

In `@grpc_servicer/tests/test_tokenspeed_servicer.py`:
- Around line 529-555: Add a new async test similar to
test_cancel_calls_abort_request but set the request sampling params to n>1 (use
_make_generate_request and adjust req.sampling_params.n = 3 or desired count),
keep fake_engine.generate_fn = never_finish, start servicer.Generate(req, ctx)
and run it via asyncio.create_task(_drain(gen)), cancel the task and await the
CancelledError, then assert that each expanded rid string (e.g., "rid-1-n0",
"rid-1-n1", ..., matching the count) is present in fake_engine.aborted_rids;
this targets the expanded-rid path in Generate and uses the same helpers
(_drain, _make_generate_request, fake_engine.aborted_rids) so the new test
follows the existing pattern.

---

Outside diff comments:
In `@e2e_test/infra/constants.py`:
- Line 57: Update the stale inline comment for the ENV_RUNTIME constant: modify
the comment on ENV_RUNTIME to list the currently supported runtimes (e.g.,
include "tokenspeed" and other accepted values such as "openai", "xai", etc.) or
rephrase it to reference the authoritative enum so it won't drift; locate
ENV_RUNTIME in e2e_test/infra/constants.py and replace the old enumeration
string with the updated list or a short note pointing to the enum.

In `@e2e_test/infra/model_specs.py`:
- Around line 55-86: The dict key "Qwen/Qwen3-30B-A3B" is defined twice in
MODEL_SPECS which causes the first entry (with function_calling/tool_choice) to
be overwritten; merge the two entries into a single "Qwen/Qwen3-30B-A3B" entry
in MODEL_SPECS by combining the features (union of
["chat","streaming","function_calling","tool_choice"] and
["chat","streaming","thinking","reasoning"]) and preserving the vllm_args and
trtllm_extra_config from the second entry as well as the appropriate tp and
model path; remove the duplicate dict literal so
get_models_with_feature("function_calling") / ("tool_choice") will include
Qwen/Qwen3-30B-A3B and the vllm_args/trtllm_extra_config remain in effect.

In `@e2e_test/infra/worker.py`:
- Line 37: Update the inline comment for the engine field to reflect all
supported engine values; change the comment on the 'engine: str' field to
include "tokenspeed" alongside "sglang", "vllm", and "trtllm", and ensure any
mention in _build_cmd's dispatch logic remains consistent with that docstring so
reviewers see the available options match the actual dispatch cases in
_build_cmd.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: fc6bc210-4187-4c99-bc26-b7c2ce2eb33a

📥 Commits

Reviewing files that changed from the base of the PR and between f394f60 and 34e6471.

📒 Files selected for processing (15)
  • e2e_test/chat_completions/test_function_calling.py
  • e2e_test/infra/constants.py
  • e2e_test/infra/model_specs.py
  • e2e_test/infra/worker.py
  • grpc_servicer/pyproject.toml
  • grpc_servicer/smg_grpc_servicer/tokenspeed/__init__.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/scheduler_launcher.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/server.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
  • grpc_servicer/tests/__init__.py
  • grpc_servicer/tests/conftest.py
  • grpc_servicer/tests/test_tokenspeed_health_servicer.py
  • grpc_servicer/tests/test_tokenspeed_servicer.py

Comment thread e2e_test/chat_completions/test_function_calling.py Outdated
Comment thread e2e_test/infra/worker.py Outdated
Comment on lines +24 to +38
def main(argv: list[str] | None = None) -> None:
if argv is None:
argv = sys.argv[1:]

logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(name)s] %(levelname)s %(message)s",
)

server_args = prepare_server_args(argv)
# The scheduler processes will read these env vars; make sure we ran
# through TokenSpeed's shared env/resource setup path instead of
# duplicating it here.
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
asyncio.run(serve_grpc(server_args))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check that prepare_server_args isn't eagerly creating a loop (sanity check on TokenSpeed's API shape).
rg -nP -C3 'def prepare_server_args'
rg -nP 'asyncio\.(new_event_loop|get_event_loop|set_event_loop)\b' -C2

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

# First, let's find the file mentioned in the review comment
find . -type f -name "__main__.py" | head -20

Repository: lightseekorg/smg

Length of output: 154


🏁 Script executed:

# Search for prepare_server_args function definition across the codebase
rg "def prepare_server_args" -A 20

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

# Check for asyncio event loop operations in the entire codebase
rg "asyncio\.(new_event_loop|get_event_loop|set_event_loop)" -B 2 -A 2

Repository: lightseekorg/smg

Length of output: 532


🏁 Script executed:

# Look for uvloop imports
rg "uvloop" -B 2 -A 2

Repository: lightseekorg/smg

Length of output: 1235


🏁 Script executed:

# Read the full __main__.py file to understand imports and structure
cat -n grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py

Repository: lightseekorg/smg

Length of output: 1508


🏁 Script executed:

# Search for prepare_server_args in the tokenspeed module
find . -path "*/tokenspeed/runtime/server_args*" -type f

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

# Look for any asyncio.get_event_loop or asyncio.new_event_loop in tokenspeed modules
rg "asyncio\." grpc_servicer/smg_grpc_servicer/tokenspeed/ -A 2

Repository: lightseekorg/smg

Length of output: 1646


🏁 Script executed:

# Search for grpc imports in the codebase
rg "import.*grpc" grpc_servicer/smg_grpc_servicer/tokenspeed/ -B 2 -A 2

Repository: lightseekorg/smg

Length of output: 2361


🏁 Script executed:

# Check the serve_grpc implementation and its imports
cat -n grpc_servicer/smg_grpc_servicer/tokenspeed/server.py | head -50

Repository: lightseekorg/smg

Length of output: 2356


Move asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) to the beginning of main() before importing grpc.aio internals.

The serve_grpc function is imported at the module level (line 21), which indirectly imports the grpc module in server.py (line 23). This happens before asyncio.set_event_loop_policy() is called at line 37. The uvloop policy must be set before gRPC internals are initialized. Move the policy setting to the start of main(), before prepare_server_args() is called.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py` around lines 24 - 38,
The uvloop event loop policy is being set after grpc/aio internals (via the
module-level serve_grpc import) are already initialized; move
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) to the very start of
main() before any grpc-related initialization by invoking it before
prepare_server_args(argv) and before any use of serve_grpc/other grpc-related
code so the policy is applied prior to gRPC internals being
imported/initialized.

Comment on lines +34 to +38
# The scheduler processes will read these env vars; make sure we ran
# through TokenSpeed's shared env/resource setup path instead of
# duplicating it here.
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
asyncio.run(serve_grpc(server_args))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comment doesn't match the code below it.

The comment on lines 34-36 says "The scheduler processes will read these env vars; make sure we ran through TokenSpeed's shared env/resource setup path instead of duplicating it here" — but the immediately following lines set the uvloop event loop policy and call asyncio.run. No env vars are set or referenced here. The comment appears to be misplaced (or a leftover from an earlier iteration that set env vars between prepare_server_args and asyncio.run). Either move it above prepare_server_args(argv) where the env priming actually occurs (inside that call), or drop it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py` around lines 34 - 38,
The comment about priming env vars is misplaced: it refers to the env/resource
setup that actually happens in prepare_server_args(argv), but it sits above
asyncio.set_event_loop_policy() and asyncio.run(serve_grpc(server_args)); either
move the comment to directly above the prepare_server_args(argv) call (or inside
prepare_server_args) so it documents the env priming, or remove the comment
entirely if it's stale; adjust text to reference prepare_server_args,
asyncio.set_event_loop_policy, and asyncio.run/serve_grpc to ensure clarity.

Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/server.py Outdated
Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated
Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated
Comment thread grpc_servicer/tests/conftest.py Outdated
Comment on lines +1 to +22
"""Pytest configuration for smg-grpc-servicer unit tests.

Adds the parent directory to ``sys.path`` so editable installs work
without needing ``pip install -e``, and declares an asyncio-mode default.
"""

from __future__ import annotations

import pathlib
import sys

import pytest

_HERE = pathlib.Path(__file__).resolve().parent
_PKG_ROOT = _HERE.parent

if str(_PKG_ROOT) not in sys.path:
sys.path.insert(0, str(_PKG_ROOT))


def pytest_configure(config: pytest.Config) -> None:
config.addinivalue_line("markers", "tokenspeed: tests that require TokenSpeed")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor: docstring mentions an asyncio-mode default that isn't declared in this file.

The module docstring says it "declares an asyncio-mode default," but asyncio_mode = "auto" is actually set in grpc_servicer/pyproject.toml under [tool.pytest.ini_options]. Only the tokenspeed marker is registered here. Consider tightening the docstring to reflect the current behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/tests/conftest.py` around lines 1 - 22, Update the module
docstring to accurately describe what this file does: remove or reword the claim
about "declares an asyncio-mode default" since asyncio_mode is configured in
pyproject.toml, and instead state that the file adds the package root to
sys.path and registers the pytest marker via pytest_configure (the function
registering the "tokenspeed" marker); keep the rest of the file unchanged.

Comment thread grpc_servicer/tests/test_tokenspeed_health_servicer.py Outdated
Comment thread grpc_servicer/tests/test_tokenspeed_servicer.py Outdated
@github-actions github-actions Bot added dependencies Dependency updates grpc gRPC client and router changes tests Test changes labels Apr 23, 2026
Comment on lines +251 to +276
try:
embedding: list[float] | None = None
prompt_tokens = 0
async for output in self.async_llm.generate_request(obj):
# EmbeddingReqInput is non-streaming: the loop yields exactly
# one dict at finish, carrying the embedding vector.
embedding = output.get("embedding")
prompt_tokens = output.get("meta_info", {}).get("prompt_tokens", 0)

if embedding is None:
await context.abort(grpc.StatusCode.INTERNAL, "Empty embedding result")
return

return sglang_scheduler_pb2.EmbedResponse(
embedding=list(embedding),
prompt_tokens=prompt_tokens,
embedding_dim=len(embedding),
)
except grpc.aio.AbortError:
raise
except ValueError as e:
logger.warning("Embed invalid request %s: %s", rid, e)
await context.abort(grpc.StatusCode.INVALID_ARGUMENT, str(e))
except Exception as e:
logger.exception("Embed failed for request %s", rid)
await context.abort(grpc.StatusCode.INTERNAL, str(e))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: Embed is missing asyncio.CancelledError handling and finally cleanup, unlike Generate (lines 198–221). If a client disconnects or times out while the async for loop is awaiting the scheduler, CancelledError propagates unhandled and the rid leaks in rid_to_state. For embeddings this is unlikely to matter in practice (single forward pass, fast), but the inconsistency could bite once longer-running embedding models are onboarded.

Minimal fix — mirror the Generate pattern:

        try:
            ...
        except asyncio.CancelledError:
            self.async_llm.abort_request(rid)
            raise
        except grpc.aio.AbortError:
            raise
        ...
        finally:
            state = self.async_llm.rid_to_state.get(rid)
            if state is not None and not getattr(state, "finished", False):
                self.async_llm.abort_request(rid)

@yetone yetone force-pushed the feat/grpc-servicer-tokenspeed branch from 34e6471 to 9056d9e Compare April 23, 2026 15:26
Comment thread e2e_test/infra/model_specs.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9056d9ef25

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +363 to +366
known = rid in self.async_llm.rid_to_state
try:
self.async_llm.abort_request(rid)
return sglang_scheduler_pb2.AbortResponse(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Abort all child RIDs for n>1 requests

When sampling_params.n > 1, _build_generate_req rewrites the request into per-choice IDs (<request_id>-n*), but Abort still checks and aborts only the original request_id. In that case known becomes false and the engine never receives aborts for the active child requests, so explicit aborts (for example router timeout/cleanup paths) report "not found" and generation continues consuming GPU work until natural completion.

Useful? React with 👍 / 👎.

@yetone yetone force-pushed the feat/grpc-servicer-tokenspeed branch from 9056d9e to 7aca171 Compare April 23, 2026 16:28
@yetone yetone requested a review from gongwei-130 as a code owner April 23, 2026 16:28
@github-actions github-actions Bot added the ci CI/CD configuration changes label Apr 23, 2026
@yetone
Copy link
Copy Markdown
Collaborator Author

yetone commented Apr 23, 2026

Review addressed — pushed 7aca1716 (force-push).

Fixed (from CodeRabbit review)

  • 🔴 model_specs.py duplicate key (line 55-59 + line 80-86). I added a second Qwen/Qwen3-30B-A3B entry instead of merging; Python kept the later (thinking/reasoning) one, silently dropping function_calling + tool_choice features. This was also the python-lint / pre-commit CI failure (ruff F601). Merged into the existing entry with the union of features.
  • 🟠 P1: Abort sweep for n>1 child rids. Previously Abort only hit the original request_id; for n>1 requests Generate minted {rid}-n{i} children that kept consuming GPU work on client-driven abort. Abort now sweeps the original and every {rid}-n* child.
  • P1: Cancel path also sweeps n>1 children. Same class of bug in the except asyncio.CancelledError branch in Generate — fixed alongside a new test_cancel_aborts_all_n_children unit test.
  • 🟡 Embed missing CancelledError / finally cleanup. Now mirrors Generate's cleanup pattern so a client disconnect during an embed call doesn't leak the rid in rid_to_state.
  • Stale docstrings (constants.py, worker.py, test_function_calling.py class-level comment/docstring, "Llama-3.2-1B is flaky" comment).
  • Unused SimpleNamespace import in test_tokenspeed_health_servicer.py.
  • Ruff UP038 issues (PEP 604 isinstance union style) in servicer.py.
  • Ruff format applied to all touched files.

Fixed (from e2e-1gpu-chat CI failures)

  • Model swap → OOM on 80 GB H100 CI: reviewer correctly predicted Qwen/Qwen3-30B-A3B (~66 GB bf16 weights alone) wouldn't fit. Switched TestOpenAIServerFunctionCalling to Qwen/Qwen3-4B (dense, still Qwen3ForCausalLM — TokenSpeed-supported). This also unbreaks the sglang / vllm / trtllm variants of the same test class that the 30B swap had inadvertently broken.
  • Kept Qwen/Qwen3-30B-A3B available in model_specs.py (for the thinking/reasoning test classes that already used it) — the duplicate was the only bug.

Added — TokenSpeed in GitHub CI

Missing from the first push (my bad). Wired up end-to-end:

  • scripts/ci_install_tokenspeed.sh — clones TokenSpeed, builds kernel (CUDA) + scheduler (C++/CMake/nanobind) + Python runtime, caches the built kernel wheel at /tmp/tokenspeed-wheel-cache/ to skip the ~30-min CUDA compile on subsequent runs. Mirrors the install-from-source + wheel-cache pattern already used for ci_install_trtllm.sh.
  • .github/actions/setup-tokenspeed/action.yml — composite action wiring the install script alongside the venv setup.
  • .github/workflows/e2e-gpu-job.yml — new Setup TokenSpeed backend step gated on inputs.engine == 'tokenspeed', plus a docstring update.
  • .github/workflows/pr-test-rust.yml:
    • e2e-1gpu-chat matrix: added tokenspeed (60-min timeout, matches TRT-LLM's pattern for build-from-source engines).
    • detect-changes paths-filter: added scripts/ci_install_tokenspeed.sh to common and chat-completions filters so edits to the install script re-trigger the test.

Servicer fallback dropped

Servicer / scheduler_launcher no longer dual-mode import. Target: only the post-Phase-D.3 AsyncLLM (standalone class, not the legacy TokenizerManager wrapper). This matches the e2e evidence from the prior asciinema (https://asciinema.org/a/3pl8PKYo92DnSA68) where AsyncLLM.__bases__ = (EngineClient, TokenizerCommunicatorMixin).

What I did NOT change (out of scope for this PR)

  • TestToolChoiceLlama still uses meta-llama/Llama-3.2-1B-Instruct + llama parser — baseline Llama regression coverage preserved for sglang/vllm/trtllm. This class is NOT marked to run on tokenspeed because TokenSpeed's registry doesn't include LlamaForCausalLM; flip that marker back in a follow-up once TokenSpeed registers it.
  • smg gateway's tool_choice="required" / tool_choice="specific" constraint translation — vLLM fails these same tests identically on the same model, so the failure is upstream of the engine and belongs in a separate smg-gateway PR.

async def generate_request(self, obj):
# Record the request so tests can assert on what was forwarded.
rid = getattr(obj, "rid", None) or "no-rid"
self.rid_to_state[rid] = _FakeState()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Important: test_cancel_aborts_all_n_children will crash here with TypeError: unhashable type: 'list' before it ever reaches the cancel path.

When sampling_params.n = 3, _build_generate_req sets obj.rid = ["rid-1-n0", "rid-1-n1", "rid-1-n2"]. This line then tries to use that list as a dict key (self.rid_to_state[rid]), which is illegal. The resulting TypeError propagates into Generate's except Exception handler → context.abort(INTERNAL, ...) → the test gets _FakeAbortError instead of CancelledError, and the assertion fails.

Fix: handle list rids the same way the real AsyncLLM does — register each child rid individually:

Suggested change
self.rid_to_state[rid] = _FakeState()
rid = getattr(obj, "rid", None) or "no-rid"
if isinstance(rid, list):
for r in rid:
self.rid_to_state[r] = _FakeState()
else:
self.rid_to_state[rid] = _FakeState()

Comment thread scripts/ci_install_tokenspeed.sh Outdated
Comment on lines +83 to +89
# Cache the built wheel — uv stores wheels under its cache, copy out.
mkdir -p "$WHEEL_CACHE"
python3 -c "import tokenspeed_kernel, os, shutil, glob; \
d = os.path.dirname(tokenspeed_kernel.__file__); \
site = os.path.dirname(d); \
whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \
print('kernel install dir:', whls)" || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: The wheel cache write is a no-op — this block creates $WHEEL_CACHE and prints the dist-info path, but never actually copies a .whl file into the cache directory. Every CI run will fall through to the 30-minute source build because find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" on line 75 will always come up empty.

uv pip install from a local path doesn't produce a standalone wheel in the site-packages. You'd need to build the wheel explicitly first (e.g. uv pip wheel tokenspeed-kernel/python/ -w "$WHEEL_CACHE" --no-build-isolation) and then install from the cached artifact, or use uv's own cache directory (uv cache dir) to locate the built wheel and copy it out.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7aca171673

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread scripts/ci_install_tokenspeed.sh Outdated
# Pin to a tested TokenSpeed ref so CI is reproducible. Bump explicitly when
# we want a newer runtime; keeping it pinned avoids surprise breakage when
# TokenSpeed main moves ahead of what our gRPC servicer was verified against.
TOKENSPEED_REF="${TOKENSPEED_REF:-main}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin TokenSpeed install to immutable git ref

Defaulting TOKENSPEED_REF to main makes CI non-reproducible: the same SMG commit can build against different TokenSpeed revisions on different days, which can introduce unrelated breakages or behavior drift in the gRPC servicer tests. Because this script is used in PR and E2E setup, a moving default branch undermines bisectability and makes failures hard to attribute to SMG changes.

Useful? React with 👍 / 👎.

Comment thread scripts/ci_install_tokenspeed.sh Outdated
Comment on lines +89 to +95
print('kernel install dir:', whls)" || true
fi

# Step 4: scheduler (scikit-build-core + nanobind + CMake).
echo "Building tokenspeed-scheduler..."
uv pip install tokenspeed-scheduler/

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Write built kernel wheel into configured cache

The cache read path looks for tokenspeed_kernel-*.whl in WHEEL_CACHE, but after a source build this branch only prints install metadata and never copies a wheel into that directory. As a result, later runs miss the cache and rebuild the CUDA kernel every time, which materially increases CI runtime and can push the TokenSpeed lane into timeout territory.

Useful? React with 👍 / 👎.

@yetone yetone force-pushed the feat/grpc-servicer-tokenspeed branch from 7aca171 to d264a5b Compare April 23, 2026 16:38
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
e2e_test/chat_completions/test_function_calling.py (1)

25-37: ⚠️ Potential issue | 🟡 Minor

Remove unused LLAMA_SYSTEM_MESSAGE constant and stale comment.

LLAMA_SYSTEM_MESSAGE is defined at line 28 but never referenced anywhere in the codebase. The comment claiming it is "Used by TestToolChoiceLlama below" is inaccurate—TestToolChoiceLlama (extending _TestToolChoiceBase) uses get_test_messages() instead, which does not reference this constant. Delete the constant and its associated comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e_test/chat_completions/test_function_calling.py` around lines 25 - 37,
Delete the unused LLAMA_SYSTEM_MESSAGE constant and its preceding stale comment
block from the test file; specifically remove the LLAMA_SYSTEM_MESSAGE
definition and the comment claiming it's "Used by TestToolChoiceLlama", and
verify there are no remaining references to LLAMA_SYSTEM_MESSAGE anywhere
(confirm TestToolChoiceLlama/_TestToolChoiceBase and get_test_messages() are
unchanged and do not rely on this constant).
♻️ Duplicate comments (4)
e2e_test/infra/worker.py (1)

441-442: ⚠️ Potential issue | 🟡 Minor

Docstring still enumerates only three engines.

start_workers docstring lists engines as "sglang", "vllm", "trtllm"; please include "tokenspeed" to match the _build_cmd dispatch and the Worker.engine field comment on line 37.

📝 Suggested fix
-        engine: Runtime engine ("sglang", "vllm", "trtllm").
+        engine: Runtime engine ("sglang", "vllm", "trtllm", or "tokenspeed").
                 If None, auto-detected from E2E_RUNTIME env var.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e_test/infra/worker.py` around lines 441 - 442, The start_workers docstring
enumerates only "sglang", "vllm", "trtllm" but omits "tokenspeed", causing a
mismatch with the _build_cmd dispatch and the Worker.engine field; update the
start_workers docstring to include "tokenspeed" among the supported Runtime
engine options so it matches the _build_cmd function's cases and the
Worker.engine comment (see start_workers docstring, _build_cmd, Worker.engine).
grpc_servicer/tests/conftest.py (1)

1-5: ⚠️ Potential issue | 🟡 Minor

Docstring still references an asyncio-mode default not declared in this file.

asyncio_mode is configured in grpc_servicer/pyproject.toml under [tool.pytest.ini_options], not here. This file only prepends the package root to sys.path and registers the tokenspeed marker.

📝 Suggested fix
 """Pytest configuration for smg-grpc-servicer unit tests.

-Adds the parent directory to ``sys.path`` so editable installs work
-without needing ``pip install -e``, and declares an asyncio-mode default.
+Adds the parent directory to ``sys.path`` so tests work without an
+editable install, and registers custom pytest markers (e.g. ``tokenspeed``).
 """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/tests/conftest.py` around lines 1 - 5, The module docstring
incorrectly states it "declares an asyncio-mode default"; update the docstring
to accurately describe what this file does: mention it prepends the package root
to sys.path to support editable installs and registers the "tokenspeed" pytest
marker, and remove the claim about configuring asyncio_mode (which is set in
pyproject.toml under [tool.pytest.ini_options]). Keep the description concise
and consistent with the actual behavior (sys.path modification and marker
registration).
grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py (2)

33-38: ⚠️ Potential issue | 🟡 Minor

Move or remove the env/resource setup comment.

Lines 34-36 document the side effects of prepare_server_args(argv), but they sit above uvloop setup and asyncio.run(...). Put the comment directly above line 33 or drop it.

Proposed fix
-    server_args = prepare_server_args(argv)
     # The scheduler processes will read these env vars; make sure we ran
     # through TokenSpeed's shared env/resource setup path instead of
     # duplicating it here.
+    server_args = prepare_server_args(argv)
     asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py` around lines 33 - 38,
The comment about env/resource setup belongs with prepare_server_args(argv) but
is currently above asyncio setup; move that comment so it immediately precedes
the prepare_server_args(argv) call (or remove it entirely if redundant) to avoid
implying it documents the uvloop/asyncio.run calls; update or relocate the
comment near the prepare_server_args symbol and leave
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) and
asyncio.run(serve_grpc(server_args)) untouched.

18-21: ⚠️ Potential issue | 🟠 Major

Set the uvloop policy before importing the gRPC server module.

serve_grpc is imported at module load, which imports server.py and grpc before line 37 applies the uvloop policy. Make serve_grpc a lazy import after asyncio.set_event_loop_policy(...); moving line 37 alone is not sufficient.

Proposed fix
 import uvloop
 from tokenspeed.runtime.server_args import prepare_server_args
 
-from smg_grpc_servicer.tokenspeed.server import serve_grpc
-
 
 def main(argv: list[str] | None = None) -> None:
     if argv is None:
         argv = sys.argv[1:]
@@
 
     server_args = prepare_server_args(argv)
-    # The scheduler processes will read these env vars; make sure we ran
-    # through TokenSpeed's shared env/resource setup path instead of
-    # duplicating it here.
     asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
+    from smg_grpc_servicer.tokenspeed.server import serve_grpc
+
     asyncio.run(serve_grpc(server_args))

Verification:

#!/bin/bash
# Expect: no module-level import of serve_grpc before set_event_loop_policy.
rg -nP 'from smg_grpc_servicer\.tokenspeed\.server import serve_grpc|asyncio\.set_event_loop_policy|import grpc' \
  grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py \
  grpc_servicer/smg_grpc_servicer/tokenspeed/server.py

Also applies to: 33-38

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py` around lines 18 - 21,
The module currently imports serve_grpc from smg_grpc_servicer.tokenspeed.server
at import time which causes grpc (and its event loop behavior) to be imported
before the uvloop policy is set; change this to a lazy import by removing the
top-level "from smg_grpc_servicer.tokenspeed.server import serve_grpc" and
instead import serve_grpc inside the main entrypoint after calling
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) (ensure uvloop is set
via uvloop.install() or asyncio.set_event_loop_policy(...) first), so that
serve_grpc and any grpc imports occur only after the uvloop policy is applied;
verify no other module-level imports of serve_grpc or grpc remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e_test/infra/model_specs.py`:
- Around line 71-75: The comment above the "Qwen/Qwen3-30B-A3B" model spec is
stale and should be updated to reflect that TestOpenAIServerFunctionCalling now
routes through "Qwen/Qwen3-4B"; edit the comment near the "Qwen/Qwen3-30B-A3B"
entry in e2e_test/infra/model_specs.py to remove or replace the reference to
TokenSpeed using this 30B model for TestOpenAIServerFunctionCalling and instead
note that the test uses "Qwen/Qwen3-4B" so future maintainers won't be directed
to the OOM-prone 30B model.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/server.py`:
- Around line 162-205: The health flag is being set unconditionally; change the
flow so health_servicer.set_serving() is only called when the warmup actually
succeeded: for Generate ensure final is not None and final.HasField("complete"),
for Embed ensure resp.embedding_dim > 0; on failure (exceptions, missing
Complete, or empty embedding) log the error/warning and do not call
health_servicer.set_serving(). Keep the existing try/except/finally structure
and still close the channel in finally, but move the
health_servicer.set_serving() into the success branches (or gate it behind a
boolean like warmup_succeeded set true only on those success conditions).

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 386-404: The current child_prefix logic can match unrelated IDs
and iterates the live state map; fix by snapshotting the keys (e.g., keys =
list(self.async_llm.rid_to_state)) and compute targets as those equal to rid or
that start with child_prefix AND have only digits after the prefix (validate
suffix with isdigit), then call self.async_llm.abort_request(r) for each target;
this ensures only true `{rid}-n<int>` children are aborted and avoids
concurrent-mutation issues when iterating state_map.

In `@grpc_servicer/tests/test_tokenspeed_servicer.py`:
- Around line 167-179: The test fails because generate_request treats obj.rid as
a hashable key while _build_generate_req may expand obj.rid into a list when
sampling_params.n > 1; update generate_request (and the corresponding later
block around the same logic) to detect if rid is a list/iterable and register a
_FakeState for each child rid in self.rid_to_state (e.g., iterate over obj.rid
when it's a list and assign self.rid_to_state[child_rid] = _FakeState()), yield
outputs the same way, and set each child's .finished = True at the end so
cancellation/never_finish logic sees the per-child started/finished states.
Ensure you reference generate_request and self.rid_to_state when making the
change.

In `@scripts/ci_install_tokenspeed.sh`:
- Around line 24-30: The script currently defaults TOKENSPEED_REF to "main",
which breaks reproducibility; change the logic in ci_install_tokenspeed.sh so
TOKENSPEED_REF is not defaulted to a moving branch but required: remove the
default assignment for TOKENSPEED_REF="${TOKENSPEED_REF:-main}" and add an
explicit validation that exits with a non‑zero status if TOKENSPEED_REF is empty
or equals "main", prompting the caller to supply a pinned commit SHA or tag;
update any surrounding comment to state that a pinned ref is required.
- Around line 74-90: The build branch never writes a wheel into WHEEL_CACHE so
subsequent CI runs rebuild; change the build/install flow so the kernel wheel is
created into the cache and then installed from that cached file: when building
tokenspeed-kernel (the branch that currently runs uv pip install
tokenspeed-kernel/python/ and the python3 inspection that prints kernel install
dir), run pip wheel --wheel-dir "$WHEEL_CACHE" (or otherwise copy the produced
wheel into $WHEEL_CACHE) so a tokenspeed_kernel-*.whl appears in $WHEEL_CACHE,
then set CACHED_KERNEL_WHEEL to that file and install it (uv pip install
"$CACHED_KERNEL_WHEEL" or pip install "$CACHED_KERNEL_WHEEL") instead of leaving
the artifact only in the uv cache; ensure the existing CACHED_KERNEL_WHEEL
lookup, MAX_JOBS, FLASHINFER_CUDA_ARCH_LIST and tokenspeed-kernel reference
remain consistent.

---

Outside diff comments:
In `@e2e_test/chat_completions/test_function_calling.py`:
- Around line 25-37: Delete the unused LLAMA_SYSTEM_MESSAGE constant and its
preceding stale comment block from the test file; specifically remove the
LLAMA_SYSTEM_MESSAGE definition and the comment claiming it's "Used by
TestToolChoiceLlama", and verify there are no remaining references to
LLAMA_SYSTEM_MESSAGE anywhere (confirm TestToolChoiceLlama/_TestToolChoiceBase
and get_test_messages() are unchanged and do not rely on this constant).

---

Duplicate comments:
In `@e2e_test/infra/worker.py`:
- Around line 441-442: The start_workers docstring enumerates only "sglang",
"vllm", "trtllm" but omits "tokenspeed", causing a mismatch with the _build_cmd
dispatch and the Worker.engine field; update the start_workers docstring to
include "tokenspeed" among the supported Runtime engine options so it matches
the _build_cmd function's cases and the Worker.engine comment (see start_workers
docstring, _build_cmd, Worker.engine).

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py`:
- Around line 33-38: The comment about env/resource setup belongs with
prepare_server_args(argv) but is currently above asyncio setup; move that
comment so it immediately precedes the prepare_server_args(argv) call (or remove
it entirely if redundant) to avoid implying it documents the uvloop/asyncio.run
calls; update or relocate the comment near the prepare_server_args symbol and
leave asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) and
asyncio.run(serve_grpc(server_args)) untouched.
- Around line 18-21: The module currently imports serve_grpc from
smg_grpc_servicer.tokenspeed.server at import time which causes grpc (and its
event loop behavior) to be imported before the uvloop policy is set; change this
to a lazy import by removing the top-level "from
smg_grpc_servicer.tokenspeed.server import serve_grpc" and instead import
serve_grpc inside the main entrypoint after calling
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) (ensure uvloop is set
via uvloop.install() or asyncio.set_event_loop_policy(...) first), so that
serve_grpc and any grpc imports occur only after the uvloop policy is applied;
verify no other module-level imports of serve_grpc or grpc remain.

In `@grpc_servicer/tests/conftest.py`:
- Around line 1-5: The module docstring incorrectly states it "declares an
asyncio-mode default"; update the docstring to accurately describe what this
file does: mention it prepends the package root to sys.path to support editable
installs and registers the "tokenspeed" pytest marker, and remove the claim
about configuring asyncio_mode (which is set in pyproject.toml under
[tool.pytest.ini_options]). Keep the description concise and consistent with the
actual behavior (sys.path modification and marker registration).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9bc812b8-569c-44c4-9d4e-7bad5a192d00

📥 Commits

Reviewing files that changed from the base of the PR and between 34e6471 and 7aca171.

📒 Files selected for processing (19)
  • .github/actions/setup-tokenspeed/action.yml
  • .github/workflows/e2e-gpu-job.yml
  • .github/workflows/pr-test-rust.yml
  • e2e_test/chat_completions/test_function_calling.py
  • e2e_test/infra/constants.py
  • e2e_test/infra/model_specs.py
  • e2e_test/infra/worker.py
  • grpc_servicer/pyproject.toml
  • grpc_servicer/smg_grpc_servicer/tokenspeed/__init__.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/scheduler_launcher.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/server.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
  • grpc_servicer/tests/__init__.py
  • grpc_servicer/tests/conftest.py
  • grpc_servicer/tests/test_tokenspeed_health_servicer.py
  • grpc_servicer/tests/test_tokenspeed_servicer.py
  • scripts/ci_install_tokenspeed.sh

Comment thread e2e_test/infra/model_specs.py Outdated
Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/server.py Outdated
Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated
Comment on lines +167 to +179
async def generate_request(self, obj):
# Record the request so tests can assert on what was forwarded.
rid = getattr(obj, "rid", None) or "no-rid"
self.rid_to_state[rid] = _FakeState()
if self.generate_fn is not None:
async for out in self.generate_fn(obj):
self.last_receive_tstamp = 9999.0 # anything > tic
yield out
return
for out in self.outputs:
self.last_receive_tstamp = 9999.0
yield out
self.rid_to_state[rid].finished = True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Expect: fake engine normalizes list rids before using them as dict keys.
sed -n '167,180p' grpc_servicer/tests/test_tokenspeed_servicer.py
sed -n '561,592p' grpc_servicer/tests/test_tokenspeed_servicer.py

Repository: lightseekorg/smg

Length of output: 1876


🏁 Script executed:

# Find _build_generate_req definition to confirm it expands rid to list
rg "_build_generate_req" grpc_servicer/tests/test_tokenspeed_servicer.py -A 20 -B 2

Repository: lightseekorg/smg

Length of output: 2304


🏁 Script executed:

# Check servicer.Generate to see how it calls fake_engine.generate_request
rg "class TokenSpeedSchedulerServicer" grpc_servicer -A 50 | head -80

Repository: lightseekorg/smg

Length of output: 4936


🏁 Script executed:

# Look for the actual Generate method implementation
fd "servicer.py" grpc_servicer --type f | xargs rg "def Generate" -A 30 | head -100

Repository: lightseekorg/smg

Length of output: 8735


🏁 Script executed:

# Find _build_generate_req implementation
rg "def _build_generate_req" grpc_servicer/tests/test_tokenspeed_servicer.py -A 50

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

# Also check if there's _build_generate_req in the actual servicer
fd "servicer.py" grpc_servicer/smg_grpc_servicer/tokenspeed --type f -exec rg "def _build_generate_req" {} -A 50 \;

Repository: lightseekorg/smg

Length of output: 290


🏁 Script executed:

# Check if rid is ever used as dict key elsewhere in the test to understand how it's meant to work
rg "rid_to_state\[" grpc_servicer/tests/test_tokenspeed_servicer.py

Repository: lightseekorg/smg

Length of output: 439


🏁 Script executed:

# Search for _build_generate_req more carefully
rg "_build_generate_req" grpc_servicer/tests/test_tokenspeed_servicer.py -B 2 -A 30

Repository: lightseekorg/smg

Length of output: 2606


🏁 Script executed:

# Also search in the actual servicer implementation
fd "servicer.py" grpc_servicer/smg_grpc_servicer/tokenspeed --type f | head -1 | xargs cat | grep -n "_build_generate_req" -A 30

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

# Look for where rid is expanded/created as a list
rg "rid.*\[.*n0" grpc_servicer/tests/test_tokenspeed_servicer.py

Repository: lightseekorg/smg

Length of output: 136


🏁 Script executed:

# Look for _build_generate_req in the actual servicer file
rg "def _build_generate_req" grpc_servicer/smg_grpc_servicer/tokenspeed/ -A 80

Repository: lightseekorg/smg

Length of output: 8350


🏁 Script executed:

# Check what the test says about how rid gets converted
rg "rid-1-n" grpc_servicer/tests/test_tokenspeed_servicer.py -B 5 -A 5

Repository: lightseekorg/smg

Length of output: 1300


Teach the fake engine to handle expanded rid lists.

When sampling_params.n > 1, _build_generate_req expands obj.rid to a list (e.g., ["rid-1-n0", "rid-1-n1", "rid-1-n2"]), but line 170 attempts to use it as a dict key. This raises TypeError: unhashable type: 'list' before never_finish() signals started, causing test_cancel_aborts_all_n_children to fail immediately instead of exercising cancellation logic.

Proposed fix
     async def generate_request(self, obj):
         # Record the request so tests can assert on what was forwarded.
         rid = getattr(obj, "rid", None) or "no-rid"
+        rids = rid if isinstance(rid, list) else [rid]
+        for one_rid in rids:
+            self.rid_to_state[one_rid] = _FakeState()
-        self.rid_to_state[rid] = _FakeState()
         if self.generate_fn is not None:
             async for out in self.generate_fn(obj):
                 self.last_receive_tstamp = 9999.0  # anything > tic
                 yield out
             return
         for out in self.outputs:
             self.last_receive_tstamp = 9999.0
             yield out
+        for one_rid in rids:
+            if one_rid in self.rid_to_state:
+                self.rid_to_state[one_rid].finished = True
-        self.rid_to_state[rid].finished = True

Also applies to: 561-592 (corresponding update needed at line 181 for consistency).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/tests/test_tokenspeed_servicer.py` around lines 167 - 179, The
test fails because generate_request treats obj.rid as a hashable key while
_build_generate_req may expand obj.rid into a list when sampling_params.n > 1;
update generate_request (and the corresponding later block around the same
logic) to detect if rid is a list/iterable and register a _FakeState for each
child rid in self.rid_to_state (e.g., iterate over obj.rid when it's a list and
assign self.rid_to_state[child_rid] = _FakeState()), yield outputs the same way,
and set each child's .finished = True at the end so cancellation/never_finish
logic sees the per-child started/finished states. Ensure you reference
generate_request and self.rid_to_state when making the change.

Comment thread scripts/ci_install_tokenspeed.sh Outdated
Comment on lines +24 to +30
# Pin to a tested TokenSpeed ref so CI is reproducible. Bump explicitly when
# we want a newer runtime; keeping it pinned avoids surprise breakage when
# TokenSpeed main moves ahead of what our gRPC servicer was verified against.
TOKENSPEED_REF="${TOKENSPEED_REF:-main}"
TOKENSPEED_REPO="${TOKENSPEED_REPO:-https://github.com/lightseekorg/tokenspeed.git}"
TOKENSPEED_DIR="${TOKENSPEED_DIR:-/tmp/tokenspeed-src}"
WHEEL_CACHE="${TOKENSPEED_WHEEL_CACHE:-/tmp/tokenspeed-wheel-cache}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat scripts/ci_install_tokenspeed.sh

Repository: lightseekorg/smg

Length of output: 5294


Default TOKENSPEED_REF to "main" contradicts reproducibility goal.

Line 27 defaults to a moving branch despite the comment stating this is pinned for reproducibility. Upstream TokenSpeed changes can break CI or silently alter runtime behavior without any changes to this repository.

Require explicit pinning to a commit SHA or tag instead:

Proposed fix
-# Pin to a tested TokenSpeed ref so CI is reproducible. Bump explicitly when
-# we want a newer runtime; keeping it pinned avoids surprise breakage when
-# TokenSpeed main moves ahead of what our gRPC servicer was verified against.
-TOKENSPEED_REF="${TOKENSPEED_REF:-main}"
+# Pin to a tested TokenSpeed commit/tag so CI is reproducible. Bump explicitly
+# when we want a newer runtime.
+: "${TOKENSPEED_REF:?Set TOKENSPEED_REF to the tested TokenSpeed commit SHA or tag}"
+if [ "$TOKENSPEED_REF" = "main" ]; then
+    echo "TOKENSPEED_REF must be a pinned commit SHA or tag, not main" >&2
+    exit 1
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Pin to a tested TokenSpeed ref so CI is reproducible. Bump explicitly when
# we want a newer runtime; keeping it pinned avoids surprise breakage when
# TokenSpeed main moves ahead of what our gRPC servicer was verified against.
TOKENSPEED_REF="${TOKENSPEED_REF:-main}"
TOKENSPEED_REPO="${TOKENSPEED_REPO:-https://github.com/lightseekorg/tokenspeed.git}"
TOKENSPEED_DIR="${TOKENSPEED_DIR:-/tmp/tokenspeed-src}"
WHEEL_CACHE="${TOKENSPEED_WHEEL_CACHE:-/tmp/tokenspeed-wheel-cache}"
# Pin to a tested TokenSpeed commit/tag so CI is reproducible. Bump explicitly
# when we want a newer runtime.
: "${TOKENSPEED_REF:?Set TOKENSPEED_REF to the tested TokenSpeed commit SHA or tag}"
if [ "$TOKENSPEED_REF" = "main" ]; then
echo "TOKENSPEED_REF must be a pinned commit SHA or tag, not main" >&2
exit 1
fi
TOKENSPEED_REPO="${TOKENSPEED_REPO:-https://github.com/lightseekorg/tokenspeed.git}"
TOKENSPEED_DIR="${TOKENSPEED_DIR:-/tmp/tokenspeed-src}"
WHEEL_CACHE="${TOKENSPEED_WHEEL_CACHE:-/tmp/tokenspeed-wheel-cache}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/ci_install_tokenspeed.sh` around lines 24 - 30, The script currently
defaults TOKENSPEED_REF to "main", which breaks reproducibility; change the
logic in ci_install_tokenspeed.sh so TOKENSPEED_REF is not defaulted to a moving
branch but required: remove the default assignment for
TOKENSPEED_REF="${TOKENSPEED_REF:-main}" and add an explicit validation that
exits with a non‑zero status if TOKENSPEED_REF is empty or equals "main",
prompting the caller to supply a pinned commit SHA or tag; update any
surrounding comment to state that a pinned ref is required.

Comment thread scripts/ci_install_tokenspeed.sh Outdated
Comment on lines +74 to +90
# Step 3: kernel (CUDA compile — the expensive one). Try the cached wheel first.
CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" 2>/dev/null | head -1 || true)
if [ -n "$CACHED_KERNEL_WHEEL" ] && [ -f "$CACHED_KERNEL_WHEEL" ]; then
echo "Installing cached tokenspeed-kernel wheel: $CACHED_KERNEL_WHEEL"
uv pip install "$CACHED_KERNEL_WHEEL" --no-build-isolation
else
echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..."
MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \
uv pip install tokenspeed-kernel/python/ --no-build-isolation
# Cache the built wheel — uv stores wheels under its cache, copy out.
mkdir -p "$WHEEL_CACHE"
python3 -c "import tokenspeed_kernel, os, shutil, glob; \
d = os.path.dirname(tokenspeed_kernel.__file__); \
site = os.path.dirname(d); \
whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \
print('kernel install dir:', whls)" || true
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Expect: the build branch creates/copies a tokenspeed_kernel wheel into WHEEL_CACHE.
rg -n 'WHEEL_CACHE|tokenspeed_kernel-\*\.whl|pip wheel|uv pip install tokenspeed-kernel' scripts/ci_install_tokenspeed.sh

Repository: lightseekorg/smg

Length of output: 420


🏁 Script executed:

sed -n '74,95p' scripts/ci_install_tokenspeed.sh

Repository: lightseekorg/smg

Length of output: 1226


Fix cache population: the build branch never writes wheels to $WHEEL_CACHE.

The Python script on lines 85–89 only prints the kernel install directory but doesn't copy the wheel into the cache. This means the first CI run builds from source and wastes the build; the second CI run rebuilds again (defeating the intended speedup).

Replace uv pip install with pip wheel --wheel-dir "$WHEEL_CACHE" to build the wheel directly into the cache, then install from the cached file:

Proposed fix
 else
     echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..."
-    MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \
-        uv pip install tokenspeed-kernel/python/ --no-build-isolation
-    # Cache the built wheel — uv stores wheels under its cache, copy out.
     mkdir -p "$WHEEL_CACHE"
-    python3 -c "import tokenspeed_kernel, os, shutil, glob; \
-        d = os.path.dirname(tokenspeed_kernel.__file__); \
-        site = os.path.dirname(d); \
-        whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \
-        print('kernel install dir:', whls)" || true
+    MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \
+        python3 -m pip wheel tokenspeed-kernel/python/ \
+            --no-build-isolation \
+            --wheel-dir "$WHEEL_CACHE"
+    CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" | head -1)
+    uv pip install "$CACHED_KERNEL_WHEEL" --no-build-isolation
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Step 3: kernel (CUDA compile — the expensive one). Try the cached wheel first.
CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" 2>/dev/null | head -1 || true)
if [ -n "$CACHED_KERNEL_WHEEL" ] && [ -f "$CACHED_KERNEL_WHEEL" ]; then
echo "Installing cached tokenspeed-kernel wheel: $CACHED_KERNEL_WHEEL"
uv pip install "$CACHED_KERNEL_WHEEL" --no-build-isolation
else
echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..."
MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \
uv pip install tokenspeed-kernel/python/ --no-build-isolation
# Cache the built wheel — uv stores wheels under its cache, copy out.
mkdir -p "$WHEEL_CACHE"
python3 -c "import tokenspeed_kernel, os, shutil, glob; \
d = os.path.dirname(tokenspeed_kernel.__file__); \
site = os.path.dirname(d); \
whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \
print('kernel install dir:', whls)" || true
fi
# Step 3: kernel (CUDA compile — the expensive one). Try the cached wheel first.
CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" 2>/dev/null | head -1 || true)
if [ -n "$CACHED_KERNEL_WHEEL" ] && [ -f "$CACHED_KERNEL_WHEEL" ]; then
echo "Installing cached tokenspeed-kernel wheel: $CACHED_KERNEL_WHEEL"
uv pip install "$CACHED_KERNEL_WHEEL" --no-build-isolation
else
echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..."
mkdir -p "$WHEEL_CACHE"
MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \
python3 -m pip wheel tokenspeed-kernel/python/ \
--no-build-isolation \
--wheel-dir "$WHEEL_CACHE"
CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" | head -1)
uv pip install "$CACHED_KERNEL_WHEEL" --no-build-isolation
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/ci_install_tokenspeed.sh` around lines 74 - 90, The build branch
never writes a wheel into WHEEL_CACHE so subsequent CI runs rebuild; change the
build/install flow so the kernel wheel is created into the cache and then
installed from that cached file: when building tokenspeed-kernel (the branch
that currently runs uv pip install tokenspeed-kernel/python/ and the python3
inspection that prints kernel install dir), run pip wheel --wheel-dir
"$WHEEL_CACHE" (or otherwise copy the produced wheel into $WHEEL_CACHE) so a
tokenspeed_kernel-*.whl appears in $WHEEL_CACHE, then set CACHED_KERNEL_WHEEL to
that file and install it (uv pip install "$CACHED_KERNEL_WHEEL" or pip install
"$CACHED_KERNEL_WHEEL") instead of leaving the artifact only in the uv cache;
ensure the existing CACHED_KERNEL_WHEEL lookup, MAX_JOBS,
FLASHINFER_CUDA_ARCH_LIST and tokenspeed-kernel reference remain consistent.

Comment thread e2e_test/infra/model_specs.py Outdated
Comment on lines +72 to +74
# Also used by TestOpenAIServerFunctionCalling as a TokenSpeed-supported
# substitute while TokenSpeed's model registry does not cover plain
# LlamaForCausalLM (only MoE / Eagle3 variants). Arch: Qwen3MoeForCausalLM.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: This comment is stale — TestOpenAIServerFunctionCalling still targets meta-llama/Llama-3.2-1B-Instruct (line 118 of test_function_calling.py), and the new TokenSpeed subclass uses Qwen/Qwen3-4B, not Qwen3-30B-A3B. The comment describes an earlier design that was superseded when the tests switched to the 4B model to fit on a single 80GB GPU.

Suggested change
# Also used by TestOpenAIServerFunctionCalling as a TokenSpeed-supported
# substitute while TokenSpeed's model registry does not cover plain
# LlamaForCausalLM (only MoE / Eagle3 variants). Arch: Qwen3MoeForCausalLM.
# Thinking/reasoning model (larger). Arch: Qwen3MoeForCausalLM.
# function_calling / tool_choice features are listed for completeness —
# active TokenSpeed function-calling tests use Qwen3-4B (see
# TestOpenAIServerFunctionCallingTokenSpeed in test_function_calling.py).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d264a5b8fa

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread scripts/ci_install_tokenspeed.sh Outdated
# ── Clone TokenSpeed ────────────────────────────────────────────────────────
if [ ! -d "$TOKENSPEED_DIR" ]; then
echo "Cloning TokenSpeed ${TOKENSPEED_REF} from ${TOKENSPEED_REPO}..."
git clone --depth 1 --branch "$TOKENSPEED_REF" "$TOKENSPEED_REPO" "$TOKENSPEED_DIR"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow TOKENSPEED_REF to be a commit SHA

git clone --branch accepts branch/tag names (see git clone -h: “checkout instead of the remote’s HEAD”), so setting TOKENSPEED_REF to an immutable commit SHA makes first-run installs fail with “Remote branch ... not found”. That blocks the pin-by-SHA workflow this installer is supposed to support on clean CI runners.

Useful? React with 👍 / 👎.

Comment thread scripts/ci_install_tokenspeed.sh Outdated
git clone --depth 1 --branch "$TOKENSPEED_REF" "$TOKENSPEED_REPO" "$TOKENSPEED_DIR"
else
echo "TokenSpeed clone exists at $TOKENSPEED_DIR, reusing"
(cd "$TOKENSPEED_DIR" && git fetch --depth 1 origin "$TOKENSPEED_REF" && git checkout "$TOKENSPEED_REF")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fast-forward reused clone to fetched TokenSpeed ref

In the reuse path, git fetch ... origin "$TOKENSPEED_REF" && git checkout "$TOKENSPEED_REF" does not advance an existing local branch to the fetched remote tip. When TOKENSPEED_REF=main, warm runners can keep building an old local main while fresh runners build newer code, causing inconsistent CI behavior across nodes.

Useful? React with 👍 / 👎.

@yetone yetone force-pushed the feat/grpc-servicer-tokenspeed branch from d264a5b to be83fd2 Compare April 23, 2026 17:09
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be83fd282f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread scripts/ci_install_tokenspeed.sh Outdated
--no-build-isolation

# Step 3: kernel (CUDA compile — the expensive one). Try the cached wheel first.
CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" 2>/dev/null | head -1 || true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope cached kernel wheel to current TokenSpeed ref

The cache lookup installs the first tokenspeed_kernel-*.whl found without checking TOKENSPEED_REF, so warm runners can reuse a wheel built from an older TokenSpeed revision while the scheduler/runtime code is checked out at a different ref. That mismatch can produce nondeterministic CI behavior (or runtime import/ABI failures) when the kernel package changes across refs; the cache keying needs to include the selected ref (and ideally interpreter/ABI) rather than using an unqualified glob.

Useful? React with 👍 / 👎.

Comment on lines +169 to +170
rid = getattr(obj, "rid", None) or "no-rid"
self.rid_to_state[rid] = _FakeState()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle list RIDs in fake AsyncLLM state tracking

For n>1 requests the servicer sets obj.rid to a list of child IDs, but this test double stores rid directly as a dict key; when rid is a list, self.rid_to_state[rid] = ... raises TypeError: unhashable type: 'list'. That makes the n>1 cancellation path fail before assertions run, so the newly added abort-all-children behavior is not actually testable in this fixture.

Useful? React with 👍 / 👎.

@yetone
Copy link
Copy Markdown
Collaborator Author

yetone commented Apr 23, 2026

Dug into the 4 remaining tool_choice="auto" failures — root-caused + fixed

Reproduction

Full-stack e2e on a single H100: tokenspeed gRPC worker + smg gateway + openai client against meta-llama/Llama-3.2-1B-Instruct. Setup matches CI (ci_install_tokenspeed.sh + TestOpenAIServerFunctionCalling).

Root cause

Instrumented the servicer to log what flows out:

[SRV-AFTER] completion=22 raw=27
first10=[128009, 128006, 78191, 128007, 271, 5018, 609, 794, 330, 723]
last5  =[65, 794, 330, 20, 32075]
+ trailing 128008 (<|eom_id|>)

TokenSpeed's AsyncLLM prepends the Llama-3 assistant-role header tokens — [<|eot_id|>, <|start_header_id|>, "assistant", <|end_header_id|>, "\n\n"] — to every output_ids, plus includes the matched stop token. When the smg gateway detokenizes with skip_special_tokens=True, the 128xxx control tokens get stripped, but "assistant" (78191) and "\n\n" (271) are regular word tokens so they survive. The decoded text arriving at the llama parser looks like:

assistant\n\n{"name": "add", "parameters": {"a": "3", "b": "5"}}

The llama parser's parse_complete does text.trim_start().starts_with('{') which returns false because of the assistant\n\n prefix, falls through to "no JSON structure found", and emits an empty tool-call list.

Why this didn't hit Qwen3-30B: the qwen parser uses a regex to extract <tool_call>...</tool_call> tags and is tolerant of both leading prefix and trailing special tokens.

Why this didn't hit sglang/vllm/trtllm on the same model: their respective gRPC servicers/backends emit token_ids containing only the model's generated content — no header echo, no stop token included. SGLang's on-wire shape is the contract the gateway was built for; this PR's tokenspeed servicer was forwarding output_ids verbatim, inheriting TokenSpeed-AsyncLLM-internal details.

Fix

Added _generated_output_ids(output, reason_dict) to the tokenspeed servicer. It:

  1. Slices raw[-meta.completion_tokens:] to drop the header prefix.
  2. If the finish reason is "stop" and the trailing id matches reason.matched, drops that too.

matched_token_id still rides in the separate proto field. meta_info.completion_tokens passes through unchanged (SGLang's convention — it reports generation accounting, not wire-payload length).

Updated unit tests:

  • test_non_streaming_emits_complete now asserts the stripped wire shape ([10, 11] with matched_token_id=12).
  • New test_strips_chat_template_prefix — reproduces the real 27-token payload shape and asserts the servicer forwards only the 21 content tokens.

Result

TestOpenAIServerFunctionCalling against unsloth/Llama-3.2-1B-Instruct + tokenspeed (ungated mirror of the official repo; identical weights; local test environment had no HF gated-repo token):

Run passed failed skipped
Before fix 6 10 2
After fix 10 6 2

The 6 remaining failures are the pre-existing smg-gateway tool_choice=required/specific constraint-translation bug (same failure set on vLLM for the same model — unrelated to this PR). See the earlier reference runs in the PR description.

Commit: 7718843

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
e2e_test/infra/constants.py (1)

67-72: ⚠️ Potential issue | 🟡 Minor

Stale docstring in get_runtime.

The docstring still says "sglang or vllm" but the function now returns any of sglang, vllm, trtllm, tokenspeed. Update for clarity.

🔧 Proposed fix
 def get_runtime() -> str:
-    """Get the current test runtime (sglang or vllm).
+    """Get the current test runtime (sglang, vllm, trtllm, or tokenspeed).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e_test/infra/constants.py` around lines 67 - 72, Update the stale docstring
for get_runtime to list the current allowed runtime values (sglang, vllm,
trtllm, tokenspeed) and clarify that it reads the E2E_RUNTIME environment
variable with a default of "sglang"; modify the docstring text in the
get_runtime function to reflect these exact runtime names and the default
behavior so the documentation matches the implementation.
♻️ Duplicate comments (1)
grpc_servicer/tests/test_tokenspeed_servicer.py (1)

167-179: ⚠️ Potential issue | 🔴 Critical

Fake engine still breaks test_cancel_aborts_all_n_children for list rids.

When sampling_params.n > 1, _build_generate_req sets obj.rid = ["rid-1-n0", "rid-1-n1", "rid-1-n2"]. The line self.rid_to_state[rid] = _FakeState() then raises TypeError: unhashable type: 'list' before never_finish() runs, so started.set() is never called and the newly-added test_cancel_aborts_all_n_children will hang on await started.wait() (or fail via the servicer's outer except Exception branch converting the TypeError into an INVALID/INTERNAL abort, not a CancelledError).

Please teach the fake engine to iterate list rids so the n>1 cancel test actually exercises the expanded-rid path it's meant to guard.

🛠 Proposed fix
     async def generate_request(self, obj):
         # Record the request so tests can assert on what was forwarded.
         rid = getattr(obj, "rid", None) or "no-rid"
-        self.rid_to_state[rid] = _FakeState()
+        rids = rid if isinstance(rid, list) else [rid]
+        for one_rid in rids:
+            self.rid_to_state[one_rid] = _FakeState()
         if self.generate_fn is not None:
             async for out in self.generate_fn(obj):
                 self.last_receive_tstamp = 9999.0  # anything > tic
                 yield out
             return
         for out in self.outputs:
             self.last_receive_tstamp = 9999.0
             yield out
-        self.rid_to_state[rid].finished = True
+        for one_rid in rids:
+            if one_rid in self.rid_to_state:
+                self.rid_to_state[one_rid].finished = True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/tests/test_tokenspeed_servicer.py` around lines 167 - 179, The
fake engine's generate_request currently treats obj.rid as a single hashable key
and does self.rid_to_state[rid] = _FakeState(), which fails when
_build_generate_req sets obj.rid to a list for sampling_params.n > 1; update
generate_request to detect if rid is a list/iterable and iterate over each
element, creating an entry self.rid_to_state[child_rid] = _FakeState() for each
child_rid (and mark each .finished appropriately) so
started.set()/never_finish() can run and the test_cancel_aborts_all_n_children
path exercises the expanded-rid behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py`:
- Around line 57-67: The set_serving and set_not_serving methods mutate the
shared _serving_status dict from a non-event-loop thread while Check() reads it
on the asyncio loop; make this thread-safe by marshaling updates onto the event
loop (use the servicer's asyncio loop and call loop.call_soon_threadsafe to
schedule the dict assignments and logger.info), or alternatively protect
_serving_status with a threading.Lock and use it in set_serving, set_not_serving
and Check; locate the methods set_serving, set_not_serving and the Check method
to implement the chosen approach and ensure the warmup thread in server.py
(where set_serving is invoked) remains unchanged.
- Around line 77-78: The Check handler currently returns NOT_SERVING whenever
self.async_llm.gracefully_exit is true, which masks SERVICE_UNKNOWN for
unrecognized service names; in the health servicer's Check method (or the class
containing it), move the unknown-service validation logic (the branch that
returns SERVICE_UNKNOWN / NOT_FOUND for bogus service names) to run before the
gracefully_exit check so that unknown services return SERVICE_UNKNOWN even
during shutdown, preserving parity with the SGLang behavior.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/scheduler_launcher.py`:
- Around line 35-53: The docstring Raises section is inaccurate: update the
Raises text in the launch_engine wrapper to state that RuntimeError is raised
when async_llm is None (i.e., the current node is not rank 0 in a multi-node
deployment or scheduler died during init) and separately note that actual rank-0
scheduler initialization failures are propagated by _launch_subprocesses (not
wrapped by this RuntimeError); reference _launch_subprocesses, async_llm, and
the RuntimeError guard in the body when editing the docstring.

In `@grpc_servicer/tests/test_tokenspeed_health_servicer.py`:
- Around line 12-15: Remove the redundant noqa comments on the top-level import
lines: delete the "# noqa: E402" trailing each import of health_pb2 and
TokenSpeedHealthServicer in the test module so the imports read normally (refer
to the import statements importing grpc_health.v1.health_pb2 and
smg_grpc_servicer.tokenspeed.health_servicer.TokenSpeedHealthServicer).

In `@scripts/ci_install_tokenspeed.sh`:
- Around line 60-63: The reuse branch fetch/checkout path can leave the working
tree on a stale local branch; after fetching origin "$TOKENSPEED_REF" in the
TOKENSPEED_DIR reuse block, explicitly reset the working tree to the fetched ref
(e.g., run a hard reset to FETCH_HEAD) so the clone always matches the remote
ref; update the commands in the reuse branch that reference TOKENSPEED_DIR and
TOKENSPEED_REF to perform git fetch --depth 1 origin "$TOKENSPEED_REF" followed
by a git reset --hard FETCH_HEAD (and optionally git clean -fd if you need to
remove untracked files) to ensure the working tree is exactly the fetched
commit.

---

Outside diff comments:
In `@e2e_test/infra/constants.py`:
- Around line 67-72: Update the stale docstring for get_runtime to list the
current allowed runtime values (sglang, vllm, trtllm, tokenspeed) and clarify
that it reads the E2E_RUNTIME environment variable with a default of "sglang";
modify the docstring text in the get_runtime function to reflect these exact
runtime names and the default behavior so the documentation matches the
implementation.

---

Duplicate comments:
In `@grpc_servicer/tests/test_tokenspeed_servicer.py`:
- Around line 167-179: The fake engine's generate_request currently treats
obj.rid as a single hashable key and does self.rid_to_state[rid] = _FakeState(),
which fails when _build_generate_req sets obj.rid to a list for
sampling_params.n > 1; update generate_request to detect if rid is a
list/iterable and iterate over each element, creating an entry
self.rid_to_state[child_rid] = _FakeState() for each child_rid (and mark each
.finished appropriately) so started.set()/never_finish() can run and the
test_cancel_aborts_all_n_children path exercises the expanded-rid behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8c93b593-996b-44c3-8529-c06dbfd6418f

📥 Commits

Reviewing files that changed from the base of the PR and between 7aca171 and 7718843.

📒 Files selected for processing (19)
  • .github/actions/setup-tokenspeed/action.yml
  • .github/workflows/e2e-gpu-job.yml
  • .github/workflows/pr-test-rust.yml
  • e2e_test/chat_completions/test_function_calling.py
  • e2e_test/infra/constants.py
  • e2e_test/infra/model_specs.py
  • e2e_test/infra/worker.py
  • grpc_servicer/pyproject.toml
  • grpc_servicer/smg_grpc_servicer/tokenspeed/__init__.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/scheduler_launcher.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/server.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
  • grpc_servicer/tests/__init__.py
  • grpc_servicer/tests/conftest.py
  • grpc_servicer/tests/test_tokenspeed_health_servicer.py
  • grpc_servicer/tests/test_tokenspeed_servicer.py
  • scripts/ci_install_tokenspeed.sh

Comment on lines +57 to +67
def set_serving(self) -> None:
"""Flip both services to SERVING (call after successful warmup)."""
self._serving_status[self.OVERALL_SERVER] = health_pb2.HealthCheckResponse.SERVING
self._serving_status[self.SGLANG_SERVICE] = health_pb2.HealthCheckResponse.SERVING
logger.info("TokenSpeed gRPC health status -> SERVING")

def set_not_serving(self) -> None:
"""Flip both services to NOT_SERVING (call on shutdown)."""
self._serving_status[self.OVERALL_SERVER] = health_pb2.HealthCheckResponse.NOT_SERVING
self._serving_status[self.SGLANG_SERVICE] = health_pb2.HealthCheckResponse.NOT_SERVING
logger.info("TokenSpeed gRPC health status -> NOT_SERVING")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

set_serving/set_not_serving are called from a non-loop thread.

server.py invokes health_servicer.set_serving() from the warmup daemon thread (line 205 there), while Check() reads _serving_status on the asyncio event loop. CPython dict item assignment is GIL-atomic so this is practically safe today, but a reader-visible subtle bug if the implementation grows (e.g., multi-field transitions, logging ordering). Consider either documenting the "called from any thread" contract here or marshaling via loop.call_soon_threadsafe.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py` around lines
57 - 67, The set_serving and set_not_serving methods mutate the shared
_serving_status dict from a non-event-loop thread while Check() reads it on the
asyncio loop; make this thread-safe by marshaling updates onto the event loop
(use the servicer's asyncio loop and call loop.call_soon_threadsafe to schedule
the dict assignments and logger.info), or alternatively protect _serving_status
with a threading.Lock and use it in set_serving, set_not_serving and Check;
locate the methods set_serving, set_not_serving and the Check method to
implement the chosen approach and ensure the warmup thread in server.py (where
set_serving is invoked) remains unchanged.

Comment on lines +77 to +78
if self.async_llm.gracefully_exit:
return health_pb2.HealthCheckResponse(status=health_pb2.HealthCheckResponse.NOT_SERVING)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

gracefully_exit short-circuit returns NOT_SERVING for unknown services.

During shutdown, a Check for a bogus service name will return NOT_SERVING rather than SERVICE_UNKNOWN / NOT_FOUND. Minor — most clients won't observe it, and the response is still "don't route here" — but if parity with the SGLang health servicer matters, consider reordering the unknown-service check before the gracefully_exit gate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py` around lines
77 - 78, The Check handler currently returns NOT_SERVING whenever
self.async_llm.gracefully_exit is true, which masks SERVICE_UNKNOWN for
unrecognized service names; in the health servicer's Check method (or the class
containing it), move the unknown-service validation logic (the branch that
returns SERVICE_UNKNOWN / NOT_FOUND for bogus service names) to run before the
gracefully_exit check so that unknown services return SERVICE_UNKNOWN even
during shutdown, preserving parity with the SGLang behavior.

Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/scheduler_launcher.py Outdated
Comment on lines +12 to +15
from grpc_health.v1 import health_pb2 # noqa: E402
from smg_grpc_servicer.tokenspeed.health_servicer import ( # noqa: E402
TokenSpeedHealthServicer,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Unnecessary # noqa: E402.

These imports are at the top of the module (no preceding non-import code), so E402 won't fire. The noqa directives are redundant.

🔧 Proposed fix
-from grpc_health.v1 import health_pb2  # noqa: E402
-from smg_grpc_servicer.tokenspeed.health_servicer import (  # noqa: E402
+from grpc_health.v1 import health_pb2
+from smg_grpc_servicer.tokenspeed.health_servicer import (
     TokenSpeedHealthServicer,
 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from grpc_health.v1 import health_pb2 # noqa: E402
from smg_grpc_servicer.tokenspeed.health_servicer import ( # noqa: E402
TokenSpeedHealthServicer,
)
from grpc_health.v1 import health_pb2
from smg_grpc_servicer.tokenspeed.health_servicer import (
TokenSpeedHealthServicer,
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/tests/test_tokenspeed_health_servicer.py` around lines 12 - 15,
Remove the redundant noqa comments on the top-level import lines: delete the "#
noqa: E402" trailing each import of health_pb2 and TokenSpeedHealthServicer in
the test module so the imports read normally (refer to the import statements
importing grpc_health.v1.health_pb2 and
smg_grpc_servicer.tokenspeed.health_servicer.TokenSpeedHealthServicer).

Comment thread scripts/ci_install_tokenspeed.sh Outdated
Comment on lines +60 to +63
else
echo "TokenSpeed clone exists at $TOKENSPEED_DIR, reusing"
(cd "$TOKENSPEED_DIR" && git fetch --depth 1 origin "$TOKENSPEED_REF" && git checkout "$TOKENSPEED_REF")
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reuse path does not advance to the latest ref.

When the clone directory already exists, git fetch --depth 1 origin "$TOKENSPEED_REF" && git checkout "$TOKENSPEED_REF" leaves the working tree on the previously checked-out commit if the local branch already exists and diverges from the remote — checkout won't fast-forward. On a long-lived self-hosted runner, CI can silently build a stale TokenSpeed. Reset explicitly to FETCH_HEAD:

🔧 Proposed fix
-else
-    echo "TokenSpeed clone exists at $TOKENSPEED_DIR, reusing"
-    (cd "$TOKENSPEED_DIR" && git fetch --depth 1 origin "$TOKENSPEED_REF" && git checkout "$TOKENSPEED_REF")
-fi
+else
+    echo "TokenSpeed clone exists at $TOKENSPEED_DIR, reusing"
+    (cd "$TOKENSPEED_DIR" \
+        && git fetch --depth 1 origin "$TOKENSPEED_REF" \
+        && git checkout -B "$TOKENSPEED_REF" FETCH_HEAD)
+fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/ci_install_tokenspeed.sh` around lines 60 - 63, The reuse branch
fetch/checkout path can leave the working tree on a stale local branch; after
fetching origin "$TOKENSPEED_REF" in the TOKENSPEED_DIR reuse block, explicitly
reset the working tree to the fetched ref (e.g., run a hard reset to FETCH_HEAD)
so the clone always matches the remote ref; update the commands in the reuse
branch that reference TOKENSPEED_DIR and TOKENSPEED_REF to perform git fetch
--depth 1 origin "$TOKENSPEED_REF" followed by a git reset --hard FETCH_HEAD
(and optionally git clean -fd if you need to remove untracked files) to ensure
the working tree is exactly the fetched commit.

@key4ng
Copy link
Copy Markdown
Collaborator

key4ng commented May 8, 2026

Reorganized into a 3-PR stack

@CatherineSue per your review, split this into three reviewable PRs:

PR Scope Base
#1351 (this) Rust gRPC + protocol — proto, generated stubs, TokenSpeedSchedulerClient, router wiring, RuntimeType::TokenSpeed main
#1464 Python servicer + unit tests feat/grpc-servicer-tokenspeed
#1465 CI workflows + e2e tests + install scripts feat/grpc-tokenspeed-servicer

PR #1464 and #1465 are open as draft until this PR is in shape to merge.

What changed in this PR vs. the previous tip

Verification (this PR)

  • cargo +nightly fmt --all -- --check
  • cargo clippy -p smg-grpc-client --all-targets --all-features -- -D warnings
  • cargo clippy -p smg --all-targets --all-features -- -D warnings

Other reviewer notes

  • 🔴 critical bot finding (FakeAsyncLLM crashing on n>1 list rid) is fixed in PR feat(grpc_servicer): add TokenSpeed servicer (Part 2/3) #1464.
  • The duplicate Qwen/Qwen3-30B-A3B dict-key flagged earlier is not present on the current branch (resolved in an earlier yetone commit; verified grep -n shows only one entry).
  • Lower-priority bot findings (P1/P2 on warmup-then-SERVING ordering, temperature=0 preservation, bootstrap_port=0 ambiguity, etc.) are deliberately out of scope for this restructuring; happy to address them in follow-up PRs once this stack lands.

cc @yetone @slin1237 @XinyueZhang369 @gongwei-130

key4ng added 2 commits May 9, 2026 12:19
…dule

The 5 ``build_*_sampling_params_from_*`` helpers (chat / responses /
messages / completion / plain) plus their constraint helpers were
sitting on ``SglangSchedulerClient`` even though the OpenAI mapping is
backend-neutral. TokenSpeed's client was reaching across to call them
through ``SglangSchedulerClient::*`` which suggested SGLang owned the
OpenAI→sampling translation, when it doesn't.

Move them to ``crate::sampling_params`` as free functions. Both the
SGLang and TokenSpeed clients (and any future client that wants the
same mapping) now call ``crate::sampling_params::build_*`` directly.

The return type is still ``sglang::SamplingParams`` because that proto
happens to be the most permissive shape across our supported backends;
TokenSpeed translates to its own slimmer shape at the wire boundary.
When a backend grows a sampling field SGLang lacks, this is the place
to add it.

No behavior change. Tests stay green; the call sites in
``build_generate_request_from_*`` are mechanically updated.

Signed-off-by: key4ng <rukeyang@gmail.com>
…ation)

TokenSpeed previously rode the SGLang IR arms in proto_wrapper.rs:
``TokenSpeedSchedulerClient::generate()`` accepted ``sglang::GenerateRequest``,
the streaming response was translated back into ``sglang::GenerateResponse``
at the wire boundary, and the router dispatched through
``ProtoGenerateRequest::Sglang``. This let TokenSpeed reuse the existing
match arms but coupled it to SGLang's evolving schema — every SGLang field
addition forced a TokenSpeed translator stub (most recently
``default_sampling_params_json``).

Add native TokenSpeed arms to the router IR:

  - ``ProtoGenerateRequest::TokenSpeed(Box<tokenspeed::GenerateRequest>)``
  - ``ProtoGenerateResponse::TokenSpeed(Box<tokenspeed::GenerateResponse>)``
  - ``ProtoGenerateStreamChunk::TokenSpeed(tokenspeed::GenerateStreamChunk)``
  - ``ProtoGenerateComplete::TokenSpeed(tokenspeed::GenerateComplete)``

…with ``as_tokenspeed`` / ``as_tokenspeed_mut`` / ``is_tokenspeed`` accessors
mirroring the existing per-backend pattern (Mlx / Vllm / Trtllm) and
``TokenSpeed`` arms in every aggregator method (``token_ids``, ``index``,
``output_logprobs``, ``input_logprobs``, ``prompt_tokens``,
``completion_tokens``, ``cached_tokens``, ``finish_reason``,
``matched_stop_json``, ``output_ids``, ``set_stream``, ``request_id``,
``set_max_tokens_for_prefill``, ``clear_mm_inputs``,
``set_kv_transfer_params``, ``kv_transfer_params``).

Client-side:

  - ``TokenSpeedSchedulerClient::generate()`` now takes
    ``tokenspeed_proto::GenerateRequest``
  - ``AbortOnDropStream::Item`` is ``tokenspeed_proto::GenerateResponse``
  - The 5 ``build_*_request`` builders return native ``tokenspeed_proto``
    types
  - ``translate::generate_request`` / ``generate_response`` /
    ``stream_chunk`` / ``complete`` / ``output_logprobs`` are gone — the
    only translation kept is ``translate::sampling_params`` (a thin field
    map) plus the unary RPC adapters (``model_info`` / ``server_info`` /
    ``loads``), which still produce SGLang shapes because the router's
    ``ModelInfo`` / ``ServerInfo`` enums consume those — that's a
    separate cleanup.

Router-side:

  - ``client.rs::generate()`` dispatch arm now matches
    ``(Self::TokenSpeed(_), ProtoGenerateRequest::TokenSpeed(_))``.
  - The 5 ``build_*_request`` paths in ``GrpcClient`` wrap into
    ``ProtoGenerateRequest::TokenSpeed`` instead of ``::Sglang``.
  - ``harmony/stages/request_building.rs`` builds
    ``ProtoGenerateRequest::TokenSpeed`` and grew a ``TokenSpeed`` arm
    in the Harmony stop-token injection match.
  - ``common/stages/helpers.rs::apply_sampling_defaults_to_generate_request``
    early-returns for TokenSpeed (alongside Trtllm) since neither
    backend plumbs sampling defaults through today; the explicit arm
    keeps the match exhaustive.

PD-disagg paths (``response_collection.rs``) remain SGLang-keyed — the
``if let ProtoGenerateComplete::Sglang(...)`` checks simply won't match
TokenSpeed responses, which is the correct behavior since TokenSpeed
doesn't ship PD-disaggregation.

Verification:

  - ``cargo +nightly fmt --all -- --check`` passes
  - ``cargo clippy -p smg-grpc-client -p smg --all-targets --all-features -- -D warnings`` passes
  - ``cargo check -p smg --bin smg`` passes

This addresses the architectural concern raised on #1351: SGLang's proto
shouldn't be the de-facto router IR. Each backend now has its own arm,
matching how vLLM / MLX / TRT-LLM are integrated.

Signed-off-by: key4ng <rukeyang@gmail.com>
key4ng added a commit that referenced this pull request May 9, 2026
Adds the Python servicer that runs alongside a TokenSpeed scheduler
process and serves the gRPC protocol PR1 introduced. Includes:

  - the async scheduler servicer (Generate/HealthCheck/Abort/
    GetModelInfo/GetServerInfo/GetLoads), with cancellation handling
    for streaming, non-streaming, channel-close, and n>1 paths
  - a health-service bridge that flips SERVING/NOT_SERVING based on
    scheduler liveness (deep probe with bounded staleness)
  - a scheduler launcher that boots TokenSpeed's AsyncLLM in-process
  - the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint
  - real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so
    router-side load balancing reflects scheduler-side metrics
  - 57 unit tests covering the servicer, health service, proto
    conversion, finish reasons, sampling params, streaming/non-
    streaming generation, abort/cancel (incl. n>1), model/server
    info, and load metrics

This is part 2 of 3 splitting #1351:
  - PR1: Rust gRPC + protocol (merged first)
  - PR2 (this): Python servicer + unit tests
  - PR3: CI workflows + e2e tests

Stacked on PR1 — the servicer imports the proto stubs PR1 generates
from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``.

Fixes a 🔴 critical from review on #1351:
  - FakeAsyncLLM.generate_request crashed with
    ``TypeError: unhashable type: 'list'`` when n>1, because
    ``_build_generate_req`` rewrites ``rid`` to a list of per-choice
    ids. The fake engine now registers state for each child rid, so
    ``test_cancel_aborts_all_n_children`` exercises the cancel sweep
    instead of dying at setup.

Signed-off-by: key4ng <rukeyang@gmail.com>
key4ng added a commit that referenced this pull request May 9, 2026
Wires TokenSpeed into CI and the GPU e2e suite:

  - ``.github/actions/setup-tokenspeed`` composite action and
    ``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed
    (kernel + scheduler) at a pinned ref, with a wheel cache lookup so
    repeat runs skip the ~20 min compile
  - e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret
    access so forked PRs skip cleanly
  - pr-test-rust.yml: install the same proto deps so Rust-only changes
    that touch ``crates/grpc_client`` still cover the tokenspeed proto
  - e2e_test infra: ``constants``, ``hooks``, ``worker``, and
    ``model_specs`` learn about a ``tokenspeed`` runtime alongside
    sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the
    suite-wide ``@pytest.mark.engine(...)`` markers expand to include
    tokenspeed
  - Function-calling and tool_choice e2e suites swap to
    ``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3
    family is what TokenSpeed's model registry currently supports)

This is part 3 of 3 splitting #1351:
  - PR1: Rust gRPC + protocol
  - PR2: Python servicer + unit tests
  - PR3 (this): CI workflows + e2e tests

Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and
the Python servicer from PR2 against a live TokenSpeed worker.

Addresses CatherineSue's review on #1351:
  - drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` —
    that context belongs in the PR description, not in the test file

Signed-off-by: key4ng <rukeyang@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 76bf5720e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if matches!(request, ProtoGenerateRequest::Trtllm(_)) {
if matches!(
request,
ProtoGenerateRequest::Trtllm(_) | ProtoGenerateRequest::TokenSpeed(_)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply sampling defaults to TokenSpeed requests

When a selected TokenSpeed worker has a default_sampling_params_json label and the client omits fields like temperature or top_p, this early return skips the default-injection path entirely. The TokenSpeed request builders already populate fallback values before dispatch, so the model/worker-preferred defaults that are applied for SGLang/vLLM/MLX are silently ignored for TokenSpeed, changing generation behavior for workers that rely on those labels.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c478f1ac. Dropped TokenSpeed from the early-return, added a
ProtoGenerateRequest::TokenSpeed(req) match arm calling a new
apply_tokenspeed_sampling_defaults helper. Defaults now flow
end-to-end: translate::model_info populates
default_sampling_params_json (sibling thread), discovery flattens it
into the label map, and the injector applies the masked fields with
Some(value) per TokenSpeed's optional scalar shape.

Comment on lines 122 to 127
if matches!(
request,
ProtoGenerateRequest::Trtllm(_) | ProtoGenerateRequest::TokenSpeed(_)
) {
return;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: This early-return is a behavioral regression from the previous code. Before this push, TokenSpeed requests were wrapped as ProtoGenerateRequest::Sglang, so they fell into the SGLang arm below and received model-discovered sampling defaults (e.g., temperature=0.7 from generation_config.json). Now that TokenSpeed has its own variant, defaults are skipped entirely.

Concretely: if a model publishes temperature=0.7 in its generation config and a user sends a chat request without specifying temperature, SGLang/vLLM/MLX workers will use 0.7, but TokenSpeed workers will use the hardcoded 1.0 from the sampling-params builder.

This is straightforward to fix (write an apply_tokenspeed_sampling_defaults similar to the existing per-backend functions), but since TokenSpeed is still in development I understand deferring it. Worth tracking explicitly so it doesn't get lost.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c478f1ac rather than deferred — good catch on the
regression. apply_sampling_defaults_to_generate_request no longer
early-returns for TokenSpeed; there's a dedicated
apply_tokenspeed_sampling_defaults mirroring the SGLang/vLLM/MLX
helpers but writing Some(value) because every TokenSpeed
SamplingParams scalar is optional on the wire (the servicer's
HasField() checks rely on that set-vs-unset distinction).

Paired with the translate::model_info fix in the same commit so
default_sampling_params_json carries the JSON the discovery path
expects.

key4ng added a commit that referenced this pull request May 9, 2026
Wires TokenSpeed into CI and the GPU e2e suite:

  - ``.github/actions/setup-tokenspeed`` composite action and
    ``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed
    (kernel + scheduler) at a pinned ref, with a wheel cache lookup so
    repeat runs skip the ~20 min compile
  - e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret
    access so forked PRs skip cleanly
  - pr-test-rust.yml: install the same proto deps so Rust-only changes
    that touch ``crates/grpc_client`` still cover the tokenspeed proto
  - e2e_test infra: ``constants``, ``hooks``, ``worker``, and
    ``model_specs`` learn about a ``tokenspeed`` runtime alongside
    sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the
    suite-wide ``@pytest.mark.engine(...)`` markers expand to include
    tokenspeed
  - Function-calling and tool_choice e2e suites swap to
    ``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3
    family is what TokenSpeed's model registry currently supports)

This is part 3 of 3 splitting #1351:
  - PR1: Rust gRPC + protocol
  - PR2: Python servicer + unit tests
  - PR3 (this): CI workflows + e2e tests

Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and
the Python servicer from PR2 against a live TokenSpeed worker.

Addresses CatherineSue's review on #1351:
  - drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` —
    that context belongs in the PR description, not in the test file

Signed-off-by: key4ng <rukeyang@gmail.com>
key4ng added 2 commits May 9, 2026 13:52
…ection

The chat-template tool-shape pre-processor was correct for Kimi-K2.5
(BFCL accuracy +6 pp on simple_python, +24 pp on parallel_multiple)
but breaks Mistral chat templates: their template at line 32 iterates
``tool.function.parameters.properties.items()``, which raises
``unknown method: undefined has no method named items`` once we unwrap
``{"type": "function", "function": {...}}`` into the bare inner dict.

The shape a chat template expects is template-dependent, not
engine-dependent. Reverting the unconditional unwrap; full rationale,
accuracy data, and proposed per-model fix in
docs/proposals/2026-05-09-deferred-chat-template-tools-strip.md.

Affected CI lane: e2e_test/chat_completions/test_function_calling.py::TestToolChoiceMistral
(20 tests failing with chat:32 render error).

Signed-off-by: key4ng <rukeyang@gmail.com>
Cuts ~125 lines of doc-comment / inline-rationale prose without losing
information. Hot spots:

- tokenspeed_scheduler.rs module doc (28 → 7 lines)
- tokenspeed_scheduler.proto service header + SamplingParams comment
  + per-field rationale (~40 lines)
- sampling_params.rs module doc (14 → 4 lines)
- translate::sampling_params explainer (12 → 3 lines)
- inline arm comments in client.rs / multimodal.rs / harmony/stages /
  common/stages that just re-stated what the code already shows

Behavior unchanged. ``cargo +nightly fmt --check`` passes;
``cargo clippy -p smg-grpc-client -p smg --all-targets --all-features
-- -D warnings`` passes.

Signed-off-by: key4ng <rukeyang@gmail.com>
key4ng added a commit that referenced this pull request May 9, 2026
Adds the Python servicer that runs alongside a TokenSpeed scheduler
process and serves the gRPC protocol PR1 introduced. Includes:

  - the async scheduler servicer (Generate/HealthCheck/Abort/
    GetModelInfo/GetServerInfo/GetLoads), with cancellation handling
    for streaming, non-streaming, channel-close, and n>1 paths
  - a health-service bridge that flips SERVING/NOT_SERVING based on
    scheduler liveness (deep probe with bounded staleness)
  - a scheduler launcher that boots TokenSpeed's AsyncLLM in-process
  - the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint
  - real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so
    router-side load balancing reflects scheduler-side metrics
  - 57 unit tests covering the servicer, health service, proto
    conversion, finish reasons, sampling params, streaming/non-
    streaming generation, abort/cancel (incl. n>1), model/server
    info, and load metrics

This is part 2 of 3 splitting #1351:
  - PR1: Rust gRPC + protocol (merged first)
  - PR2 (this): Python servicer + unit tests
  - PR3: CI workflows + e2e tests

Stacked on PR1 — the servicer imports the proto stubs PR1 generates
from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``.

Fixes a 🔴 critical from review on #1351:
  - FakeAsyncLLM.generate_request crashed with
    ``TypeError: unhashable type: 'list'`` when n>1, because
    ``_build_generate_req`` rewrites ``rid`` to a list of per-choice
    ids. The fake engine now registers state for each child rid, so
    ``test_cancel_aborts_all_n_children`` exercises the cancel sweep
    instead of dying at setup.

Signed-off-by: key4ng <rukeyang@gmail.com>
key4ng added a commit that referenced this pull request May 9, 2026
Wires TokenSpeed into CI and the GPU e2e suite:

  - ``.github/actions/setup-tokenspeed`` composite action and
    ``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed
    (kernel + scheduler) at a pinned ref, with a wheel cache lookup so
    repeat runs skip the ~20 min compile
  - e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret
    access so forked PRs skip cleanly
  - pr-test-rust.yml: install the same proto deps so Rust-only changes
    that touch ``crates/grpc_client`` still cover the tokenspeed proto
  - e2e_test infra: ``constants``, ``hooks``, ``worker``, and
    ``model_specs`` learn about a ``tokenspeed`` runtime alongside
    sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the
    suite-wide ``@pytest.mark.engine(...)`` markers expand to include
    tokenspeed
  - Function-calling and tool_choice e2e suites swap to
    ``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3
    family is what TokenSpeed's model registry currently supports)

This is part 3 of 3 splitting #1351:
  - PR1: Rust gRPC + protocol
  - PR2: Python servicer + unit tests
  - PR3 (this): CI workflows + e2e tests

Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and
the Python servicer from PR2 against a live TokenSpeed worker.

Addresses CatherineSue's review on #1351:
  - drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` —
    that context belongs in the PR description, not in the test file

Signed-off-by: key4ng <rukeyang@gmail.com>
@github-actions github-actions Bot removed the tokenizer Tokenizer related changes label May 9, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8478a636ba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/grpc_client/src/tokenspeed_scheduler.rs Outdated
key4ng added a commit that referenced this pull request May 11, 2026
Adds the Python servicer that runs alongside a TokenSpeed scheduler
process and serves the gRPC protocol PR1 introduced. Includes:

  - the async scheduler servicer (Generate/HealthCheck/Abort/
    GetModelInfo/GetServerInfo/GetLoads), with cancellation handling
    for streaming, non-streaming, channel-close, and n>1 paths
  - a health-service bridge that flips SERVING/NOT_SERVING based on
    scheduler liveness (deep probe with bounded staleness)
  - a scheduler launcher that boots TokenSpeed's AsyncLLM in-process
  - the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint
  - real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so
    router-side load balancing reflects scheduler-side metrics
  - 57 unit tests covering the servicer, health service, proto
    conversion, finish reasons, sampling params, streaming/non-
    streaming generation, abort/cancel (incl. n>1), model/server
    info, and load metrics

This is part 2 of 3 splitting #1351:
  - PR1: Rust gRPC + protocol (merged first)
  - PR2 (this): Python servicer + unit tests
  - PR3: CI workflows + e2e tests

Stacked on PR1 — the servicer imports the proto stubs PR1 generates
from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``.

Fixes a 🔴 critical from review on #1351:
  - FakeAsyncLLM.generate_request crashed with
    ``TypeError: unhashable type: 'list'`` when n>1, because
    ``_build_generate_req`` rewrites ``rid`` to a list of per-choice
    ids. The fake engine now registers state for each child rid, so
    ``test_cancel_aborts_all_n_children`` exercises the cancel sweep
    instead of dying at setup.

Signed-off-by: key4ng <rukeyang@gmail.com>
key4ng added a commit that referenced this pull request May 11, 2026
Wires TokenSpeed into CI and the GPU e2e suite:

  - ``.github/actions/setup-tokenspeed`` composite action and
    ``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed
    (kernel + scheduler) at a pinned ref, with a wheel cache lookup so
    repeat runs skip the ~20 min compile
  - e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret
    access so forked PRs skip cleanly
  - pr-test-rust.yml: install the same proto deps so Rust-only changes
    that touch ``crates/grpc_client`` still cover the tokenspeed proto
  - e2e_test infra: ``constants``, ``hooks``, ``worker``, and
    ``model_specs`` learn about a ``tokenspeed`` runtime alongside
    sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the
    suite-wide ``@pytest.mark.engine(...)`` markers expand to include
    tokenspeed
  - Function-calling and tool_choice e2e suites swap to
    ``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3
    family is what TokenSpeed's model registry currently supports)

This is part 3 of 3 splitting #1351:
  - PR1: Rust gRPC + protocol
  - PR2: Python servicer + unit tests
  - PR3 (this): CI workflows + e2e tests

Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and
the Python servicer from PR2 against a live TokenSpeed worker.

Addresses CatherineSue's review on #1351:
  - drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` —
    that context belongs in the PR description, not in the test file

Signed-off-by: key4ng <rukeyang@gmail.com>
Tightens the comments introduced (or modified) by this PR to describe
behavior directly. No behavior change; literal type / path references
left intact.

Files touched:
- crates/grpc_client/build.rs
- crates/grpc_client/proto/tokenspeed_scheduler.proto
- crates/grpc_client/src/sampling_params.rs
- crates/grpc_client/src/tokenspeed_scheduler.rs
- model_gateway/src/routers/grpc/client.rs
- model_gateway/src/routers/grpc/proto_wrapper.rs

Signed-off-by: key4ng <rukeyang@gmail.com>
@key4ng key4ng force-pushed the feat/grpc-servicer-tokenspeed branch from 8057d10 to 656f1c2 Compare May 11, 2026 03:44
key4ng added a commit that referenced this pull request May 11, 2026
Adds the Python servicer that runs alongside a TokenSpeed scheduler
process and serves the gRPC protocol PR1 introduced. Includes:

  - the async scheduler servicer (Generate/HealthCheck/Abort/
    GetModelInfo/GetServerInfo/GetLoads), with cancellation handling
    for streaming, non-streaming, channel-close, and n>1 paths
  - a health-service bridge that flips SERVING/NOT_SERVING based on
    scheduler liveness (deep probe with bounded staleness)
  - a scheduler launcher that boots TokenSpeed's AsyncLLM in-process
  - the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint
  - real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so
    router-side load balancing reflects scheduler-side metrics
  - 57 unit tests covering the servicer, health service, proto
    conversion, finish reasons, sampling params, streaming/non-
    streaming generation, abort/cancel (incl. n>1), model/server
    info, and load metrics

This is part 2 of 3 splitting #1351:
  - PR1: Rust gRPC + protocol (merged first)
  - PR2 (this): Python servicer + unit tests
  - PR3: CI workflows + e2e tests

Stacked on PR1 — the servicer imports the proto stubs PR1 generates
from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``.

Fixes a 🔴 critical from review on #1351:
  - FakeAsyncLLM.generate_request crashed with
    ``TypeError: unhashable type: 'list'`` when n>1, because
    ``_build_generate_req`` rewrites ``rid`` to a list of per-choice
    ids. The fake engine now registers state for each child rid, so
    ``test_cancel_aborts_all_n_children`` exercises the cancel sweep
    instead of dying at setup.

Signed-off-by: key4ng <rukeyang@gmail.com>
key4ng added a commit that referenced this pull request May 11, 2026
Wires TokenSpeed into CI and the GPU e2e suite:

  - ``.github/actions/setup-tokenspeed`` composite action and
    ``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed
    (kernel + scheduler) at a pinned ref, with a wheel cache lookup so
    repeat runs skip the ~20 min compile
  - e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret
    access so forked PRs skip cleanly
  - pr-test-rust.yml: install the same proto deps so Rust-only changes
    that touch ``crates/grpc_client`` still cover the tokenspeed proto
  - e2e_test infra: ``constants``, ``hooks``, ``worker``, and
    ``model_specs`` learn about a ``tokenspeed`` runtime alongside
    sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the
    suite-wide ``@pytest.mark.engine(...)`` markers expand to include
    tokenspeed
  - Function-calling and tool_choice e2e suites swap to
    ``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3
    family is what TokenSpeed's model registry currently supports)

This is part 3 of 3 splitting #1351:
  - PR1: Rust gRPC + protocol
  - PR2: Python servicer + unit tests
  - PR3 (this): CI workflows + e2e tests

Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and
the Python servicer from PR2 against a live TokenSpeed worker.

Addresses CatherineSue's review on #1351:
  - drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` —
    that context belongs in the PR description, not in the test file

Signed-off-by: key4ng <rukeyang@gmail.com>
Comment thread crates/grpc_client/src/lib.rs Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 656f1c2e98

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/grpc_client/proto/tokenspeed_scheduler.proto
Comment thread crates/grpc_client/src/lib.rs
The IR refactor split TokenSpeed off the SGLang request arm, which
left two gaps in the sampling-defaults path:

1. `translate::model_info` populated the `preferred_sampling_params`
   label slot from TokenSpeed's response but left
   `default_sampling_params_json` empty. Worker discovery reads only
   the latter, so model-published defaults never reached the label
   map and were invisible to the request-stage injector.

2. `apply_sampling_defaults_to_generate_request` early-returned for
   `ProtoGenerateRequest::TokenSpeed(_)`. Even if a worker's labels
   carried defaults, the injector skipped the TokenSpeed arm — so a
   model publishing `temperature=0.7` in its generation config would
   apply to the other engines but TokenSpeed would fall through to
   the hardcoded 1.0 in the request builder.

Both fixed:

- `tokenspeed_scheduler.rs`: surface `preferred_sampling_params` in
  both `preferred_sampling_params` and `default_sampling_params_json`
  slots so the discovery path picks it up.
- `helpers.rs`: drop TokenSpeed from the early-return, add a
  `TokenSpeed(req)` match arm calling a new
  `apply_tokenspeed_sampling_defaults`. TokenSpeed's wire declares
  every sampling scalar as `optional`, so the helper writes
  `Some(value)` rather than the bare value — preserving the
  set-vs-unset distinction the servicer's `HasField()` checks rely on.

Verification: `cargo +nightly fmt --all -- --check` and `cargo clippy
-p smg-grpc-client -p smg --all-targets --all-features -- -D warnings`
both pass.

Signed-off-by: key4ng <rukeyang@gmail.com>
key4ng added a commit that referenced this pull request May 11, 2026
Adds the Python servicer that runs alongside a TokenSpeed scheduler
process and serves the gRPC protocol PR1 introduced. Includes:

  - the async scheduler servicer (Generate/HealthCheck/Abort/
    GetModelInfo/GetServerInfo/GetLoads), with cancellation handling
    for streaming, non-streaming, channel-close, and n>1 paths
  - a health-service bridge that flips SERVING/NOT_SERVING based on
    scheduler liveness (deep probe with bounded staleness)
  - a scheduler launcher that boots TokenSpeed's AsyncLLM in-process
  - the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint
  - real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so
    router-side load balancing reflects scheduler-side metrics
  - 57 unit tests covering the servicer, health service, proto
    conversion, finish reasons, sampling params, streaming/non-
    streaming generation, abort/cancel (incl. n>1), model/server
    info, and load metrics

This is part 2 of 3 splitting #1351:
  - PR1: Rust gRPC + protocol (merged first)
  - PR2 (this): Python servicer + unit tests
  - PR3: CI workflows + e2e tests

Stacked on PR1 — the servicer imports the proto stubs PR1 generates
from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``.

Fixes a 🔴 critical from review on #1351:
  - FakeAsyncLLM.generate_request crashed with
    ``TypeError: unhashable type: 'list'`` when n>1, because
    ``_build_generate_req`` rewrites ``rid`` to a list of per-choice
    ids. The fake engine now registers state for each child rid, so
    ``test_cancel_aborts_all_n_children`` exercises the cancel sweep
    instead of dying at setup.

Signed-off-by: key4ng <rukeyang@gmail.com>
key4ng added a commit that referenced this pull request May 11, 2026
Wires TokenSpeed into CI and the GPU e2e suite:

  - ``.github/actions/setup-tokenspeed`` composite action and
    ``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed
    (kernel + scheduler) at a pinned ref, with a wheel cache lookup so
    repeat runs skip the ~20 min compile
  - e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret
    access so forked PRs skip cleanly
  - pr-test-rust.yml: install the same proto deps so Rust-only changes
    that touch ``crates/grpc_client`` still cover the tokenspeed proto
  - e2e_test infra: ``constants``, ``hooks``, ``worker``, and
    ``model_specs`` learn about a ``tokenspeed`` runtime alongside
    sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the
    suite-wide ``@pytest.mark.engine(...)`` markers expand to include
    tokenspeed
  - Function-calling and tool_choice e2e suites swap to
    ``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3
    family is what TokenSpeed's model registry currently supports)

This is part 3 of 3 splitting #1351:
  - PR1: Rust gRPC + protocol
  - PR2: Python servicer + unit tests
  - PR3 (this): CI workflows + e2e tests

Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and
the Python servicer from PR2 against a live TokenSpeed worker.

Addresses CatherineSue's review on #1351:
  - drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` —
    that context belongs in the PR description, not in the test file

Signed-off-by: key4ng <rukeyang@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c478f1acd7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/grpc_client/python/smg_grpc_proto/__init__.py
- Collapse the duplicate 2-line crate-level docstring into a single
  neutral line covering all supported backends.
- Drop the 5-line TokenSpeed re-export rationale block; the design
  rationale already lives in `tokenspeed_scheduler.rs`'s module doc.

Signed-off-by: key4ng <rukeyang@gmail.com>
key4ng added a commit that referenced this pull request May 11, 2026
Adds the Python servicer that runs alongside a TokenSpeed scheduler
process and serves the gRPC protocol PR1 introduced. Includes:

  - the async scheduler servicer (Generate/HealthCheck/Abort/
    GetModelInfo/GetServerInfo/GetLoads), with cancellation handling
    for streaming, non-streaming, channel-close, and n>1 paths
  - a health-service bridge that flips SERVING/NOT_SERVING based on
    scheduler liveness (deep probe with bounded staleness)
  - a scheduler launcher that boots TokenSpeed's AsyncLLM in-process
  - the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint
  - real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so
    router-side load balancing reflects scheduler-side metrics
  - 57 unit tests covering the servicer, health service, proto
    conversion, finish reasons, sampling params, streaming/non-
    streaming generation, abort/cancel (incl. n>1), model/server
    info, and load metrics

This is part 2 of 3 splitting #1351:
  - PR1: Rust gRPC + protocol (merged first)
  - PR2 (this): Python servicer + unit tests
  - PR3: CI workflows + e2e tests

Stacked on PR1 — the servicer imports the proto stubs PR1 generates
from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``.

Fixes a 🔴 critical from review on #1351:
  - FakeAsyncLLM.generate_request crashed with
    ``TypeError: unhashable type: 'list'`` when n>1, because
    ``_build_generate_req`` rewrites ``rid`` to a list of per-choice
    ids. The fake engine now registers state for each child rid, so
    ``test_cancel_aborts_all_n_children`` exercises the cancel sweep
    instead of dying at setup.

Signed-off-by: key4ng <rukeyang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

grpc gRPC client and router changes model-gateway Model gateway crate changes protocols Protocols crate changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants