feat(grpc): add TokenSpeed gRPC client and router wiring (Part 1/3)#1351
feat(grpc): add TokenSpeed gRPC client and router wiring (Part 1/3)#1351yetone wants to merge 8 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds TokenSpeed as a supported runtime across tests, worker wiring, CI, gRPC servicer (launcher, server, health, servicer), model gateway clients/workflow, Rust/Cargo proto build, and new unit tests and packaging extras. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as gRPC Client
participant Server as TokenSpeed gRPC Server
participant Launcher as Scheduler Launcher
participant AsyncLLM as AsyncLLM Engine
participant Health as Health Servicer
Client->>Server: Generate / Embed / HealthCheck
Server->>Launcher: launch_engine(server_args)
Launcher->>AsyncLLM: _launch_subprocesses()
AsyncLLM-->>Launcher: AsyncLLM instance + scheduler_info
Launcher-->>Server: (async_llm, scheduler_info)
Server->>Health: instantiate TokenSpeedHealthServicer
par Warmup
Server->>AsyncLLM: probe (one-token Generate/Embed)
AsyncLLM-->>Server: probe response
Server->>Health: set_serving()
end
Server->>AsyncLLM: dispatch Generate/Embed request
AsyncLLM-->>Server: stream/generate results
Server-->>Client: stream/return RPC responses
Client->>Server: Abort (or cancel)
Server->>AsyncLLM: abort_request for rid(s)
AsyncLLM-->>Server: acknowledge
Server-->>Client: finalize/error
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the TokenSpeed inference engine by implementing a gRPC servicer that mirrors the SGLang protocol, allowing the existing Rust router to interface with TokenSpeed workers. The changes encompass a new gRPC server implementation, health checks, subprocess management, and infrastructure updates for end-to-end testing. A critical issue was identified in the streaming generation logic where client disconnections during multi-sample requests (n > 1) fail to abort all derived backend requests, potentially leading to resource leaks.
| except asyncio.CancelledError: | ||
| # Client disconnected — tell the scheduler to drop the request. | ||
| aborted = True | ||
| self.async_llm.abort_request(rid) |
There was a problem hiding this comment.
When a client disconnects from a streaming Generate request with n > 1, not all underlying requests are being aborted on the backend. The asyncio.CancelledError exception handler calls self.async_llm.abort_request(rid), where rid is the original request.request_id. However, for requests with n > 1, the servicer creates multiple backend requests with derived request IDs (e.g., f"{rid}-n{i}"), which are stored in expanded_rid. The engine is only aware of these derived IDs. This can lead to orphaned requests continuing to run on the backend, consuming resources. The fix is to iterate over expanded_rid and abort each request individually to ensure all acquired resources are tracked and released consistently to prevent leaks.
| self.async_llm.abort_request(rid) | |
| rids_to_abort = ( | |
| list(expanded_rid) | |
| if isinstance(expanded_rid, list) | |
| else ([expanded_rid] if isinstance(expanded_rid, str) else []) | |
| ) | |
| for r in rids_to_abort: | |
| self.async_llm.abort_request(r) |
References
- When managing resources that are acquired and need to be released, ensure that all acquired resources are tracked immediately and consistently to prevent resource leaks, especially when exceptions might occur during subsequent operations.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 34e647141e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if params.temperature: | ||
| out["temperature"] = params.temperature |
There was a problem hiding this comment.
Preserve explicit zero temperature in sampling params
The truthy check drops temperature=0.0, so requests that explicitly ask for greedy/deterministic decoding lose that setting before reaching TokenSpeed. The Rust gRPC builder can emit 0.0 for user-specified temperature, but this branch treats it as "unset" and omits it, which can silently fall back to backend defaults and change generation behavior.
Useful? React with 👍 / 👎.
| logger.warning("TokenSpeed warmup failed: %s", e) | ||
| finally: | ||
| channel.close() | ||
|
|
||
| health_servicer.set_serving() |
There was a problem hiding this comment.
Set health status to SERVING only after warmup succeeds
Warmup failures in Generate/Embed are logged but still followed by set_serving(), so readiness can report healthy even when the model path is not actually able to serve inference. This can route production traffic to a broken worker right after startup instead of keeping it out of rotation.
Useful? React with 👍 / 👎.
| except asyncio.CancelledError: | ||
| # Client disconnected — tell the scheduler to drop the request. | ||
| aborted = True | ||
| self.async_llm.abort_request(rid) | ||
| raise |
There was a problem hiding this comment.
Abort all expanded request IDs on client cancellation
On cancellation, the handler aborts only the original request_id, but _build_generate_req expands rid into per-choice IDs when n>1. Since aborted=True also skips the finally cleanup path, those expanded child requests can continue running after the client disconnects, wasting scheduler/GPU work until they finish naturally.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
e2e_test/infra/constants.py (1)
57-57:⚠️ Potential issue | 🟡 MinorStale ENV_RUNTIME comment.
The inline comment still enumerates only
"sglang", "vllm", or "trtllm"even thoughtokenspeedis now a supported runtime (andopenai,xai, etc. are also accepted via the enum). Consider updating to avoid drift.-ENV_RUNTIME = "E2E_RUNTIME" # Runtime for gRPC tests: "sglang", "vllm", or "trtllm" +ENV_RUNTIME = "E2E_RUNTIME" # Runtime for gRPC tests: "sglang", "vllm", "trtllm", or "tokenspeed"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@e2e_test/infra/constants.py` at line 57, Update the stale inline comment for the ENV_RUNTIME constant: modify the comment on ENV_RUNTIME to list the currently supported runtimes (e.g., include "tokenspeed" and other accepted values such as "openai", "xai", etc.) or rephrase it to reference the authoritative enum so it won't drift; locate ENV_RUNTIME in e2e_test/infra/constants.py and replace the old enumeration string with the updated list or a short note pointing to the enum.e2e_test/infra/worker.py (1)
37-37:⚠️ Potential issue | 🟡 MinorUpdate engine docstring to include
tokenspeed.The field comment still lists only three engines even though
tokenspeedis now dispatched in_build_cmd.- engine: str # "sglang", "vllm", or "trtllm" + engine: str # "sglang", "vllm", "trtllm", or "tokenspeed"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@e2e_test/infra/worker.py` at line 37, Update the inline comment for the engine field to reflect all supported engine values; change the comment on the 'engine: str' field to include "tokenspeed" alongside "sglang", "vllm", and "trtllm", and ensure any mention in _build_cmd's dispatch logic remains consistent with that docstring so reviewers see the available options match the actual dispatch cases in _build_cmd.e2e_test/infra/model_specs.py (1)
55-86:⚠️ Potential issue | 🔴 CriticalDuplicate dict key silently drops
function_calling/tool_choicefromQwen/Qwen3-30B-A3B— breaks this PR's primary purpose.
"Qwen/Qwen3-30B-A3B"is defined twice inMODEL_SPECS(here at 55–59 and again at 80–86). Python dict literals keep the last value for duplicate keys, so the new entry withfeatures: ["chat","streaming","function_calling","tool_choice"]is overwritten by the pre-existing entry whose features are["chat","streaming","thinking","reasoning"].Downstream impact:
get_models_with_feature("function_calling")andget_models_with_feature("tool_choice")will not includeQwen/Qwen3-30B-A3B.TestOpenAIServerFunctionCalling(which this PR switches toQwen/Qwen3-30B-A3B) will run against a spec that doesn't advertise the features it needs, andvllm_args=["--enforce-eager"]/trtllm_extra_configfrom the second entry remain in effect.Merge the two entries into one so features union and the vLLM/TRT-LLM config is preserved:
🛠 Proposed fix: merge duplicate entries
- # Qwen3 small chat model — used as a TokenSpeed-supported substitute - # while TokenSpeed's model registry does not cover LlamaForCausalLM. - "Qwen/Qwen3-4B": { - "model": _resolve_model_path("Qwen/Qwen3-4B"), - "tp": 1, - "features": ["chat", "streaming", "function_calling", "tool_choice"], - }, - # Larger Qwen3 MoE — much stronger at function calling than 4B dense. - # Arch: Qwen3MoeForCausalLM (in TokenSpeed's registry); 30B total / - # 3B active, so inference speed stays close to the 4B variant. - "Qwen/Qwen3-30B-A3B": { - "model": _resolve_model_path("Qwen/Qwen3-30B-A3B"), - "tp": 1, - "features": ["chat", "streaming", "function_calling", "tool_choice"], - }, + # Qwen3 small chat model — used as a TokenSpeed-supported substitute + # while TokenSpeed's model registry does not cover LlamaForCausalLM. + "Qwen/Qwen3-4B": { + "model": _resolve_model_path("Qwen/Qwen3-4B"), + "tp": 1, + "features": ["chat", "streaming", "function_calling", "tool_choice"], + }, # Function calling specialist @@ - # Thinking/reasoning model (larger) - "Qwen/Qwen3-30B-A3B": { - "model": _resolve_model_path("Qwen/Qwen3-30B-A3B"), - "tp": 1, - "features": ["chat", "streaming", "thinking", "reasoning"], - "vllm_args": [] if _is_nightly else ["--enforce-eager"], - "trtllm_extra_config": {"kv_cache_config": {"free_gpu_memory_fraction": 0.8}}, - }, + # Thinking/reasoning model (larger) — also used as TokenSpeed fn-calling + # substitute while TokenSpeed's registry lacks LlamaForCausalLM. + # Arch: Qwen3MoeForCausalLM; 30B total / 3B active. + "Qwen/Qwen3-30B-A3B": { + "model": _resolve_model_path("Qwen/Qwen3-30B-A3B"), + "tp": 1, + "features": [ + "chat", "streaming", "thinking", "reasoning", + "function_calling", "tool_choice", + ], + "vllm_args": [] if _is_nightly else ["--enforce-eager"], + "trtllm_extra_config": {"kv_cache_config": {"free_gpu_memory_fraction": 0.8}}, + },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@e2e_test/infra/model_specs.py` around lines 55 - 86, The dict key "Qwen/Qwen3-30B-A3B" is defined twice in MODEL_SPECS which causes the first entry (with function_calling/tool_choice) to be overwritten; merge the two entries into a single "Qwen/Qwen3-30B-A3B" entry in MODEL_SPECS by combining the features (union of ["chat","streaming","function_calling","tool_choice"] and ["chat","streaming","thinking","reasoning"]) and preserving the vllm_args and trtllm_extra_config from the second entry as well as the appropriate tp and model path; remove the duplicate dict literal so get_models_with_feature("function_calling") / ("tool_choice") will include Qwen/Qwen3-30B-A3B and the vllm_args/trtllm_extra_config remain in effect.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@e2e_test/chat_completions/test_function_calling.py`:
- Around line 103-115: This test was switched from Llama to Qwen3 but still
contains Llama-specific labels and will OOM on a single 80GB GPU; update the
test to either use a quantized Qwen3 variant or require more GPUs and rename
Llama-specific text: change the pytest model marker value (e.g.,
"Qwen/Qwen3-30B-A3B" → a Q4/quantized variant like "Qwen/Qwen3-30B-A3B-q4" or
change `@pytest.mark.gpu`(1) → `@pytest.mark.gpu`(4)), update the module/class
docstring that mentions "Llama tool parser" to "Qwen tool parser", replace
occurrences of LLAMA_SYSTEM_MESSAGE with a Qwen-appropriate system message
constant or remove it (references around the test helper and setup), and edit
comments that reference “Llama-3.2-1B is flaky” to refer to Qwen3 or remove;
locate these in the test file by symbols TestToolChoiceLlama,
LLAMA_SYSTEM_MESSAGE, and the top-level module docstring and adjust markers and
messages accordingly.
In `@e2e_test/infra/worker.py`:
- Around line 271-298: The tokenspeed gRPC command builder
(_build_tokenspeed_grpc_cmd) is fine, but the start_workers function's engine
docstring still lists only three engines; update that docstring to include the
new "tokenspeed" engine name so documentation matches the implemented backend
and keep symmetry with other builders (also check the "engine" mention near
start_workers for any enumerations or examples to include "tokenspeed").
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py`:
- Around line 34-38: The comment about priming env vars is misplaced: it refers
to the env/resource setup that actually happens in prepare_server_args(argv),
but it sits above asyncio.set_event_loop_policy() and
asyncio.run(serve_grpc(server_args)); either move the comment to directly above
the prepare_server_args(argv) call (or inside prepare_server_args) so it
documents the env priming, or remove the comment entirely if it's stale; adjust
text to reference prepare_server_args, asyncio.set_event_loop_policy, and
asyncio.run/serve_grpc to ensure clarity.
- Around line 24-38: The uvloop event loop policy is being set after grpc/aio
internals (via the module-level serve_grpc import) are already initialized; move
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) to the very start of
main() before any grpc-related initialization by invoking it before
prepare_server_args(argv) and before any use of serve_grpc/other grpc-related
code so the policy is applied prior to gRPC internals being
imported/initialized.
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/server.py`:
- Around line 157-203: The code relies on an implicit invariant that `response`
is set when `connected` is True; to harden this, initialize `response` (e.g.,
response = None) before the GetModelInfo loop and/or capture the successful
model info into an explicit variable like `model_info_response` when you set
`connected = True`, then use that variable in the warmup path instead of the
possibly-unbound `response`; update references in the warmup block (the
is_generation check and any uses of response) to read from the explicit variable
so future refactors of the GetModelInfo loop can't leave `response` undefined.
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 156-222: The CancelledError handler currently aborts only the
original proto rid and sets aborted=True, which prevents the finally block from
cleaning up per-choice expanded rids for n>1; change the asyncio.CancelledError
except block to (1) compute the same expanded rids from req_obj.rid (use the
existing logic that builds rids_to_check), (2) call
self.async_llm.abort_request(r) for each expanded rid, and (3) do NOT set
aborted=True so the finally block still executes its defensive cleanup; also add
an n>1 unit test in test_tokenspeed_servicer.py mirroring
test_cancel_calls_abort_request to assert abort_request is invoked for each
expanded per-choice rid.
- Around line 535-546: In SubscribeKvEvents the current return appears before
the required yield which is confusing; keep the abort call but ensure the
function is still recognized as an async generator by moving the yield into an
unreachable branch (e.g. use "if False: yield" or "if False: yield
common_pb2.KvEventBatch()") after the abort so the function retains an
async-generator signature while aborting via context.abort; update the
SubscribeKvEvents body to call context.abort(...) and then have an unreachable
yield branch to satisfy the async-generator contract.
- Around line 559-617: The code currently does obj.bootstrap_port =
p.bootstrap_port or None which treats proto3 default 0 as "unset" and loses the
ability to distinguish a real 0 value; update handling so presence is explicit:
if the DisaggregatedParams proto declares bootstrap_port as optional, check
p.HasField("bootstrap_port") and set obj.bootstrap_port = p.bootstrap_port when
present (else None); if you cannot change the proto, add documentation/comments
about the asymmetry or introduce an explicit presence flag (e.g.
bootstrap_port_present) and use that to decide whether to set
obj.bootstrap_port. Ensure you modify the logic around p.bootstrap_port and
obj.bootstrap_port (and related proto definition) rather than changing unrelated
code.
- Around line 619-688: The code in _sampling_params_from_proto drops explicit
zero values by using truthy checks (e.g., if params.min_new_tokens:) and
similarly the non-optional proto scalar logprob_start_len will always appear
"present" but may be mishandled; change these checks to use HasField for fields
that must distinguish "unset" from explicit zero (e.g., replace truthy guards
for min_new_tokens and logprob_start_len with params.HasField("min_new_tokens")
/ params.HasField("logprob_start_len") and forward the raw proto value), and if
the proto currently defines those scalars as non-optional, update the proto
definitions to make them optional so HasField works.
In `@grpc_servicer/tests/conftest.py`:
- Around line 1-22: Update the module docstring to accurately describe what this
file does: remove or reword the claim about "declares an asyncio-mode default"
since asyncio_mode is configured in pyproject.toml, and instead state that the
file adds the package root to sys.path and registers the pytest marker via
pytest_configure (the function registering the "tokenspeed" marker); keep the
rest of the file unchanged.
In `@grpc_servicer/tests/test_tokenspeed_health_servicer.py`:
- Line 7: The import SimpleNamespace is unused in the test file; remove the
unused import statement "from types import SimpleNamespace" from
grpc_servicer/tests/test_tokenspeed_health_servicer.py (locate the top-level
imports in that file) to clean up the module namespace and update any imports if
necessary, then run the test suite to ensure nothing else depended on it.
In `@grpc_servicer/tests/test_tokenspeed_servicer.py`:
- Around line 529-555: Add a new async test similar to
test_cancel_calls_abort_request but set the request sampling params to n>1 (use
_make_generate_request and adjust req.sampling_params.n = 3 or desired count),
keep fake_engine.generate_fn = never_finish, start servicer.Generate(req, ctx)
and run it via asyncio.create_task(_drain(gen)), cancel the task and await the
CancelledError, then assert that each expanded rid string (e.g., "rid-1-n0",
"rid-1-n1", ..., matching the count) is present in fake_engine.aborted_rids;
this targets the expanded-rid path in Generate and uses the same helpers
(_drain, _make_generate_request, fake_engine.aborted_rids) so the new test
follows the existing pattern.
---
Outside diff comments:
In `@e2e_test/infra/constants.py`:
- Line 57: Update the stale inline comment for the ENV_RUNTIME constant: modify
the comment on ENV_RUNTIME to list the currently supported runtimes (e.g.,
include "tokenspeed" and other accepted values such as "openai", "xai", etc.) or
rephrase it to reference the authoritative enum so it won't drift; locate
ENV_RUNTIME in e2e_test/infra/constants.py and replace the old enumeration
string with the updated list or a short note pointing to the enum.
In `@e2e_test/infra/model_specs.py`:
- Around line 55-86: The dict key "Qwen/Qwen3-30B-A3B" is defined twice in
MODEL_SPECS which causes the first entry (with function_calling/tool_choice) to
be overwritten; merge the two entries into a single "Qwen/Qwen3-30B-A3B" entry
in MODEL_SPECS by combining the features (union of
["chat","streaming","function_calling","tool_choice"] and
["chat","streaming","thinking","reasoning"]) and preserving the vllm_args and
trtllm_extra_config from the second entry as well as the appropriate tp and
model path; remove the duplicate dict literal so
get_models_with_feature("function_calling") / ("tool_choice") will include
Qwen/Qwen3-30B-A3B and the vllm_args/trtllm_extra_config remain in effect.
In `@e2e_test/infra/worker.py`:
- Line 37: Update the inline comment for the engine field to reflect all
supported engine values; change the comment on the 'engine: str' field to
include "tokenspeed" alongside "sglang", "vllm", and "trtllm", and ensure any
mention in _build_cmd's dispatch logic remains consistent with that docstring so
reviewers see the available options match the actual dispatch cases in
_build_cmd.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: fc6bc210-4187-4c99-bc26-b7c2ce2eb33a
📒 Files selected for processing (15)
e2e_test/chat_completions/test_function_calling.pye2e_test/infra/constants.pye2e_test/infra/model_specs.pye2e_test/infra/worker.pygrpc_servicer/pyproject.tomlgrpc_servicer/smg_grpc_servicer/tokenspeed/__init__.pygrpc_servicer/smg_grpc_servicer/tokenspeed/__main__.pygrpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.pygrpc_servicer/smg_grpc_servicer/tokenspeed/scheduler_launcher.pygrpc_servicer/smg_grpc_servicer/tokenspeed/server.pygrpc_servicer/smg_grpc_servicer/tokenspeed/servicer.pygrpc_servicer/tests/__init__.pygrpc_servicer/tests/conftest.pygrpc_servicer/tests/test_tokenspeed_health_servicer.pygrpc_servicer/tests/test_tokenspeed_servicer.py
| def main(argv: list[str] | None = None) -> None: | ||
| if argv is None: | ||
| argv = sys.argv[1:] | ||
|
|
||
| logging.basicConfig( | ||
| level=logging.INFO, | ||
| format="%(asctime)s [%(name)s] %(levelname)s %(message)s", | ||
| ) | ||
|
|
||
| server_args = prepare_server_args(argv) | ||
| # The scheduler processes will read these env vars; make sure we ran | ||
| # through TokenSpeed's shared env/resource setup path instead of | ||
| # duplicating it here. | ||
| asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) | ||
| asyncio.run(serve_grpc(server_args)) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check that prepare_server_args isn't eagerly creating a loop (sanity check on TokenSpeed's API shape).
rg -nP -C3 'def prepare_server_args'
rg -nP 'asyncio\.(new_event_loop|get_event_loop|set_event_loop)\b' -C2Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
# First, let's find the file mentioned in the review comment
find . -type f -name "__main__.py" | head -20Repository: lightseekorg/smg
Length of output: 154
🏁 Script executed:
# Search for prepare_server_args function definition across the codebase
rg "def prepare_server_args" -A 20Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
# Check for asyncio event loop operations in the entire codebase
rg "asyncio\.(new_event_loop|get_event_loop|set_event_loop)" -B 2 -A 2Repository: lightseekorg/smg
Length of output: 532
🏁 Script executed:
# Look for uvloop imports
rg "uvloop" -B 2 -A 2Repository: lightseekorg/smg
Length of output: 1235
🏁 Script executed:
# Read the full __main__.py file to understand imports and structure
cat -n grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.pyRepository: lightseekorg/smg
Length of output: 1508
🏁 Script executed:
# Search for prepare_server_args in the tokenspeed module
find . -path "*/tokenspeed/runtime/server_args*" -type fRepository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
# Look for any asyncio.get_event_loop or asyncio.new_event_loop in tokenspeed modules
rg "asyncio\." grpc_servicer/smg_grpc_servicer/tokenspeed/ -A 2Repository: lightseekorg/smg
Length of output: 1646
🏁 Script executed:
# Search for grpc imports in the codebase
rg "import.*grpc" grpc_servicer/smg_grpc_servicer/tokenspeed/ -B 2 -A 2Repository: lightseekorg/smg
Length of output: 2361
🏁 Script executed:
# Check the serve_grpc implementation and its imports
cat -n grpc_servicer/smg_grpc_servicer/tokenspeed/server.py | head -50Repository: lightseekorg/smg
Length of output: 2356
Move asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) to the beginning of main() before importing grpc.aio internals.
The serve_grpc function is imported at the module level (line 21), which indirectly imports the grpc module in server.py (line 23). This happens before asyncio.set_event_loop_policy() is called at line 37. The uvloop policy must be set before gRPC internals are initialized. Move the policy setting to the start of main(), before prepare_server_args() is called.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py` around lines 24 - 38,
The uvloop event loop policy is being set after grpc/aio internals (via the
module-level serve_grpc import) are already initialized; move
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) to the very start of
main() before any grpc-related initialization by invoking it before
prepare_server_args(argv) and before any use of serve_grpc/other grpc-related
code so the policy is applied prior to gRPC internals being
imported/initialized.
| # The scheduler processes will read these env vars; make sure we ran | ||
| # through TokenSpeed's shared env/resource setup path instead of | ||
| # duplicating it here. | ||
| asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) | ||
| asyncio.run(serve_grpc(server_args)) |
There was a problem hiding this comment.
Comment doesn't match the code below it.
The comment on lines 34-36 says "The scheduler processes will read these env vars; make sure we ran through TokenSpeed's shared env/resource setup path instead of duplicating it here" — but the immediately following lines set the uvloop event loop policy and call asyncio.run. No env vars are set or referenced here. The comment appears to be misplaced (or a leftover from an earlier iteration that set env vars between prepare_server_args and asyncio.run). Either move it above prepare_server_args(argv) where the env priming actually occurs (inside that call), or drop it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py` around lines 34 - 38,
The comment about priming env vars is misplaced: it refers to the env/resource
setup that actually happens in prepare_server_args(argv), but it sits above
asyncio.set_event_loop_policy() and asyncio.run(serve_grpc(server_args)); either
move the comment to directly above the prepare_server_args(argv) call (or inside
prepare_server_args) so it documents the env priming, or remove the comment
entirely if it's stale; adjust text to reference prepare_server_args,
asyncio.set_event_loop_policy, and asyncio.run/serve_grpc to ensure clarity.
| """Pytest configuration for smg-grpc-servicer unit tests. | ||
|
|
||
| Adds the parent directory to ``sys.path`` so editable installs work | ||
| without needing ``pip install -e``, and declares an asyncio-mode default. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import pathlib | ||
| import sys | ||
|
|
||
| import pytest | ||
|
|
||
| _HERE = pathlib.Path(__file__).resolve().parent | ||
| _PKG_ROOT = _HERE.parent | ||
|
|
||
| if str(_PKG_ROOT) not in sys.path: | ||
| sys.path.insert(0, str(_PKG_ROOT)) | ||
|
|
||
|
|
||
| def pytest_configure(config: pytest.Config) -> None: | ||
| config.addinivalue_line("markers", "tokenspeed: tests that require TokenSpeed") |
There was a problem hiding this comment.
Minor: docstring mentions an asyncio-mode default that isn't declared in this file.
The module docstring says it "declares an asyncio-mode default," but asyncio_mode = "auto" is actually set in grpc_servicer/pyproject.toml under [tool.pytest.ini_options]. Only the tokenspeed marker is registered here. Consider tightening the docstring to reflect the current behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@grpc_servicer/tests/conftest.py` around lines 1 - 22, Update the module
docstring to accurately describe what this file does: remove or reword the claim
about "declares an asyncio-mode default" since asyncio_mode is configured in
pyproject.toml, and instead state that the file adds the package root to
sys.path and registers the pytest marker via pytest_configure (the function
registering the "tokenspeed" marker); keep the rest of the file unchanged.
| try: | ||
| embedding: list[float] | None = None | ||
| prompt_tokens = 0 | ||
| async for output in self.async_llm.generate_request(obj): | ||
| # EmbeddingReqInput is non-streaming: the loop yields exactly | ||
| # one dict at finish, carrying the embedding vector. | ||
| embedding = output.get("embedding") | ||
| prompt_tokens = output.get("meta_info", {}).get("prompt_tokens", 0) | ||
|
|
||
| if embedding is None: | ||
| await context.abort(grpc.StatusCode.INTERNAL, "Empty embedding result") | ||
| return | ||
|
|
||
| return sglang_scheduler_pb2.EmbedResponse( | ||
| embedding=list(embedding), | ||
| prompt_tokens=prompt_tokens, | ||
| embedding_dim=len(embedding), | ||
| ) | ||
| except grpc.aio.AbortError: | ||
| raise | ||
| except ValueError as e: | ||
| logger.warning("Embed invalid request %s: %s", rid, e) | ||
| await context.abort(grpc.StatusCode.INVALID_ARGUMENT, str(e)) | ||
| except Exception as e: | ||
| logger.exception("Embed failed for request %s", rid) | ||
| await context.abort(grpc.StatusCode.INTERNAL, str(e)) |
There was a problem hiding this comment.
🟡 Nit: Embed is missing asyncio.CancelledError handling and finally cleanup, unlike Generate (lines 198–221). If a client disconnects or times out while the async for loop is awaiting the scheduler, CancelledError propagates unhandled and the rid leaks in rid_to_state. For embeddings this is unlikely to matter in practice (single forward pass, fast), but the inconsistency could bite once longer-running embedding models are onboarded.
Minimal fix — mirror the Generate pattern:
try:
...
except asyncio.CancelledError:
self.async_llm.abort_request(rid)
raise
except grpc.aio.AbortError:
raise
...
finally:
state = self.async_llm.rid_to_state.get(rid)
if state is not None and not getattr(state, "finished", False):
self.async_llm.abort_request(rid)34e6471 to
9056d9e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9056d9ef25
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| known = rid in self.async_llm.rid_to_state | ||
| try: | ||
| self.async_llm.abort_request(rid) | ||
| return sglang_scheduler_pb2.AbortResponse( |
There was a problem hiding this comment.
Abort all child RIDs for n>1 requests
When sampling_params.n > 1, _build_generate_req rewrites the request into per-choice IDs (<request_id>-n*), but Abort still checks and aborts only the original request_id. In that case known becomes false and the engine never receives aborts for the active child requests, so explicit aborts (for example router timeout/cleanup paths) report "not found" and generation continues consuming GPU work until natural completion.
Useful? React with 👍 / 👎.
9056d9e to
7aca171
Compare
|
Review addressed — pushed Fixed (from CodeRabbit review)
Fixed (from
|
| async def generate_request(self, obj): | ||
| # Record the request so tests can assert on what was forwarded. | ||
| rid = getattr(obj, "rid", None) or "no-rid" | ||
| self.rid_to_state[rid] = _FakeState() |
There was a problem hiding this comment.
🔴 Important: test_cancel_aborts_all_n_children will crash here with TypeError: unhashable type: 'list' before it ever reaches the cancel path.
When sampling_params.n = 3, _build_generate_req sets obj.rid = ["rid-1-n0", "rid-1-n1", "rid-1-n2"]. This line then tries to use that list as a dict key (self.rid_to_state[rid]), which is illegal. The resulting TypeError propagates into Generate's except Exception handler → context.abort(INTERNAL, ...) → the test gets _FakeAbortError instead of CancelledError, and the assertion fails.
Fix: handle list rids the same way the real AsyncLLM does — register each child rid individually:
| self.rid_to_state[rid] = _FakeState() | |
| rid = getattr(obj, "rid", None) or "no-rid" | |
| if isinstance(rid, list): | |
| for r in rid: | |
| self.rid_to_state[r] = _FakeState() | |
| else: | |
| self.rid_to_state[rid] = _FakeState() |
| # Cache the built wheel — uv stores wheels under its cache, copy out. | ||
| mkdir -p "$WHEEL_CACHE" | ||
| python3 -c "import tokenspeed_kernel, os, shutil, glob; \ | ||
| d = os.path.dirname(tokenspeed_kernel.__file__); \ | ||
| site = os.path.dirname(d); \ | ||
| whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \ | ||
| print('kernel install dir:', whls)" || true |
There was a problem hiding this comment.
🟡 Nit: The wheel cache write is a no-op — this block creates $WHEEL_CACHE and prints the dist-info path, but never actually copies a .whl file into the cache directory. Every CI run will fall through to the 30-minute source build because find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" on line 75 will always come up empty.
uv pip install from a local path doesn't produce a standalone wheel in the site-packages. You'd need to build the wheel explicitly first (e.g. uv pip wheel tokenspeed-kernel/python/ -w "$WHEEL_CACHE" --no-build-isolation) and then install from the cached artifact, or use uv's own cache directory (uv cache dir) to locate the built wheel and copy it out.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7aca171673
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Pin to a tested TokenSpeed ref so CI is reproducible. Bump explicitly when | ||
| # we want a newer runtime; keeping it pinned avoids surprise breakage when | ||
| # TokenSpeed main moves ahead of what our gRPC servicer was verified against. | ||
| TOKENSPEED_REF="${TOKENSPEED_REF:-main}" |
There was a problem hiding this comment.
Pin TokenSpeed install to immutable git ref
Defaulting TOKENSPEED_REF to main makes CI non-reproducible: the same SMG commit can build against different TokenSpeed revisions on different days, which can introduce unrelated breakages or behavior drift in the gRPC servicer tests. Because this script is used in PR and E2E setup, a moving default branch undermines bisectability and makes failures hard to attribute to SMG changes.
Useful? React with 👍 / 👎.
| print('kernel install dir:', whls)" || true | ||
| fi | ||
|
|
||
| # Step 4: scheduler (scikit-build-core + nanobind + CMake). | ||
| echo "Building tokenspeed-scheduler..." | ||
| uv pip install tokenspeed-scheduler/ | ||
|
|
There was a problem hiding this comment.
Write built kernel wheel into configured cache
The cache read path looks for tokenspeed_kernel-*.whl in WHEEL_CACHE, but after a source build this branch only prints install metadata and never copies a wheel into that directory. As a result, later runs miss the cache and rebuild the CUDA kernel every time, which materially increases CI runtime and can push the TokenSpeed lane into timeout territory.
Useful? React with 👍 / 👎.
7aca171 to
d264a5b
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
e2e_test/chat_completions/test_function_calling.py (1)
25-37:⚠️ Potential issue | 🟡 MinorRemove unused
LLAMA_SYSTEM_MESSAGEconstant and stale comment.
LLAMA_SYSTEM_MESSAGEis defined at line 28 but never referenced anywhere in the codebase. The comment claiming it is "Used byTestToolChoiceLlamabelow" is inaccurate—TestToolChoiceLlama(extending_TestToolChoiceBase) usesget_test_messages()instead, which does not reference this constant. Delete the constant and its associated comment.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@e2e_test/chat_completions/test_function_calling.py` around lines 25 - 37, Delete the unused LLAMA_SYSTEM_MESSAGE constant and its preceding stale comment block from the test file; specifically remove the LLAMA_SYSTEM_MESSAGE definition and the comment claiming it's "Used by TestToolChoiceLlama", and verify there are no remaining references to LLAMA_SYSTEM_MESSAGE anywhere (confirm TestToolChoiceLlama/_TestToolChoiceBase and get_test_messages() are unchanged and do not rely on this constant).
♻️ Duplicate comments (4)
e2e_test/infra/worker.py (1)
441-442:⚠️ Potential issue | 🟡 MinorDocstring still enumerates only three engines.
start_workersdocstring lists engines as"sglang", "vllm", "trtllm"; please include"tokenspeed"to match the_build_cmddispatch and theWorker.enginefield comment on line 37.📝 Suggested fix
- engine: Runtime engine ("sglang", "vllm", "trtllm"). + engine: Runtime engine ("sglang", "vllm", "trtllm", or "tokenspeed"). If None, auto-detected from E2E_RUNTIME env var.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@e2e_test/infra/worker.py` around lines 441 - 442, The start_workers docstring enumerates only "sglang", "vllm", "trtllm" but omits "tokenspeed", causing a mismatch with the _build_cmd dispatch and the Worker.engine field; update the start_workers docstring to include "tokenspeed" among the supported Runtime engine options so it matches the _build_cmd function's cases and the Worker.engine comment (see start_workers docstring, _build_cmd, Worker.engine).grpc_servicer/tests/conftest.py (1)
1-5:⚠️ Potential issue | 🟡 MinorDocstring still references an
asyncio-mode defaultnot declared in this file.
asyncio_modeis configured ingrpc_servicer/pyproject.tomlunder[tool.pytest.ini_options], not here. This file only prepends the package root tosys.pathand registers thetokenspeedmarker.📝 Suggested fix
"""Pytest configuration for smg-grpc-servicer unit tests. -Adds the parent directory to ``sys.path`` so editable installs work -without needing ``pip install -e``, and declares an asyncio-mode default. +Adds the parent directory to ``sys.path`` so tests work without an +editable install, and registers custom pytest markers (e.g. ``tokenspeed``). """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/tests/conftest.py` around lines 1 - 5, The module docstring incorrectly states it "declares an asyncio-mode default"; update the docstring to accurately describe what this file does: mention it prepends the package root to sys.path to support editable installs and registers the "tokenspeed" pytest marker, and remove the claim about configuring asyncio_mode (which is set in pyproject.toml under [tool.pytest.ini_options]). Keep the description concise and consistent with the actual behavior (sys.path modification and marker registration).grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py (2)
33-38:⚠️ Potential issue | 🟡 MinorMove or remove the env/resource setup comment.
Lines 34-36 document the side effects of
prepare_server_args(argv), but they sit above uvloop setup andasyncio.run(...). Put the comment directly above line 33 or drop it.Proposed fix
- server_args = prepare_server_args(argv) # The scheduler processes will read these env vars; make sure we ran # through TokenSpeed's shared env/resource setup path instead of # duplicating it here. + server_args = prepare_server_args(argv) asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py` around lines 33 - 38, The comment about env/resource setup belongs with prepare_server_args(argv) but is currently above asyncio setup; move that comment so it immediately precedes the prepare_server_args(argv) call (or remove it entirely if redundant) to avoid implying it documents the uvloop/asyncio.run calls; update or relocate the comment near the prepare_server_args symbol and leave asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) and asyncio.run(serve_grpc(server_args)) untouched.
18-21:⚠️ Potential issue | 🟠 MajorSet the uvloop policy before importing the gRPC server module.
serve_grpcis imported at module load, which importsserver.pyandgrpcbefore line 37 applies the uvloop policy. Makeserve_grpca lazy import afterasyncio.set_event_loop_policy(...); moving line 37 alone is not sufficient.Proposed fix
import uvloop from tokenspeed.runtime.server_args import prepare_server_args -from smg_grpc_servicer.tokenspeed.server import serve_grpc - def main(argv: list[str] | None = None) -> None: if argv is None: argv = sys.argv[1:] @@ server_args = prepare_server_args(argv) - # The scheduler processes will read these env vars; make sure we ran - # through TokenSpeed's shared env/resource setup path instead of - # duplicating it here. asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) + from smg_grpc_servicer.tokenspeed.server import serve_grpc + asyncio.run(serve_grpc(server_args))Verification:
#!/bin/bash # Expect: no module-level import of serve_grpc before set_event_loop_policy. rg -nP 'from smg_grpc_servicer\.tokenspeed\.server import serve_grpc|asyncio\.set_event_loop_policy|import grpc' \ grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py \ grpc_servicer/smg_grpc_servicer/tokenspeed/server.pyAlso applies to: 33-38
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py` around lines 18 - 21, The module currently imports serve_grpc from smg_grpc_servicer.tokenspeed.server at import time which causes grpc (and its event loop behavior) to be imported before the uvloop policy is set; change this to a lazy import by removing the top-level "from smg_grpc_servicer.tokenspeed.server import serve_grpc" and instead import serve_grpc inside the main entrypoint after calling asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) (ensure uvloop is set via uvloop.install() or asyncio.set_event_loop_policy(...) first), so that serve_grpc and any grpc imports occur only after the uvloop policy is applied; verify no other module-level imports of serve_grpc or grpc remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@e2e_test/infra/model_specs.py`:
- Around line 71-75: The comment above the "Qwen/Qwen3-30B-A3B" model spec is
stale and should be updated to reflect that TestOpenAIServerFunctionCalling now
routes through "Qwen/Qwen3-4B"; edit the comment near the "Qwen/Qwen3-30B-A3B"
entry in e2e_test/infra/model_specs.py to remove or replace the reference to
TokenSpeed using this 30B model for TestOpenAIServerFunctionCalling and instead
note that the test uses "Qwen/Qwen3-4B" so future maintainers won't be directed
to the OOM-prone 30B model.
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/server.py`:
- Around line 162-205: The health flag is being set unconditionally; change the
flow so health_servicer.set_serving() is only called when the warmup actually
succeeded: for Generate ensure final is not None and final.HasField("complete"),
for Embed ensure resp.embedding_dim > 0; on failure (exceptions, missing
Complete, or empty embedding) log the error/warning and do not call
health_servicer.set_serving(). Keep the existing try/except/finally structure
and still close the channel in finally, but move the
health_servicer.set_serving() into the success branches (or gate it behind a
boolean like warmup_succeeded set true only on those success conditions).
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 386-404: The current child_prefix logic can match unrelated IDs
and iterates the live state map; fix by snapshotting the keys (e.g., keys =
list(self.async_llm.rid_to_state)) and compute targets as those equal to rid or
that start with child_prefix AND have only digits after the prefix (validate
suffix with isdigit), then call self.async_llm.abort_request(r) for each target;
this ensures only true `{rid}-n<int>` children are aborted and avoids
concurrent-mutation issues when iterating state_map.
In `@grpc_servicer/tests/test_tokenspeed_servicer.py`:
- Around line 167-179: The test fails because generate_request treats obj.rid as
a hashable key while _build_generate_req may expand obj.rid into a list when
sampling_params.n > 1; update generate_request (and the corresponding later
block around the same logic) to detect if rid is a list/iterable and register a
_FakeState for each child rid in self.rid_to_state (e.g., iterate over obj.rid
when it's a list and assign self.rid_to_state[child_rid] = _FakeState()), yield
outputs the same way, and set each child's .finished = True at the end so
cancellation/never_finish logic sees the per-child started/finished states.
Ensure you reference generate_request and self.rid_to_state when making the
change.
In `@scripts/ci_install_tokenspeed.sh`:
- Around line 24-30: The script currently defaults TOKENSPEED_REF to "main",
which breaks reproducibility; change the logic in ci_install_tokenspeed.sh so
TOKENSPEED_REF is not defaulted to a moving branch but required: remove the
default assignment for TOKENSPEED_REF="${TOKENSPEED_REF:-main}" and add an
explicit validation that exits with a non‑zero status if TOKENSPEED_REF is empty
or equals "main", prompting the caller to supply a pinned commit SHA or tag;
update any surrounding comment to state that a pinned ref is required.
- Around line 74-90: The build branch never writes a wheel into WHEEL_CACHE so
subsequent CI runs rebuild; change the build/install flow so the kernel wheel is
created into the cache and then installed from that cached file: when building
tokenspeed-kernel (the branch that currently runs uv pip install
tokenspeed-kernel/python/ and the python3 inspection that prints kernel install
dir), run pip wheel --wheel-dir "$WHEEL_CACHE" (or otherwise copy the produced
wheel into $WHEEL_CACHE) so a tokenspeed_kernel-*.whl appears in $WHEEL_CACHE,
then set CACHED_KERNEL_WHEEL to that file and install it (uv pip install
"$CACHED_KERNEL_WHEEL" or pip install "$CACHED_KERNEL_WHEEL") instead of leaving
the artifact only in the uv cache; ensure the existing CACHED_KERNEL_WHEEL
lookup, MAX_JOBS, FLASHINFER_CUDA_ARCH_LIST and tokenspeed-kernel reference
remain consistent.
---
Outside diff comments:
In `@e2e_test/chat_completions/test_function_calling.py`:
- Around line 25-37: Delete the unused LLAMA_SYSTEM_MESSAGE constant and its
preceding stale comment block from the test file; specifically remove the
LLAMA_SYSTEM_MESSAGE definition and the comment claiming it's "Used by
TestToolChoiceLlama", and verify there are no remaining references to
LLAMA_SYSTEM_MESSAGE anywhere (confirm TestToolChoiceLlama/_TestToolChoiceBase
and get_test_messages() are unchanged and do not rely on this constant).
---
Duplicate comments:
In `@e2e_test/infra/worker.py`:
- Around line 441-442: The start_workers docstring enumerates only "sglang",
"vllm", "trtllm" but omits "tokenspeed", causing a mismatch with the _build_cmd
dispatch and the Worker.engine field; update the start_workers docstring to
include "tokenspeed" among the supported Runtime engine options so it matches
the _build_cmd function's cases and the Worker.engine comment (see start_workers
docstring, _build_cmd, Worker.engine).
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/__main__.py`:
- Around line 33-38: The comment about env/resource setup belongs with
prepare_server_args(argv) but is currently above asyncio setup; move that
comment so it immediately precedes the prepare_server_args(argv) call (or remove
it entirely if redundant) to avoid implying it documents the uvloop/asyncio.run
calls; update or relocate the comment near the prepare_server_args symbol and
leave asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) and
asyncio.run(serve_grpc(server_args)) untouched.
- Around line 18-21: The module currently imports serve_grpc from
smg_grpc_servicer.tokenspeed.server at import time which causes grpc (and its
event loop behavior) to be imported before the uvloop policy is set; change this
to a lazy import by removing the top-level "from
smg_grpc_servicer.tokenspeed.server import serve_grpc" and instead import
serve_grpc inside the main entrypoint after calling
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) (ensure uvloop is set
via uvloop.install() or asyncio.set_event_loop_policy(...) first), so that
serve_grpc and any grpc imports occur only after the uvloop policy is applied;
verify no other module-level imports of serve_grpc or grpc remain.
In `@grpc_servicer/tests/conftest.py`:
- Around line 1-5: The module docstring incorrectly states it "declares an
asyncio-mode default"; update the docstring to accurately describe what this
file does: mention it prepends the package root to sys.path to support editable
installs and registers the "tokenspeed" pytest marker, and remove the claim
about configuring asyncio_mode (which is set in pyproject.toml under
[tool.pytest.ini_options]). Keep the description concise and consistent with the
actual behavior (sys.path modification and marker registration).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9bc812b8-569c-44c4-9d4e-7bad5a192d00
📒 Files selected for processing (19)
.github/actions/setup-tokenspeed/action.yml.github/workflows/e2e-gpu-job.yml.github/workflows/pr-test-rust.ymle2e_test/chat_completions/test_function_calling.pye2e_test/infra/constants.pye2e_test/infra/model_specs.pye2e_test/infra/worker.pygrpc_servicer/pyproject.tomlgrpc_servicer/smg_grpc_servicer/tokenspeed/__init__.pygrpc_servicer/smg_grpc_servicer/tokenspeed/__main__.pygrpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.pygrpc_servicer/smg_grpc_servicer/tokenspeed/scheduler_launcher.pygrpc_servicer/smg_grpc_servicer/tokenspeed/server.pygrpc_servicer/smg_grpc_servicer/tokenspeed/servicer.pygrpc_servicer/tests/__init__.pygrpc_servicer/tests/conftest.pygrpc_servicer/tests/test_tokenspeed_health_servicer.pygrpc_servicer/tests/test_tokenspeed_servicer.pyscripts/ci_install_tokenspeed.sh
| async def generate_request(self, obj): | ||
| # Record the request so tests can assert on what was forwarded. | ||
| rid = getattr(obj, "rid", None) or "no-rid" | ||
| self.rid_to_state[rid] = _FakeState() | ||
| if self.generate_fn is not None: | ||
| async for out in self.generate_fn(obj): | ||
| self.last_receive_tstamp = 9999.0 # anything > tic | ||
| yield out | ||
| return | ||
| for out in self.outputs: | ||
| self.last_receive_tstamp = 9999.0 | ||
| yield out | ||
| self.rid_to_state[rid].finished = True |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Expect: fake engine normalizes list rids before using them as dict keys.
sed -n '167,180p' grpc_servicer/tests/test_tokenspeed_servicer.py
sed -n '561,592p' grpc_servicer/tests/test_tokenspeed_servicer.pyRepository: lightseekorg/smg
Length of output: 1876
🏁 Script executed:
# Find _build_generate_req definition to confirm it expands rid to list
rg "_build_generate_req" grpc_servicer/tests/test_tokenspeed_servicer.py -A 20 -B 2Repository: lightseekorg/smg
Length of output: 2304
🏁 Script executed:
# Check servicer.Generate to see how it calls fake_engine.generate_request
rg "class TokenSpeedSchedulerServicer" grpc_servicer -A 50 | head -80Repository: lightseekorg/smg
Length of output: 4936
🏁 Script executed:
# Look for the actual Generate method implementation
fd "servicer.py" grpc_servicer --type f | xargs rg "def Generate" -A 30 | head -100Repository: lightseekorg/smg
Length of output: 8735
🏁 Script executed:
# Find _build_generate_req implementation
rg "def _build_generate_req" grpc_servicer/tests/test_tokenspeed_servicer.py -A 50Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
# Also check if there's _build_generate_req in the actual servicer
fd "servicer.py" grpc_servicer/smg_grpc_servicer/tokenspeed --type f -exec rg "def _build_generate_req" {} -A 50 \;Repository: lightseekorg/smg
Length of output: 290
🏁 Script executed:
# Check if rid is ever used as dict key elsewhere in the test to understand how it's meant to work
rg "rid_to_state\[" grpc_servicer/tests/test_tokenspeed_servicer.pyRepository: lightseekorg/smg
Length of output: 439
🏁 Script executed:
# Search for _build_generate_req more carefully
rg "_build_generate_req" grpc_servicer/tests/test_tokenspeed_servicer.py -B 2 -A 30Repository: lightseekorg/smg
Length of output: 2606
🏁 Script executed:
# Also search in the actual servicer implementation
fd "servicer.py" grpc_servicer/smg_grpc_servicer/tokenspeed --type f | head -1 | xargs cat | grep -n "_build_generate_req" -A 30Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
# Look for where rid is expanded/created as a list
rg "rid.*\[.*n0" grpc_servicer/tests/test_tokenspeed_servicer.pyRepository: lightseekorg/smg
Length of output: 136
🏁 Script executed:
# Look for _build_generate_req in the actual servicer file
rg "def _build_generate_req" grpc_servicer/smg_grpc_servicer/tokenspeed/ -A 80Repository: lightseekorg/smg
Length of output: 8350
🏁 Script executed:
# Check what the test says about how rid gets converted
rg "rid-1-n" grpc_servicer/tests/test_tokenspeed_servicer.py -B 5 -A 5Repository: lightseekorg/smg
Length of output: 1300
Teach the fake engine to handle expanded rid lists.
When sampling_params.n > 1, _build_generate_req expands obj.rid to a list (e.g., ["rid-1-n0", "rid-1-n1", "rid-1-n2"]), but line 170 attempts to use it as a dict key. This raises TypeError: unhashable type: 'list' before never_finish() signals started, causing test_cancel_aborts_all_n_children to fail immediately instead of exercising cancellation logic.
Proposed fix
async def generate_request(self, obj):
# Record the request so tests can assert on what was forwarded.
rid = getattr(obj, "rid", None) or "no-rid"
+ rids = rid if isinstance(rid, list) else [rid]
+ for one_rid in rids:
+ self.rid_to_state[one_rid] = _FakeState()
- self.rid_to_state[rid] = _FakeState()
if self.generate_fn is not None:
async for out in self.generate_fn(obj):
self.last_receive_tstamp = 9999.0 # anything > tic
yield out
return
for out in self.outputs:
self.last_receive_tstamp = 9999.0
yield out
+ for one_rid in rids:
+ if one_rid in self.rid_to_state:
+ self.rid_to_state[one_rid].finished = True
- self.rid_to_state[rid].finished = TrueAlso applies to: 561-592 (corresponding update needed at line 181 for consistency).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@grpc_servicer/tests/test_tokenspeed_servicer.py` around lines 167 - 179, The
test fails because generate_request treats obj.rid as a hashable key while
_build_generate_req may expand obj.rid into a list when sampling_params.n > 1;
update generate_request (and the corresponding later block around the same
logic) to detect if rid is a list/iterable and register a _FakeState for each
child rid in self.rid_to_state (e.g., iterate over obj.rid when it's a list and
assign self.rid_to_state[child_rid] = _FakeState()), yield outputs the same way,
and set each child's .finished = True at the end so cancellation/never_finish
logic sees the per-child started/finished states. Ensure you reference
generate_request and self.rid_to_state when making the change.
| # Pin to a tested TokenSpeed ref so CI is reproducible. Bump explicitly when | ||
| # we want a newer runtime; keeping it pinned avoids surprise breakage when | ||
| # TokenSpeed main moves ahead of what our gRPC servicer was verified against. | ||
| TOKENSPEED_REF="${TOKENSPEED_REF:-main}" | ||
| TOKENSPEED_REPO="${TOKENSPEED_REPO:-https://github.com/lightseekorg/tokenspeed.git}" | ||
| TOKENSPEED_DIR="${TOKENSPEED_DIR:-/tmp/tokenspeed-src}" | ||
| WHEEL_CACHE="${TOKENSPEED_WHEEL_CACHE:-/tmp/tokenspeed-wheel-cache}" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat scripts/ci_install_tokenspeed.shRepository: lightseekorg/smg
Length of output: 5294
Default TOKENSPEED_REF to "main" contradicts reproducibility goal.
Line 27 defaults to a moving branch despite the comment stating this is pinned for reproducibility. Upstream TokenSpeed changes can break CI or silently alter runtime behavior without any changes to this repository.
Require explicit pinning to a commit SHA or tag instead:
Proposed fix
-# Pin to a tested TokenSpeed ref so CI is reproducible. Bump explicitly when
-# we want a newer runtime; keeping it pinned avoids surprise breakage when
-# TokenSpeed main moves ahead of what our gRPC servicer was verified against.
-TOKENSPEED_REF="${TOKENSPEED_REF:-main}"
+# Pin to a tested TokenSpeed commit/tag so CI is reproducible. Bump explicitly
+# when we want a newer runtime.
+: "${TOKENSPEED_REF:?Set TOKENSPEED_REF to the tested TokenSpeed commit SHA or tag}"
+if [ "$TOKENSPEED_REF" = "main" ]; then
+ echo "TOKENSPEED_REF must be a pinned commit SHA or tag, not main" >&2
+ exit 1
+fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Pin to a tested TokenSpeed ref so CI is reproducible. Bump explicitly when | |
| # we want a newer runtime; keeping it pinned avoids surprise breakage when | |
| # TokenSpeed main moves ahead of what our gRPC servicer was verified against. | |
| TOKENSPEED_REF="${TOKENSPEED_REF:-main}" | |
| TOKENSPEED_REPO="${TOKENSPEED_REPO:-https://github.com/lightseekorg/tokenspeed.git}" | |
| TOKENSPEED_DIR="${TOKENSPEED_DIR:-/tmp/tokenspeed-src}" | |
| WHEEL_CACHE="${TOKENSPEED_WHEEL_CACHE:-/tmp/tokenspeed-wheel-cache}" | |
| # Pin to a tested TokenSpeed commit/tag so CI is reproducible. Bump explicitly | |
| # when we want a newer runtime. | |
| : "${TOKENSPEED_REF:?Set TOKENSPEED_REF to the tested TokenSpeed commit SHA or tag}" | |
| if [ "$TOKENSPEED_REF" = "main" ]; then | |
| echo "TOKENSPEED_REF must be a pinned commit SHA or tag, not main" >&2 | |
| exit 1 | |
| fi | |
| TOKENSPEED_REPO="${TOKENSPEED_REPO:-https://github.com/lightseekorg/tokenspeed.git}" | |
| TOKENSPEED_DIR="${TOKENSPEED_DIR:-/tmp/tokenspeed-src}" | |
| WHEEL_CACHE="${TOKENSPEED_WHEEL_CACHE:-/tmp/tokenspeed-wheel-cache}" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/ci_install_tokenspeed.sh` around lines 24 - 30, The script currently
defaults TOKENSPEED_REF to "main", which breaks reproducibility; change the
logic in ci_install_tokenspeed.sh so TOKENSPEED_REF is not defaulted to a moving
branch but required: remove the default assignment for
TOKENSPEED_REF="${TOKENSPEED_REF:-main}" and add an explicit validation that
exits with a non‑zero status if TOKENSPEED_REF is empty or equals "main",
prompting the caller to supply a pinned commit SHA or tag; update any
surrounding comment to state that a pinned ref is required.
| # Step 3: kernel (CUDA compile — the expensive one). Try the cached wheel first. | ||
| CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" 2>/dev/null | head -1 || true) | ||
| if [ -n "$CACHED_KERNEL_WHEEL" ] && [ -f "$CACHED_KERNEL_WHEEL" ]; then | ||
| echo "Installing cached tokenspeed-kernel wheel: $CACHED_KERNEL_WHEEL" | ||
| uv pip install "$CACHED_KERNEL_WHEEL" --no-build-isolation | ||
| else | ||
| echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..." | ||
| MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \ | ||
| uv pip install tokenspeed-kernel/python/ --no-build-isolation | ||
| # Cache the built wheel — uv stores wheels under its cache, copy out. | ||
| mkdir -p "$WHEEL_CACHE" | ||
| python3 -c "import tokenspeed_kernel, os, shutil, glob; \ | ||
| d = os.path.dirname(tokenspeed_kernel.__file__); \ | ||
| site = os.path.dirname(d); \ | ||
| whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \ | ||
| print('kernel install dir:', whls)" || true | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Expect: the build branch creates/copies a tokenspeed_kernel wheel into WHEEL_CACHE.
rg -n 'WHEEL_CACHE|tokenspeed_kernel-\*\.whl|pip wheel|uv pip install tokenspeed-kernel' scripts/ci_install_tokenspeed.shRepository: lightseekorg/smg
Length of output: 420
🏁 Script executed:
sed -n '74,95p' scripts/ci_install_tokenspeed.shRepository: lightseekorg/smg
Length of output: 1226
Fix cache population: the build branch never writes wheels to $WHEEL_CACHE.
The Python script on lines 85–89 only prints the kernel install directory but doesn't copy the wheel into the cache. This means the first CI run builds from source and wastes the build; the second CI run rebuilds again (defeating the intended speedup).
Replace uv pip install with pip wheel --wheel-dir "$WHEEL_CACHE" to build the wheel directly into the cache, then install from the cached file:
Proposed fix
else
echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..."
- MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \
- uv pip install tokenspeed-kernel/python/ --no-build-isolation
- # Cache the built wheel — uv stores wheels under its cache, copy out.
mkdir -p "$WHEEL_CACHE"
- python3 -c "import tokenspeed_kernel, os, shutil, glob; \
- d = os.path.dirname(tokenspeed_kernel.__file__); \
- site = os.path.dirname(d); \
- whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \
- print('kernel install dir:', whls)" || true
+ MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \
+ python3 -m pip wheel tokenspeed-kernel/python/ \
+ --no-build-isolation \
+ --wheel-dir "$WHEEL_CACHE"
+ CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" | head -1)
+ uv pip install "$CACHED_KERNEL_WHEEL" --no-build-isolation
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Step 3: kernel (CUDA compile — the expensive one). Try the cached wheel first. | |
| CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" 2>/dev/null | head -1 || true) | |
| if [ -n "$CACHED_KERNEL_WHEEL" ] && [ -f "$CACHED_KERNEL_WHEEL" ]; then | |
| echo "Installing cached tokenspeed-kernel wheel: $CACHED_KERNEL_WHEEL" | |
| uv pip install "$CACHED_KERNEL_WHEEL" --no-build-isolation | |
| else | |
| echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..." | |
| MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \ | |
| uv pip install tokenspeed-kernel/python/ --no-build-isolation | |
| # Cache the built wheel — uv stores wheels under its cache, copy out. | |
| mkdir -p "$WHEEL_CACHE" | |
| python3 -c "import tokenspeed_kernel, os, shutil, glob; \ | |
| d = os.path.dirname(tokenspeed_kernel.__file__); \ | |
| site = os.path.dirname(d); \ | |
| whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \ | |
| print('kernel install dir:', whls)" || true | |
| fi | |
| # Step 3: kernel (CUDA compile — the expensive one). Try the cached wheel first. | |
| CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" 2>/dev/null | head -1 || true) | |
| if [ -n "$CACHED_KERNEL_WHEEL" ] && [ -f "$CACHED_KERNEL_WHEEL" ]; then | |
| echo "Installing cached tokenspeed-kernel wheel: $CACHED_KERNEL_WHEEL" | |
| uv pip install "$CACHED_KERNEL_WHEEL" --no-build-isolation | |
| else | |
| echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..." | |
| mkdir -p "$WHEEL_CACHE" | |
| MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \ | |
| python3 -m pip wheel tokenspeed-kernel/python/ \ | |
| --no-build-isolation \ | |
| --wheel-dir "$WHEEL_CACHE" | |
| CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" | head -1) | |
| uv pip install "$CACHED_KERNEL_WHEEL" --no-build-isolation | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/ci_install_tokenspeed.sh` around lines 74 - 90, The build branch
never writes a wheel into WHEEL_CACHE so subsequent CI runs rebuild; change the
build/install flow so the kernel wheel is created into the cache and then
installed from that cached file: when building tokenspeed-kernel (the branch
that currently runs uv pip install tokenspeed-kernel/python/ and the python3
inspection that prints kernel install dir), run pip wheel --wheel-dir
"$WHEEL_CACHE" (or otherwise copy the produced wheel into $WHEEL_CACHE) so a
tokenspeed_kernel-*.whl appears in $WHEEL_CACHE, then set CACHED_KERNEL_WHEEL to
that file and install it (uv pip install "$CACHED_KERNEL_WHEEL" or pip install
"$CACHED_KERNEL_WHEEL") instead of leaving the artifact only in the uv cache;
ensure the existing CACHED_KERNEL_WHEEL lookup, MAX_JOBS,
FLASHINFER_CUDA_ARCH_LIST and tokenspeed-kernel reference remain consistent.
| # Also used by TestOpenAIServerFunctionCalling as a TokenSpeed-supported | ||
| # substitute while TokenSpeed's model registry does not cover plain | ||
| # LlamaForCausalLM (only MoE / Eagle3 variants). Arch: Qwen3MoeForCausalLM. |
There was a problem hiding this comment.
🟡 Nit: This comment is stale — TestOpenAIServerFunctionCalling still targets meta-llama/Llama-3.2-1B-Instruct (line 118 of test_function_calling.py), and the new TokenSpeed subclass uses Qwen/Qwen3-4B, not Qwen3-30B-A3B. The comment describes an earlier design that was superseded when the tests switched to the 4B model to fit on a single 80GB GPU.
| # Also used by TestOpenAIServerFunctionCalling as a TokenSpeed-supported | |
| # substitute while TokenSpeed's model registry does not cover plain | |
| # LlamaForCausalLM (only MoE / Eagle3 variants). Arch: Qwen3MoeForCausalLM. | |
| # Thinking/reasoning model (larger). Arch: Qwen3MoeForCausalLM. | |
| # function_calling / tool_choice features are listed for completeness — | |
| # active TokenSpeed function-calling tests use Qwen3-4B (see | |
| # TestOpenAIServerFunctionCallingTokenSpeed in test_function_calling.py). |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d264a5b8fa
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # ── Clone TokenSpeed ──────────────────────────────────────────────────────── | ||
| if [ ! -d "$TOKENSPEED_DIR" ]; then | ||
| echo "Cloning TokenSpeed ${TOKENSPEED_REF} from ${TOKENSPEED_REPO}..." | ||
| git clone --depth 1 --branch "$TOKENSPEED_REF" "$TOKENSPEED_REPO" "$TOKENSPEED_DIR" |
There was a problem hiding this comment.
Allow TOKENSPEED_REF to be a commit SHA
git clone --branch accepts branch/tag names (see git clone -h: “checkout instead of the remote’s HEAD”), so setting TOKENSPEED_REF to an immutable commit SHA makes first-run installs fail with “Remote branch ... not found”. That blocks the pin-by-SHA workflow this installer is supposed to support on clean CI runners.
Useful? React with 👍 / 👎.
| git clone --depth 1 --branch "$TOKENSPEED_REF" "$TOKENSPEED_REPO" "$TOKENSPEED_DIR" | ||
| else | ||
| echo "TokenSpeed clone exists at $TOKENSPEED_DIR, reusing" | ||
| (cd "$TOKENSPEED_DIR" && git fetch --depth 1 origin "$TOKENSPEED_REF" && git checkout "$TOKENSPEED_REF") |
There was a problem hiding this comment.
Fast-forward reused clone to fetched TokenSpeed ref
In the reuse path, git fetch ... origin "$TOKENSPEED_REF" && git checkout "$TOKENSPEED_REF" does not advance an existing local branch to the fetched remote tip. When TOKENSPEED_REF=main, warm runners can keep building an old local main while fresh runners build newer code, causing inconsistent CI behavior across nodes.
Useful? React with 👍 / 👎.
d264a5b to
be83fd2
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be83fd282f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| --no-build-isolation | ||
|
|
||
| # Step 3: kernel (CUDA compile — the expensive one). Try the cached wheel first. | ||
| CACHED_KERNEL_WHEEL=$(find "$WHEEL_CACHE" -name "tokenspeed_kernel-*.whl" 2>/dev/null | head -1 || true) |
There was a problem hiding this comment.
Scope cached kernel wheel to current TokenSpeed ref
The cache lookup installs the first tokenspeed_kernel-*.whl found without checking TOKENSPEED_REF, so warm runners can reuse a wheel built from an older TokenSpeed revision while the scheduler/runtime code is checked out at a different ref. That mismatch can produce nondeterministic CI behavior (or runtime import/ABI failures) when the kernel package changes across refs; the cache keying needs to include the selected ref (and ideally interpreter/ABI) rather than using an unqualified glob.
Useful? React with 👍 / 👎.
| rid = getattr(obj, "rid", None) or "no-rid" | ||
| self.rid_to_state[rid] = _FakeState() |
There was a problem hiding this comment.
Handle list RIDs in fake AsyncLLM state tracking
For n>1 requests the servicer sets obj.rid to a list of child IDs, but this test double stores rid directly as a dict key; when rid is a list, self.rid_to_state[rid] = ... raises TypeError: unhashable type: 'list'. That makes the n>1 cancellation path fail before assertions run, so the newly added abort-all-children behavior is not actually testable in this fixture.
Useful? React with 👍 / 👎.
Dug into the 4 remaining
|
| Run | passed | failed | skipped |
|---|---|---|---|
| Before fix | 6 | 10 | 2 |
| After fix | 10 | 6 | 2 |
The 6 remaining failures are the pre-existing smg-gateway tool_choice=required/specific constraint-translation bug (same failure set on vLLM for the same model — unrelated to this PR). See the earlier reference runs in the PR description.
Commit: 7718843
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
e2e_test/infra/constants.py (1)
67-72:⚠️ Potential issue | 🟡 MinorStale docstring in
get_runtime.The docstring still says "sglang or vllm" but the function now returns any of
sglang,vllm,trtllm,tokenspeed. Update for clarity.🔧 Proposed fix
def get_runtime() -> str: - """Get the current test runtime (sglang or vllm). + """Get the current test runtime (sglang, vllm, trtllm, or tokenspeed).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@e2e_test/infra/constants.py` around lines 67 - 72, Update the stale docstring for get_runtime to list the current allowed runtime values (sglang, vllm, trtllm, tokenspeed) and clarify that it reads the E2E_RUNTIME environment variable with a default of "sglang"; modify the docstring text in the get_runtime function to reflect these exact runtime names and the default behavior so the documentation matches the implementation.
♻️ Duplicate comments (1)
grpc_servicer/tests/test_tokenspeed_servicer.py (1)
167-179:⚠️ Potential issue | 🔴 CriticalFake engine still breaks
test_cancel_aborts_all_n_childrenfor list rids.When
sampling_params.n > 1,_build_generate_reqsetsobj.rid = ["rid-1-n0", "rid-1-n1", "rid-1-n2"]. The lineself.rid_to_state[rid] = _FakeState()then raisesTypeError: unhashable type: 'list'beforenever_finish()runs, sostarted.set()is never called and the newly-addedtest_cancel_aborts_all_n_childrenwill hang onawait started.wait()(or fail via the servicer's outerexcept Exceptionbranch converting the TypeError into an INVALID/INTERNAL abort, not a CancelledError).Please teach the fake engine to iterate list rids so the n>1 cancel test actually exercises the expanded-rid path it's meant to guard.
🛠 Proposed fix
async def generate_request(self, obj): # Record the request so tests can assert on what was forwarded. rid = getattr(obj, "rid", None) or "no-rid" - self.rid_to_state[rid] = _FakeState() + rids = rid if isinstance(rid, list) else [rid] + for one_rid in rids: + self.rid_to_state[one_rid] = _FakeState() if self.generate_fn is not None: async for out in self.generate_fn(obj): self.last_receive_tstamp = 9999.0 # anything > tic yield out return for out in self.outputs: self.last_receive_tstamp = 9999.0 yield out - self.rid_to_state[rid].finished = True + for one_rid in rids: + if one_rid in self.rid_to_state: + self.rid_to_state[one_rid].finished = True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/tests/test_tokenspeed_servicer.py` around lines 167 - 179, The fake engine's generate_request currently treats obj.rid as a single hashable key and does self.rid_to_state[rid] = _FakeState(), which fails when _build_generate_req sets obj.rid to a list for sampling_params.n > 1; update generate_request to detect if rid is a list/iterable and iterate over each element, creating an entry self.rid_to_state[child_rid] = _FakeState() for each child_rid (and mark each .finished appropriately) so started.set()/never_finish() can run and the test_cancel_aborts_all_n_children path exercises the expanded-rid behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py`:
- Around line 57-67: The set_serving and set_not_serving methods mutate the
shared _serving_status dict from a non-event-loop thread while Check() reads it
on the asyncio loop; make this thread-safe by marshaling updates onto the event
loop (use the servicer's asyncio loop and call loop.call_soon_threadsafe to
schedule the dict assignments and logger.info), or alternatively protect
_serving_status with a threading.Lock and use it in set_serving, set_not_serving
and Check; locate the methods set_serving, set_not_serving and the Check method
to implement the chosen approach and ensure the warmup thread in server.py
(where set_serving is invoked) remains unchanged.
- Around line 77-78: The Check handler currently returns NOT_SERVING whenever
self.async_llm.gracefully_exit is true, which masks SERVICE_UNKNOWN for
unrecognized service names; in the health servicer's Check method (or the class
containing it), move the unknown-service validation logic (the branch that
returns SERVICE_UNKNOWN / NOT_FOUND for bogus service names) to run before the
gracefully_exit check so that unknown services return SERVICE_UNKNOWN even
during shutdown, preserving parity with the SGLang behavior.
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/scheduler_launcher.py`:
- Around line 35-53: The docstring Raises section is inaccurate: update the
Raises text in the launch_engine wrapper to state that RuntimeError is raised
when async_llm is None (i.e., the current node is not rank 0 in a multi-node
deployment or scheduler died during init) and separately note that actual rank-0
scheduler initialization failures are propagated by _launch_subprocesses (not
wrapped by this RuntimeError); reference _launch_subprocesses, async_llm, and
the RuntimeError guard in the body when editing the docstring.
In `@grpc_servicer/tests/test_tokenspeed_health_servicer.py`:
- Around line 12-15: Remove the redundant noqa comments on the top-level import
lines: delete the "# noqa: E402" trailing each import of health_pb2 and
TokenSpeedHealthServicer in the test module so the imports read normally (refer
to the import statements importing grpc_health.v1.health_pb2 and
smg_grpc_servicer.tokenspeed.health_servicer.TokenSpeedHealthServicer).
In `@scripts/ci_install_tokenspeed.sh`:
- Around line 60-63: The reuse branch fetch/checkout path can leave the working
tree on a stale local branch; after fetching origin "$TOKENSPEED_REF" in the
TOKENSPEED_DIR reuse block, explicitly reset the working tree to the fetched ref
(e.g., run a hard reset to FETCH_HEAD) so the clone always matches the remote
ref; update the commands in the reuse branch that reference TOKENSPEED_DIR and
TOKENSPEED_REF to perform git fetch --depth 1 origin "$TOKENSPEED_REF" followed
by a git reset --hard FETCH_HEAD (and optionally git clean -fd if you need to
remove untracked files) to ensure the working tree is exactly the fetched
commit.
---
Outside diff comments:
In `@e2e_test/infra/constants.py`:
- Around line 67-72: Update the stale docstring for get_runtime to list the
current allowed runtime values (sglang, vllm, trtllm, tokenspeed) and clarify
that it reads the E2E_RUNTIME environment variable with a default of "sglang";
modify the docstring text in the get_runtime function to reflect these exact
runtime names and the default behavior so the documentation matches the
implementation.
---
Duplicate comments:
In `@grpc_servicer/tests/test_tokenspeed_servicer.py`:
- Around line 167-179: The fake engine's generate_request currently treats
obj.rid as a single hashable key and does self.rid_to_state[rid] = _FakeState(),
which fails when _build_generate_req sets obj.rid to a list for
sampling_params.n > 1; update generate_request to detect if rid is a
list/iterable and iterate over each element, creating an entry
self.rid_to_state[child_rid] = _FakeState() for each child_rid (and mark each
.finished appropriately) so started.set()/never_finish() can run and the
test_cancel_aborts_all_n_children path exercises the expanded-rid behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8c93b593-996b-44c3-8529-c06dbfd6418f
📒 Files selected for processing (19)
.github/actions/setup-tokenspeed/action.yml.github/workflows/e2e-gpu-job.yml.github/workflows/pr-test-rust.ymle2e_test/chat_completions/test_function_calling.pye2e_test/infra/constants.pye2e_test/infra/model_specs.pye2e_test/infra/worker.pygrpc_servicer/pyproject.tomlgrpc_servicer/smg_grpc_servicer/tokenspeed/__init__.pygrpc_servicer/smg_grpc_servicer/tokenspeed/__main__.pygrpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.pygrpc_servicer/smg_grpc_servicer/tokenspeed/scheduler_launcher.pygrpc_servicer/smg_grpc_servicer/tokenspeed/server.pygrpc_servicer/smg_grpc_servicer/tokenspeed/servicer.pygrpc_servicer/tests/__init__.pygrpc_servicer/tests/conftest.pygrpc_servicer/tests/test_tokenspeed_health_servicer.pygrpc_servicer/tests/test_tokenspeed_servicer.pyscripts/ci_install_tokenspeed.sh
| def set_serving(self) -> None: | ||
| """Flip both services to SERVING (call after successful warmup).""" | ||
| self._serving_status[self.OVERALL_SERVER] = health_pb2.HealthCheckResponse.SERVING | ||
| self._serving_status[self.SGLANG_SERVICE] = health_pb2.HealthCheckResponse.SERVING | ||
| logger.info("TokenSpeed gRPC health status -> SERVING") | ||
|
|
||
| def set_not_serving(self) -> None: | ||
| """Flip both services to NOT_SERVING (call on shutdown).""" | ||
| self._serving_status[self.OVERALL_SERVER] = health_pb2.HealthCheckResponse.NOT_SERVING | ||
| self._serving_status[self.SGLANG_SERVICE] = health_pb2.HealthCheckResponse.NOT_SERVING | ||
| logger.info("TokenSpeed gRPC health status -> NOT_SERVING") |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
set_serving/set_not_serving are called from a non-loop thread.
server.py invokes health_servicer.set_serving() from the warmup daemon thread (line 205 there), while Check() reads _serving_status on the asyncio event loop. CPython dict item assignment is GIL-atomic so this is practically safe today, but a reader-visible subtle bug if the implementation grows (e.g., multi-field transitions, logging ordering). Consider either documenting the "called from any thread" contract here or marshaling via loop.call_soon_threadsafe.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py` around lines
57 - 67, The set_serving and set_not_serving methods mutate the shared
_serving_status dict from a non-event-loop thread while Check() reads it on the
asyncio loop; make this thread-safe by marshaling updates onto the event loop
(use the servicer's asyncio loop and call loop.call_soon_threadsafe to schedule
the dict assignments and logger.info), or alternatively protect _serving_status
with a threading.Lock and use it in set_serving, set_not_serving and Check;
locate the methods set_serving, set_not_serving and the Check method to
implement the chosen approach and ensure the warmup thread in server.py (where
set_serving is invoked) remains unchanged.
| if self.async_llm.gracefully_exit: | ||
| return health_pb2.HealthCheckResponse(status=health_pb2.HealthCheckResponse.NOT_SERVING) |
There was a problem hiding this comment.
gracefully_exit short-circuit returns NOT_SERVING for unknown services.
During shutdown, a Check for a bogus service name will return NOT_SERVING rather than SERVICE_UNKNOWN / NOT_FOUND. Minor — most clients won't observe it, and the response is still "don't route here" — but if parity with the SGLang health servicer matters, consider reordering the unknown-service check before the gracefully_exit gate.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py` around lines
77 - 78, The Check handler currently returns NOT_SERVING whenever
self.async_llm.gracefully_exit is true, which masks SERVICE_UNKNOWN for
unrecognized service names; in the health servicer's Check method (or the class
containing it), move the unknown-service validation logic (the branch that
returns SERVICE_UNKNOWN / NOT_FOUND for bogus service names) to run before the
gracefully_exit check so that unknown services return SERVICE_UNKNOWN even
during shutdown, preserving parity with the SGLang behavior.
| from grpc_health.v1 import health_pb2 # noqa: E402 | ||
| from smg_grpc_servicer.tokenspeed.health_servicer import ( # noqa: E402 | ||
| TokenSpeedHealthServicer, | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Unnecessary # noqa: E402.
These imports are at the top of the module (no preceding non-import code), so E402 won't fire. The noqa directives are redundant.
🔧 Proposed fix
-from grpc_health.v1 import health_pb2 # noqa: E402
-from smg_grpc_servicer.tokenspeed.health_servicer import ( # noqa: E402
+from grpc_health.v1 import health_pb2
+from smg_grpc_servicer.tokenspeed.health_servicer import (
TokenSpeedHealthServicer,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from grpc_health.v1 import health_pb2 # noqa: E402 | |
| from smg_grpc_servicer.tokenspeed.health_servicer import ( # noqa: E402 | |
| TokenSpeedHealthServicer, | |
| ) | |
| from grpc_health.v1 import health_pb2 | |
| from smg_grpc_servicer.tokenspeed.health_servicer import ( | |
| TokenSpeedHealthServicer, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@grpc_servicer/tests/test_tokenspeed_health_servicer.py` around lines 12 - 15,
Remove the redundant noqa comments on the top-level import lines: delete the "#
noqa: E402" trailing each import of health_pb2 and TokenSpeedHealthServicer in
the test module so the imports read normally (refer to the import statements
importing grpc_health.v1.health_pb2 and
smg_grpc_servicer.tokenspeed.health_servicer.TokenSpeedHealthServicer).
| else | ||
| echo "TokenSpeed clone exists at $TOKENSPEED_DIR, reusing" | ||
| (cd "$TOKENSPEED_DIR" && git fetch --depth 1 origin "$TOKENSPEED_REF" && git checkout "$TOKENSPEED_REF") | ||
| fi |
There was a problem hiding this comment.
Reuse path does not advance to the latest ref.
When the clone directory already exists, git fetch --depth 1 origin "$TOKENSPEED_REF" && git checkout "$TOKENSPEED_REF" leaves the working tree on the previously checked-out commit if the local branch already exists and diverges from the remote — checkout won't fast-forward. On a long-lived self-hosted runner, CI can silently build a stale TokenSpeed. Reset explicitly to FETCH_HEAD:
🔧 Proposed fix
-else
- echo "TokenSpeed clone exists at $TOKENSPEED_DIR, reusing"
- (cd "$TOKENSPEED_DIR" && git fetch --depth 1 origin "$TOKENSPEED_REF" && git checkout "$TOKENSPEED_REF")
-fi
+else
+ echo "TokenSpeed clone exists at $TOKENSPEED_DIR, reusing"
+ (cd "$TOKENSPEED_DIR" \
+ && git fetch --depth 1 origin "$TOKENSPEED_REF" \
+ && git checkout -B "$TOKENSPEED_REF" FETCH_HEAD)
+fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/ci_install_tokenspeed.sh` around lines 60 - 63, The reuse branch
fetch/checkout path can leave the working tree on a stale local branch; after
fetching origin "$TOKENSPEED_REF" in the TOKENSPEED_DIR reuse block, explicitly
reset the working tree to the fetched ref (e.g., run a hard reset to FETCH_HEAD)
so the clone always matches the remote ref; update the commands in the reuse
branch that reference TOKENSPEED_DIR and TOKENSPEED_REF to perform git fetch
--depth 1 origin "$TOKENSPEED_REF" followed by a git reset --hard FETCH_HEAD
(and optionally git clean -fd if you need to remove untracked files) to ensure
the working tree is exactly the fetched commit.
Reorganized into a 3-PR stack@CatherineSue per your review, split this into three reviewable PRs:
PR #1464 and #1465 are open as draft until this PR is in shape to merge. What changed in this PR vs. the previous tip
Verification (this PR)
Other reviewer notes
|
…dule The 5 ``build_*_sampling_params_from_*`` helpers (chat / responses / messages / completion / plain) plus their constraint helpers were sitting on ``SglangSchedulerClient`` even though the OpenAI mapping is backend-neutral. TokenSpeed's client was reaching across to call them through ``SglangSchedulerClient::*`` which suggested SGLang owned the OpenAI→sampling translation, when it doesn't. Move them to ``crate::sampling_params`` as free functions. Both the SGLang and TokenSpeed clients (and any future client that wants the same mapping) now call ``crate::sampling_params::build_*`` directly. The return type is still ``sglang::SamplingParams`` because that proto happens to be the most permissive shape across our supported backends; TokenSpeed translates to its own slimmer shape at the wire boundary. When a backend grows a sampling field SGLang lacks, this is the place to add it. No behavior change. Tests stay green; the call sites in ``build_generate_request_from_*`` are mechanically updated. Signed-off-by: key4ng <rukeyang@gmail.com>
…ation)
TokenSpeed previously rode the SGLang IR arms in proto_wrapper.rs:
``TokenSpeedSchedulerClient::generate()`` accepted ``sglang::GenerateRequest``,
the streaming response was translated back into ``sglang::GenerateResponse``
at the wire boundary, and the router dispatched through
``ProtoGenerateRequest::Sglang``. This let TokenSpeed reuse the existing
match arms but coupled it to SGLang's evolving schema — every SGLang field
addition forced a TokenSpeed translator stub (most recently
``default_sampling_params_json``).
Add native TokenSpeed arms to the router IR:
- ``ProtoGenerateRequest::TokenSpeed(Box<tokenspeed::GenerateRequest>)``
- ``ProtoGenerateResponse::TokenSpeed(Box<tokenspeed::GenerateResponse>)``
- ``ProtoGenerateStreamChunk::TokenSpeed(tokenspeed::GenerateStreamChunk)``
- ``ProtoGenerateComplete::TokenSpeed(tokenspeed::GenerateComplete)``
…with ``as_tokenspeed`` / ``as_tokenspeed_mut`` / ``is_tokenspeed`` accessors
mirroring the existing per-backend pattern (Mlx / Vllm / Trtllm) and
``TokenSpeed`` arms in every aggregator method (``token_ids``, ``index``,
``output_logprobs``, ``input_logprobs``, ``prompt_tokens``,
``completion_tokens``, ``cached_tokens``, ``finish_reason``,
``matched_stop_json``, ``output_ids``, ``set_stream``, ``request_id``,
``set_max_tokens_for_prefill``, ``clear_mm_inputs``,
``set_kv_transfer_params``, ``kv_transfer_params``).
Client-side:
- ``TokenSpeedSchedulerClient::generate()`` now takes
``tokenspeed_proto::GenerateRequest``
- ``AbortOnDropStream::Item`` is ``tokenspeed_proto::GenerateResponse``
- The 5 ``build_*_request`` builders return native ``tokenspeed_proto``
types
- ``translate::generate_request`` / ``generate_response`` /
``stream_chunk`` / ``complete`` / ``output_logprobs`` are gone — the
only translation kept is ``translate::sampling_params`` (a thin field
map) plus the unary RPC adapters (``model_info`` / ``server_info`` /
``loads``), which still produce SGLang shapes because the router's
``ModelInfo`` / ``ServerInfo`` enums consume those — that's a
separate cleanup.
Router-side:
- ``client.rs::generate()`` dispatch arm now matches
``(Self::TokenSpeed(_), ProtoGenerateRequest::TokenSpeed(_))``.
- The 5 ``build_*_request`` paths in ``GrpcClient`` wrap into
``ProtoGenerateRequest::TokenSpeed`` instead of ``::Sglang``.
- ``harmony/stages/request_building.rs`` builds
``ProtoGenerateRequest::TokenSpeed`` and grew a ``TokenSpeed`` arm
in the Harmony stop-token injection match.
- ``common/stages/helpers.rs::apply_sampling_defaults_to_generate_request``
early-returns for TokenSpeed (alongside Trtllm) since neither
backend plumbs sampling defaults through today; the explicit arm
keeps the match exhaustive.
PD-disagg paths (``response_collection.rs``) remain SGLang-keyed — the
``if let ProtoGenerateComplete::Sglang(...)`` checks simply won't match
TokenSpeed responses, which is the correct behavior since TokenSpeed
doesn't ship PD-disaggregation.
Verification:
- ``cargo +nightly fmt --all -- --check`` passes
- ``cargo clippy -p smg-grpc-client -p smg --all-targets --all-features -- -D warnings`` passes
- ``cargo check -p smg --bin smg`` passes
This addresses the architectural concern raised on #1351: SGLang's proto
shouldn't be the de-facto router IR. Each backend now has its own arm,
matching how vLLM / MLX / TRT-LLM are integrated.
Signed-off-by: key4ng <rukeyang@gmail.com>
Adds the Python servicer that runs alongside a TokenSpeed scheduler
process and serves the gRPC protocol PR1 introduced. Includes:
- the async scheduler servicer (Generate/HealthCheck/Abort/
GetModelInfo/GetServerInfo/GetLoads), with cancellation handling
for streaming, non-streaming, channel-close, and n>1 paths
- a health-service bridge that flips SERVING/NOT_SERVING based on
scheduler liveness (deep probe with bounded staleness)
- a scheduler launcher that boots TokenSpeed's AsyncLLM in-process
- the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint
- real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so
router-side load balancing reflects scheduler-side metrics
- 57 unit tests covering the servicer, health service, proto
conversion, finish reasons, sampling params, streaming/non-
streaming generation, abort/cancel (incl. n>1), model/server
info, and load metrics
This is part 2 of 3 splitting #1351:
- PR1: Rust gRPC + protocol (merged first)
- PR2 (this): Python servicer + unit tests
- PR3: CI workflows + e2e tests
Stacked on PR1 — the servicer imports the proto stubs PR1 generates
from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``.
Fixes a 🔴 critical from review on #1351:
- FakeAsyncLLM.generate_request crashed with
``TypeError: unhashable type: 'list'`` when n>1, because
``_build_generate_req`` rewrites ``rid`` to a list of per-choice
ids. The fake engine now registers state for each child rid, so
``test_cancel_aborts_all_n_children`` exercises the cancel sweep
instead of dying at setup.
Signed-off-by: key4ng <rukeyang@gmail.com>
Wires TokenSpeed into CI and the GPU e2e suite:
- ``.github/actions/setup-tokenspeed`` composite action and
``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed
(kernel + scheduler) at a pinned ref, with a wheel cache lookup so
repeat runs skip the ~20 min compile
- e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret
access so forked PRs skip cleanly
- pr-test-rust.yml: install the same proto deps so Rust-only changes
that touch ``crates/grpc_client`` still cover the tokenspeed proto
- e2e_test infra: ``constants``, ``hooks``, ``worker``, and
``model_specs`` learn about a ``tokenspeed`` runtime alongside
sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the
suite-wide ``@pytest.mark.engine(...)`` markers expand to include
tokenspeed
- Function-calling and tool_choice e2e suites swap to
``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3
family is what TokenSpeed's model registry currently supports)
This is part 3 of 3 splitting #1351:
- PR1: Rust gRPC + protocol
- PR2: Python servicer + unit tests
- PR3 (this): CI workflows + e2e tests
Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and
the Python servicer from PR2 against a live TokenSpeed worker.
Addresses CatherineSue's review on #1351:
- drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` —
that context belongs in the PR description, not in the test file
Signed-off-by: key4ng <rukeyang@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 76bf5720e6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if matches!(request, ProtoGenerateRequest::Trtllm(_)) { | ||
| if matches!( | ||
| request, | ||
| ProtoGenerateRequest::Trtllm(_) | ProtoGenerateRequest::TokenSpeed(_) |
There was a problem hiding this comment.
Apply sampling defaults to TokenSpeed requests
When a selected TokenSpeed worker has a default_sampling_params_json label and the client omits fields like temperature or top_p, this early return skips the default-injection path entirely. The TokenSpeed request builders already populate fallback values before dispatch, so the model/worker-preferred defaults that are applied for SGLang/vLLM/MLX are silently ignored for TokenSpeed, changing generation behavior for workers that rely on those labels.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in c478f1ac. Dropped TokenSpeed from the early-return, added a
ProtoGenerateRequest::TokenSpeed(req) match arm calling a new
apply_tokenspeed_sampling_defaults helper. Defaults now flow
end-to-end: translate::model_info populates
default_sampling_params_json (sibling thread), discovery flattens it
into the label map, and the injector applies the masked fields with
Some(value) per TokenSpeed's optional scalar shape.
| if matches!( | ||
| request, | ||
| ProtoGenerateRequest::Trtllm(_) | ProtoGenerateRequest::TokenSpeed(_) | ||
| ) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
🟡 Nit: This early-return is a behavioral regression from the previous code. Before this push, TokenSpeed requests were wrapped as ProtoGenerateRequest::Sglang, so they fell into the SGLang arm below and received model-discovered sampling defaults (e.g., temperature=0.7 from generation_config.json). Now that TokenSpeed has its own variant, defaults are skipped entirely.
Concretely: if a model publishes temperature=0.7 in its generation config and a user sends a chat request without specifying temperature, SGLang/vLLM/MLX workers will use 0.7, but TokenSpeed workers will use the hardcoded 1.0 from the sampling-params builder.
This is straightforward to fix (write an apply_tokenspeed_sampling_defaults similar to the existing per-backend functions), but since TokenSpeed is still in development I understand deferring it. Worth tracking explicitly so it doesn't get lost.
There was a problem hiding this comment.
Fixed in c478f1ac rather than deferred — good catch on the
regression. apply_sampling_defaults_to_generate_request no longer
early-returns for TokenSpeed; there's a dedicated
apply_tokenspeed_sampling_defaults mirroring the SGLang/vLLM/MLX
helpers but writing Some(value) because every TokenSpeed
SamplingParams scalar is optional on the wire (the servicer's
HasField() checks rely on that set-vs-unset distinction).
Paired with the translate::model_info fix in the same commit so
default_sampling_params_json carries the JSON the discovery path
expects.
Wires TokenSpeed into CI and the GPU e2e suite:
- ``.github/actions/setup-tokenspeed`` composite action and
``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed
(kernel + scheduler) at a pinned ref, with a wheel cache lookup so
repeat runs skip the ~20 min compile
- e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret
access so forked PRs skip cleanly
- pr-test-rust.yml: install the same proto deps so Rust-only changes
that touch ``crates/grpc_client`` still cover the tokenspeed proto
- e2e_test infra: ``constants``, ``hooks``, ``worker``, and
``model_specs`` learn about a ``tokenspeed`` runtime alongside
sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the
suite-wide ``@pytest.mark.engine(...)`` markers expand to include
tokenspeed
- Function-calling and tool_choice e2e suites swap to
``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3
family is what TokenSpeed's model registry currently supports)
This is part 3 of 3 splitting #1351:
- PR1: Rust gRPC + protocol
- PR2: Python servicer + unit tests
- PR3 (this): CI workflows + e2e tests
Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and
the Python servicer from PR2 against a live TokenSpeed worker.
Addresses CatherineSue's review on #1351:
- drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` —
that context belongs in the PR description, not in the test file
Signed-off-by: key4ng <rukeyang@gmail.com>
…ection
The chat-template tool-shape pre-processor was correct for Kimi-K2.5
(BFCL accuracy +6 pp on simple_python, +24 pp on parallel_multiple)
but breaks Mistral chat templates: their template at line 32 iterates
``tool.function.parameters.properties.items()``, which raises
``unknown method: undefined has no method named items`` once we unwrap
``{"type": "function", "function": {...}}`` into the bare inner dict.
The shape a chat template expects is template-dependent, not
engine-dependent. Reverting the unconditional unwrap; full rationale,
accuracy data, and proposed per-model fix in
docs/proposals/2026-05-09-deferred-chat-template-tools-strip.md.
Affected CI lane: e2e_test/chat_completions/test_function_calling.py::TestToolChoiceMistral
(20 tests failing with chat:32 render error).
Signed-off-by: key4ng <rukeyang@gmail.com>
Cuts ~125 lines of doc-comment / inline-rationale prose without losing information. Hot spots: - tokenspeed_scheduler.rs module doc (28 → 7 lines) - tokenspeed_scheduler.proto service header + SamplingParams comment + per-field rationale (~40 lines) - sampling_params.rs module doc (14 → 4 lines) - translate::sampling_params explainer (12 → 3 lines) - inline arm comments in client.rs / multimodal.rs / harmony/stages / common/stages that just re-stated what the code already shows Behavior unchanged. ``cargo +nightly fmt --check`` passes; ``cargo clippy -p smg-grpc-client -p smg --all-targets --all-features -- -D warnings`` passes. Signed-off-by: key4ng <rukeyang@gmail.com>
Adds the Python servicer that runs alongside a TokenSpeed scheduler
process and serves the gRPC protocol PR1 introduced. Includes:
- the async scheduler servicer (Generate/HealthCheck/Abort/
GetModelInfo/GetServerInfo/GetLoads), with cancellation handling
for streaming, non-streaming, channel-close, and n>1 paths
- a health-service bridge that flips SERVING/NOT_SERVING based on
scheduler liveness (deep probe with bounded staleness)
- a scheduler launcher that boots TokenSpeed's AsyncLLM in-process
- the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint
- real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so
router-side load balancing reflects scheduler-side metrics
- 57 unit tests covering the servicer, health service, proto
conversion, finish reasons, sampling params, streaming/non-
streaming generation, abort/cancel (incl. n>1), model/server
info, and load metrics
This is part 2 of 3 splitting #1351:
- PR1: Rust gRPC + protocol (merged first)
- PR2 (this): Python servicer + unit tests
- PR3: CI workflows + e2e tests
Stacked on PR1 — the servicer imports the proto stubs PR1 generates
from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``.
Fixes a 🔴 critical from review on #1351:
- FakeAsyncLLM.generate_request crashed with
``TypeError: unhashable type: 'list'`` when n>1, because
``_build_generate_req`` rewrites ``rid`` to a list of per-choice
ids. The fake engine now registers state for each child rid, so
``test_cancel_aborts_all_n_children`` exercises the cancel sweep
instead of dying at setup.
Signed-off-by: key4ng <rukeyang@gmail.com>
Wires TokenSpeed into CI and the GPU e2e suite:
- ``.github/actions/setup-tokenspeed`` composite action and
``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed
(kernel + scheduler) at a pinned ref, with a wheel cache lookup so
repeat runs skip the ~20 min compile
- e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret
access so forked PRs skip cleanly
- pr-test-rust.yml: install the same proto deps so Rust-only changes
that touch ``crates/grpc_client`` still cover the tokenspeed proto
- e2e_test infra: ``constants``, ``hooks``, ``worker``, and
``model_specs`` learn about a ``tokenspeed`` runtime alongside
sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the
suite-wide ``@pytest.mark.engine(...)`` markers expand to include
tokenspeed
- Function-calling and tool_choice e2e suites swap to
``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3
family is what TokenSpeed's model registry currently supports)
This is part 3 of 3 splitting #1351:
- PR1: Rust gRPC + protocol
- PR2: Python servicer + unit tests
- PR3 (this): CI workflows + e2e tests
Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and
the Python servicer from PR2 against a live TokenSpeed worker.
Addresses CatherineSue's review on #1351:
- drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` —
that context belongs in the PR description, not in the test file
Signed-off-by: key4ng <rukeyang@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8478a636ba
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Adds the Python servicer that runs alongside a TokenSpeed scheduler
process and serves the gRPC protocol PR1 introduced. Includes:
- the async scheduler servicer (Generate/HealthCheck/Abort/
GetModelInfo/GetServerInfo/GetLoads), with cancellation handling
for streaming, non-streaming, channel-close, and n>1 paths
- a health-service bridge that flips SERVING/NOT_SERVING based on
scheduler liveness (deep probe with bounded staleness)
- a scheduler launcher that boots TokenSpeed's AsyncLLM in-process
- the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint
- real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so
router-side load balancing reflects scheduler-side metrics
- 57 unit tests covering the servicer, health service, proto
conversion, finish reasons, sampling params, streaming/non-
streaming generation, abort/cancel (incl. n>1), model/server
info, and load metrics
This is part 2 of 3 splitting #1351:
- PR1: Rust gRPC + protocol (merged first)
- PR2 (this): Python servicer + unit tests
- PR3: CI workflows + e2e tests
Stacked on PR1 — the servicer imports the proto stubs PR1 generates
from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``.
Fixes a 🔴 critical from review on #1351:
- FakeAsyncLLM.generate_request crashed with
``TypeError: unhashable type: 'list'`` when n>1, because
``_build_generate_req`` rewrites ``rid`` to a list of per-choice
ids. The fake engine now registers state for each child rid, so
``test_cancel_aborts_all_n_children`` exercises the cancel sweep
instead of dying at setup.
Signed-off-by: key4ng <rukeyang@gmail.com>
Wires TokenSpeed into CI and the GPU e2e suite:
- ``.github/actions/setup-tokenspeed`` composite action and
``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed
(kernel + scheduler) at a pinned ref, with a wheel cache lookup so
repeat runs skip the ~20 min compile
- e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret
access so forked PRs skip cleanly
- pr-test-rust.yml: install the same proto deps so Rust-only changes
that touch ``crates/grpc_client`` still cover the tokenspeed proto
- e2e_test infra: ``constants``, ``hooks``, ``worker``, and
``model_specs`` learn about a ``tokenspeed`` runtime alongside
sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the
suite-wide ``@pytest.mark.engine(...)`` markers expand to include
tokenspeed
- Function-calling and tool_choice e2e suites swap to
``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3
family is what TokenSpeed's model registry currently supports)
This is part 3 of 3 splitting #1351:
- PR1: Rust gRPC + protocol
- PR2: Python servicer + unit tests
- PR3 (this): CI workflows + e2e tests
Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and
the Python servicer from PR2 against a live TokenSpeed worker.
Addresses CatherineSue's review on #1351:
- drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` —
that context belongs in the PR description, not in the test file
Signed-off-by: key4ng <rukeyang@gmail.com>
Tightens the comments introduced (or modified) by this PR to describe behavior directly. No behavior change; literal type / path references left intact. Files touched: - crates/grpc_client/build.rs - crates/grpc_client/proto/tokenspeed_scheduler.proto - crates/grpc_client/src/sampling_params.rs - crates/grpc_client/src/tokenspeed_scheduler.rs - model_gateway/src/routers/grpc/client.rs - model_gateway/src/routers/grpc/proto_wrapper.rs Signed-off-by: key4ng <rukeyang@gmail.com>
8057d10 to
656f1c2
Compare
Adds the Python servicer that runs alongside a TokenSpeed scheduler
process and serves the gRPC protocol PR1 introduced. Includes:
- the async scheduler servicer (Generate/HealthCheck/Abort/
GetModelInfo/GetServerInfo/GetLoads), with cancellation handling
for streaming, non-streaming, channel-close, and n>1 paths
- a health-service bridge that flips SERVING/NOT_SERVING based on
scheduler liveness (deep probe with bounded staleness)
- a scheduler launcher that boots TokenSpeed's AsyncLLM in-process
- the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint
- real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so
router-side load balancing reflects scheduler-side metrics
- 57 unit tests covering the servicer, health service, proto
conversion, finish reasons, sampling params, streaming/non-
streaming generation, abort/cancel (incl. n>1), model/server
info, and load metrics
This is part 2 of 3 splitting #1351:
- PR1: Rust gRPC + protocol (merged first)
- PR2 (this): Python servicer + unit tests
- PR3: CI workflows + e2e tests
Stacked on PR1 — the servicer imports the proto stubs PR1 generates
from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``.
Fixes a 🔴 critical from review on #1351:
- FakeAsyncLLM.generate_request crashed with
``TypeError: unhashable type: 'list'`` when n>1, because
``_build_generate_req`` rewrites ``rid`` to a list of per-choice
ids. The fake engine now registers state for each child rid, so
``test_cancel_aborts_all_n_children`` exercises the cancel sweep
instead of dying at setup.
Signed-off-by: key4ng <rukeyang@gmail.com>
Wires TokenSpeed into CI and the GPU e2e suite:
- ``.github/actions/setup-tokenspeed`` composite action and
``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed
(kernel + scheduler) at a pinned ref, with a wheel cache lookup so
repeat runs skip the ~20 min compile
- e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret
access so forked PRs skip cleanly
- pr-test-rust.yml: install the same proto deps so Rust-only changes
that touch ``crates/grpc_client`` still cover the tokenspeed proto
- e2e_test infra: ``constants``, ``hooks``, ``worker``, and
``model_specs`` learn about a ``tokenspeed`` runtime alongside
sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the
suite-wide ``@pytest.mark.engine(...)`` markers expand to include
tokenspeed
- Function-calling and tool_choice e2e suites swap to
``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3
family is what TokenSpeed's model registry currently supports)
This is part 3 of 3 splitting #1351:
- PR1: Rust gRPC + protocol
- PR2: Python servicer + unit tests
- PR3 (this): CI workflows + e2e tests
Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and
the Python servicer from PR2 against a live TokenSpeed worker.
Addresses CatherineSue's review on #1351:
- drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` —
that context belongs in the PR description, not in the test file
Signed-off-by: key4ng <rukeyang@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 656f1c2e98
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The IR refactor split TokenSpeed off the SGLang request arm, which left two gaps in the sampling-defaults path: 1. `translate::model_info` populated the `preferred_sampling_params` label slot from TokenSpeed's response but left `default_sampling_params_json` empty. Worker discovery reads only the latter, so model-published defaults never reached the label map and were invisible to the request-stage injector. 2. `apply_sampling_defaults_to_generate_request` early-returned for `ProtoGenerateRequest::TokenSpeed(_)`. Even if a worker's labels carried defaults, the injector skipped the TokenSpeed arm — so a model publishing `temperature=0.7` in its generation config would apply to the other engines but TokenSpeed would fall through to the hardcoded 1.0 in the request builder. Both fixed: - `tokenspeed_scheduler.rs`: surface `preferred_sampling_params` in both `preferred_sampling_params` and `default_sampling_params_json` slots so the discovery path picks it up. - `helpers.rs`: drop TokenSpeed from the early-return, add a `TokenSpeed(req)` match arm calling a new `apply_tokenspeed_sampling_defaults`. TokenSpeed's wire declares every sampling scalar as `optional`, so the helper writes `Some(value)` rather than the bare value — preserving the set-vs-unset distinction the servicer's `HasField()` checks rely on. Verification: `cargo +nightly fmt --all -- --check` and `cargo clippy -p smg-grpc-client -p smg --all-targets --all-features -- -D warnings` both pass. Signed-off-by: key4ng <rukeyang@gmail.com>
Adds the Python servicer that runs alongside a TokenSpeed scheduler
process and serves the gRPC protocol PR1 introduced. Includes:
- the async scheduler servicer (Generate/HealthCheck/Abort/
GetModelInfo/GetServerInfo/GetLoads), with cancellation handling
for streaming, non-streaming, channel-close, and n>1 paths
- a health-service bridge that flips SERVING/NOT_SERVING based on
scheduler liveness (deep probe with bounded staleness)
- a scheduler launcher that boots TokenSpeed's AsyncLLM in-process
- the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint
- real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so
router-side load balancing reflects scheduler-side metrics
- 57 unit tests covering the servicer, health service, proto
conversion, finish reasons, sampling params, streaming/non-
streaming generation, abort/cancel (incl. n>1), model/server
info, and load metrics
This is part 2 of 3 splitting #1351:
- PR1: Rust gRPC + protocol (merged first)
- PR2 (this): Python servicer + unit tests
- PR3: CI workflows + e2e tests
Stacked on PR1 — the servicer imports the proto stubs PR1 generates
from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``.
Fixes a 🔴 critical from review on #1351:
- FakeAsyncLLM.generate_request crashed with
``TypeError: unhashable type: 'list'`` when n>1, because
``_build_generate_req`` rewrites ``rid`` to a list of per-choice
ids. The fake engine now registers state for each child rid, so
``test_cancel_aborts_all_n_children`` exercises the cancel sweep
instead of dying at setup.
Signed-off-by: key4ng <rukeyang@gmail.com>
Wires TokenSpeed into CI and the GPU e2e suite:
- ``.github/actions/setup-tokenspeed`` composite action and
``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed
(kernel + scheduler) at a pinned ref, with a wheel cache lookup so
repeat runs skip the ~20 min compile
- e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret
access so forked PRs skip cleanly
- pr-test-rust.yml: install the same proto deps so Rust-only changes
that touch ``crates/grpc_client`` still cover the tokenspeed proto
- e2e_test infra: ``constants``, ``hooks``, ``worker``, and
``model_specs`` learn about a ``tokenspeed`` runtime alongside
sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the
suite-wide ``@pytest.mark.engine(...)`` markers expand to include
tokenspeed
- Function-calling and tool_choice e2e suites swap to
``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3
family is what TokenSpeed's model registry currently supports)
This is part 3 of 3 splitting #1351:
- PR1: Rust gRPC + protocol
- PR2: Python servicer + unit tests
- PR3 (this): CI workflows + e2e tests
Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and
the Python servicer from PR2 against a live TokenSpeed worker.
Addresses CatherineSue's review on #1351:
- drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` —
that context belongs in the PR description, not in the test file
Signed-off-by: key4ng <rukeyang@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c478f1acd7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Collapse the duplicate 2-line crate-level docstring into a single neutral line covering all supported backends. - Drop the 5-line TokenSpeed re-export rationale block; the design rationale already lives in `tokenspeed_scheduler.rs`'s module doc. Signed-off-by: key4ng <rukeyang@gmail.com>
Adds the Python servicer that runs alongside a TokenSpeed scheduler
process and serves the gRPC protocol PR1 introduced. Includes:
- the async scheduler servicer (Generate/HealthCheck/Abort/
GetModelInfo/GetServerInfo/GetLoads), with cancellation handling
for streaming, non-streaming, channel-close, and n>1 paths
- a health-service bridge that flips SERVING/NOT_SERVING based on
scheduler liveness (deep probe with bounded staleness)
- a scheduler launcher that boots TokenSpeed's AsyncLLM in-process
- the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint
- real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so
router-side load balancing reflects scheduler-side metrics
- 57 unit tests covering the servicer, health service, proto
conversion, finish reasons, sampling params, streaming/non-
streaming generation, abort/cancel (incl. n>1), model/server
info, and load metrics
This is part 2 of 3 splitting #1351:
- PR1: Rust gRPC + protocol (merged first)
- PR2 (this): Python servicer + unit tests
- PR3: CI workflows + e2e tests
Stacked on PR1 — the servicer imports the proto stubs PR1 generates
from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``.
Fixes a 🔴 critical from review on #1351:
- FakeAsyncLLM.generate_request crashed with
``TypeError: unhashable type: 'list'`` when n>1, because
``_build_generate_req`` rewrites ``rid`` to a list of per-choice
ids. The fake engine now registers state for each child rid, so
``test_cancel_aborts_all_n_children`` exercises the cancel sweep
instead of dying at setup.
Signed-off-by: key4ng <rukeyang@gmail.com>
Description
Problem
SMG has no first-class TokenSpeed backend. Routing to a TokenSpeed worker requires the gateway to speak TokenSpeed's dedicated gRPC protocol, plumb it through the router's per-backend IR, and pick TokenSpeed up correctly during runtime detection.
Solution
This PR (Part 1 of 3) adds the Rust side of TokenSpeed support: a self-contained
tokenspeed.grpc.scheduler.TokenSpeedSchedulerproto, aTokenSpeedSchedulerClientthat produces and consumes nativetokenspeed_prototypes end-to-end, dedicatedProtoGenerateRequest::TokenSpeed/ProtoGenerateStreamChunk::TokenSpeed/ProtoGenerateComplete::TokenSpeedIR arms in the router (same shape as the existing per-backend arms), and theRuntimeType::TokenSpeedvariant.PR1 is functionally inert without PR2 — the router can dial a TokenSpeed worker, but the worker process is in PR2's Python servicer.
3-PR Stack
maingrpc_servicer/smg_grpc_servicer/tokenspeed/) + unit tests — basefeat/grpc-servicer-tokenspeedfeat/grpc-tokenspeed-servicerChanges
crates/grpc_client/proto/tokenspeed_scheduler.proto— fully self-contained wire definition (packagetokenspeed.grpc.scheduler)crates/grpc_client/src/tokenspeed_scheduler.rs—TokenSpeedSchedulerClientreturning nativetokenspeed_proto::*typescrates/protocols/src/worker.rs—RuntimeType::TokenSpeedvariantmodel_gateway/src/routers/grpc/proto_wrapper.rs— dedicatedTokenSpeedarms across the IR enums + per-arm accessorsmodel_gateway/src/routers/grpc/client.rs— dispatch routes the new variant throughTokenSpeedSchedulerClientmodel_gateway/src/workflow/steps/local/detect_backend.rs— health-check ordering extended with tokenspeedTest Plan
cargo +nightly fmt --all -- --checkpassescargo clippy -p smg-grpc-client --all-targets --all-features -- -D warningspassescargo clippy -p smg --all-targets --all-features -- -D warningspassescargo check -p smg --bin smgsucceeds against currentmainEnd-to-end coverage runs in PR3 once both the Rust router and the Python servicer (PR2) are in place.
Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspasses