feat(grpc_servicer): add TokenSpeed servicer (Part 2/3) by key4ng · Pull Request #1464 · lightseekorg/smg

key4ng · 2026-05-08T20:11:55Z

Description

Problem

PR #1351's Rust router (Part 1, on feat/grpc-servicer-tokenspeed) can dial a TokenSpeed worker over the gRPC protocol it defines, but no worker speaks that protocol. We need a Python servicer that runs alongside a TokenSpeed scheduler process and serves the wire types defined in Part 1.

Solution

A self-contained TokenSpeed servicer module under grpc_servicer/smg_grpc_servicer/tokenspeed/, with cancellation handling for streaming/non-streaming, channel-close, and n>1 paths, plus 57 unit tests.

3-PR Stack

This is part 2 of 3 splitting the original #1351:

PR1 (feat(grpc): add TokenSpeed gRPC client and router wiring (Part 1/3) #1351): Rust gRPC + protocol — base main
PR2 (this): Python servicer + unit tests — base feat/grpc-servicer-tokenspeed (= PR1)
PR3 (ci(tokenspeed): add CI install + GPU e2e coverage (Part 3/3) #1465): CI workflows + e2e tests — base feat/grpc-tokenspeed-servicer

Stacked on PR1. The servicer imports proto stubs generated from crates/grpc_client/proto/tokenspeed_scheduler.proto (added in PR1).

Changes

grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py — async scheduler servicer (Generate / HealthCheck / Abort / GetModelInfo / GetServerInfo / GetLoads), with cancellation that sweeps every {rid}-n{i} child rid expanded by n>1
grpc_servicer/smg_grpc_servicer/tokenspeed/health_servicer.py — health-service bridge that flips SERVING / NOT_SERVING based on bounded-staleness scheduler liveness probes
grpc_servicer/smg_grpc_servicer/tokenspeed/scheduler_launcher.py — boots TokenSpeed AsyncLLM in-process
grpc_servicer/smg_grpc_servicer/tokenspeed/server.py and __main__.py — python -m smg_grpc_servicer.tokenspeed entrypoint
GetLoads returns real AsyncLLM.get_load() metrics (was a stub returning zeros)
grpc_servicer/tests/test_tokenspeed_*.py — 57 unit tests covering proto conversion, finish reasons, sampling params, streaming/non-streaming, abort/cancel (incl. n>1), model/server info, and load metrics

Test Plan

pytest grpc_servicer/tests/ -v → 57 passed in 1.47s, including:
- test_cancel_calls_abort_request (n=1 cancel path)
- test_cancel_aborts_all_n_children (n>1 cancel sweep)
- test_abort_sweeps_n_children (Abort RPC anchored regex)

Review Threads from #1351

Addressed in this PR:

grpc_servicer/tests/conftest.py — kept tests with the code under test rather than moving to a follow-up; if you'd still prefer the tests in a separate PR, happy to peel them out.
🔴 tests/test_tokenspeed_servicer.py — FakeAsyncLLM.generate_request previously crashed with TypeError: unhashable type: 'list' for n>1 because _build_generate_req rewrites rid to a list of per-choice ids. The fake engine now registers state for each child rid, so test_cancel_aborts_all_n_children actually exercises the cancel sweep.

Checklist

cargo +nightly fmt passes (no Rust changes)
cargo clippy --all-targets --all-features -- -D warnings passes (no Rust changes)
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

coderabbitai · 2026-05-08T20:12:02Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9064cb1a-667a-426b-ab45-f9b58777d7be

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/grpc-tokenspeed-servicer

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude

Clean, thorough implementation. Reviewed all 11 files: servicer (Generate streaming/non-streaming/n>1, HealthCheck, Abort, GetModelInfo, GetServerInfo, GetLoads), health servicer, server lifecycle, scheduler launcher, CLI entrypoint, and 57 unit tests.

Key things verified:

n>1 cancel sweep: CancelledError handler and Abort RPC both correctly walk the expanded {rid}-n{i} children, preventing orphaned GPU work
Chat-template prefix strip: _generated_output_ids correctly slices to the last completion_tokens tokens, removing the Llama-3 assistant header that broke tool-call parsing
Stop-token trim + no_stop_trim: Trailing matched stop is stripped from output_ids unless no_stop_trim is set; matched_token_id still rides in the proto field
Logprob alignment: Cumulative-to-delta slicing in _convert_output_logprobs_to_proto correctly handles streaming chunks and stop-token-stripped frames
HasField for optional scalars: temperature=0 (greedy) is correctly forwarded via presence-tracking rather than truthy checks
Warmup lifecycle: Synchronous gRPC client on daemon thread with proper channel cleanup; health stays NOT_SERVING until a complete frame is received
Graceful shutdown: Drain loop with timeout, then kill_process_tree(include_parent=False) to reap scheduler children without self-terminating

No bugs, no security concerns, no silent fallbacks. LGTM.

gemini-code-assist

Code Review

This pull request implements a gRPC servicer for the TokenSpeed inference engine, including health monitoring, subprocess management, and request handling for generation and metadata. Feedback identifies a need to handle zero completion tokens to prevent chat template prefix leakage and recommends updating the health servicer's Watch method to support server-streaming for full gRPC protocol compliance.

gemini-code-assist · 2026-05-08T20:19:44Z

+        if isinstance(completion, int) and 0 < completion <= len(raw):
+            token_ids = raw[-completion:]
+        else:
+            token_ids = raw


When completion_tokens is 0, the current logic falls back to returning the entire raw list of token IDs. Since raw often contains chat template prefix tokens, this fallback will leak those prefix tokens into the response. If completion_tokens is 0, an empty list should be returned. Additionally, ensure that in streaming token generation, the completion_tokens count is reported cumulatively for the entire request to ensure accurate progress reporting.

Suggested change

if isinstance(completion, int) and 0 < completion <= len(raw):

token_ids = raw[-completion:]

else:

token_ids = raw

if isinstance(completion, int):

token_ids = raw[-completion:] if completion > 0 else []

else:

token_ids = raw

References

In a streaming token generation API, response chunks should report a cumulative count of completion_tokens for the entire request, not just the tokens in the current chunk, to ensure accurate progress reporting.

gemini-code-assist · 2026-05-08T20:19:44Z

+    async def Watch(
+        self,
+        request: health_pb2.HealthCheckRequest,
+        context: grpc.aio.ServicerContext,
+    ) -> AsyncIterator[health_pb2.HealthCheckResponse]:
+        # K8s probes use Check, not Watch — we emit the current status once.
+        yield await self.Check(request, context)


The Watch method implementation does not comply with the gRPC Health Checking Protocol (v1). The protocol requires Watch to be a server-streaming RPC that stays open and yields the current status whenever it changes. The current implementation yields once and then terminates the stream, which may cause issues with clients (like service meshes or load balancers) that rely on the streaming behavior of Watch to track backend health in real-time.

claude · 2026-05-09T19:47:07Z

+        finish_reason = "stop"
+        matched_kwargs: dict[str, Any] = {}
+        if reason_dict:
+            kind = reason_dict.get("type")
+            if kind == "length":
+                finish_reason = "length"
+            elif kind == "abort":
+                finish_reason = "abort"


🟡 Nit: The finish_reason mapping is an incomplete allowlist — only "length" and "abort" are recognized; every other type (including any future TokenSpeed additions like "cancelled") silently falls back to "stop". This means the gRPC path would silently misreport a new finish reason while the HTTP path handles it correctly, creating a subtle divergence between the two serving paths.

Consider logging a warning for unrecognized types so this doesn't fail silently:

Suggested change

finish_reason = "stop"

matched_kwargs: dict[str, Any] = {}

if reason_dict:

kind = reason_dict.get("type")

if kind == "length":

finish_reason = "length"

elif kind == "abort":

finish_reason = "abort"

finish_reason = "stop"

matched_kwargs: dict[str, Any] = {}

if reason_dict:

kind = reason_dict.get("type")

if kind == "length":

finish_reason = "length"

elif kind == "abort":

finish_reason = "abort"

elif kind and kind != "stop":

logger.warning("Unrecognized finish_reason type %r; defaulting to 'stop'", kind)

claude · 2026-05-09T20:26:32Z

+        return reason
+    to_json = getattr(reason, "to_json", None)
+    if callable(to_json):
+        result = to_json()


🟡 Nit: Removing the try/except wrapper around to_json() changes error-routing semantics. The caller at line 191 catches ValueError and maps it to StatusCode.INVALID_ARGUMENT (user input error). If to_json() internally raises a ValueError, it will now be misclassified as bad user input rather than an internal server error. The previous code deliberately wrapped all to_json() failures in TypeError to guarantee they'd fall through to except Exception → StatusCode.INTERNAL.

The risk is low (a well-behaved to_json() shouldn't raise ValueError), but the original wrapper existed specifically to defend against this mismatch.

claude · 2026-05-11T03:50:29Z

+
+Mirrors smg_grpc_servicer.vllm / smg_grpc_servicer.sglang. Wraps TokenSpeed's
+AsyncLLM (main-process async frontend) behind the SGLang gRPC service so the
+existing Rust router (which auto-detects the SGLang proto) can route traffic
+to TokenSpeed without needing a new client.
+"""


🟡 Nit: This docstring is stale — it describes the opposite of what the implementation does. The servicer does NOT wrap behind "the SGLang gRPC service"; it uses its own tokenspeed.grpc.scheduler.TokenSpeedScheduler proto. The Rust router does NOT "auto-detect the SGLang proto"; DetectBackendStep identifies TokenSpeed natively from the service name. And there IS a new Rust client (TokenSpeedSchedulerClient).

Suggested change

Mirrors smg_grpc_servicer.vllm / smg_grpc_servicer.sglang. Wraps TokenSpeed's

AsyncLLM (main-process async frontend) behind the SGLang gRPC service so the

existing Rust router (which auto-detects the SGLang proto) can route traffic

to TokenSpeed without needing a new client.

"""

"""TokenSpeed gRPC servicer implementation.

Exposes TokenSpeed's AsyncLLM over the dedicated

``tokenspeed.grpc.scheduler.TokenSpeedScheduler`` gRPC service.

The Rust gateway's ``DetectBackendStep`` identifies TokenSpeed workers

natively from the service name.

"""

Adds the Python servicer that runs alongside a TokenSpeed scheduler process and serves the gRPC protocol PR1 introduced. Includes: - the async scheduler servicer (Generate/HealthCheck/Abort/ GetModelInfo/GetServerInfo/GetLoads), with cancellation handling for streaming, non-streaming, channel-close, and n>1 paths - a health-service bridge that flips SERVING/NOT_SERVING based on scheduler liveness (deep probe with bounded staleness) - a scheduler launcher that boots TokenSpeed's AsyncLLM in-process - the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint - real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so router-side load balancing reflects scheduler-side metrics - 57 unit tests covering the servicer, health service, proto conversion, finish reasons, sampling params, streaming/non- streaming generation, abort/cancel (incl. n>1), model/server info, and load metrics This is part 2 of 3 splitting #1351: - PR1: Rust gRPC + protocol (merged first) - PR2 (this): Python servicer + unit tests - PR3: CI workflows + e2e tests Stacked on PR1 — the servicer imports the proto stubs PR1 generates from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``. Fixes a 🔴 critical from review on #1351: - FakeAsyncLLM.generate_request crashed with ``TypeError: unhashable type: 'list'`` when n>1, because ``_build_generate_req`` rewrites ``rid`` to a list of per-choice ids. The fake engine now registers state for each child rid, so ``test_cancel_aborts_all_n_children`` exercises the cancel sweep instead of dying at setup. Signed-off-by: key4ng <[email protected]>

Trim ~13 lines: collapse the early-returns into a single conditional, drop the inner ``try/except`` around ``to_json()`` (propagating the original exception is more useful than wrapping it), and shorten the docstring. Behavior is unchanged — the same shapes accepted, the same TypeError raised on unknown shapes. Signed-off-by: key4ng <[email protected]>

claude · 2026-05-11T04:06:35Z

+            load_outputs = await asyncio.wait_for(
+                self.async_llm.get_load(), timeout=HEALTH_CHECK_TIMEOUT
+            )
+        except TimeoutError:


🔴 Important: except TimeoutError catches builtins.TimeoutError (subclass of OSError), but asyncio.wait_for raises asyncio.TimeoutError which on Python 3.10 is a separate class inheriting from Exception, not from builtins.TimeoutError. Since pyproject.toml declares requires-python = ">=3.10", this handler is dead code on 3.10 — the timeout falls through to the except Exception block below and reports StatusCode.INTERNAL instead of DEADLINE_EXCEEDED.

asyncio.TimeoutError became an alias of builtins.TimeoutError only in Python 3.11 (bpo-45098).

Suggested change

except TimeoutError:

except (TimeoutError, asyncio.TimeoutError):

This catches both the builtin and the asyncio variant, working correctly on 3.10+. On 3.11+ it's redundant but harmless.

Upstream tokenspeed renamed the launch flag from ``--model-path`` to ``--model``. Update the docstring example so copy-paste still works. Signed-off-by: key4ng <[email protected]>

Upstream lightseekorg/tokenspeed renamed the model + tokenizer ``ServerArgs`` fields alongside the matching CLI flag renames: - ``ServerArgs.model_path`` → ``ServerArgs.model`` - ``ServerArgs.tokenizer_path`` → ``ServerArgs.tokenizer`` Both are sources of fields in ``GetModelInfo``, so post-bump that RPC fails with: AttributeError: 'ServerArgs' object has no attribute 'model_path' AttributeError: 'ServerArgs' object has no attribute 'tokenizer_path' Pick whichever attribute is populated so the servicer works against both old and new tokenspeed pins: model_path = getattr(self.server_args, "model", None) or getattr( self.server_args, "model_path", "" ) tokenizer_path = getattr(self.server_args, "tokenizer", None) or getattr( self.server_args, "tokenizer_path", "" ) The proto fields stay named ``model_path`` / ``tokenizer_path`` because those are the on-wire contracts the router consumes. 57/57 unit tests still pass. Signed-off-by: key4ng <[email protected]>

… models When tokenspeed runs with a reasoning parser that has an xgrammar template (e.g. ``gpt-oss`` → ``harmony``), forwarding a raw JSON-schema constraint causes xgrammar to fight the Harmony channel preamble (``<|channel|>analysis<|message|>…``): the model either generates garbage or stalls until ``max_tokens``, leaving ``content`` empty. Mirror tokenspeed's HTTP entrypoint (``serving_chat.py``): when a ``reasoning_parser`` is configured, wrap the user's JSON schema via ``structural_tag_for_reasoning_json_schema()`` so the grammar only activates inside the response channel. Parsers without an xgrammar mapping fall back to the raw json_schema unchanged. Plumbs ``reasoning_parser`` into ``_sampling_params_from_proto`` as a keyword-only argument so the helper stays a static method and existing tests keep passing without modification. The new import of ``tokenspeed.runtime.grammar.reasoning_structural_tag`` is wrapped in ``try/except ImportError`` so stale tokenspeed pins fall back to raw json_schema rather than crashing. Signed-off-by: key4ng <[email protected]>

claude · 2026-05-12T15:13:08Z

+                    wrapped = structural_tag_for_reasoning_json_schema(
+                        reasoning_parser, json.loads(params.json_schema)
+                    )
+                except ImportError:
+                    wrapped = None


🟡 Nit: json.loads(params.json_schema) can raise json.JSONDecodeError (a ValueError subclass) if the client sends a malformed schema string, but the except only catches ImportError. This means malformed JSON blows up here when a reasoning parser is configured, while without a parser the same bad string silently passes through as out["json_schema"].

The inconsistency is minor (the caller's except ValueError handler would produce a reasonable INVALID_ARGUMENT gRPC status), but catching JSONDecodeError alongside ImportError would make the fallback path uniform:

Suggested change

wrapped = structural_tag_for_reasoning_json_schema(

reasoning_parser, json.loads(params.json_schema)

)

except ImportError:

wrapped = None

wrapped = structural_tag_for_reasoning_json_schema(

reasoning_parser, json.loads(params.json_schema)

)

except (ImportError, json.JSONDecodeError):

wrapped = None

github-actions Bot added dependencies Dependency updates grpc gRPC client and router changes tests Test changes labels May 8, 2026

This was referenced May 8, 2026

ci(tokenspeed): add CI install + GPU e2e coverage (Part 3/3) #1465

Open

feat(grpc): add TokenSpeed gRPC client and router wiring (Part 1/3) #1351

Open

claude Bot approved these changes May 8, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

key4ng force-pushed the feat/grpc-tokenspeed-servicer branch from 52cda9d to c3c3c02 Compare May 9, 2026 19:36

claude Bot reviewed May 9, 2026

View reviewed changes

key4ng force-pushed the feat/grpc-tokenspeed-servicer branch 2 times, most recently from 788933a to d16d38f Compare May 11, 2026 03:42

key4ng force-pushed the feat/grpc-servicer-tokenspeed branch from 8057d10 to 656f1c2 Compare May 11, 2026 03:44

key4ng force-pushed the feat/grpc-tokenspeed-servicer branch from d16d38f to 3f8983a Compare May 11, 2026 03:44

claude Bot reviewed May 11, 2026

View reviewed changes

key4ng force-pushed the feat/grpc-tokenspeed-servicer branch from 3f8983a to 2ecbbb9 Compare May 11, 2026 03:52

key4ng added 2 commits May 10, 2026 21:01

key4ng force-pushed the feat/grpc-tokenspeed-servicer branch from 2ecbbb9 to 6bb18d2 Compare May 11, 2026 04:01

claude Bot reviewed May 11, 2026

View reviewed changes

key4ng added 2 commits May 11, 2026 21:07

docs(grpc_servicer): use --model in tokenspeed entrypoint usage example

93038d1

Upstream tokenspeed renamed the launch flag from ``--model-path`` to ``--model``. Update the docstring example so copy-paste still works. Signed-off-by: key4ng <[email protected]>

key4ng force-pushed the feat/grpc-tokenspeed-servicer branch from 8583d04 to a812f5c Compare May 12, 2026 06:41

claude Bot reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(grpc_servicer): add TokenSpeed servicer (Part 2/3)#1464

feat(grpc_servicer): add TokenSpeed servicer (Part 2/3)#1464
key4ng wants to merge 5 commits into
feat/grpc-servicer-tokenspeedfrom
feat/grpc-tokenspeed-servicer

key4ng commented May 8, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Review skipped

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Uh oh!

gemini-code-assist Bot May 8, 2026

Uh oh!

claude Bot May 9, 2026

Uh oh!

claude Bot May 9, 2026

Uh oh!

claude Bot May 11, 2026

Uh oh!

claude Bot May 11, 2026

Uh oh!

claude Bot May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-Mirrors smg_grpc_servicer.vllm / smg_grpc_servicer.sglang. Wraps TokenSpeed's
-AsyncLLM (main-process async frontend) behind the SGLang gRPC service so the
-existing Rust router (which auto-detects the SGLang proto) can route traffic
-to TokenSpeed without needing a new client.
-"""
+"""TokenSpeed gRPC servicer implementation.
+Exposes TokenSpeed's AsyncLLM over the dedicated
+``tokenspeed.grpc.scheduler.TokenSpeedScheduler`` gRPC service.
+The Rust gateway's ``DetectBackendStep`` identifies TokenSpeed workers
+natively from the service name.
+"""

	except TimeoutError:
	except (TimeoutError, asyncio.TimeoutError):

Conversation

key4ng commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

3-PR Stack

Changes

Test Plan

Review Threads from #1351

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

key4ng commented May 8, 2026 •

edited

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading