ci(tokenspeed): add CI install + GPU e2e coverage (Part 3/3)#1465
ci(tokenspeed): add CI install + GPU e2e coverage (Part 3/3)#1465key4ng wants to merge 24 commits into
Conversation
Adds TokenSpeed as a first-class GPU backend on the Rust side: a self- contained `tokenspeed.grpc.scheduler.TokenSpeedScheduler` proto, the `TokenSpeedSchedulerClient` wrapper that translates SGLang-shaped request/response types at the boundary, and the model_gateway router plumbing (client dispatch, runtime detection, harmony/regular request builders, multimodal and embedding stages). This is part 1 of 3 splitting #1351: - PR1 (this): Rust gRPC + protocol - PR2: Python servicer (grpc_servicer) - PR3: CI workflows + e2e tests PR1 is functionally inert without PR2 — the router can dial a TokenSpeed worker, but the worker process lives in the Python servicer landed by PR2. Addresses CatherineSue's review on #1351: - shorten the TokenSpeed RuntimeType doc in protocols/worker.rs - drop the verbose TokenSpeed note in grpc_client/build.rs - restore the concise module doc in detect_backend.rs, just adding tokenspeed to the existing health-check ordering Signed-off-by: key4ng <[email protected]>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR integrates TokenSpeed as a new gRPC inference backend across protocol definitions, Rust and Python clients/servicers, model gateway routing, E2E test infrastructure, and CI workflows. It adds a full TokenSpeed proto, shared sampling-params builders, a Rust TokenSpeed client, a Python TokenSpeed servicer/server/health, model-gateway plumbing and proto-wrapper variants, CI install/workflow steps, and broad E2E test coverage. ChangesTokenSpeed Backend Support
Estimated code review effort 🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Code Review
This pull request integrates the TokenSpeed backend into the end-to-end testing suite. Key changes include the addition of a GitHub Action and a comprehensive installation script for TokenSpeed, infrastructure updates to support the new engine in the worker and constants, and the inclusion of TokenSpeed in numerous test suites. Additionally, the pytest_runtest_setup hook was improved to correctly handle multiple runtime skip markers. Feedback was provided to fix the caching logic in the installation script, as the current implementation fails to save the compiled kernel wheel, which would result in redundant 30-minute builds during CI.
| echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..." | ||
| MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \ | ||
| uv pip install tokenspeed-kernel/python/ --no-build-isolation | ||
| # Cache the built wheel — uv stores wheels under its cache, copy out. | ||
| mkdir -p "$WHEEL_CACHE" | ||
| python3 -c "import tokenspeed_kernel, os, shutil, glob; \ | ||
| d = os.path.dirname(tokenspeed_kernel.__file__); \ | ||
| site = os.path.dirname(d); \ | ||
| whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \ | ||
| print('kernel install dir:', whls)" || true | ||
| fi |
There was a problem hiding this comment.
The caching logic for tokenspeed-kernel is non-functional. The else block builds the kernel from source but fails to save the resulting wheel to $WHEEL_CACHE. Additionally, the Python snippet on lines 147-151 only prints the installation path instead of copying the built artifacts. This means the expensive ~30-minute compilation will occur on every CI run. Using uv pip wheel to generate the wheel directly into the cache directory is a more robust approach and ensures consistency with the use of uv in CI scripts.
| echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..." | |
| MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \ | |
| uv pip install tokenspeed-kernel/python/ --no-build-isolation | |
| # Cache the built wheel — uv stores wheels under its cache, copy out. | |
| mkdir -p "$WHEEL_CACHE" | |
| python3 -c "import tokenspeed_kernel, os, shutil, glob; \ | |
| d = os.path.dirname(tokenspeed_kernel.__file__); \ | |
| site = os.path.dirname(d); \ | |
| whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \ | |
| print('kernel install dir:', whls)" || true | |
| fi | |
| echo "Building tokenspeed-kernel from source (this takes ~30 min the first time)..." | |
| mkdir -p "$WHEEL_CACHE" | |
| MAX_JOBS="${MAX_JOBS:-16}" FLASHINFER_CUDA_ARCH_LIST="9.0a 10.0a" \ | |
| uv pip wheel tokenspeed-kernel/python/ -w "$WHEEL_CACHE" --no-build-isolation | |
| uv pip install "$WHEEL_CACHE"/tokenspeed_kernel-*.whl --no-build-isolation | |
| fi |
References
- When installing Python packages in CI scripts, use
uv pip installfor consistency and performance ifuvis already in use for other installations.
| d = os.path.dirname(tokenspeed_kernel.__file__); \ | ||
| site = os.path.dirname(d); \ | ||
| whls = glob.glob(os.path.join(site, 'tokenspeed_kernel-*.dist-info')); \ | ||
| print('kernel install dir:', whls)" || true | ||
| fi | ||
|
|
||
| # Step 4: scheduler (scikit-build-core + nanobind + CMake). | ||
| echo "Building tokenspeed-scheduler..." | ||
| uv pip install tokenspeed-scheduler/ | ||
|
|
||
| # Step 5: the Python runtime (pure-Python). | ||
| uv pip install "./python" --no-build-isolation | ||
|
|
||
| # ── Persist env to subsequent CI steps ───────────────────────────────────── | ||
| if [ -n "${GITHUB_ENV:-}" ]; then | ||
| echo "CUDA_HOME=$CUDA_HOME" >> "$GITHUB_ENV" |
There was a problem hiding this comment.
🟡 Nit: The wheel-cache write path is non-functional. After building from source, it creates $WHEEL_CACHE and runs a Python snippet that prints the dist-info directory but never copies a .whl file into $WHEEL_CACHE. The cache-read path (lines 142-146) looks for tokenspeed_kernel-*.whl in that directory, so it will never find anything — every run rebuilds from source (~30 min).
Either:
- Actually locate and copy the built wheel into
$WHEEL_CACHE, or - Remove the dead caching code and document that every run is a fresh build (the
|| trueat the end already swallows failures, so this isn't a correctness bug, just a misleading comment + wastedmkdir).
|
|
||
| cd "$TOKENSPEED_DIR" | ||
|
|
||
| # ── System dependencies (mirrors docker/Dockerfile) ───────────────────────── |
There was a problem hiding this comment.
🟡 Nit: This echo prints $TOKENSPEED_REPO which, after line 63, contains the embedded access token (https://x-access-token:<token>@github.com/...). GitHub Actions redacts known secret values in logs, so this is likely safe in practice, but it's still better hygiene to avoid echoing credential-bearing URLs. Consider printing the repo without the auth portion:
| # ── System dependencies (mirrors docker/Dockerfile) ───────────────────────── | |
| echo "Cloning TokenSpeed ${TOKENSPEED_REF} from ${TOKENSPEED_REPO%%@*}@..." |
| @pytest.mark.engine("sglang", "vllm", "trtllm", "tokenspeed") | ||
| @pytest.mark.gpu(1) | ||
| @pytest.mark.model("Qwen/Qwen3-4B-Instruct-2507") |
There was a problem hiding this comment.
🟡 Nit: This changes TestMultiTurnToolCall from Qwen2.5-14B (2 GPUs) to Qwen3-4B (1 GPU) for all engines, not just TokenSpeed. The 14B model was presumably chosen for reliable multi-turn tool-calling behavior — a 4B model may be flakier at following the tool-call/tool-result conversation pattern.
If the intent is just to add TokenSpeed coverage (which needs a Qwen3 model), consider keeping the existing 14B config for sglang/vllm/trtllm and adding a separate TestMultiTurnToolCallTokenSpeed class with the 4B model, or at minimum confirming the 4B model passes the existing multi-turn tests reliably on the other engines before merging.
There was a problem hiding this comment.
Solid CI + e2e wiring for TokenSpeed. The iter_markers fix in hooks.py is a real bug fix — nice catch.
Summary: 0 🔴 Important · 3 🟡 Nit · 0 🟣 Pre-existing
Nits posted inline:
- Wheel cache is non-functional (
ci_install_tokenspeed.sh): the cache-save path prints debug info but never writes a.whlto$WHEEL_CACHE, so every run rebuilds from source. - Credential in echo (
ci_install_tokenspeed.sh:123):echoprints$TOKENSPEED_REPOafter the token has been embedded in it. GitHub Actions redacts secrets, but better to strip the auth portion. - Model downgrade for all engines (
test_function_calling.py):TestMultiTurnToolCallmoved from Qwen2.5-14B (2 GPU) → Qwen3-4B (1 GPU) for all engines, not just TokenSpeed — worth confirming the smaller model is reliable for multi-turn tool calling on sglang/vllm/trtllm.
Also noted (couldn't post inline — line not in diff): the engine marker help text in hooks.py:28 still says (sglang, vllm, trtllm) without tokenspeed.
…dule The 5 ``build_*_sampling_params_from_*`` helpers (chat / responses / messages / completion / plain) plus their constraint helpers were sitting on ``SglangSchedulerClient`` even though the OpenAI mapping is backend-neutral. TokenSpeed's client was reaching across to call them through ``SglangSchedulerClient::*`` which suggested SGLang owned the OpenAI→sampling translation, when it doesn't. Move them to ``crate::sampling_params`` as free functions. Both the SGLang and TokenSpeed clients (and any future client that wants the same mapping) now call ``crate::sampling_params::build_*`` directly. The return type is still ``sglang::SamplingParams`` because that proto happens to be the most permissive shape across our supported backends; TokenSpeed translates to its own slimmer shape at the wire boundary. When a backend grows a sampling field SGLang lacks, this is the place to add it. No behavior change. Tests stay green; the call sites in ``build_generate_request_from_*`` are mechanically updated. Signed-off-by: key4ng <[email protected]>
…ation)
TokenSpeed previously rode the SGLang IR arms in proto_wrapper.rs:
``TokenSpeedSchedulerClient::generate()`` accepted ``sglang::GenerateRequest``,
the streaming response was translated back into ``sglang::GenerateResponse``
at the wire boundary, and the router dispatched through
``ProtoGenerateRequest::Sglang``. This let TokenSpeed reuse the existing
match arms but coupled it to SGLang's evolving schema — every SGLang field
addition forced a TokenSpeed translator stub (most recently
``default_sampling_params_json``).
Add native TokenSpeed arms to the router IR:
- ``ProtoGenerateRequest::TokenSpeed(Box<tokenspeed::GenerateRequest>)``
- ``ProtoGenerateResponse::TokenSpeed(Box<tokenspeed::GenerateResponse>)``
- ``ProtoGenerateStreamChunk::TokenSpeed(tokenspeed::GenerateStreamChunk)``
- ``ProtoGenerateComplete::TokenSpeed(tokenspeed::GenerateComplete)``
…with ``as_tokenspeed`` / ``as_tokenspeed_mut`` / ``is_tokenspeed`` accessors
mirroring the existing per-backend pattern (Mlx / Vllm / Trtllm) and
``TokenSpeed`` arms in every aggregator method (``token_ids``, ``index``,
``output_logprobs``, ``input_logprobs``, ``prompt_tokens``,
``completion_tokens``, ``cached_tokens``, ``finish_reason``,
``matched_stop_json``, ``output_ids``, ``set_stream``, ``request_id``,
``set_max_tokens_for_prefill``, ``clear_mm_inputs``,
``set_kv_transfer_params``, ``kv_transfer_params``).
Client-side:
- ``TokenSpeedSchedulerClient::generate()`` now takes
``tokenspeed_proto::GenerateRequest``
- ``AbortOnDropStream::Item`` is ``tokenspeed_proto::GenerateResponse``
- The 5 ``build_*_request`` builders return native ``tokenspeed_proto``
types
- ``translate::generate_request`` / ``generate_response`` /
``stream_chunk`` / ``complete`` / ``output_logprobs`` are gone — the
only translation kept is ``translate::sampling_params`` (a thin field
map) plus the unary RPC adapters (``model_info`` / ``server_info`` /
``loads``), which still produce SGLang shapes because the router's
``ModelInfo`` / ``ServerInfo`` enums consume those — that's a
separate cleanup.
Router-side:
- ``client.rs::generate()`` dispatch arm now matches
``(Self::TokenSpeed(_), ProtoGenerateRequest::TokenSpeed(_))``.
- The 5 ``build_*_request`` paths in ``GrpcClient`` wrap into
``ProtoGenerateRequest::TokenSpeed`` instead of ``::Sglang``.
- ``harmony/stages/request_building.rs`` builds
``ProtoGenerateRequest::TokenSpeed`` and grew a ``TokenSpeed`` arm
in the Harmony stop-token injection match.
- ``common/stages/helpers.rs::apply_sampling_defaults_to_generate_request``
early-returns for TokenSpeed (alongside Trtllm) since neither
backend plumbs sampling defaults through today; the explicit arm
keeps the match exhaustive.
PD-disagg paths (``response_collection.rs``) remain SGLang-keyed — the
``if let ProtoGenerateComplete::Sglang(...)`` checks simply won't match
TokenSpeed responses, which is the correct behavior since TokenSpeed
doesn't ship PD-disaggregation.
Verification:
- ``cargo +nightly fmt --all -- --check`` passes
- ``cargo clippy -p smg-grpc-client -p smg --all-targets --all-features -- -D warnings`` passes
- ``cargo check -p smg --bin smg`` passes
This addresses the architectural concern raised on #1351: SGLang's proto
shouldn't be the de-facto router IR. Each backend now has its own arm,
matching how vLLM / MLX / TRT-LLM are integrated.
Signed-off-by: key4ng <[email protected]>
52cda9d to
c3c3c02
Compare
0234f43 to
5297fbe
Compare
…ection
The chat-template tool-shape pre-processor was correct for Kimi-K2.5
(BFCL accuracy +6 pp on simple_python, +24 pp on parallel_multiple)
but breaks Mistral chat templates: their template at line 32 iterates
``tool.function.parameters.properties.items()``, which raises
``unknown method: undefined has no method named items`` once we unwrap
``{"type": "function", "function": {...}}`` into the bare inner dict.
The shape a chat template expects is template-dependent, not
engine-dependent. Reverting the unconditional unwrap; full rationale,
accuracy data, and proposed per-model fix in
docs/proposals/2026-05-09-deferred-chat-template-tools-strip.md.
Affected CI lane: e2e_test/chat_completions/test_function_calling.py::TestToolChoiceMistral
(20 tests failing with chat:32 render error).
Signed-off-by: key4ng <[email protected]>
Cuts ~125 lines of doc-comment / inline-rationale prose without losing information. Hot spots: - tokenspeed_scheduler.rs module doc (28 → 7 lines) - tokenspeed_scheduler.proto service header + SamplingParams comment + per-field rationale (~40 lines) - sampling_params.rs module doc (14 → 4 lines) - translate::sampling_params explainer (12 → 3 lines) - inline arm comments in client.rs / multimodal.rs / harmony/stages / common/stages that just re-stated what the code already shows Behavior unchanged. ``cargo +nightly fmt --check`` passes; ``cargo clippy -p smg-grpc-client -p smg --all-targets --all-features -- -D warnings`` passes. Signed-off-by: key4ng <[email protected]>
21ef827 to
788933a
Compare
5297fbe to
71a7883
Compare
788933a to
d16d38f
Compare
71a7883 to
9432064
Compare
Tightens the comments introduced (or modified) by this PR to describe behavior directly. No behavior change; literal type / path references left intact. Files touched: - crates/grpc_client/build.rs - crates/grpc_client/proto/tokenspeed_scheduler.proto - crates/grpc_client/src/sampling_params.rs - crates/grpc_client/src/tokenspeed_scheduler.rs - model_gateway/src/routers/grpc/client.rs - model_gateway/src/routers/grpc/proto_wrapper.rs Signed-off-by: key4ng <[email protected]>
d16d38f to
3f8983a
Compare
9432064 to
ace5e6d
Compare
The IR refactor split TokenSpeed off the SGLang request arm, which left two gaps in the sampling-defaults path: 1. `translate::model_info` populated the `preferred_sampling_params` label slot from TokenSpeed's response but left `default_sampling_params_json` empty. Worker discovery reads only the latter, so model-published defaults never reached the label map and were invisible to the request-stage injector. 2. `apply_sampling_defaults_to_generate_request` early-returned for `ProtoGenerateRequest::TokenSpeed(_)`. Even if a worker's labels carried defaults, the injector skipped the TokenSpeed arm — so a model publishing `temperature=0.7` in its generation config would apply to the other engines but TokenSpeed would fall through to the hardcoded 1.0 in the request builder. Both fixed: - `tokenspeed_scheduler.rs`: surface `preferred_sampling_params` in both `preferred_sampling_params` and `default_sampling_params_json` slots so the discovery path picks it up. - `helpers.rs`: drop TokenSpeed from the early-return, add a `TokenSpeed(req)` match arm calling a new `apply_tokenspeed_sampling_defaults`. TokenSpeed's wire declares every sampling scalar as `optional`, so the helper writes `Some(value)` rather than the bare value — preserving the set-vs-unset distinction the servicer's `HasField()` checks rely on. Verification: `cargo +nightly fmt --all -- --check` and `cargo clippy -p smg-grpc-client -p smg --all-targets --all-features -- -D warnings` both pass. Signed-off-by: key4ng <[email protected]>
3f8983a to
2ecbbb9
Compare
ace5e6d to
243c1b5
Compare
- Collapse the duplicate 2-line crate-level docstring into a single neutral line covering all supported backends. - Drop the 5-line TokenSpeed re-export rationale block; the design rationale already lives in `tokenspeed_scheduler.rs`'s module doc. Signed-off-by: key4ng <[email protected]>
Adds the Python servicer that runs alongside a TokenSpeed scheduler
process and serves the gRPC protocol PR1 introduced. Includes:
- the async scheduler servicer (Generate/HealthCheck/Abort/
GetModelInfo/GetServerInfo/GetLoads), with cancellation handling
for streaming, non-streaming, channel-close, and n>1 paths
- a health-service bridge that flips SERVING/NOT_SERVING based on
scheduler liveness (deep probe with bounded staleness)
- a scheduler launcher that boots TokenSpeed's AsyncLLM in-process
- the ``python -m smg_grpc_servicer.tokenspeed`` entrypoint
- real ``GetLoads`` plumbing backed by ``AsyncLLM.get_load()`` so
router-side load balancing reflects scheduler-side metrics
- 57 unit tests covering the servicer, health service, proto
conversion, finish reasons, sampling params, streaming/non-
streaming generation, abort/cancel (incl. n>1), model/server
info, and load metrics
This is part 2 of 3 splitting #1351:
- PR1: Rust gRPC + protocol (merged first)
- PR2 (this): Python servicer + unit tests
- PR3: CI workflows + e2e tests
Stacked on PR1 — the servicer imports the proto stubs PR1 generates
from ``crates/grpc_client/proto/tokenspeed_scheduler.proto``.
Fixes a 🔴 critical from review on #1351:
- FakeAsyncLLM.generate_request crashed with
``TypeError: unhashable type: 'list'`` when n>1, because
``_build_generate_req`` rewrites ``rid`` to a list of per-choice
ids. The fake engine now registers state for each child rid, so
``test_cancel_aborts_all_n_children`` exercises the cancel sweep
instead of dying at setup.
Signed-off-by: key4ng <[email protected]>
bf462f9 to
d79af1d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c57a3ec22
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "--tensor-parallel-size", | ||
| str(tp_size), |
There was a problem hiding this comment.
Use TokenSpeed's parallelism flag
When the TokenSpeed e2e worker is launched, this argument is passed through to prepare_server_args, but the pinned TokenSpeed ServerArgs.add_cli_args does not define --tensor-parallel-size; its parallelism flags are --world-size, --attn-tp-size, and --nprocs-per-node (and allow_abbrev is disabled). Any TokenSpeed test worker therefore exits during argument parsing before it can become healthy, so the new TokenSpeed GPU lanes never exercise the servicer.
Useful? React with 👍 / 👎.
| description = "SMG gRPC servicer implementations for LLM inference engines (vLLM, SGLang, MLX, TokenSpeed)" | ||
| requires-python = ">=3.10" | ||
| dependencies = [ | ||
| "smg-grpc-proto>=0.4.6", |
There was a problem hiding this comment.
🔴 Important: The proto version floor >=0.4.6 is too low. This PR adds tokenspeed_scheduler_pb2 / tokenspeed_scheduler_pb2_grpc to smg-grpc-proto, but the proto package version stays at 0.4.7 (same as main). If someone already has smg-grpc-proto==0.4.7 installed (the old build without TokenSpeed stubs), pip will see >=0.4.6 satisfied and skip the upgrade → from smg_grpc_proto.generated import tokenspeed_scheduler_pb2 fails at import time.
Fix: bump crates/grpc_client/python/pyproject.toml version to 0.4.8 and set this floor to >=0.4.8:
| "smg-grpc-proto>=0.4.6", | |
| "smg-grpc-proto>=0.4.8", |
Incremental Review Summary (synchronize)Reviewed the full diff ( Open 🔴 Important issues (4)All four were flagged in the prior review cycle and remain unaddressed in the latest push:
Resolved since last review (1)
Open 🟡 Nit issues (3)
Not approving — synchronize event, and 🔴 issues remain open. |
06fe339 to
bb6e97e
Compare
… models When tokenspeed runs with a reasoning parser that has an xgrammar template (e.g. ``gpt-oss`` → ``harmony``), forwarding a raw JSON-schema constraint causes xgrammar to fight the Harmony channel preamble (``<|channel|>analysis<|message|>…``): the model either generates garbage or stalls until ``max_tokens``, leaving ``content`` empty. Mirror tokenspeed's HTTP entrypoint (``serving_chat.py``): when a ``reasoning_parser`` is configured, wrap the user's JSON schema via ``structural_tag_for_reasoning_json_schema()`` so the grammar only activates inside the response channel. Parsers without an xgrammar mapping fall back to the raw json_schema unchanged. Plumbs ``reasoning_parser`` into ``_sampling_params_from_proto`` as a keyword-only argument so the helper stays a static method and existing tests keep passing without modification. The new import of ``tokenspeed.runtime.grammar.reasoning_structural_tag`` is wrapped in ``try/except ImportError`` so stale tokenspeed pins fall back to raw json_schema rather than crashing. Signed-off-by: key4ng <[email protected]>
Wires TokenSpeed into CI and the GPU e2e suite:
- ``.github/actions/setup-tokenspeed`` composite action and
``scripts/ci_install_tokenspeed.sh`` to source-install TokenSpeed
(kernel + scheduler) at a pinned ref, with a wheel cache lookup so
repeat runs skip the ~20 min compile
- e2e-gpu-job.yml: add a tokenspeed engine lane, gated on secret
access so forked PRs skip cleanly
- pr-test-rust.yml: install the same proto deps so Rust-only changes
that touch ``crates/grpc_client`` still cover the tokenspeed proto
- e2e_test infra: ``constants``, ``hooks``, ``worker``, and
``model_specs`` learn about a ``tokenspeed`` runtime alongside
sglang/vllm/trtllm; ``worker.py`` adds the launch builder; the
suite-wide ``@pytest.mark.engine(...)`` markers expand to include
tokenspeed
- Function-calling and tool_choice e2e suites swap to
``Qwen/Qwen3-4B-Instruct-2507`` for tool-call coverage (the Qwen3
family is what TokenSpeed's model registry currently supports)
This is part 3 of 3 splitting #1351:
- PR1: Rust gRPC + protocol
- PR2: Python servicer + unit tests
- PR3 (this): CI workflows + e2e tests
Stacked on PR2. e2e wiring exercises both the Rust router from PR1 and
the Python servicer from PR2 against a live TokenSpeed worker.
Addresses CatherineSue's review on #1351:
- drop the verbose Qwen3-4B docstring on ``TestToolChoiceQwen`` —
that context belongs in the PR description, not in the test file
Signed-off-by: key4ng <[email protected]>
…urce lightseekorg/tokenspeed is now public, so the clone no longer needs HTTPS basic auth and the fork-PR skip-on-missing-secret guard is no longer necessary. - scripts/ci_install_tokenspeed.sh: drop the ``TOKENSPEED_GITHUB_TOKEN`` rewrite block; the default ``https://github.com/...`` URL clones anonymously. - .github/actions/setup-tokenspeed/action.yml: drop the ``github-token`` input and the env-forwarding step. - .github/workflows/e2e-gpu-job.yml: drop the job-level ``if`` guard that skipped tokenspeed on forked PRs, and the ``with: github-token: ${{ secrets.TOKENSPEED_GITHUB_TOKEN }}`` plumbing on the ``Setup TokenSpeed backend`` step. TOKENSPEED_REF stays pinned to a tested SHA — bumping that is a separate decision. Signed-off-by: key4ng <[email protected]>
Switches the TokenSpeed e2e lane to run inside the upstream-published ``lightseekorg/tokenspeed-runner:cu130-torch-2.11.0`` image instead of on the bare runner host. That image already ships CUDA 13.0, Torch 2.11.0, nanobind, cmake, libopenmpi-dev, and libssl-dev — everything the install script previously apt-installed at job time. Changes: - ``.github/workflows/e2e-gpu-job.yml``: add a job-level ``container:`` expression that evaluates to the runner image only when ``inputs.engine == 'tokenspeed'``, otherwise ``null`` (bare host, no change for the SGLang/vLLM/TRT-LLM/MLX lanes). Bind-mounts ``/models`` and ``/tmp/tokenspeed-wheel-cache`` so the existing model cache and kernel-wheel cache work unchanged inside the container. - ``scripts/ci_install_tokenspeed.sh``: drop the CUDA-toolkit apt-get block (~45 lines), the ``libssl-dev``/``libopenmpi-dev``/``cmake`` install, and the ``GITHUB_ENV``/``GITHUB_PATH`` persistence — all redundant now that the container baseline already exports ``CUDA_HOME``, ``LD_LIBRARY_PATH``, etc. Script shrinks from 189 → 105 lines. - Bump ``TOKENSPEED_REF`` to ``70030b29`` (latest lightseekorg/tokenspeed main, "Add KV cache events to scheduler"). The unavoidable ~30 min CUDA kernel compile still happens on first run because upstream doesn't publish pre-built engine wheels yet — the wheel cache at ``/tmp/tokenspeed-wheel-cache`` short-circuits it on every subsequent run. Signed-off-by: key4ng <[email protected]>
The previous commit (388cfc1) moved the TokenSpeed lane inside ``lightseekorg/tokenspeed-runner:cu130-torch-2.11.0`` via a job-level ``container:`` directive. On the k8s-runner-gpu pods this fails: the runner image pull hits ``unexpected EOF`` mid-download (likely ephemeral-storage pressure on the pod for a large CUDA image) and the retry can't reach the docker daemon at all (it's not stably available inside the pod). Roll back the workflow ``container:`` block and restore the host-side toolkit install in ``ci_install_tokenspeed.sh`` (CUDA-13 apt-get, ``libssl-dev``/``libopenmpi-dev``/``cmake``, ``GITHUB_ENV``/``GITHUB_PATH`` persistence). Keep the bumped ``TOKENSPEED_REF`` — that part is independent of the container experiment. The "use the upstream runner image" simplification is still the right direction in principle; revisiting after the runner infra grows stable container support, or after upstream publishes engine wheels (which would remove the source-build step entirely). Signed-off-by: key4ng <[email protected]>
…-path) Upstream lightseekorg/tokenspeed (current main) renamed the ``--model-path`` argparse flag to ``--model``. The old name remains only as a positional alias on the parser; passing ``--model-path`` fails the worker boot with: __main__.py: error: unrecognized arguments: --model-path Switch ``_build_tokenspeed_grpc_cmd`` to the new flag form. The servicer's ``server_args.model_path`` attribute is still populated because upstream's ``prepare_server_args`` resolves both the positional ``model_path`` and ``--model`` sources into the same attribute. Only the tokenspeed lane is affected; sglang / vllm / trtllm / mlx each have their own argparse and continue to accept ``--model-path``. Signed-off-by: key4ng <[email protected]>
The bumped ``TOKENSPEED_REF`` dropped ``Qwen3MoeForCausalLM`` from its model registry, so the existing ``Qwen/Qwen3-30B-A3B`` worker fails to start on the tokenspeed lane with: ValueError: Model architectures ['Qwen3MoeForCausalLM'] are not supported for now. Switch ``TestEnableThinking`` to ``Qwen/Qwen3.5-27B``. The model: - is ``Qwen3_5ForConditionalGeneration`` — in the current tokenspeed registry (the family upstream is investing in); - has the ``enable_thinking`` chat-template toggle the test exercises; - emits the same ``<think>...</think>`` reasoning markers, so the existing ``qwen3`` SMG reasoning parser handles it without changes; - weighs ~57 GB (54 GB text + ~3 GB vision encoder), which fits on the 1×H100 80GB lane with ~20 GB headroom for KV cache and activations. No SMG-side parser changes needed for this test. For future Qwen3.5 function-calling tests, the SMG tool-parser factory already auto-maps ``Qwen/Qwen3.5*`` → ``qwen_xml`` at ``crates/tool_parser/src/factory.rs:353``. Changes: - ``e2e_test/infra/model_specs.py``: add a ``Qwen/Qwen3.5-27B`` spec entry (``tp: 1``, thinking + reasoning features) and point ``DEFAULT_ENABLE_THINKING_MODEL_PATH`` at it. The existing ``Qwen/Qwen3-30B-A3B`` entry stays for the nightly perf benchmark. - ``e2e_test/chat_completions/test_enable_thinking.py``: update the ``@pytest.mark.model`` to ``Qwen/Qwen3.5-27B``. Signed-off-by: key4ng <[email protected]>
After model load (~57 GB on 1×H100 80GB) the remaining ~16 GB of GPU
memory can't pre-allocate KV cache for the model's 256K native context.
vLLM fails the worker boot with:
ValueError: To serve at least one request with the models's max seq
len (262144), (16.17 GiB KV cache is needed, which is larger than
the available KV cache memory (16.0 GiB).
The TestEnableThinking test sends short single-turn chats ("Hello"),
so cap the context at 16K — same value the ``Qwen/Qwen2.5-14B-Instruct``
spec uses in this file. Apply across all engines that pre-allocate
KV cache by max-seq-len (sglang, vllm, tokenspeed); TRT-LLM allocates
dynamically and keeps the existing ``free_gpu_memory_fraction: 0.8``
knob.
Also pass ``--enforce-eager`` on the engines that respect it — the
hybrid Mamba (Gated DeltaNet) + attention layout makes CUDA-graph
capture finicky and is unnecessary for a smoke test.
Signed-off-by: key4ng <[email protected]>
Dense Qwen3 (``Qwen3ForCausalLM``) is in the bumped tokenspeed pin's model registry where ``Qwen3MoeForCausalLM`` (``Qwen3-30B-A3B``) and ``Qwen3_5ForConditionalGeneration`` (``Qwen3.5-27B``, which also overflowed vLLM's CUDA-13 ``nixl_ep`` dependency) are not. Hybrid Qwen3 keeps the ``enable_thinking`` chat-template toggle the test needs and emits the same ``<think>...</think>`` markers the existing ``qwen3`` SMG reasoning parser handles. 4B fits trivially on the 1×H100 lane (~8 GB weights), so the engine-specific KV cap and ``--enforce-eager`` args the Qwen3.5-27B spec needed are no longer required. Changes: - ``e2e_test/infra/model_specs.py``: replace the ``Qwen/Qwen3.5-27B`` entry with ``Qwen/Qwen3-4B``; point ``DEFAULT_ENABLE_THINKING_MODEL_PATH`` at it. Keep the ``Qwen/Qwen3-30B-A3B`` entry for the nightly perf benchmark. - ``e2e_test/chat_completions/test_enable_thinking.py``: update ``@pytest.mark.model`` to ``Qwen/Qwen3-4B``. Signed-off-by: key4ng <[email protected]>
bb6e97e to
779445a
Compare
|
Hi @key4ng, the DCO sign-off check has failed. All commits must include a To fix existing commits: # Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-leaseTo sign off future commits automatically:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ab36713b8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| tokenspeed_scheduler_pb2, | ||
| tokenspeed_scheduler_pb2_grpc, |
There was a problem hiding this comment.
Bump the proto package version for TokenSpeed stubs
When the TokenSpeed servicer is installed from published packages rather than the CI editable checkout, these new tokenspeed_scheduler_pb2(_grpc) exports need a distinct smg-grpc-proto release to depend on; however crates/grpc_client/python/pyproject.toml still declares version 0.4.7, the same version that predates these stubs. Without bumping the proto package version, downstream installs cannot reliably require a build that contains these modules, so TokenSpeed startup can still fail at import time even if the servicer dependency is tightened.
Useful? React with 👍 / 👎.
Signed-off-by: key4ng <[email protected]>
Signed-off-by: zhyncs <[email protected]>
| grpc_servicer/pyproject.toml | ||
| if [ "${RELEASE_PROTO}" = "true" ]; then | ||
| # Matches both the core dep ("smg-grpc-proto>=0.4.6") and the | ||
| # mlx extra ("smg-grpc-proto>=0.4.7"). |
There was a problem hiding this comment.
🟡 Nit: always() also runs on workflow cancellation. If a user cancels after one build job has already succeeded, the release job will still execute and publish artifacts. Using !cancelled() instead preserves the intended behavior (run when dependencies are skipped) while correctly skipping on cancellation.
| # mlx extra ("smg-grpc-proto>=0.4.7"). | |
| !cancelled() |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 921a3d9f14
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
|
|
||
| @pytest.mark.engine("sglang", "vllm") | ||
| @pytest.mark.engine("sglang", "vllm", "tokenspeed") |
There was a problem hiding this comment.
Add TokenSpeed to non-chat e2e matrices
When this marker is added, these tests are only selected if a CI job runs with E2E_ENGINE=tokenspeed. I checked .github/workflows/pr-test-rust.yml: TokenSpeed was added only to the chat matrices, while the completions job still lists only sglang/vllm and the router/responses jobs likewise do not run TokenSpeed. As a result, the newly marked TokenSpeed completions/router/responses coverage is never exercised in PR CI; add TokenSpeed to the corresponding matrices or drop the marks.
Useful? React with 👍 / 👎.
|
Hi @key4ng, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch: git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease |
Description
Problem
With the Rust router (#1351) and Python servicer (#1464) in place, TokenSpeed needs to be exercised in CI: source-installed at a pinned ref, launched as a worker through the e2e infrastructure, and covered by the function-calling and tool_choice suites.
Solution
A composite GitHub Action plus install script that builds and caches the TokenSpeed kernel + scheduler, a
tokenspeedlane in the GPU e2e job, ande2e_test/infraupdates that teach the suite about atokenspeedruntime alongside sglang/vllm/trtllm.3-PR Stack
This is part 3 of 3 splitting the original #1351:
mainfeat/grpc-servicer-tokenspeedfeat/grpc-tokenspeed-servicer(= PR2)Changes
.github/actions/setup-tokenspeed/action.yml— composite action wrappingscripts/ci_install_tokenspeed.shscripts/ci_install_tokenspeed.sh— source install of TokenSpeed kernel + scheduler at a pinned git ref, with a wheel cache so repeat runs skip the ~20 min compile.github/workflows/e2e-gpu-job.yml— add atokenspeedengine lane, gated on secret access so forked PRs skip cleanly.github/workflows/pr-test-rust.yml— install proto deps so Rust-only changes touchingcrates/grpc_clientstill cover the new tokenspeed protoe2e_test/infra/{constants,worker,model_specs}.pyande2e_test/fixtures/hooks.py— teach the suite about atokenspeedruntime@pytest.mark.engine(...)markers expand to includetokenspeedQwen/Qwen3-4B-Instruct-2507(the Qwen3 family is what TokenSpeed's model registry currently supports)Test Plan
bash -n scripts/ci_install_tokenspeed.shpassespre-commit run --files <changed>passes (rustfmt / clippy / ruff / ruff format / codespell / yaml / toml all green)Qwen/Qwen3-4B-Instruct-2507against a live tokenspeed worker, plus the existing reference engines for parity)Review Threads from #1351
Addressed in this PR:
e2e_test/chat_completions/test_function_calling.py— verbose Qwen3-4B docstring onTestToolChoiceQwenremoved; that context lives in this PR description instead.Checklist
cargo +nightly fmtpasses (no Rust changes)cargo clippy --all-targets --all-features -- -D warningspasses (no Rust changes)Summary by CodeRabbit
New Features
Testing
Chores