feat(mlx-bench): nightly benchmark — MLX direct HTTP vs Router+gRPC by key4ng · Pull Request #1399 · lightseekorg/smg

key4ng · 2026-04-27T04:39:35Z

Description

Problem

There's no way to measure how the SMG router + MLX gRPC servicer (#1099) compares against mlx-lm's own HTTP server in throughput / latency terms. We expect the router to win at high concurrency (Rust tokenization, no GIL contention, protobuf for streaming) and roughly tie at concurrency=1 — but expectations need data.

Solution

A nightly benchmark on macos-latest (Apple Silicon GitHub-hosted runners) that drives the same genai-bench matrix against both paths and emits a comparison table.

       ┌─ Phase 1 ─────────────────────────┐
clients ─▶ mlx_lm.server (HTTP) ─▶ MLX engine
       └───────────────────────────────────┘

       ┌─ Phase 2 ─────────────────────────────────────────────┐
clients ─▶ SMG router (HTTP) ─gRPC─▶ MLX servicer ─▶ MLX engine
       └───────────────────────────────────────────────────────┘

Both paths land at the same mlx-lm BatchGenerator, so the diff is purely transport + tokenization location.

Changes

Bench harness (`benchmarks/mlx_grpc_vs_http/`)

run.sh — bash orchestrator. Phase 1 starts mlx_lm.server and runs the matrix; Phase 2 starts the SMG MLX servicer + SMG router and runs the matrix. All tunables via env vars.
aggregate.py — parses genai-bench output JSON (aggregated_metrics.stats.*) and emits a markdown comparison table grouped by traffic scenario.
README.md — local-run instructions, what the benchmark measures, expected results.

CI (`.github/workflows/nightly-mlx-bench.yml`)

Triggers: nightly cron (17 9 * * * UTC), and workflow_dispatch with overridable inputs (model, concurrencies, scenarios, duration_min).
Runner: macos-latest (Apple Silicon).
Build: cargo build --release --bin smg (release profile — performance-critical, not the ci profile we use for correctness E2E).
Reuses ghcr.io/moirai-internal/genai-bench:0.0.4 — same image used by the existing nightly-benchmark.yml workflow.
Mirrors the SUMMARY.md table to $GITHUB_STEP_SUMMARY so the comparison is visible directly in the run UI without artifact download.

Default matrix

Dimension	Default
Model	`mlx-community/gemma-3-4b-it-qat-4bit` (~3 GB)
Concurrencies	1, 4, 16, 64
Scenarios	`D(100,256)` short prompts, `D(2000,128)` long-prompt RAG-style
Duration	10 min/cell
Total runtime	~2.7 hours (16 cells)

Test Plan

# Local on Apple Silicon
cargo build --release --bin smg
pip install -e ./crates/grpc_client/python
pip install -e "./grpc_servicer[mlx]"
docker pull ghcr.io/moirai-internal/genai-bench:0.0.4

# Quick sanity (1 min/cell, just one scenario, c=1)
DURATION_MIN=1 CONCURRENCIES="1" SCENARIOS="D(100,100)" \
  ./benchmarks/mlx_grpc_vs_http/run.sh

python3 benchmarks/mlx_grpc_vs_http/aggregate.py

In CI: trigger via Actions → Nightly MLX Benchmark → Run workflow with duration_min=1 for a fast smoke run before relying on full nightly schedule.

Checklist

cargo +nightly fmt passes (no Rust changes)
cargo clippy --all-targets --all-features -- -D warnings passes (no Rust changes)
Bash syntax: bash -n run.sh ok
Python lint: ruff check clean
Workflow YAML validated locally
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

New Features
- Added a scheduled/manual macOS benchmark to compare MLX HTTP vs Router+gRPC, capture artifacts, and publish summaries.
- Added a runnable benchmark that executes HTTP and gRPC phases, manages lifecycle, and uploads results and logs.
- Added a CLI to aggregate experiment outputs into a human-readable SUMMARY.md.
Documentation
- Added detailed README describing how to run, tune, and interpret the benchmarks.

gemini-code-assist · 2026-04-27T04:39:39Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-04-27T04:39:49Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a nightly/manual macOS GitHub Actions workflow and a benchmark suite that compares MLX direct HTTP vs Router+gRPC: workflow config, a Bash orchestrator to run two benchmark phases, a Python aggregator to summarize results, and README documentation.

Changes

Cohort / File(s)	Summary
GitHub Actions Workflow `.github/workflows/nightly-mlx-bench.yml`	New macOS workflow (cron + manual dispatch, temporary PR filter) that provisions toolchain, caches Rust builds, builds `smg`, installs Python deps and MLX components, runs the benchmark runner, aggregates results to `bench-results/SUMMARY.md` (appends to GITHUB_STEP_SUMMARY), and uploads artifacts/logs with retention rules.
Benchmark Docs `benchmarks/mlx_grpc_vs_http/README.md`	Adds README describing benchmark purpose, metrics, local run instructions, env tunables, and CI invocation.
Aggregation Utility `benchmarks/mlx_grpc_vs_http/aggregate.py`	New CLI that parses `bench-results/*` experiment dirs, extracts TTFT/latency/throughput from genai-bench JSONs, formats values, and emits a per-scenario HTTP vs gRPC markdown summary (`bench-results/SUMMARY.md`).
Benchmark Orchestration Script `benchmarks/mlx_grpc_vs_http/run.sh`	New runner performing two phases: start MLX HTTP server and run genai-bench matrix; then start MLX gRPC servicer + SMG router and repeat. Implements readiness polling, result naming, PID tracking, traps, and result output to `bench-results/`.

Sequence Diagram(s)

sequenceDiagram
    participant CI as GitHub Actions
    participant Runner as run.sh Orchestrator
    participant Bench as genai-bench
    participant HTTPSrv as MLX HTTP Server
    participant GRPCSvc as MLX gRPC Servicer
    participant Router as SMG Router
    participant FS as Filesystem

    CI->>Runner: Start with MODEL, SCENARIOS, CONCURRENCIES, DURATION

    rect rgba(100, 150, 200, 0.5)
    Note over Runner,FS: Phase 1 — Direct HTTP Benchmark
    Runner->>HTTPSrv: Start mlx_lm.server (HTTP)
    Runner->>HTTPSrv: Poll /v1/models for readiness
    Runner->>Bench: Launch genai-bench per scenario×concurrency
    Bench->>HTTPSrv: Send requests
    HTTPSrv->>Bench: Return responses
    Bench->>FS: Write results to bench-results/http_*
    Runner->>HTTPSrv: Stop HTTP server
    end

    rect rgba(150, 200, 100, 0.5)
    Note over Runner,FS: Phase 2 — gRPC via Router Benchmark
    Runner->>GRPCSvc: Start MLX gRPC servicer
    Runner->>Router: Start SMG router (forward to gRPC)
    Runner->>Router: Poll /v1/models for readiness
    Runner->>Bench: Launch genai-bench per scenario×concurrency
    Bench->>Router: Send requests
    Router->>GRPCSvc: Forward via gRPC
    GRPCSvc->>Router: Return responses
    Router->>Bench: Return responses
    Bench->>FS: Write results to bench-results/grpc_*
    Runner->>Router: Stop router
    Runner->>GRPCSvc: Stop servicer
    end

    Runner->>FS: Aggregate bench-results into SUMMARY.md
    CI->>FS: Upload artifacts and summary

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

tests

Suggested reviewers

slin1237
CatherineSue

Poem

🐰 I hopped through scripts at break of night,
Started servers, chased each benchmark flight;
HTTP hummed, gRPC joined the race,
I tallied rows and saved each trace —
A crunchy carrot earned for every write.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title clearly and specifically describes the main change: introducing a nightly benchmark that compares MLX direct HTTP serving versus Router+gRPC architectures.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch keyang/mlx-bench

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-04-27T04:41:43Z

+jobs:
+  bench:
+    name: MLX HTTP vs Router+gRPC
+    runs-on: macos-latest


🔴 Important: macos-latest GitHub-hosted runners do not have Docker pre-installed. The docker pull (line 89) and the docker run calls in run.sh will fail.

Options:

Add a step to install Docker (e.g. via colima or the docker/setup-docker-action).

Install genai-bench as a native binary/pip package instead of using the container image.

Run on a self-hosted macOS runner that has Docker available.

Fixed in 1f2953e. Went with option 2 (Colima): brew install colima docker && colima start. Adds ~1-2 min to setup but keeps the existing genai-bench docker pattern intact, matching e2e_test/benchmarks/conftest.py.

claude · 2026-04-27T04:41:49Z

+
+# ──────────────────────────────────────────────────────────────────────────
+# Phase 2: SMG router + MLX gRPC servicer
+# ──────────────────────────────────────────────────────────────────────────


🟡 Nit: This pattern-substitution on an array doesn't actually remove the element — it replaces the matching value with an empty string, leaving a ghost entry in PIDS.

Suggested change

# ──────────────────────────────────────────────────────────────────────────

PIDS=("${PIDS[@]/$MLX_HTTP_PID}")

Not harmful here because the cleanup trap guards with kill -0, but the idiomatic fix is:

local new_pids=() for p in "${PIDS[@]}"; do [[ "$p" != "$MLX_HTTP_PID" ]] && new_pids+=("$p") done PIDS=("${new_pids[@]}")

Or since this is the only place you remove a PID, you could just skip the removal entirely — the trap already handles dead PIDs gracefully.

Same fix as the coderabbit thread — addressed in 1f2953e by rebuilding the array instead of substituting.

claude · 2026-04-27T04:42:04Z

+        benchmark \
+        --api-backend openai \


🔴 Important: Even if Docker is installed on the macOS runner, --network host does not work the same way on macOS as on Linux. On Docker for Mac (Docker Desktop, colima, etc.), the Docker engine runs inside a Linux VM, so --network host gives the container the VM's network — not the macOS host's. The --api-base URL pointing to 127.0.0.1 will fail to reach the mlx-lm/smg servers running on the Mac host.

On macOS you'd need to either:

Replace 127.0.0.1 with host.docker.internal in the --api-base URL passed to genai-bench, or

Drop --network host and use -p port publishing instead.

Since this script also needs to work on macOS for local dev (per the README), consider detecting the OS:

if [[ "$(uname)" == "Darwin" ]]; then BENCH_HOST="host.docker.internal" DOCKER_NET_ARGS=() else BENCH_HOST="127.0.0.1" DOCKER_NET_ARGS=(--network host) fi

Fixed in 1f2953e. Dropped --network host, added --add-host=host.docker.internal:host-gateway, and rewrite the base URL inside run_bench so the bench container reaches the macOS host's inference servers via host.docker.internal instead of trying to share the VM's network namespace.

claude · 2026-04-27T04:42:34Z

+        --num-concurrency "$concurrency" \
+        --traffic-scenario "$scenario" \
+        --max-requests-per-run "$MAX_REQUESTS" \
+        --max-time-per-run "$DURATION_MIN" \


🔴 Important: --max-time-per-run takes seconds, not minutes. See e2e_test/benchmarks/test_nightly_perf.py:30:

_MAX_TIME_PER_RUN = 10 # seconds per scenario×concurrency combo

With the current code, DURATION_MIN=10 passes 10 to --max-time-per-run, meaning each cell runs for 10 seconds instead of the intended 10 minutes. The total runtime would be ~160 seconds, not ~2.7 hours, producing statistically meaningless results.

Fix: convert minutes to seconds:

Suggested change

--max-time-per-run "$DURATION_MIN" \

--max-time-per-run "$((DURATION_MIN * 60))" \

Critical catch — fixed in 1f2953e. run.sh now multiplies DURATION_MIN by 60 before passing it as --max-time-per-run. Verified against the comment in e2e_test/benchmarks/test_nightly_perf.py:30. Without this fix the entire 'nightly' bench would have been 16 × 10 seconds = 160 seconds total, not 2.7 hours.

Reverted the previous 'fix' in 16769d5 after empirically catching it on run #24978016312. genai-bench takes --max-time-per-run in minutes (it multiplies by 60 internally at genai_bench/cli/cli.py:326). The comment in e2e_test/benchmarks/test_nightly_perf.py:30 is misleading — the existing tests pass 10 and get 10 minutes, not 10 seconds. With DURATION_MIN=2 and the seconds conversion I added, each cell was scheduled for 7200 seconds = 2 hours, which matches the 74-minute c=1 cell observed in the cancelled run.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 15a26171e8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T04:43:48Z

+jobs:
+  bench:
+    name: MLX HTTP vs Router+gRPC
+    runs-on: macos-latest


Move Docker benchmark job off macOS runner

This job is pinned to runs-on: macos-latest but the workflow later relies on docker pull and docker run (via run.sh) without provisioning a container runtime. On GitHub-hosted runners, macos-latest maps to the arm64 macOS image, and the published macos-15-arm64 image software manifest does not include Docker, so the benchmark fails before collecting any results. Please run this benchmark on a Linux runner or add explicit Docker runtime setup/startup for macOS.

Useful? React with 👍 / 👎.

Fixed in 1f2953e. Added a Colima install step to the workflow before any docker call: brew install colima docker && colima start --cpu 2 --memory 4 --disk 20. macOS runner now has a working docker daemon.

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-mlx-bench.yml:
- Around line 29-31: The concurrency group currently uses a dynamic name
`nightly-mlx-bench-${{ github.event_name }}` which separates scheduled and
manual runs; change it to a fixed group name like `nightly-mlx-bench` so all
workflows share a single concurrency group and only one instance runs at a
time—update the `concurrency.group` value to the fixed string and keep
`cancel-in-progress` as desired to ensure proper single-instance behavior.
- Around line 88-92: The workflow currently pulls and runs a Docker image on
macos-latest without installing Docker and the benchmark script run.sh uses
"--network host" which does not behave the same on macOS; either switch the job
to run on ubuntu-latest or add a step to install/configure Docker on macOS
(e.g., use actions/setup-docker or install Docker Desktop CLI) before the "Pull
genai-bench image" and "Run benchmark (this is the long step)" steps, and modify
run.sh to avoid or conditionalize the "--network host" flag on macOS (detect
platform in run.sh and skip or use alternative networking) so the benchmark runs
reliably across runners.

In `@benchmarks/mlx_grpc_vs_http/aggregate.py`:
- Around line 136-139: The code currently writes output via
Path(args.out).write_text(table) without ensuring the parent directory exists;
update the logic in aggregate.py right before the write_text call (near where
collect and build_table are used) to create the parent directory if missing (use
Path(args.out).parent.mkdir(parents=True, exist_ok=True)) and then perform
Path(args.out).write_text(table) so standalone runs won't fail when the target
directory doesn't exist.
- Around line 27-45: parse_experiment currently picks json_files[0] arbitrarily
when multiple JSONs exist; make selection deterministic by sorting json_files
before choosing one (e.g., sort by modification time using Path.stat().st_mtime
or by filename) and then read the chosen file; update the logic in
parse_experiment (variable json_files and the selection of json_files[0]) to
pick the latest/appropriately ordered file so repeated or partial reruns yield a
predictable result.

In `@benchmarks/mlx_grpc_vs_http/run.sh`:
- Line 147: The current removal using PIDS=("${PIDS[@]/$MLX_HTTP_PID}") does
substring replacement and can corrupt entries; replace it by rebuilding the
array to exclude the exact PID: iterate over PIDS (use the PIDS variable) and
append only entries not equal to MLX_HTTP_PID into a NEW_PIDS array, then assign
PIDS=("${NEW_PIDS[@]}"); update the code in run.sh where PIDS and MLX_HTTP_PID
are handled (or alternatively rely on the existing cleanup trap that checks kill
-0), but do not use the substring-replacement form.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cf4091d4-f882-44f5-9a62-62fda3e76faf

📥 Commits

Reviewing files that changed from the base of the PR and between e4d4452 and 15a2617.

📒 Files selected for processing (4)

.github/workflows/nightly-mlx-bench.yml
benchmarks/mlx_grpc_vs_http/README.md
benchmarks/mlx_grpc_vs_http/aggregate.py
benchmarks/mlx_grpc_vs_http/run.sh

coderabbitai · 2026-04-27T04:43:48Z

+concurrency:
+  group: nightly-mlx-bench-${{ github.event_name }}
+  cancel-in-progress: false


🧹 Nitpick | 🔵 Trivial

Concurrency group may cause unexpected cancellations.

The concurrency group nightly-mlx-bench-${{ github.event_name }} means scheduled runs and manual dispatches use different groups (nightly-mlx-bench-schedule vs nightly-mlx-bench-workflow_dispatch). This allows a manual run to execute concurrently with a scheduled run, which could compete for resources or cause confusion. Consider using a fixed group name if only one instance should ever run.

💡 Suggested fix for single-instance concurrency

concurrency: - group: nightly-mlx-bench-${{ github.event_name }} + group: nightly-mlx-bench cancel-in-progress: false

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/nightly-mlx-bench.yml around lines 29 - 31, The concurrency group currently uses a dynamic name `nightly-mlx-bench-${{ github.event_name }}` which separates scheduled and manual runs; change it to a fixed group name like `nightly-mlx-bench` so all workflows share a single concurrency group and only one instance runs at a time—update the `concurrency.group` value to the fixed string and keep `cancel-in-progress` as desired to ensure proper single-instance behavior.

Already mitigated — the workflow has cancel-in-progress: false, so even though ${{ github.event_name }} produces different group keys for schedule vs workflow_dispatch, neither cancels the other. Keeping as-is intentionally so a manual dispatch never kills an in-progress nightly.

@key4ng, thanks for the clarification — that's a valid and deliberate design. Using ${{ github.event_name }}-scoped groups with cancel-in-progress: false is the correct way to let a manual dispatch and a nightly coexist independently: each group serializes its own event type without interfering with the other. I've noted this for future reviews.

✏️ Learnings added

Learnt from: key4ng Repo: lightseekorg/smg PR: 1399 File: .github/workflows/nightly-mlx-bench.yml:39-41 Timestamp: 2026-04-27T04:50:22.772Z Learning: In repo lightseekorg/smg, `.github/workflows/nightly-mlx-bench.yml`: The concurrency group `nightly-mlx-bench-${{ github.event_name }}` is intentional. It creates separate groups for `schedule` and `workflow_dispatch` events so that a manual dispatch run and a nightly scheduled run can coexist without blocking or cancelling each other. `cancel-in-progress: false` ensures neither a new nightly nor a new manual dispatch kills an already-running instance of the same event type. Do not flag this as a bug or suggest collapsing to a single fixed group name.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 635 File: .github/workflows/pr-test-rust.yml:621-621 Timestamp: 2026-03-04T23:20:28.110Z Learning: In lightseekorg/smg, keeping `-s` (no capture) unconditionally in pytest invocations for parallel E2E jobs in `.github/workflows/pr-test-rust.yml` is intentional. Removing `-s` for parallel jobs was explicitly rejected because: (1) captured output is discarded on CI timeout/hang, which is unacceptable for debugging, and (2) it removes real-time log visibility. Log interleaving from parallel runs is instead addressed by a `_TestNameFormatter` in `e2e_test/conftest.py` that prefixes every log line with `[test_name]`, allowing users to grep for a specific test name to follow its output through interleaved logs.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 489 File: model_gateway/benches/wasm_middleware_latency.rs:88-91 Timestamp: 2026-02-21T02:37:02.009Z Learning: Repo: lightseekorg/smg — For clippy-only/enforcement PRs (e.g., PR `#489`), even micro-optimizations (like replacing an async closure with std::future::ready in benches such as model_gateway/benches/wasm_middleware_latency.rs) should be deferred to a follow-up PR rather than included inline.

Learnt from: pallasathena92 Repo: lightseekorg/smg PR: 687 File: model_gateway/src/routers/openai/realtime/webrtc.rs:238-289 Timestamp: 2026-03-11T01:29:56.655Z Learning: In repo lightseekorg/smg, file model_gateway/src/routers/openai/realtime/webrtc.rs: `Metrics::record_router_request` is already emitted in router.rs (around line 1152) before `handle_realtime_webrtc` is called, and `Metrics::record_router_error` is emitted inside `handle_realtime_webrtc` for the no-workers case. The missing instrumentation is success/duration recording after `setup_and_spawn_bridge` returns — this is a metrics improvement deferred to a follow-up PR, not a correctness gap. Do not flag missing success/duration metrics as a blocking issue for PR `#687`.

Learnt from: zhaowenzi Repo: lightseekorg/smg PR: 1250 File: model_gateway/tests/api/responses_api_test.rs:643-653 Timestamp: 2026-04-24T20:01:12.229Z Learning: In repo lightseekorg/smg, PR `#1250` (`model_gateway/tests/api/responses_api_test.rs`): The router-side rejection path for `previous_response_id + mcp_approval_response + no MCP tools` (i.e., a resumed approval continuation where the follow-up request carries zero MCP tool bindings) is intentionally not covered by integration tests in this PR. The existing continuation fixtures all inherit `req1` which carries an MCP binding. A dedicated test for this rejection branch is deferred to a follow-up PR. Do not re-flag the missing coverage in PR `#1250` or similar approval-continuation PRs.

Learnt from: XinyueZhang369 Repo: lightseekorg/smg PR: 723 File: model_gateway/tests/api/interactions_api_test.rs:316-346 Timestamp: 2026-03-11T05:32:45.536Z Learning: In repo lightseekorg/smg, `test_interactions_multiple_workers` in `model_gateway/tests/api/interactions_api_test.rs` intentionally only verifies that requests succeed with multiple Gemini backends (connectivity), not load distribution. Load distribution across workers is covered separately in `tests/routing/load_balancing_test.rs`. Do not flag the absence of per-worker request count assertions in this test as a gap — distribution verification is deferred to a follow-up PR once the Gemini router implementation matures.

Learnt from: RohanSogani Repo: lightseekorg/smg PR: 1333 File: model_gateway/tests/api/responses_api_test.rs:325-373 Timestamp: 2026-04-23T18:59:11.719Z Learning: In repo lightseekorg/smg, PR `#1333` adds `extract_forwardable_request_headers` and threads forwarded headers through `McpToolSession` → `McpRequestContext` → `HandlerRequestContext`. This PR's scope is strictly internal request-context preservation (preserving request-scoped headers so they are accessible in MCP execution context), NOT outbound MCP header propagation to dynamic MCP servers. The integration test `test_non_streaming_mcp_e2e_accepts_forwardable_request_headers` in `model_gateway/tests/api/responses_api_test.rs` is intentionally a smoke test (verifies MCP still executes when forwardable headers are present); wire-level header assertions via `MockMCPServer::captured_headers()` belong in the follow-up PR covering config-driven outbound MCP header propagation. Do not flag the absence of `captured_headers()` assertions in this test as a gap for PR `#1333`.

Learnt from: vschandramourya Repo: lightseekorg/smg PR: 915 File: model_gateway/src/routers/grpc/client.rs:387-423 Timestamp: 2026-03-26T17:06:14.307Z Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 1305 File: model_gateway/src/routers/grpc/harmony/builder.rs:726-728 Timestamp: 2026-04-22T20:29:00.426Z Learning: In repo lightseekorg/smg, `model_gateway/src/routers/grpc/harmony/builder.rs`: The warn message "Approval item reached Harmony conversion" in `parse_response_item_to_harmony_message` is pre-existing code. The `ComputerCall { .. }` and `ComputerCallOutput { .. }` match arms added in PR `#1305` are forced-cascade additions to keep the exhaustive match valid within a protocol-scope PR. Rewording the warning to a neutral message (e.g. "Unsupported input item reached Harmony conversion") is intentionally deferred to a follow-up router task. Do not re-flag this stale log message in protocol-scope PRs.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 936 File: model_gateway/src/routers/gemini/router.rs:82-90 Timestamp: 2026-03-30T18:52:19.689Z Learning: In repo lightseekorg/smg, the Gemini router's (`model_gateway/src/routers/gemini/router.rs`) use of `model_id` for per-model retry config lookup in `route_interactions` has a pre-existing limitation: when the request carries only an agent identifier, `get_retry_config(agent_id)` finds no match because agent→model resolution happens later inside `driver::execute` (in `worker_selection.rs`). This same limitation affects policy lookup and worker selection in the Gemini router. It is NOT introduced by PR `#936` and is a known accepted gap. The Gemini router is not actively used currently. Do not flag this as a bug introduced by per-model retry config changes to the Gemini router.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 633 File: .github/workflows/pr-test-rust.yml:439-440 Timestamp: 2026-03-04T20:41:53.013Z Learning: Ensure GitHub Actions workflows always include an actions/checkout step when referencing local composite actions (uses: ./... or similar) on self-hosted runners. Do not rely on a pre-checked-out workspace unless your environment explicitly guarantees it; follow the standard behavior and include checkout to ensure reliable action resolution.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 638 File: .github/workflows/release-grpc.yml:35-37 Timestamp: 2026-03-05T05:50:07.158Z Learning: In workflows under .github/workflows that trigger only on push to main, you do not need to guard against the all-zeros github.event.before (0000000000000000000000000000000000000000) since that value only occurs on initial branch creation and cannot happen for an already-existing main. Apply this guidance to similar push-to-main workflows across the repository (pattern: **/.github/workflows/*.yml). If a workflow also handles other events or branches, retain appropriate guards for those cases.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1036 File: .github/workflows/claude-code-review.yml:71-73 Timestamp: 2026-04-06T16:49:28.633Z Learning: In GitHub Actions workflow YAMLs, treat any `prompt:` field intended for an LLM (e.g., Claude) as natural-language instructions for the model. Do not embed shell-script-specific constructs or error-handling patterns inside the prompt content (e.g., `git rev-parse --verify`, `2>/dev/null`, explicit exit-code checks like `if [ $? -ne 0 ]`). Instead, express fallback/error-handling guidance in natural language so the LLM can reason about failures from context rather than parsing shell stderr/stdout semantics.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1099 File: .github/workflows/pr-test-mlx.yml:43-43 Timestamp: 2026-04-13T02:02:07.209Z Learning: In this repo, when reviewing GitHub Actions workflow YAML files, treat `runs-on: macos-latest` as acceptable for Apple Silicon: GitHub’s standard `macos-latest` runner resolves to an arm64 macOS runner (Intel is being deprecated). Do not flag `macos-latest` as potentially non-arm64/Intel for Apple Silicon + MLX workloads.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

.github/workflows/nightly-mlx-bench.yml (1)

100-104: ⚠️ Potential issue | 🟠 Major

Docker runtime is assumed but not provisioned on the macOS runner.

Line 101 and Line 104 depend on a working Docker daemon, but this job has no Docker setup/start step. Please provision Docker explicitly on macOS (or remove the container dependency for genai-bench) before these steps.

Run this verification to confirm the current macOS runner image contents:

#!/bin/bash
set -euo pipefail

# Check whether Docker appears in the official macOS arm64 runner image manifest.
curl -fsSL https://raw.githubusercontent.com/actions/runner-images/main/images/macos/macos-15-arm64-Readme.md \
  | rg -n -i '\bdocker\b' || true

# Expected result:
# - No Docker tool entry found in the manifest (or no matching lines),
#   which indicates the workflow must install/start Docker itself.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-mlx-bench.yml around lines 100 - 104, The workflow
assumes a Docker daemon but the macOS runner doesn't provide one; before the
steps named "Pull genai-bench image" and "Run benchmark (this is the long step)"
either provision Docker on macOS or remove the container dependency. Fix by
adding a setup step that installs and starts Docker on macOS (for example:
install Docker Desktop via brew install --cask docker, launch Docker.app and
wait for the daemon to be ready) or change the job to use an Ubuntu runner or
modify ./benchmarks/mlx_grpc_vs_http/run.sh to use a non-container execution
path so the docker pull/run calls are no longer required.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-mlx-bench.yml:
- Around line 25-34: Remove the temporary CI trigger by deleting the
pull_request block (the YAML mapping starting with pull_request: and its nested
branches: and paths: entries) so only the intended schedule and
workflow_dispatch triggers remain; ensure the commented note about it being
temporary is also removed or updated so the workflow no longer runs on PRs for
the benchmarks/mlx_grpc_vs_http path after merge.

---

Duplicate comments:
In @.github/workflows/nightly-mlx-bench.yml:
- Around line 100-104: The workflow assumes a Docker daemon but the macOS runner
doesn't provide one; before the steps named "Pull genai-bench image" and "Run
benchmark (this is the long step)" either provision Docker on macOS or remove
the container dependency. Fix by adding a setup step that installs and starts
Docker on macOS (for example: install Docker Desktop via brew install --cask
docker, launch Docker.app and wait for the daemon to be ready) or change the job
to use an Ubuntu runner or modify ./benchmarks/mlx_grpc_vs_http/run.sh to use a
non-container execution path so the docker pull/run calls are no longer
required.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 282d6bcd-1201-4a4b-9f0a-f1578d5112c4

📥 Commits

Reviewing files that changed from the base of the PR and between 15a2617 and b068350.

📒 Files selected for processing (1)

.github/workflows/nightly-mlx-bench.yml

coderabbitai · 2026-04-27T04:48:05Z

+  # TEMPORARY: enable PR trigger so this workflow can be smoke-tested before
+  # merge. Path-filtered so unrelated PRs don't fire it, and the matrix is
+  # reduced via the env block below to ~12 minutes. REMOVE this block before
+  # merging — once the harness is validated, only schedule + workflow_dispatch
+  # should remain.
+  pull_request:
+    branches: [main]
+    paths:
+      - "benchmarks/mlx_grpc_vs_http/**"
+      - ".github/workflows/nightly-mlx-bench.yml"


⚠️ Potential issue | 🟡 Minor

Remove the temporary pull_request trigger before merge.

This block is explicitly marked temporary; merging it as-is keeps PR smoke runs enabled beyond the validation phase and expands CI spend/surface area unintentionally.

✂️ Proposed cleanup before final merge

- # TEMPORARY: enable PR trigger so this workflow can be smoke-tested before - # merge. Path-filtered so unrelated PRs don't fire it, and the matrix is - # reduced via the env block below to ~12 minutes. REMOVE this block before - # merging — once the harness is validated, only schedule + workflow_dispatch - # should remain. - pull_request: - branches: [main] - paths: - - "benchmarks/mlx_grpc_vs_http/**" - - ".github/workflows/nightly-mlx-bench.yml"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# TEMPORARY: enable PR trigger so this workflow can be smoke-tested before

# merge. Path-filtered so unrelated PRs don't fire it, and the matrix is

# reduced via the env block below to ~12 minutes. REMOVE this block before

# merging — once the harness is validated, only schedule + workflow_dispatch

# should remain.

pull_request:

branches: [main]

paths:

- "benchmarks/mlx_grpc_vs_http/**"

- ".github/workflows/nightly-mlx-bench.yml"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/nightly-mlx-bench.yml around lines 25 - 34, Remove the temporary CI trigger by deleting the pull_request block (the YAML mapping starting with pull_request: and its nested branches: and paths: entries) so only the intended schedule and workflow_dispatch triggers remain; ensure the commented note about it being temporary is also removed or updated so the workflow no longer runs on PRs for the benchmarks/mlx_grpc_vs_http path after merge.

Deliberately deferred. The PR trigger smoke-tests the harness on each push while the bench is still being validated; without it I can't see whether changes break the workflow until manual dispatch. Will remove the trigger and the surrounding TEMPORARY comment in the final pre-merge commit. The path filter (benchmarks/mlx_grpc_vs_http/** + the workflow file itself) keeps unrelated PRs from firing it, and the env-block matrix reduction (concurrencies=1 4, duration=4 min) caps PR-trigger spend at ~25 min vs the full ~2 h scheduled run.

@key4ng, understood — that's a reasonable validation strategy. The path filter + reduced matrix (≈25 min cap) makes the temporary trigger cost-conscious, and having it fire on every push is genuinely more convenient than repeated manual dispatches during harness development.

I've noted your commitment to remove the pull_request block and its surrounding TEMPORARY comment in the final pre-merge commit. I'll keep this comment open as a pre-merge reminder.

✏️ Learnings added

Learnt from: key4ng Repo: lightseekorg/smg PR: 1399 File: .github/workflows/nightly-mlx-bench.yml:29-38 Timestamp: 2026-05-01T00:13:32.765Z Learning: In repo lightseekorg/smg, `.github/workflows/nightly-mlx-bench.yml`: The `pull_request:` trigger block (with `branches: [main]` and `paths: ["benchmarks/mlx_grpc_vs_http/**", ".github/workflows/nightly-mlx-bench.yml"]`) is explicitly marked TEMPORARY and is intentionally kept during the harness validation phase so the author can smoke-test the workflow on each push without needing manual `workflow_dispatch`. The env-block matrix reduction caps PR-trigger spend at ~25 min. The block is committed to be removed in the final pre-merge commit, leaving only `schedule` and `workflow_dispatch` triggers. Do NOT re-flag this temporary trigger as a blocking issue during the validation period, but do flag it if it is still present at merge time.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: key4ng Repo: lightseekorg/smg PR: 1399 File: .github/workflows/nightly-mlx-bench.yml:39-41 Timestamp: 2026-04-27T04:50:22.772Z Learning: In `.github/workflows/nightly-mlx-bench.yml`, do not flag or recommend changing the `concurrency.group` value `nightly-mlx-bench-${{ github.event_name }}`. It is intentionally designed to separate concurrency between `schedule` and `workflow_dispatch` so a manual run can coexist with (rather than block/cancel) a scheduled nightly run. Keep `cancel-in-progress: false` as-is, since it intentionally prevents new runs of the same event type from cancelling an already-running instance.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1398 File: .github/workflows/pr-test-mlx.yml:66-119 Timestamp: 2026-04-30T06:05:20.312Z Learning: In repo lightseekorg/smg, `.github/workflows/pr-test-mlx.yml`: Do NOT suggest adding a Hugging Face model cache step for the `mlx-community/Qwen3-0.6B-4bit` model (~400 MB). The model is intentionally left uncached because it is small enough that the network download cost is acceptable and caching it wastes GitHub Actions cache storage.

Learnt from: VadivelanU Repo: lightseekorg/smg PR: 1407 File: model_gateway/src/routers/openai/responses/streaming.rs:1061-1173 Timestamp: 2026-04-29T13:26:42.244Z Learning: In repo lightseekorg/smg, `handle_streaming_with_tool_interception` in `model_gateway/src/routers/openai/responses/streaming.rs`: The inner `break` on `tool_calls_detected` inside the `for normalized_event in normalized_events` loop (after `StreamAction::ExecuteTools` sets the flag) is a pre-existing behavior preserved intentionally from prior iterations. It can theoretically drop later events in the same normalized batch, but fixing it (moving the break outside the inner loop so all normalized events are processed before acting on the flag) is deferred to a follow-up PR. Do not re-flag this early break as a blocking issue in PR `#1407` or similar streaming-refactor PRs.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 635 File: .github/workflows/pr-test-rust.yml:621-621 Timestamp: 2026-03-04T23:20:28.110Z Learning: In lightseekorg/smg, keeping `-s` (no capture) unconditionally in pytest invocations for parallel E2E jobs in `.github/workflows/pr-test-rust.yml` is intentional. Removing `-s` for parallel jobs was explicitly rejected because: (1) captured output is discarded on CI timeout/hang, which is unacceptable for debugging, and (2) it removes real-time log visibility. Log interleaving from parallel runs is instead addressed by a `_TestNameFormatter` in `e2e_test/conftest.py` that prefixes every log line with `[test_name]`, allowing users to grep for a specific test name to follow its output through interleaved logs.

Learnt from: zhaowenzi Repo: lightseekorg/smg PR: 1402 File: model_gateway/src/routers/grpc/harmony/streaming.rs:509-513 Timestamp: 2026-04-27T23:48:21.853Z Learning: In repo lightseekorg/smg, `model_gateway/src/routers/grpc/harmony/streaming.rs`: The unconditional `prefill_stream.mark_completed()` call after `process_decode_stream(...)` (in both `process_responses_dual_stream` and `process_dual_stream`) is a pre-existing pattern that predates PR `#1402`. PR `#1402` only relocated the function; no logic was changed. The fix — guarding `mark_completed()` behind `result.is_ok()` so that decode/parser errors still trigger abort-on-drop cleanup for the prefill RPC — is intentionally deferred to a follow-up PR covering both dual-stream paths. Do not re-flag this as a regression in PR `#1402` or similar relocation PRs.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 489 File: model_gateway/benches/wasm_middleware_latency.rs:88-91 Timestamp: 2026-02-21T02:37:02.009Z Learning: Repo: lightseekorg/smg — For clippy-only/enforcement PRs (e.g., PR `#489`), even micro-optimizations (like replacing an async closure with std::future::ready in benches such as model_gateway/benches/wasm_middleware_latency.rs) should be deferred to a follow-up PR rather than included inline.

Learnt from: slin1237 Repo: lightseekorg/smg PR: 1305 File: model_gateway/src/routers/grpc/harmony/builder.rs:726-728 Timestamp: 2026-04-22T20:29:00.426Z Learning: In repo lightseekorg/smg, `model_gateway/src/routers/grpc/harmony/builder.rs`: The warn message "Approval item reached Harmony conversion" in `parse_response_item_to_harmony_message` is pre-existing code. The `ComputerCall { .. }` and `ComputerCallOutput { .. }` match arms added in PR `#1305` are forced-cascade additions to keep the exhaustive match valid within a protocol-scope PR. Rewording the warning to a neutral message (e.g. "Unsupported input item reached Harmony conversion") is intentionally deferred to a follow-up router task. Do not re-flag this stale log message in protocol-scope PRs.

Learnt from: zhaowenzi Repo: lightseekorg/smg PR: 1250 File: model_gateway/tests/api/responses_api_test.rs:1347-1355 Timestamp: 2026-04-24T20:01:23.917Z Learning: In repo lightseekorg/smg, the response transformer in the non-streaming OpenAI path rewrites synthesized `function_call_output` items into `mcp_call` items before they reach the final client-visible output. Consequently, the assertion in `test_final_response_hides_internal_mcp_trace_items` (model_gateway/tests/api/responses_api_test.rs) does not need to explicitly block `"function_call_output"` — any such leak would first require the transformer to fail, which would already be caught by the `mcp_call` guard. Adding `"function_call_output"` to that assertion is defense-in-depth only, not a fix for a live gap; it is intentionally deferred to a follow-up PR.

Learnt from: zhaowenzi Repo: lightseekorg/smg PR: 1250 File: model_gateway/tests/api/responses_api_test.rs:643-653 Timestamp: 2026-04-24T20:01:18.002Z Learning: In repo lightseekorg/smg, PR `#1250` (`model_gateway/tests/api/responses_api_test.rs`): The router-side rejection path for `previous_response_id + mcp_approval_response + no MCP tools` (i.e., a resumed approval continuation where the follow-up request carries zero MCP tool bindings) is intentionally not covered by integration tests in this PR. The existing continuation fixtures all inherit `req1` which carries an MCP binding. A dedicated test for this rejection branch is deferred to a follow-up PR. Do not re-flag the missing coverage in PR `#1250` or similar approval-continuation PRs.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 1090 File: crates/tool_parser/src/parsers/mistral.rs:48-75 Timestamp: 2026-04-13T23:31:17.654Z Learning: In repo lightseekorg/smg, `crates/tool_parser/src/parsers/mistral.rs` `MistralParser::build_structural_tag`: The `triggered_tags` format intentionally constrains only single tool calls (one `begin`/`end` tag pair per tool, trigger `"[TOOL_CALLS]"`). Supporting parallel Mistral tool calls (array with multiple items) requires `tags_with_separator` (as sglang implements) and is deferred to a follow-up PR. Do not flag the absence of multi-call/continuation tags in `build_structural_tag` as a bug; the primary target use case is `tool_choice: function` (single call).

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 990 File: crates/tokenizer/benches/stop_decoder.rs:23-24 Timestamp: 2026-04-01T08:02:32.590Z Learning: In `lightseekorg/smg` (`crates/tokenizer/benches/stop_decoder.rs`), benchmarks intentionally use `MockTokenizer` (not a real HuggingFace tokenizer) to keep them hermetic and avoid network/filesystem dependencies for model downloads. The HuggingFace native `step_decode_stream` fast-path is covered by the integration test `test_sequence_operations` in `crates/tokenizer/tests/tokenizer_integration.rs`, which uses a real HF tokenizer. A fixture-backed HF benchmark is a known follow-up item but is not required in this PR. Do not flag the absence of HF tokenizer usage in the bench file as a gap.

Learnt from: XinyueZhang369 Repo: lightseekorg/smg PR: 723 File: model_gateway/tests/api/interactions_api_test.rs:316-346 Timestamp: 2026-03-11T05:32:45.536Z Learning: In repo lightseekorg/smg, `test_interactions_multiple_workers` in `model_gateway/tests/api/interactions_api_test.rs` intentionally only verifies that requests succeed with multiple Gemini backends (connectivity), not load distribution. Load distribution across workers is covered separately in `tests/routing/load_balancing_test.rs`. Do not flag the absence of per-worker request count assertions in this test as a gap — distribution verification is deferred to a follow-up PR once the Gemini router implementation matures.

Learnt from: zhaowenzi Repo: lightseekorg/smg PR: 1174 File: crates/mcp/src/core/orchestrator.rs:1097-1131 Timestamp: 2026-04-17T06:48:38.553Z Learning: In repo lightseekorg/smg, `execute_tool_entry_result` in `crates/mcp/src/core/orchestrator.rs`: `record_call_end` is intentionally called for both `ToolExecutionResult::Executed` and `ToolExecutionResult::PendingApproval` to preserve pre-refactor metric shape (pending approval closes out the current tool-handling attempt). For `PendingApproval`, `succeeded = false` was a bug fixed in PR `#1174` so it no longer inflates error rates, but a dedicated `record_call_pending_approval` counter and full lifecycle split (only recording completion when approval is resolved and execution runs) are deferred to a follow-up PR. Do not flag the absence of a dedicated pending-approval metric as a blocking issue until that follow-up work is done.

Learnt from: pallasathena92 Repo: lightseekorg/smg PR: 687 File: model_gateway/src/routers/openai/realtime/webrtc.rs:238-289 Timestamp: 2026-03-11T01:29:56.655Z Learning: In repo lightseekorg/smg, file model_gateway/src/routers/openai/realtime/webrtc.rs: `Metrics::record_router_request` is already emitted in router.rs (around line 1152) before `handle_realtime_webrtc` is called, and `Metrics::record_router_error` is emitted inside `handle_realtime_webrtc` for the no-workers case. The missing instrumentation is success/duration recording after `setup_and_spawn_bridge` returns — this is a metrics improvement deferred to a follow-up PR, not a correctness gap. Do not flag missing success/duration metrics as a blocking issue for PR `#687`.

Learnt from: vschandramourya Repo: lightseekorg/smg PR: 915 File: model_gateway/src/routers/grpc/client.rs:387-423 Timestamp: 2026-03-26T17:06:14.307Z Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: shenoyvvarun Repo: lightseekorg/smg PR: 1397 File: model_gateway/src/routers/http/pd_router.rs:104-116 Timestamp: 2026-04-29T15:14:12.061Z Learning: In repo lightseekorg/smg, `model_gateway/src/routers/http/pd_router.rs` and `router.rs`: When `enforce_token_rate_limit` (or its equivalent in the regular HTTP router) returns a 429/413 early exit, do NOT emit the existing worker-dispatch metrics (`Metrics::record_router_request`, `record_router_error`, etc.) for throttled traffic. Those metrics are semantically tied to requests dispatched to backend workers. Instead, a dedicated metric such as `Metrics::record_request_throttled(router_type, backend, model_id, tenant_key)` should be added to track admission-control rejections separately. Do not re-flag the absence of worker-dispatch metrics on throttled paths as a bug; flag only the absence of a dedicated throttle counter if one is not present.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 936 File: model_gateway/src/routers/gemini/router.rs:82-90 Timestamp: 2026-03-30T18:52:19.689Z Learning: In repo lightseekorg/smg, the Gemini router's (`model_gateway/src/routers/gemini/router.rs`) use of `model_id` for per-model retry config lookup in `route_interactions` has a pre-existing limitation: when the request carries only an agent identifier, `get_retry_config(agent_id)` finds no match because agent→model resolution happens later inside `driver::execute` (in `worker_selection.rs`). This same limitation affects policy lookup and worker selection in the Gemini router. It is NOT introduced by PR `#936` and is a known accepted gap. The Gemini router is not actively used currently. Do not flag this as a bug introduced by per-model retry config changes to the Gemini router.

Learnt from: RohanSogani Repo: lightseekorg/smg PR: 1333 File: model_gateway/tests/api/responses_api_test.rs:325-373 Timestamp: 2026-04-23T18:59:15.656Z Learning: In repo lightseekorg/smg, PR `#1333` adds `extract_forwardable_request_headers` and threads forwarded headers through `McpToolSession` → `McpRequestContext` → `HandlerRequestContext`. This PR's scope is strictly internal request-context preservation (preserving request-scoped headers so they are accessible in MCP execution context), NOT outbound MCP header propagation to dynamic MCP servers. The integration test `test_non_streaming_mcp_e2e_accepts_forwardable_request_headers` in `model_gateway/tests/api/responses_api_test.rs` is intentionally a smoke test (verifies MCP still executes when forwardable headers are present); wire-level header assertions via `MockMCPServer::captured_headers()` belong in the follow-up PR covering config-driven outbound MCP header propagation. Do not flag the absence of `captured_headers()` assertions in this test as a gap for PR `#1333`.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 633 File: .github/workflows/pr-test-rust.yml:439-440 Timestamp: 2026-03-04T20:41:53.013Z Learning: Ensure GitHub Actions workflows always include an actions/checkout step when referencing local composite actions (uses: ./... or similar) on self-hosted runners. Do not rely on a pre-checked-out workspace unless your environment explicitly guarantees it; follow the standard behavior and include checkout to ensure reliable action resolution.

Learnt from: CatherineSue Repo: lightseekorg/smg PR: 638 File: .github/workflows/release-grpc.yml:35-37 Timestamp: 2026-03-05T05:50:07.158Z Learning: In workflows under .github/workflows that trigger only on push to main, you do not need to guard against the all-zeros github.event.before (0000000000000000000000000000000000000000) since that value only occurs on initial branch creation and cannot happen for an already-existing main. Apply this guidance to similar push-to-main workflows across the repository (pattern: **/.github/workflows/*.yml). If a workflow also handles other events or branches, retain appropriate guards for those cases.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1036 File: .github/workflows/claude-code-review.yml:71-73 Timestamp: 2026-04-06T16:49:28.633Z Learning: In GitHub Actions workflow YAMLs, treat any `prompt:` field intended for an LLM (e.g., Claude) as natural-language instructions for the model. Do not embed shell-script-specific constructs or error-handling patterns inside the prompt content (e.g., `git rev-parse --verify`, `2>/dev/null`, explicit exit-code checks like `if [ $? -ne 0 ]`). Instead, express fallback/error-handling guidance in natural language so the LLM can reason about failures from context rather than parsing shell stderr/stdout semantics.

Learnt from: key4ng Repo: lightseekorg/smg PR: 1099 File: .github/workflows/pr-test-mlx.yml:43-43 Timestamp: 2026-04-13T02:02:07.209Z Learning: In this repo, when reviewing GitHub Actions workflow YAML files, treat `runs-on: macos-latest` as acceptable for Apple Silicon: GitHub’s standard `macos-latest` runner resolves to an arm64 macOS runner (Intel is being deprecated). Do not flag `macos-latest` as potentially non-arm64/Intel for Apple Silicon + MLX workloads.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b068350219

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T04:48:59Z

+        --num-concurrency "$concurrency" \
+        --traffic-scenario "$scenario" \
+        --max-requests-per-run "$MAX_REQUESTS" \
+        --max-time-per-run "$DURATION_MIN" \


Convert max-time-per-run from minutes to seconds

run.sh treats DURATION_MIN as minutes, but passes it directly to genai-bench --max-time-per-run, which is defined in seconds (the upstream manage_run_time loop increments once per second, and this repo’s own benchmark harness documents _MAX_TIME_PER_RUN = 10 # seconds in e2e_test/benchmarks/test_nightly_perf.py). With the current default DURATION_MIN=10, each cell runs for ~10 seconds instead of 10 minutes, so nightly results will be statistically noisy and can mislead HTTP vs gRPC comparisons.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

.github/workflows/nightly-mlx-bench.yml (1)

25-34: ⚠️ Potential issue | 🟡 Minor

Remove the temporary pull_request trigger before merging.

This block is explicitly marked temporary; keeping it in the final workflow expands CI scope/cost beyond the intended nightly + manual modes.

Suggested cleanup

-  # TEMPORARY: enable PR trigger so this workflow can be smoke-tested before
-  # merge. Path-filtered so unrelated PRs don't fire it, and the matrix is
-  # reduced via the env block below to ~12 minutes. REMOVE this block before
-  # merging — once the harness is validated, only schedule + workflow_dispatch
-  # should remain.
-  pull_request:
-    branches: [main]
-    paths:
-      - "benchmarks/mlx_grpc_vs_http/**"
-      - ".github/workflows/nightly-mlx-bench.yml"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-mlx-bench.yml around lines 25 - 34, Remove the
temporary pull_request trigger block from the workflow so it only runs on the
intended schedule and manual dispatch; specifically delete the entire
pull_request: block (including the branches: [main] and paths: entries that
reference "benchmarks/mlx_grpc_vs_http/**" and
".github/workflows/nightly-mlx-bench.yml") before merging so CI scope reverts to
nightly + workflow_dispatch only.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/mlx_grpc_vs_http/aggregate.py`:
- Around line 49-52: The collect function currently calls results_dir.iterdir()
which raises if the directory doesn't exist; update collect (function name:
collect) to handle a missing results_dir by early-returning the empty out dict
or by wrapping the iterdir() call in a try/except FileNotFoundError (or checking
results_dir.exists() / results_dir.is_dir()) so the function emits the intended
empty-results summary instead of crashing; ensure you still return the same out
variable shape when the directory is absent.

---

Duplicate comments:
In @.github/workflows/nightly-mlx-bench.yml:
- Around line 25-34: Remove the temporary pull_request trigger block from the
workflow so it only runs on the intended schedule and manual dispatch;
specifically delete the entire pull_request: block (including the branches:
[main] and paths: entries that reference "benchmarks/mlx_grpc_vs_http/**" and
".github/workflows/nightly-mlx-bench.yml") before merging so CI scope reverts to
nightly + workflow_dispatch only.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6e389c04-0c6b-402c-967e-dd4f8060b3e5

📥 Commits

Reviewing files that changed from the base of the PR and between b068350 and 1f2953e.

📒 Files selected for processing (3)

.github/workflows/nightly-mlx-bench.yml
benchmarks/mlx_grpc_vs_http/aggregate.py
benchmarks/mlx_grpc_vs_http/run.sh

claude · 2026-04-27T04:54:23Z

+    local container_url="${base_url/127.0.0.1/host.docker.internal}"
+    container_url="${container_url/localhost/host.docker.internal}"
+
+    # genai-bench's --max-time-per-run is in SECONDS (despite the name).
+    local duration_s=$(( DURATION_MIN * 60 ))
+
+    docker run --rm \
+        --add-host=host.docker.internal:host-gateway \


🔴 Important: The servers started in Phase 1 and Phase 2 all bind to 127.0.0.1 (lines 141, 171, 181). With Colima, host.docker.internal resolves to the Mac host's IP on the virtual network (e.g. 192.168.106.1), not to 127.0.0.1. Since the servers only listen on the loopback interface, connections from the container to host.docker.internal:<port> will be refused.

Docker Desktop for Mac has a built-in proxy that forwards host.docker.internal traffic to the host's loopback, but Colima (Lima-based) does not — it relies on standard IP routing.

Fix: change the three server invocations to bind to 0.0.0.0:

# Line 141 (Phase 1 — mlx-lm) mlx_lm.server --model "$MODEL" --host 0.0.0.0 --port "$HTTP_PORT" ... # Line 171 (Phase 2 — gRPC servicer) python3 -m smg_grpc_servicer.mlx.server --model "$MODEL" --host 0.0.0.0 --port "$GRPC_PORT" ... # Line 181 (Phase 2 — SMG router) "$SMG_BIN" launch --host 0.0.0.0 --port "$ROUTER_PORT" ...

This is safe on a CI runner. The wait_for_port and wait_for_openai helpers should also be updated to check 127.0.0.1 (or just localhost) since they're run from the host.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8203fc57c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T05:27:36Z

+ROUTER_PORT="${ROUTER_PORT:-30000}"
+SMG_BIN="${SMG_BIN:-target/release/smg}"
+
+mkdir -p "$RESULTS_DIR"


Clear RESULTS_DIR before starting a new benchmark run

run.sh only creates $RESULTS_DIR and never removes prior contents, but the aggregator later walks every experiment subfolder under that directory. In practice this mixes old and new runs (for example, after a smoke run with a reduced matrix and then a full run), so SUMMARY.md can include stale cells that were not executed in the current benchmark and mislead HTTP vs gRPC comparisons.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-27T05:27:36Z

+        return None
+    # Each cell produces one summary JSON; sorted() makes the choice
+    # deterministic if a partial rerun left a stale file.
+    data = json.loads(json_files[0].read_text())


Read latest metrics file instead of first sorted JSON

parse_experiment currently loads json_files[0] after sorting, which selects the lexicographically earliest file in the folder. If a cell is rerun and multiple result JSON files remain in the same experiment directory, the summary can report stale metrics from an older run rather than the latest output, silently corrupting the final table.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/mlx_grpc_vs_http/run.sh`:
- Around line 157-160: The fixed "sleep 5" after starting the gRPC servicer is
brittle; replace it with a deterministic readiness gate instead of sleep by
adding/using a wait-for-ready helper (e.g., a new wait_for_generator_ready or
reuse wait_for_port) that polls a concrete readiness signal (health HTTP/gRPC
health endpoint, a socket response on "$GRPC_PORT", or a PID-ready file written
by the gen thread) with a timeout and backoff, then call that helper in place of
"sleep 5" so the script proceeds only when the generator is actually warmed up
(reference symbols: wait_for_port, GRPC_PORT, GRPC_PID, and the current "sleep
5" location).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f7c9e65f-8711-485f-bcad-9de3dcf6ebe2

📥 Commits

Reviewing files that changed from the base of the PR and between 1f2953e and e8203fc.

📒 Files selected for processing (3)

.github/workflows/nightly-mlx-bench.yml
benchmarks/mlx_grpc_vs_http/README.md
benchmarks/mlx_grpc_vs_http/run.sh

coderabbitai

♻️ Duplicate comments (1)

benchmarks/mlx_grpc_vs_http/run.sh (1)

159-162: ⚠️ Potential issue | 🟠 Major

Replace fixed warmup sleep with a deterministic readiness gate.

A hardcoded delay is still brittle for CI variability and can skew/flake the first measured runs.

🔧 Proposed hardening

+wait_for_log_line() {
+    local file="$1"
+    local pattern="$2"
+    local timeout="${3:-300}"
+    local start=$SECONDS
+    while ! grep -q "$pattern" "$file" 2>/dev/null; do
+        if (( SECONDS - start > timeout )); then
+            log "Timeout waiting for '$pattern' in $file"
+            return 1
+        fi
+        sleep 1
+    done
+}
+
 wait_for_port "$GRPC_PORT" 300
+wait_for_log_line "$RESULTS_DIR/mlx-grpc.log" "gRPC server listening on" 300
 log "MLX gRPC servicer up on :$GRPC_PORT (pid=$GRPC_PID)"
-# Give the gen thread a beat to finish warmup before starting the router.
-sleep 5

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mlx_grpc_vs_http/run.sh` around lines 159 - 162, Replace the
brittle fixed "sleep 5" warmup with a deterministic readiness gate:
implement/use a readiness check that polls the generator's health (e.g., a
gen-ready HTTP/IPC/UDP/flag endpoint or log marker) and block until it reports
ready before starting the router; in this script replace the sleep 5 with a call
to that gate (e.g., wait_for_generation_ready or similar) invoked after
wait_for_port "$GRPC_PORT" and before starting the router so the script relies
on an explicit readiness signal tied to the generator process (identified by
GRPC_PID/GRPC_PORT) rather than a fixed timeout.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@benchmarks/mlx_grpc_vs_http/run.sh`:
- Around line 159-162: Replace the brittle fixed "sleep 5" warmup with a
deterministic readiness gate: implement/use a readiness check that polls the
generator's health (e.g., a gen-ready HTTP/IPC/UDP/flag endpoint or log marker)
and block until it reports ready before starting the router; in this script
replace the sleep 5 with a call to that gate (e.g., wait_for_generation_ready or
similar) invoked after wait_for_port "$GRPC_PORT" and before starting the router
so the script relies on an explicit readiness signal tied to the generator
process (identified by GRPC_PID/GRPC_PORT) rather than a fixed timeout.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 94b6b10a-473f-4fcc-bd6e-f339ba3ff658

📥 Commits

Reviewing files that changed from the base of the PR and between e8203fc and 16769d5.

📒 Files selected for processing (1)

benchmarks/mlx_grpc_vs_http/run.sh

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b4820432bc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T19:47:52Z

+    local base="$1"
+    local timeout="${2:-180}"
+    local start=$SECONDS
+    while ! curl -fsS "$base/v1/models" >/dev/null 2>&1; do


Add request timeouts to readiness curl probe

The readiness loop is meant to stop after timeout seconds, but curl -fsS has no --connect-timeout/--max-time, so if the endpoint accepts a socket and then stalls, curl can block indefinitely and the timeout branch is never evaluated. In that startup-failure mode the benchmark hangs until the job-level timeout instead of failing fast.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5aed138a9d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-28T03:57:44Z

+      - name: Install genai-bench
+        # Native pip install instead of the docker image — avoids the
+        # Colima-on-macos-latest VM hop, which is unstable in CI.
+        run: pip install "genai-bench>=0.0.4"


Pin genai-bench version for reproducible nightly runs

Installing genai-bench with >=0.0.4 makes nightly results non-reproducible over time because the benchmark harness can silently change behavior, defaults, or metric calculations as newer releases are published. This job is meant to compare performance trends, so tool-version drift can corrupt those comparisons even when router code is unchanged; use an exact pinned version (or lock file) to keep runs comparable.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-28T03:57:44Z

+    scenarios = sorted({k[1] for k in results})
+    concurrencies = sorted({k[2] for k in results})


Preserve full benchmark matrix when cells fail

The table header and row dimensions are inferred only from successful result keys, so any scenario/concurrency where both HTTP and gRPC produced no JSON is dropped entirely instead of being shown as missing. In failure-heavy runs (for example, highest concurrency crashing both paths), this hides failed cells and makes the summary look like a smaller intended matrix, which can mislead comparisons.

Useful? React with 👍 / 👎.

Intentional — committed in 4580bbc with rationale: the aggregator now infers the rendered matrix from observed data so a CI run with PHASES=mlx grpc produces a two-row table, not a three-row table with a column of —. The same logic applies cell-by-cell. Failed cells are not silent: run.sh writes a .failed marker per failed cell and CI uploads bench-results/ as an artifact, so the failure record is preserved even when the row collapses. The header that previously showed every requested concurrency could imply intent the run did not satisfy — preferring "show what ran" matches how the rest of the SMG nightly suite renders its summary tables. Rendering attempted-but-failed cells from .failed markers is a reasonable follow-up if/when CI starts swallowing whole concurrencies regularly, but the current data does not show that pattern.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fdd9fe6559

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-28T04:10:10Z

+                if content:
+                    if first_token_t is None:
+                        first_token_t = time.monotonic()
+                    tokens += 1


Measure generated tokens instead of SSE chunks

This increments tokens once per streamed chunk, but OpenAI-style SSE chunks are transport frames, not token boundaries, so chunk coalescing/splitting differences between mlx_lm.server and the router path will skew total_output_tokens, output_throughput, and tpot_ms even when model speed is unchanged. Because this benchmark’s purpose is comparing those paths, the current counting method can produce misleading regressions/improvements unrelated to actual token generation.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-28T04:10:10Z

+    ds = load_dataset(
+        "anon8231489123/ShareGPT_Vicuna_unfiltered",
+        data_files="ShareGPT_V3_unfiltered_cleaned_split.json",
+        split="train",


Pin dataset revisions for stable nightly comparisons

The benchmark loads Hugging Face datasets without a revision, so each run implicitly tracks the latest dataset state; upstream edits (new/removed rows or content changes) will change prompt-length distributions and can shift latency/throughput trends even when router code is unchanged. This undermines the nightly benchmark’s comparability over time; pin immutable dataset revisions (for both chat and agent loaders) to keep trend lines meaningful.

Useful? React with 👍 / 👎.

claude · 2026-04-28T04:11:05Z

+          # downloading artifacts.
+          cat "$RESULTS_DIR/SUMMARY.md" >> "$GITHUB_STEP_SUMMARY"
+
+      - name: Upload raw genai-bench output


🟡 Nit: Stale step name — genai-bench was replaced by bench_client.py in this push.

Suggested change

- name: Upload raw genai-bench output

- name: Upload raw benchmark output

claude · 2026-04-28T04:11:08Z

+    # 4 concurrencies × 2 scenarios × 2 setups × 10 min ≈ 2.7 h, plus build
+    # and model download. Cap at 5h45m for safety.


🟡 Nit: This comment still references the old "10 min per cell" duration model. The new approach uses prompt-count-based cells (README says ~30 min for a full run), so the 345-min timeout is ~10x the expected runtime. Consider updating the comment and reducing the timeout.

Suggested change

# 4 concurrencies × 2 scenarios × 2 setups × 10 min ≈ 2.7 h, plus build

# and model download. Cap at 5h45m for safety.

# Custom bench client uses prompt-count-based cells (not time-based).

# Full run ~30 min; cap generously for model download + build overhead.

claude · 2026-04-28T04:11:13Z

+    "request_throughput": 3.24,
+    "output_throughput": 832.1,
+    "total_input_tokens_est": 312000,
+    "total_output_tokens": 31750,
+    "ttft_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...},
+    "tpot_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...}
+  }


🟡 Nit: The example JSON includes total_input_tokens_est but _summarize() doesn't produce that field. Either add it to the output or remove it from the docstring to avoid confusing consumers.

Suggested change

"request_throughput": 3.24,

"output_throughput": 832.1,

"total_input_tokens_est": 312000,

"total_output_tokens": 31750,

"ttft_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...},

"tpot_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...}

}

{

"scenario": "agent",

"concurrency": 16,

"num_prompts_requested": 128,

"completed_requests": 124,

"error_count": 4,

"duration_s": 38.2,

"request_throughput": 3.24,

"output_throughput": 832.1,

"total_output_tokens": 31750,

"ttft_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...},

"tpot_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...}

}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 406e0a4f55

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-28T04:17:33Z

+    then
+        log "[$exp_name] FAILED — recording marker, continuing"
+        date -u +"%Y-%m-%dT%H:%M:%SZ" >"$exp_dir/.failed"
+        return 0


Propagate per-cell benchmark failures to step status

This branch converts any non-zero exit from bench_client.py into success by returning 0, so a phase can finish green even when every cell fails (e.g., invalid workflow_dispatch input, dataset/API breakage, or systemic server errors). In that case the nightly job reports success with empty/placeholder results, which hides that the benchmark is actually broken and can silently drop regression signal.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-28T04:17:33Z

+        "p95": xs[min(int(len(xs) * 0.95), len(xs) - 1)],
+        "p99": xs[min(int(len(xs) * 0.99), len(xs) - 1)],


Use correct percentile indexing for latency stats

The percentile calculation uses int(len * q) as a zero-based index, which biases p95/p99 upward for the small sample sizes used here (for example, with 30 samples p95 resolves to the 29th element, effectively ~96.7th percentile). That systematically distorts tail latency comparisons between HTTP and gRPC cells and can mislead nightly trend analysis.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8abd73a1b2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-28T04:24:41Z

+        "model": model,
+        "messages": [{"role": "user", "content": prompt}],
+        "max_tokens": max_tokens,
+        "temperature": 1.0,


Use deterministic decoding for benchmark traffic

The benchmark requests are sent with temperature set to 1.0 and no fixed seed, so each run samples different continuations and output lengths even when code and prompts are unchanged. Since this harness computes output_throughput and TPOT from generated-token counts, that sampling noise can shift nightly metrics independently of HTTP vs gRPC transport changes and create false regressions/improvements. For comparability, send deterministic generation settings for benchmark runs (for example temperature=0 and a fixed seed when supported).

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a6f1afee4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-28T04:53:07Z

+    result = folder / "result.json"
+    if not result.exists():
+        return None


Ignore failed experiment folders when building summary

run.sh records a .failed marker for any non-zero bench cell (benchmarks/mlx_grpc_vs_http/run.sh lines 129-132), but parse_experiment only checks whether result.json exists and then loads it. That means failed cells can still be included in SUMMARY.md when result.json is stale or partially written, which can silently skew HTTP vs gRPC comparisons instead of rendering the cell as missing as intended.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65704207bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-30T02:45:00Z

+        run: pip install -e "./grpc_servicer[mlx]"
+
+      - name: Install mlx-lm CLI (for the direct-HTTP baseline)
+        run: pip install "mlx-lm>=0.22.0"


Pin mlx-lm version for reproducible benchmark trends

This workflow installs mlx-lm as >=0.22.0, so each nightly run can pick up a different upstream runtime/server implementation. Because this job is used to compare performance over time, that version drift can change latency/throughput independently of any SMG changes and produce misleading regressions or improvements; use an exact version (or lockfile) for stable baselines.

Useful? React with 👍 / 👎.

Resolved in f7f465e: pinned to mlx-lm==0.31.3. Each nightly will now use the same mlx-lm runtime, so trend changes in the summary diffs reflect SMG-side changes instead of upstream mlx-lm release timing. Future bumps happen in their own commits so breaks are explicit.

chatgpt-codex-connector · 2026-04-30T02:45:00Z

+# Python deps in the main env
+pip install -e ./crates/grpc_client/python
+pip install -e "./grpc_servicer[mlx]"
+pip install "mlx-lm>=0.22.0" genai-bench


Include oci_openai in documented local install command

The local setup instructions install only genai-bench, but this same commit’s workflow notes that genai-bench eagerly imports oci_openai and can crash on a clean install without it. As written, developers following the README can fail before the benchmark even starts, so the documented local workflow is not reliably runnable; add oci_openai (or pin a version where it is guaranteed present).

Useful? React with 👍 / 👎.

Resolved in f7f465e: README's local install line now matches the CI workflow — pip install "mlx-lm==0.31.3" genai-bench oci_openai, with a short note explaining why oci_openai is needed (genai-bench eagerly imports it; pip doesn't resolve it on a fresh wheel install).

claude · 2026-04-30T06:11:34Z

@@ -0,0 +1,245 @@
+name: Nightly MLX Three-Way Benchmark


🟡 Nit: The top-level workflow name still says "Three-Way" but the job name (line 51) was updated to "two-way" in this push. They should match now that vllm is dropped from the CI default.

Suggested change

name: Nightly MLX Three-Way Benchmark

name: Nightly MLX Two-Way Benchmark

Resolved in f7f465e: workflow name: is now Nightly MLX Two-Way Benchmark matching the job name.

Three issues flagged by bot review on commit 6570420 / 57e8888: 1. Workflow name said "Nightly MLX Three-Way Benchmark" but the job name was already "MLX two-way (mlx-lm vs SMG+gRPC; vllm-metal local-only)" since vllm was dropped from the CI default. The two were inconsistent; renamed the top-level workflow to "Nightly MLX Two-Way Benchmark" so the GitHub Actions UI tells the truth at every level. 2. `pip install "mlx-lm>=0.22.0"` let each nightly run pull whatever the latest mlx-lm release was at install time. For a benchmark meant to track SMG-side performance trends over time, that drift is noise — a faster decode loop in mlx-lm 0.31.4 would look like an SMG improvement. Pinned to exact version `mlx-lm==0.31.3` (what the failing run actually resolved to and what works with the gemma-3-4b-it-qat-4bit checkpoint). Future bumps now happen in their own commits, making trend breaks explicit in the bench summary diffs rather than implicit in HF/PyPI release timing. 3. Local README install instructions said `pip install "mlx-lm>=0.22.0" genai-bench` but the workflow knows `oci_openai` is a transitive dep that genai-bench imports eagerly — without it, `genai-bench --help` crashes on a clean wheel install. The CI workflow already pins it explicitly; the README didn't, so a developer following local instructions would fail before the bench even started. Updated the README to include `oci_openai` and the same pinned mlx-lm version as CI, with a brief note explaining why both are needed. The temporary `pull_request` trigger flagged by CodeRabbit (#1399 review on b068350) is deliberately kept until just before merge — we need PR-trigger smoke runs while the harness is still being validated. Will be removed in the final pre-merge commit. Signed-off-by: key4ng <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a267fc039

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-01T00:13:05Z

+  pull_request:
+    branches: [main]
+    paths:
+      - "benchmarks/mlx_grpc_vs_http/**"
+      - ".github/workflows/nightly-mlx-bench.yml"


Remove temporary PR trigger from nightly benchmark workflow

This workflow is documented as nightly/manual, and the in-file note says the pull_request trigger is temporary and should be removed before merge, but it is still enabled here. Keeping it will run this long benchmark on every matching PR update, which can add hours of CI latency and consume runner capacity/costs for non-nightly validation paths. If this check is required, it can also slow or block normal PR iteration unexpectedly.

Useful? React with 👍 / 👎.

Acknowledged but keeping for now. The pull_request trigger is path-filtered (only fires on benchmarks/mlx_grpc_vs_http/** or this workflow), the matrix is reduced under pull_request (concurrencies 1 4, duration 4 min), and the bench harness itself is what is being iterated on in this PR — without the trigger, every code change requires a manual workflow_dispatch, which has been the bottleneck during validation. The TEMPORARY note in the YAML is the standing commitment to remove the trigger before merging to main; that removal will be the final commit on this branch.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7f465e837

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-01T00:19:49Z

+
+def collect(results_dir: Path) -> dict[tuple[str, str, int], dict[str, Any]]:
+    out: dict[tuple[str, str, int], dict[str, Any]] = {}
+    for sub in sorted(results_dir.iterdir()):


Guard missing results directory before iterating

collect() unconditionally calls results_dir.iterdir(), which raises FileNotFoundError when the benchmark directory was never created. In this workflow the aggregate step is if: always(), so early setup failures (or a run where all benchmark phases are skipped) will crash aggregation instead of emitting a graceful "no results" summary and continuing artifact/log upload steps.

Useful? React with 👍 / 👎.

Fixed in 81e97b7. Added if not results_dir.is_dir(): return out at the top of collect(). Aggregate runs under if: always(), so this is the safe path when no phase ever created the dir (setup failure, empty PHASES).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4580bbcf35

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-01T04:10:47Z

+      RESULTS_DIR: bench-results
+      # Match the path that vllm-metal's install.sh uses when run via
+      # curl | bash (it hard-codes $HOME/.venv-vllm-metal in that branch).
+      VLLM_VENV: /Users/runner/.venv-vllm-metal


Derive VLLM_VENV from runner home directory

Avoid hard-coding VLLM_VENV to /Users/runner/.venv-vllm-metal: the vllm-metal installer writes into ~/.venv-vllm-metal, so on self-hosted macOS runners where $HOME is not /Users/runner, the workflow looks in the wrong path, reports vllm as missing, and silently skips the vLLM phase even when phases includes vllm. This makes the advertised three-way self-hosted run degrade to two-way without a hard failure.

Useful? React with 👍 / 👎.

Fixed in 81e97b7 by removing the VLLM_VENV env entirely. run.sh already defaults to ${VLLM_VENV:-$HOME/.venv-vllm-metal} (line 43), and the vllm-metal installer writes there too — letting the default resolve at shell-time gets the correct path on both GitHub-hosted (/Users/runner/...) and self-hosted runners.

Compares two paths into the same MLX inference engine: 1. mlx-lm.server (direct HTTP/OpenAI) 2. SMG router + SMG MLX gRPC servicer Both end at the same BatchGenerator; the difference is the transport and the work the router does (Rust tokenization, chat templating, JSON↔proto conversion). Files: - benchmarks/mlx_grpc_vs_http/run.sh — bash orchestrator. Spins up mlx-lm HTTP, runs genai-bench across the matrix, tears down, then starts SMG MLX servicer + SMG router and runs the same matrix against http://router. - benchmarks/mlx_grpc_vs_http/aggregate.py — parses genai-bench JSON output (aggregated_metrics.stats.*) and emits a side-by-side markdown table. - benchmarks/mlx_grpc_vs_http/README.md — local-run instructions and what the benchmark measures. - .github/workflows/nightly-mlx-bench.yml — runs on macos-latest (Apple Silicon) on a nightly schedule and via workflow_dispatch with overridable inputs (model, concurrencies, scenarios, duration). Mirrors the SUMMARY.md to GITHUB_STEP_SUMMARY so the table is visible without downloading artifacts. Defaults match the user's request: - Model: mlx-community/gemma-3-4b-it-qat-4bit (~3 GB, fits 16 GB Mac) - Concurrencies: 1, 4, 16, 64 - Scenarios: D(100,256) and D(2000,128) - 10 min/cell ≈ 2.7 hours total per run Reuses the genai-bench docker image already used by the existing nightly-benchmark workflow (same patterns from e2e_test/benchmarks/conftest.py). Signed-off-by: key4ng <[email protected]>

… test PR pushes now run a reduced matrix (2 concurrencies × 1 scenario × 2 min/cell ≈ 12 min) to validate the orchestrator + aggregator end to end before merge. Path-filtered so only changes to the benchmark files or workflow itself trigger it. REMOVE the pull_request: block before merging — production triggers should only be schedule + workflow_dispatch. Signed-off-by: key4ng <[email protected]>

Critical fixes: - macos-latest runners don't ship Docker. Install Colima in the workflow so docker pull/run actually work. (claude P1, codex P1, coderabbit major). - --max-time-per-run is in SECONDS, not minutes. Multiply DURATION_MIN by 60 in run.sh before passing it to genai-bench, otherwise DURATION_MIN=10 produces 10-second cells, not 10-minute cells. Confirmed against e2e_test/benchmarks/test_nightly_perf.py:30 (claude P1). - --network host doesn't share the host network on Docker for Mac / Colima — containers run inside a Linux VM. Drop --network host and rewrite 127.0.0.1/localhost in the bench client's base URL to host.docker.internal, with --add-host=host.docker.internal: host-gateway to make the lookup work (claude P1, coderabbit major). Minor fixes: - PIDS array element 'removal' was using parameter substitution which replaces the matching value with an empty string instead of dropping the element. Rebuild the array properly (claude nit, coderabbit nit). - aggregate.py: sort json_files for deterministic selection if a partial rerun left stale files (coderabbit nit). - aggregate.py: mkdir parent of --out before writing in case the user passes a fresh path (coderabbit nit). Signed-off-by: key4ng <[email protected]>

Colima boot is unstable on macos-latest GitHub runners (intermittent 'error starting vm: exit status 1' from the lima VZ driver). genai-bench is a public pip package with the same CLI as the docker entrypoint, so 'genai-bench benchmark <args>' replaces 'docker run ... benchmark <args>' verbatim. Also drops the host.docker.internal URL rewriting (no container = no VM = 127.0.0.1 reaches the host directly), and the HF cache mount (genai-bench reads HF_HOME from the same env it inherits). Workflow: - Removed 'Install Docker (Colima)' step. - Replaced 'Pull genai-bench image' with 'Install genai-bench' via pip. - Verification step also calls 'genai-bench --help' to fail fast if the CLI didn't install cleanly. run.sh is ~30 lines smaller, README updated. Signed-off-by: key4ng <[email protected]>

genai-bench multiplies the input by 60 internally (genai_bench/cli/cli.py:326: 'max_time_per_run *= 60'), so passing 120 makes each cell run for 7200 seconds = 2 hours, not 2 minutes. The misleading comment in e2e_test/benchmarks/test_nightly_perf.py:30 ('seconds per scenario×concurrency combo') is what threw me off in the previous round of review fixes. Reading the actual genai-bench source confirms minutes is the correct unit. This was the cause of run #24978016312 sitting at 'in progress' for 74 minutes when it should have completed in ~10 minutes total. Signed-off-by: key4ng <[email protected]>

Two fixes: 1. run.sh: '${new_pids[@]}: unbound variable' under set -u when phase 1 finished and the only tracked pid (mlx-lm) was filtered out. The array reference produced 'new_pids[@]: unbound variable' → script exited silently before phase 2 (gRPC) could run. Replaced the rebuild loop with PIDS=() since mlx-lm is the only server running during phase 1; resetting PIDS to empty is correct and avoids the unbound-array gotcha entirely. 2. workflow: PR-trigger now runs the full matrix (4 conc × 2 scenarios × 10 min × 2 phases = ~2.7 h) instead of the smoke matrix, so we can collect real comparison numbers on this PR before merge. Marked TEMPORARY — switch back to the smoke shape (or remove the pull_request trigger) before merging. Signed-off-by: key4ng <[email protected]>

…ilure Two independent fixes that together give us partial-but-useful results even when something falls over: 1. run.sh now accepts PHASE=http|grpc|all (default 'all' for local). Each phase is a self-contained function with its own server lifecycle, so the workflow can run them as two separate steps. Phase failure no longer leaks to the other phase. 2. run_bench wraps the genai-bench call in 'if !'. A failing cell (e.g. mlx-lm.server collapsing under c=64 with all 503s) writes a .failed marker to the experiment dir and the loop continues. aggregate.py already prints '—' for cells with no JSON, so partial results render cleanly. Workflow: - 'Run benchmark — Phase 1 (direct HTTP)' runs PHASE=http. - 'Run benchmark — Phase 2 (SMG router + gRPC)' runs PHASE=grpc with if: always() so phase 2 still executes if phase 1 errored out. This unblocks the case from run #25015745343 where phase 1 c=64 took mlx-lm.server down and the script exited before phase 2 ever started. Signed-off-by: key4ng <[email protected]>

Per discussion on PR — 'vllm bench serve' isn't usable on macos because vllm publishes Linux/CUDA wheels only. Replaces genai-bench (which only supports synthetic D(in,out) length distributions) with a small custom client that loads real prompt distributions: - chat: anon8231489123/ShareGPT_Vicuna_unfiltered — typical chat user prompts, exercises TTFT + decode throughput. - agent: vdaita/edit_10k_char — real local coding-agent prompts with ~2.5k token code context. Models the canonical Mac local-AI workload (Cursor/Continue/Cline editing files). Stresses prefill + tokenization, where Rust router should win big. bench_client.py: - Async httpx + HF datasets, ~250 LOC. - Output JSON shaped like 'vllm bench --save-result' so we can swap to vllm bench upstream if/when mac support lands. - Per-cell exit non-zero only when *every* request failed (server overload), letting run.sh distinguish complete failure from partial. run.sh: - prompts_for(scenario, concurrency) scales prompt count so each cell has a meaningful number of completed requests at every level (50 → conc*4 for chat, 30 → conc*2 for agent — slower prefill). - Bench cell call replaced with python3 bench_client.py. aggregate.py: - Reads result.json from each experiment dir (new shape). - Renders per-scenario tables (chat, agent) with RPS, output tok/s, TTFT p50/p95, TPOT p50, and errors/total. Workflow: - pip install httpx[http2] + datasets (replaces genai-bench). - workflow_dispatch inputs updated for new scenario names. - Default scenarios = 'chat agent', no DURATION_MIN (driven by num_prompts now). Signed-off-by: key4ng <[email protected]>

Signed-off-by: key4ng <[email protected]>

cancel-in-progress: false meant new PR pushes queued behind the running one, and GitHub auto-cancelled stale queued runs when even newer pushes came in — so a series of fast pushes left the actual latest code never running. Switch to cancel-in-progress: true ONLY for PR events. Scheduled nightlies and manual workflow_dispatch runs still queue normally (don't kill in-flight nightly runs). Signed-off-by: key4ng <[email protected]>

The custom bench_client.py + vllm bench experiments produced unstable TTFT/TPOT numbers for mlx-lm.server (vllm bench saw all SSE chunks coalesced in one network read on localhost, so per-token ITL was always ~0). Switching to genai-bench — the same harness the rest of the SMG nightly suite uses — gives clean schema with stats.ttft / stats.tpot / mean_output_throughput_tokens_per_s, and matches what e2e_test/benchmarks already reports. Same change adds vllm-metal as a third backend so the nightly publishes a real 3-way comparison: mlx-lm direct HTTP vs SMG router + MLX gRPC servicer vs vllm-project/vllm-metal `vllm serve`. Changes: - run.sh: unified 3-phase orchestrator (mlx | grpc | vllm), each opt-in via PHASES env. genai-bench replaces the custom client. vllm phase auto-skips if VLLM_VENV is missing. - aggregate.py: parses genai-bench's aggregated_metrics.stats schema and emits one comparison table per scenario across all three backends. - nightly-mlx-bench.yml: best-effort vllm-metal install via the project's install.sh into a workspace-local venv; three independent phase steps with `if: always()` so a crash in one backend doesn't block the others. - README.md: documents the new 3-way layout, scenarios, and tunables. - bench_client.py / aggregate_local.py / prepare_datasets.py / local_data removed — replaced by genai-bench. Scenarios are now genai-bench's deterministic shapes: - chat = D(100,256) — short prompt + medium output - agent = D(2500,256) — RAG / code-edit / Cursor-style local agent traffic Signed-off-by: key4ng <[email protected]>

… budget Three fixes for the first 3-way CI run on PR #1399: 1. vllm-metal install path. The project's install.sh hard-codes `$HOME/.venv-vllm-metal` when run via `curl | bash` — VLLM_METAL_VENV wasn't honored, so we ended up with the venv at the upstream default while VLLM_VENV pointed at the workspace. The workflow now drops the manual venv creation and matches that hard-coded path; install.sh is self-sufficient (auto-installs uv, creates the venv, installs the wheel). VLLM_VENV is set to /Users/runner/.venv-vllm-metal. 2. vllm phase no longer aborts the script when `vllm serve` fails to come up. We now invoke `$VLLM_VENV/bin/vllm` directly (no `source activate` — that way background children inherit the right interpreter reliably), and wrap `wait_for_openai` in `if !` so a startup failure logs the tail of vllm-metal.log and returns 0 instead of tripping `set -e`. Phase 3 stays green even if vllm-metal can't boot. 3. PR runs use 4 min/cell (was 2). At 2 minutes, the `D(2500,256) × concurrency=4` cell finished zero requests on the macos-latest runner — genai-bench then crashed in its Excel post-processing because metrics were empty, leaving the cell as `—` in the summary table. 4 minutes gives at least one full prefill+decode cycle to complete. Also documented in the README: - mlx-lm.server's TPOT≈0 is a real localhost SSE buffering quirk in mlx-lm; the smg→mlx-grpc row in the same scenario shows a real TPOT despite hitting the same MLX BatchGenerator underneath. Worth highlighting rather than hiding — it's exactly the kind of streaming correctness the SMG router gives you. - High-concurrency agent cells will render `—` if DURATION_MIN runs out before any 2.5k-token prefill completes. Signed-off-by: key4ng <[email protected]>

… TPOT The "mlx-lm TPOT is always 0" symptom traced earlier to the wrong cause. Wire-level probes (raw socket recv + aiohttp.iter_any) BOTH show mlx-lm streaming per-token SSE chunks at ~18 ms intervals — exactly the cadence you'd expect for ~50 tok/s decode. mlx-lm and aiohttp are not the bug. The actual bug is in genai-bench (genai_bench/user/openai_user.py:335): for chunk in response.iter_lines(chunk_size=None): `requests.iter_lines(chunk_size=None)` calls `urllib3.HTTPResponse.read(amt=None)` under the hood, which reads to EOF before yielding when the response has neither `Content-Length` nor `Transfer-Encoding: chunked`. mlx-lm.server uses Python's `BaseHTTPRequestHandler`, which defaults to HTTP/1.0 and emits neither header for streaming responses → the bench client buffers the entire generation and processes all per-token chunks at end-of-stream → TPOT collapses to ~0 and TTFT inflates to e2e latency. Local repro after applying the same patch: | Metric | chunk_size=None | chunk_size=1 | | TTFT mean | 5,452 ms | 486 ms | | TPOT mean | ~0 ms | 19.3 ms | | E2E latency | 5,452 ms | 2,938 ms | 19.3 ms TPOT matches the wire-level inter-chunk gap exactly. The patch is benign for the smg→mlx-grpc and vllm-metal phases — both emit HTTP/1.1 with `Transfer-Encoding: chunked`, where urllib3 yields per chunked frame regardless of the chunk_size argument. Workflow change: a `Patch genai-bench streaming chunk size` step runs `sed` on the installed openai_user.py right after `pip install genai-bench`, swapping `chunk_size=None` for `chunk_size=1`. Fails the job loudly (rather than silently) if the pattern isn't found, so a genai-bench upgrade that renames or removes the call site won't ship a bench run with bogus mlx-lm metrics. README updated to reflect the actual root cause (drops the wrong "SSE buffering quirk in mlx-lm" wording from the previous fix attempt) and documents the equivalent local one-liner for off-CI runs. Signed-off-by: key4ng <[email protected]>

PR #1414 (drain-and-batch fix for the gRPC servicer's BatchGenerator rope crash) is now on main; rebasing this branch onto it lets the next CI run finally produce a clean three-way comparison without the c≥4 agent rope corruption that dominated previous runs. This commit makes two changes on top of that rebase: 1. CI MODEL default → mlx-community/gemma-3-1b-it-4bit (~0.6 GB). The previous 4 B QAT model was ~3 GB on disk; combined with vllm-metal's ~1.7 GB overhead and the macos-latest runner's ~5 GB Metal budget, the KV-cache budget went negative and the vllm phase OOM'd at startup every run — we never got vllm-metal numbers in CI. The 1 B variant fits comfortably and lets all three backends produce metrics. The bench is comparing the *same model* across three serving paths, so the relative router/streaming comparison (TTFT shape, RPS scaling, TPOT correctness) stays meaningful at 1 B; only the absolute throughput numbers shift. For production-shaped 4 B+ numbers, use local_three_way (run.sh) on a real Mac, which still defaults to gemma-3-4b-it-qat-4bit, or override MODEL via workflow_dispatch on a beefier runner (self-hosted, or macos-latest-xlarge with 14 GB). 2. Tightened the genai-bench chunk_size=1 patch comment in the workflow. Previous wording overstated the case as "HTTP/1.0 buffering bug" — the actual bug class is "no length-delimiting framing on the response", and HTTP/1.0 is just one way mlx-lm lands in that state (HTTP/1.1 with Connection: close and no Content-Length / chunked would reproduce too). Also clarified that chunk_size=1 doesn't read 1 byte at a time — urllib3's read(amt=N) returns up to N bytes from whatever's currently buffered, so it effectively means "yield per recv()". The patch is a no-op for the smg→mlx-grpc and vllm-metal phases (HTTP/1.1 + Transfer-Encoding: chunked), since urllib3 decodes chunk frames before the buffer the iterator reads from. README updated to document the CI vs local model split as a table with the rationale, so the next reader doesn't have to dig through git history to understand why CI numbers are at 1 B. Signed-off-by: key4ng <[email protected]>

…le variant) The 1B non-QAT model from the previous commit didn't fix the vllm phase: vllm-metal silently fell back to PyTorch CPU/fp32 because the non-QAT 4-bit MLX checkpoints aren't recognized by vllm-metal's Metal path. Once on the CPU path it tries to allocate `--gpu-memory- utilization 0.92 × 7 GiB = 6.44 GiB`, but only 3.65 GiB is free on the runner — startup aborts before the engine reports ready. Two facts collected from the docs and the failing logs: 1. macos-latest exposes ~5 GB of Metal memory; lowering `--gpu-memory- utilization` doesn't help fit a 4 B model (the flag gates how much Metal vllm-metal can use, not how much it needs). The 4 B QAT already failed with a NEGATIVE KV-cache budget at the default fraction=0.9; lowering the fraction makes it worse. 2. vllm-metal's Metal path only picks up QAT-format MLX checkpoints (`*-qat-4bit`). Regular 4-bit checkpoints silently route through a PyTorch CPU/fp32 fallback that's documented as "experimental and unreliable in 2026". The intersection of those two constraints is `gemma-3-1b-it-qat-4bit` (~0.7 GB on disk): small enough to leave ~2 GB of KV headroom inside the runner's 5 GB Metal budget, and QAT-format so vllm-metal picks the Metal path. This is the only CI-viable Metal-backed model on the macos-latest runner today. run.sh's local default (gemma-3-4b-it-qat-4bit) is unchanged — real Mac hardware with ≥16 GB unified RAM happily serves 4B QAT. README's CI-vs-local model table updated with the QAT requirement, the gpu-memory-utilization explanation, and a more accurate set of "why CI uses 1B-QAT" reasons. workflow_dispatch input description likewise expanded so override callers see the QAT constraint up front. Signed-off-by: key4ng <[email protected]>

Two changes after another vllm-metal CI failure on the gemma-3-1b-it- qat-4bit attempt: even with the QAT model, vllm-metal silently fell back to PyTorch CPU/fp32 (`device_config=cpu`) and aborted at startup trying to allocate `--gpu-memory-utilization 0.92 × 7 GiB = 6.44 GiB` in only 3.39 GiB of available RAM. Root cause is structural: the macos-latest GitHub runner is a VM that doesn't reliably expose Metal GPU access, so vllm-metal's Metal worker can't initialize. The PyTorch MPS fallback path is documented as "experimental and unreliable in 2026". This isn't fixable from our workflow. 1. Drop vllm from the CI default `PHASES`. New default is `mlx grpc` (two-way). Workflow_dispatch input default still includes vllm so self-hosted M-series runs can opt in. The `Install vllm-metal` step now has `if: contains(env.PHASES, 'vllm')` so it doesn't burn ~2 min on every CI run downloading wheels that won't get used. Local `run.sh` default is still `mlx grpc vllm` — real Apple Silicon hardware (≥16 GB unified RAM) runs the three-way fine. 2. Swap MODEL default to `mlx-community/gemma-4-e2b-it-4bit` (~3.2 GB). Gemma 4 E2B has day-0 mlx-lm support and is the smallest production-quality MLX checkpoint that fits the runner. Local `run.sh` default stays at gemma-3-4b-it-qat-4bit for hardware that can serve 4B comfortably. README's defaults table now lists PHASES alongside MODEL, with a short explanation of why vllm-metal isn't in the CI matrix and how to opt back into the three-way (local hardware or workflow_dispatch override on a self-hosted M-series runner). The previous two-paragraph "why CI uses 1B-QAT" rationale is gone — it was specific to a constraint that no longer applies once vllm-metal is dropped. The job name is updated to "MLX two-way (mlx-lm vs SMG+gRPC; vllm- metal local-only)" so the GitHub Actions UI tells the truth about what runs in CI vs what's deferred to local hardware. Signed-off-by: key4ng <[email protected]>

…loadable via mlx_lm.server) Latest CI run failed the mlx phase with ValueError: Received 140 parameters not in model: language_model.model.layers.X.self_attn.k_norm.weight, ... mlx-lm 0.31.x added Gemma 4 support, but only for the *dense* 26B/31B text variants. The Gemma 4 "Effective" e2b/e4b checkpoints are multimodal-only — their safetensors carry text + vision + audio tower weights under `language_model.*`, `vision_tower.*`, `audio_tower.*` namespaces, and `mlx_lm.server`'s `load()` only instantiates text-only CausalLM classes. There is no mlx-lm version (current or imminent) that loads multimodal Gemma 4 checkpoints through the text-only server path; bridging that to mlx-vlm's multimodal loader is a separate workstream. Switching CI MODEL back to gemma-3-4b-it-qat-4bit: - Same as local run.sh default (apples-to-apples between CI and local), so anyone reading both summaries can compare directly. - Fits macos-latest's 7 GB RAM for the mlx and grpc phases (which run sequentially, so each gets full RAM). - Known-working with the mlx-lm pip pin we already use — no version gambling. - The merged drain-and-batch fix (PR #1414) eliminates the c≥4 rope crash that was the real bench-quality concern; with that landed, gemma-3-4b QAT produces clean three-way numbers locally and clean two-way numbers in CI. The previous Gemma 4 swap commit (57e8888) was made on the assumption that mlx-lm's day-0 Gemma 4 line covered all variants; this is the corrective revert with a docstring explaining the multimodal-vs-text-only gap so the next reader doesn't try the same. Signed-off-by: key4ng <[email protected]>

Three issues flagged by bot review on commit 6570420 / 57e8888: 1. Workflow name said "Nightly MLX Three-Way Benchmark" but the job name was already "MLX two-way (mlx-lm vs SMG+gRPC; vllm-metal local-only)" since vllm was dropped from the CI default. The two were inconsistent; renamed the top-level workflow to "Nightly MLX Two-Way Benchmark" so the GitHub Actions UI tells the truth at every level. 2. `pip install "mlx-lm>=0.22.0"` let each nightly run pull whatever the latest mlx-lm release was at install time. For a benchmark meant to track SMG-side performance trends over time, that drift is noise — a faster decode loop in mlx-lm 0.31.4 would look like an SMG improvement. Pinned to exact version `mlx-lm==0.31.3` (what the failing run actually resolved to and what works with the gemma-3-4b-it-qat-4bit checkpoint). Future bumps now happen in their own commits, making trend breaks explicit in the bench summary diffs rather than implicit in HF/PyPI release timing. 3. Local README install instructions said `pip install "mlx-lm>=0.22.0" genai-bench` but the workflow knows `oci_openai` is a transitive dep that genai-bench imports eagerly — without it, `genai-bench --help` crashes on a clean wheel install. The CI workflow already pins it explicitly; the README didn't, so a developer following local instructions would fail before the bench even started. Updated the README to include `oci_openai` and the same pinned mlx-lm version as CI, with a brief note explaining why both are needed. The temporary `pull_request` trigger flagged by CodeRabbit (#1399 review on b068350) is deliberately kept until just before merge — we need PR-trigger smoke runs while the harness is still being validated. Will be removed in the final pre-merge commit. Signed-off-by: key4ng <[email protected]>

…hes reality Until this commit, the bench summary was hardcoded to render all three backend rows ("mlx-lm.server", "smg → mlx-grpc", "vllm-metal") with the heading "MLX Three-Way Benchmark", regardless of which phases the workflow actually executed. With the CI default now `PHASES=mlx grpc` (vllm-metal needs Metal GPU access the macos- latest VM doesn't expose), the summary table looked like vllm-metal had failed every cell — every vllm-metal row was a column of `—`, and the document title still said "Three-Way". Aggregator now infers which backends are present from the result directories that exist: - `LABEL_ORDER` keeps mlx → grpc → vllm display order; only labels with at least one matching result dir are rendered. - Title shifts: 1 backend = "Single-backend", 2 = "Two-Way", 3 = "Three-Way". - The "Backends:" intro list also drops anything that didn't run, so readers see exactly the backends present in the rows below. Verified locally with synthetic data: - Two-way (mlx + grpc): "MLX Two-Way Benchmark", two rows per cell, no vllm-metal artifacts. - Three-way (mlx + grpc + vllm): "MLX Three-Way Benchmark", three rows per cell, behavior unchanged from before. Signed-off-by: key4ng <[email protected]>

claude · 2026-05-01T20:12:47Z

+doesn't reliably expose Apple GPU passthrough — vllm-metal silently
+falls back to a PyTorch CPU/fp32 path that allocates ~6 GB and OOMs
+at startup on the 7 GB runner. To run the three-way comparison,
+either run `local_three_way` on real Apple Silicon, or override


🟡 Nit: local_three_way is referenced here and in two workflow comments (lines 76, 191 of nightly-mlx-bench.yml), but no script, Makefile target, or function by that name exists. The local three-way run is just ./benchmarks/mlx_grpc_vs_http/run.sh with the default PHASES=mlx grpc vllm.

Suggested change

either run `local_three_way` on real Apple Silicon, or override

either run `./benchmarks/mlx_grpc_vs_http/run.sh` on real Apple Silicon, or override

Fixed in 81e97b7. README.md line 138 now points at ./benchmarks/mlx_grpc_vs_http/run.sh (with default PHASES=mlx grpc vllm`` to make the three-way default explicit). Same in nightly-mlx-bench.yml lines 76 and 191.

…VLLM_VENV portability, doc references * run.sh: replace `sleep 5` between mlx-grpc port-open and SMG router start with `wait_for_log_line "gRPC server listening on"`. The port binds before warmup; the log line is the real ready signal. * aggregate.py: guard `collect()` against a missing results dir so the `if: always()` aggregate step doesn't crash when no phase produced output (setup failure, empty PHASES). * nightly-mlx-bench.yml: stop pinning `VLLM_VENV` to `/Users/runner/...`. vllm-metal's installer writes to `$HOME/.venv-vllm-metal` and run.sh defaults to the same; hard-coding the GitHub-hosted path silently broke the vllm phase on self-hosted runners with a different $HOME. * README.md + workflow comments: replace `local_three_way` (which isn't a real script/target) with the actual `run.sh` invocation. Signed-off-by: key4ng <[email protected]>

key4ng requested review from CatherineSue, XinyueZhang369 and slin1237 as code owners April 27, 2026 04:39

github-actions Bot added documentation Improvements or additions to documentation ci CI/CD configuration changes labels Apr 27, 2026

claude Bot reviewed Apr 27, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 27, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 27, 2026

View reviewed changes

Comment thread benchmarks/mlx_grpc_vs_http/aggregate.py

claude Bot reviewed Apr 27, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 27, 2026

View reviewed changes

Comment thread benchmarks/mlx_grpc_vs_http/run.sh Outdated

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

claude Bot reviewed Apr 28, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

key4ng force-pushed the keyang/mlx-bench branch from 406e0a4 to 8abd73a Compare April 28, 2026 04:17

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

key4ng force-pushed the keyang/mlx-bench branch from 2d9cfc1 to 6570420 Compare April 30, 2026 02:39

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

claude Bot reviewed Apr 30, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 1, 2026

View reviewed changes

key4ng mentioned this pull request May 1, 2026

fix(mlx-grpc): per-step admission + own-thread BatchGenerator (eliminates chat-c4 TTFT regression and Stream(gpu,1) crash) #1427

Merged

5 tasks

key4ng added 19 commits May 1, 2026 13:07

style(mlx-bench): apply ruff format from pre-commit

c364bfd

Signed-off-by: key4ng <[email protected]>

key4ng force-pushed the keyang/mlx-bench branch from 4580bbc to bfc2af4 Compare May 1, 2026 20:07

claude Bot reviewed May 1, 2026

View reviewed changes

	# ──────────────────────────────────────────────────────────────────────────
	PIDS=("${PIDS[@]/$MLX_HTTP_PID}")

	--max-time-per-run "$DURATION_MIN" \
	--max-time-per-run "$((DURATION_MIN * 60))" \

		scenarios = sorted({k[1] for k in results})
		concurrencies = sorted({k[2] for k in results})

	- name: Upload raw genai-bench output
	- name: Upload raw benchmark output

		# 4 concurrencies × 2 scenarios × 2 setups × 10 min ≈ 2.7 h, plus build
		# and model download. Cap at 5h45m for safety.

		"p95": xs[min(int(len(xs) * 0.95), len(xs) - 1)],
		"p99": xs[min(int(len(xs) * 0.99), len(xs) - 1)],

	name: Nightly MLX Three-Way Benchmark
	name: Nightly MLX Two-Way Benchmark

	either run `local_three_way` on real Apple Silicon, or override
	either run `./benchmarks/mlx_grpc_vs_http/run.sh` on real Apple Silicon, or override

Conversation

key4ng commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Bench harness (benchmarks/mlx_grpc_vs_http/)

CI (.github/workflows/nightly-mlx-bench.yml)

Default matrix

Test Plan

Summary by CodeRabbit

Uh oh!

gemini-code-assist Bot commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

claude Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

key4ng commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

Bench harness (`benchmarks/mlx_grpc_vs_http/`)

CI (`.github/workflows/nightly-mlx-bench.yml`)

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

coderabbitai Bot Apr 27, 2026 •

edited

Loading

coderabbitai Bot Apr 27, 2026 •

edited

Loading