feat(mlx-bench): nightly benchmark — MLX direct HTTP vs Router+gRPC#1399
feat(mlx-bench): nightly benchmark — MLX direct HTTP vs Router+gRPC#1399key4ng wants to merge 20 commits into
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a nightly/manual macOS GitHub Actions workflow and a benchmark suite that compares MLX direct HTTP vs Router+gRPC: workflow config, a Bash orchestrator to run two benchmark phases, a Python aggregator to summarize results, and README documentation. Changes
Sequence Diagram(s)sequenceDiagram
participant CI as GitHub Actions
participant Runner as run.sh Orchestrator
participant Bench as genai-bench
participant HTTPSrv as MLX HTTP Server
participant GRPCSvc as MLX gRPC Servicer
participant Router as SMG Router
participant FS as Filesystem
CI->>Runner: Start with MODEL, SCENARIOS, CONCURRENCIES, DURATION
rect rgba(100, 150, 200, 0.5)
Note over Runner,FS: Phase 1 — Direct HTTP Benchmark
Runner->>HTTPSrv: Start mlx_lm.server (HTTP)
Runner->>HTTPSrv: Poll /v1/models for readiness
Runner->>Bench: Launch genai-bench per scenario×concurrency
Bench->>HTTPSrv: Send requests
HTTPSrv->>Bench: Return responses
Bench->>FS: Write results to bench-results/http_*
Runner->>HTTPSrv: Stop HTTP server
end
rect rgba(150, 200, 100, 0.5)
Note over Runner,FS: Phase 2 — gRPC via Router Benchmark
Runner->>GRPCSvc: Start MLX gRPC servicer
Runner->>Router: Start SMG router (forward to gRPC)
Runner->>Router: Poll /v1/models for readiness
Runner->>Bench: Launch genai-bench per scenario×concurrency
Bench->>Router: Send requests
Router->>GRPCSvc: Forward via gRPC
GRPCSvc->>Router: Return responses
Router->>Bench: Return responses
Bench->>FS: Write results to bench-results/grpc_*
Runner->>Router: Stop router
Runner->>GRPCSvc: Stop servicer
end
Runner->>FS: Aggregate bench-results into SUMMARY.md
CI->>FS: Upload artifacts and summary
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
| jobs: | ||
| bench: | ||
| name: MLX HTTP vs Router+gRPC | ||
| runs-on: macos-latest |
There was a problem hiding this comment.
🔴 Important: macos-latest GitHub-hosted runners do not have Docker pre-installed. The docker pull (line 89) and the docker run calls in run.sh will fail.
Options:
- Add a step to install Docker (e.g. via colima or the
docker/setup-docker-action). - Install
genai-benchas a native binary/pip package instead of using the container image. - Run on a self-hosted macOS runner that has Docker available.
There was a problem hiding this comment.
Fixed in 1f2953e. Went with option 2 (Colima): brew install colima docker && colima start. Adds ~1-2 min to setup but keeps the existing genai-bench docker pattern intact, matching e2e_test/benchmarks/conftest.py.
|
|
||
| # ────────────────────────────────────────────────────────────────────────── | ||
| # Phase 2: SMG router + MLX gRPC servicer | ||
| # ────────────────────────────────────────────────────────────────────────── |
There was a problem hiding this comment.
🟡 Nit: This pattern-substitution on an array doesn't actually remove the element — it replaces the matching value with an empty string, leaving a ghost entry in PIDS.
| # ────────────────────────────────────────────────────────────────────────── | |
| PIDS=("${PIDS[@]/$MLX_HTTP_PID}") |
Not harmful here because the cleanup trap guards with kill -0, but the idiomatic fix is:
local new_pids=()
for p in "${PIDS[@]}"; do
[[ "$p" != "$MLX_HTTP_PID" ]] && new_pids+=("$p")
done
PIDS=("${new_pids[@]}")Or since this is the only place you remove a PID, you could just skip the removal entirely — the trap already handles dead PIDs gracefully.
There was a problem hiding this comment.
Same fix as the coderabbit thread — addressed in 1f2953e by rebuilding the array instead of substituting.
| benchmark \ | ||
| --api-backend openai \ |
There was a problem hiding this comment.
🔴 Important: Even if Docker is installed on the macOS runner, --network host does not work the same way on macOS as on Linux. On Docker for Mac (Docker Desktop, colima, etc.), the Docker engine runs inside a Linux VM, so --network host gives the container the VM's network — not the macOS host's. The --api-base URL pointing to 127.0.0.1 will fail to reach the mlx-lm/smg servers running on the Mac host.
On macOS you'd need to either:
- Replace
127.0.0.1withhost.docker.internalin the--api-baseURL passed to genai-bench, or - Drop
--network hostand use-pport publishing instead.
Since this script also needs to work on macOS for local dev (per the README), consider detecting the OS:
if [[ "$(uname)" == "Darwin" ]]; then
BENCH_HOST="host.docker.internal"
DOCKER_NET_ARGS=()
else
BENCH_HOST="127.0.0.1"
DOCKER_NET_ARGS=(--network host)
fiThere was a problem hiding this comment.
Fixed in 1f2953e. Dropped --network host, added --add-host=host.docker.internal:host-gateway, and rewrite the base URL inside run_bench so the bench container reaches the macOS host's inference servers via host.docker.internal instead of trying to share the VM's network namespace.
| --num-concurrency "$concurrency" \ | ||
| --traffic-scenario "$scenario" \ | ||
| --max-requests-per-run "$MAX_REQUESTS" \ | ||
| --max-time-per-run "$DURATION_MIN" \ |
There was a problem hiding this comment.
🔴 Important: --max-time-per-run takes seconds, not minutes. See e2e_test/benchmarks/test_nightly_perf.py:30:
_MAX_TIME_PER_RUN = 10 # seconds per scenario×concurrency comboWith the current code, DURATION_MIN=10 passes 10 to --max-time-per-run, meaning each cell runs for 10 seconds instead of the intended 10 minutes. The total runtime would be ~160 seconds, not ~2.7 hours, producing statistically meaningless results.
Fix: convert minutes to seconds:
| --max-time-per-run "$DURATION_MIN" \ | |
| --max-time-per-run "$((DURATION_MIN * 60))" \ |
There was a problem hiding this comment.
Critical catch — fixed in 1f2953e. run.sh now multiplies DURATION_MIN by 60 before passing it as --max-time-per-run. Verified against the comment in e2e_test/benchmarks/test_nightly_perf.py:30. Without this fix the entire 'nightly' bench would have been 16 × 10 seconds = 160 seconds total, not 2.7 hours.
There was a problem hiding this comment.
Reverted the previous 'fix' in 16769d5 after empirically catching it on run #24978016312. genai-bench takes --max-time-per-run in minutes (it multiplies by 60 internally at genai_bench/cli/cli.py:326). The comment in e2e_test/benchmarks/test_nightly_perf.py:30 is misleading — the existing tests pass 10 and get 10 minutes, not 10 seconds. With DURATION_MIN=2 and the seconds conversion I added, each cell was scheduled for 7200 seconds = 2 hours, which matches the 74-minute c=1 cell observed in the cancelled run.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 15a26171e8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| jobs: | ||
| bench: | ||
| name: MLX HTTP vs Router+gRPC | ||
| runs-on: macos-latest |
There was a problem hiding this comment.
Move Docker benchmark job off macOS runner
This job is pinned to runs-on: macos-latest but the workflow later relies on docker pull and docker run (via run.sh) without provisioning a container runtime. On GitHub-hosted runners, macos-latest maps to the arm64 macOS image, and the published macos-15-arm64 image software manifest does not include Docker, so the benchmark fails before collecting any results. Please run this benchmark on a Linux runner or add explicit Docker runtime setup/startup for macOS.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 1f2953e. Added a Colima install step to the workflow before any docker call: brew install colima docker && colima start --cpu 2 --memory 4 --disk 20. macOS runner now has a working docker daemon.
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/nightly-mlx-bench.yml:
- Around line 29-31: The concurrency group currently uses a dynamic name
`nightly-mlx-bench-${{ github.event_name }}` which separates scheduled and
manual runs; change it to a fixed group name like `nightly-mlx-bench` so all
workflows share a single concurrency group and only one instance runs at a
time—update the `concurrency.group` value to the fixed string and keep
`cancel-in-progress` as desired to ensure proper single-instance behavior.
- Around line 88-92: The workflow currently pulls and runs a Docker image on
macos-latest without installing Docker and the benchmark script run.sh uses
"--network host" which does not behave the same on macOS; either switch the job
to run on ubuntu-latest or add a step to install/configure Docker on macOS
(e.g., use actions/setup-docker or install Docker Desktop CLI) before the "Pull
genai-bench image" and "Run benchmark (this is the long step)" steps, and modify
run.sh to avoid or conditionalize the "--network host" flag on macOS (detect
platform in run.sh and skip or use alternative networking) so the benchmark runs
reliably across runners.
In `@benchmarks/mlx_grpc_vs_http/aggregate.py`:
- Around line 136-139: The code currently writes output via
Path(args.out).write_text(table) without ensuring the parent directory exists;
update the logic in aggregate.py right before the write_text call (near where
collect and build_table are used) to create the parent directory if missing (use
Path(args.out).parent.mkdir(parents=True, exist_ok=True)) and then perform
Path(args.out).write_text(table) so standalone runs won't fail when the target
directory doesn't exist.
- Around line 27-45: parse_experiment currently picks json_files[0] arbitrarily
when multiple JSONs exist; make selection deterministic by sorting json_files
before choosing one (e.g., sort by modification time using Path.stat().st_mtime
or by filename) and then read the chosen file; update the logic in
parse_experiment (variable json_files and the selection of json_files[0]) to
pick the latest/appropriately ordered file so repeated or partial reruns yield a
predictable result.
In `@benchmarks/mlx_grpc_vs_http/run.sh`:
- Line 147: The current removal using PIDS=("${PIDS[@]/$MLX_HTTP_PID}") does
substring replacement and can corrupt entries; replace it by rebuilding the
array to exclude the exact PID: iterate over PIDS (use the PIDS variable) and
append only entries not equal to MLX_HTTP_PID into a NEW_PIDS array, then assign
PIDS=("${NEW_PIDS[@]}"); update the code in run.sh where PIDS and MLX_HTTP_PID
are handled (or alternatively rely on the existing cleanup trap that checks kill
-0), but do not use the substring-replacement form.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: cf4091d4-f882-44f5-9a62-62fda3e76faf
📒 Files selected for processing (4)
.github/workflows/nightly-mlx-bench.ymlbenchmarks/mlx_grpc_vs_http/README.mdbenchmarks/mlx_grpc_vs_http/aggregate.pybenchmarks/mlx_grpc_vs_http/run.sh
| concurrency: | ||
| group: nightly-mlx-bench-${{ github.event_name }} | ||
| cancel-in-progress: false |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Concurrency group may cause unexpected cancellations.
The concurrency group nightly-mlx-bench-${{ github.event_name }} means scheduled runs and manual dispatches use different groups (nightly-mlx-bench-schedule vs nightly-mlx-bench-workflow_dispatch). This allows a manual run to execute concurrently with a scheduled run, which could compete for resources or cause confusion. Consider using a fixed group name if only one instance should ever run.
💡 Suggested fix for single-instance concurrency
concurrency:
- group: nightly-mlx-bench-${{ github.event_name }}
+ group: nightly-mlx-bench
cancel-in-progress: false🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/nightly-mlx-bench.yml around lines 29 - 31, The
concurrency group currently uses a dynamic name `nightly-mlx-bench-${{
github.event_name }}` which separates scheduled and manual runs; change it to a
fixed group name like `nightly-mlx-bench` so all workflows share a single
concurrency group and only one instance runs at a time—update the
`concurrency.group` value to the fixed string and keep `cancel-in-progress` as
desired to ensure proper single-instance behavior.
There was a problem hiding this comment.
Already mitigated — the workflow has cancel-in-progress: false, so even though ${{ github.event_name }} produces different group keys for schedule vs workflow_dispatch, neither cancels the other. Keeping as-is intentionally so a manual dispatch never kills an in-progress nightly.
There was a problem hiding this comment.
@key4ng, thanks for the clarification — that's a valid and deliberate design. Using ${{ github.event_name }}-scoped groups with cancel-in-progress: false is the correct way to let a manual dispatch and a nightly coexist independently: each group serializes its own event type without interfering with the other. I've noted this for future reviews.
✏️ Learnings added
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1399
File: .github/workflows/nightly-mlx-bench.yml:39-41
Timestamp: 2026-04-27T04:50:22.772Z
Learning: In repo lightseekorg/smg, `.github/workflows/nightly-mlx-bench.yml`: The concurrency group `nightly-mlx-bench-${{ github.event_name }}` is intentional. It creates separate groups for `schedule` and `workflow_dispatch` events so that a manual dispatch run and a nightly scheduled run can coexist without blocking or cancelling each other. `cancel-in-progress: false` ensures neither a new nightly nor a new manual dispatch kills an already-running instance of the same event type. Do not flag this as a bug or suggest collapsing to a single fixed group name.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 635
File: .github/workflows/pr-test-rust.yml:621-621
Timestamp: 2026-03-04T23:20:28.110Z
Learning: In lightseekorg/smg, keeping `-s` (no capture) unconditionally in pytest invocations for parallel E2E jobs in `.github/workflows/pr-test-rust.yml` is intentional. Removing `-s` for parallel jobs was explicitly rejected because: (1) captured output is discarded on CI timeout/hang, which is unacceptable for debugging, and (2) it removes real-time log visibility. Log interleaving from parallel runs is instead addressed by a `_TestNameFormatter` in `e2e_test/conftest.py` that prefixes every log line with `[test_name]`, allowing users to grep for a specific test name to follow its output through interleaved logs.
Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: model_gateway/benches/wasm_middleware_latency.rs:88-91
Timestamp: 2026-02-21T02:37:02.009Z
Learning: Repo: lightseekorg/smg — For clippy-only/enforcement PRs (e.g., PR `#489`), even micro-optimizations (like replacing an async closure with std::future::ready in benches such as model_gateway/benches/wasm_middleware_latency.rs) should be deferred to a follow-up PR rather than included inline.
Learnt from: pallasathena92
Repo: lightseekorg/smg PR: 687
File: model_gateway/src/routers/openai/realtime/webrtc.rs:238-289
Timestamp: 2026-03-11T01:29:56.655Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/openai/realtime/webrtc.rs: `Metrics::record_router_request` is already emitted in router.rs (around line 1152) before `handle_realtime_webrtc` is called, and `Metrics::record_router_error` is emitted inside `handle_realtime_webrtc` for the no-workers case. The missing instrumentation is success/duration recording after `setup_and_spawn_bridge` returns — this is a metrics improvement deferred to a follow-up PR, not a correctness gap. Do not flag missing success/duration metrics as a blocking issue for PR `#687`.
Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 1250
File: model_gateway/tests/api/responses_api_test.rs:643-653
Timestamp: 2026-04-24T20:01:12.229Z
Learning: In repo lightseekorg/smg, PR `#1250` (`model_gateway/tests/api/responses_api_test.rs`): The router-side rejection path for `previous_response_id + mcp_approval_response + no MCP tools` (i.e., a resumed approval continuation where the follow-up request carries zero MCP tool bindings) is intentionally not covered by integration tests in this PR. The existing continuation fixtures all inherit `req1` which carries an MCP binding. A dedicated test for this rejection branch is deferred to a follow-up PR. Do not re-flag the missing coverage in PR `#1250` or similar approval-continuation PRs.
Learnt from: XinyueZhang369
Repo: lightseekorg/smg PR: 723
File: model_gateway/tests/api/interactions_api_test.rs:316-346
Timestamp: 2026-03-11T05:32:45.536Z
Learning: In repo lightseekorg/smg, `test_interactions_multiple_workers` in `model_gateway/tests/api/interactions_api_test.rs` intentionally only verifies that requests succeed with multiple Gemini backends (connectivity), not load distribution. Load distribution across workers is covered separately in `tests/routing/load_balancing_test.rs`. Do not flag the absence of per-worker request count assertions in this test as a gap — distribution verification is deferred to a follow-up PR once the Gemini router implementation matures.
Learnt from: RohanSogani
Repo: lightseekorg/smg PR: 1333
File: model_gateway/tests/api/responses_api_test.rs:325-373
Timestamp: 2026-04-23T18:59:11.719Z
Learning: In repo lightseekorg/smg, PR `#1333` adds `extract_forwardable_request_headers` and threads forwarded headers through `McpToolSession` → `McpRequestContext` → `HandlerRequestContext`. This PR's scope is strictly internal request-context preservation (preserving request-scoped headers so they are accessible in MCP execution context), NOT outbound MCP header propagation to dynamic MCP servers. The integration test `test_non_streaming_mcp_e2e_accepts_forwardable_request_headers` in `model_gateway/tests/api/responses_api_test.rs` is intentionally a smoke test (verifies MCP still executes when forwardable headers are present); wire-level header assertions via `MockMCPServer::captured_headers()` belong in the follow-up PR covering config-driven outbound MCP header propagation. Do not flag the absence of `captured_headers()` assertions in this test as a gap for PR `#1333`.
Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 915
File: model_gateway/src/routers/grpc/client.rs:387-423
Timestamp: 2026-03-26T17:06:14.307Z
Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.
Learnt from: slin1237
Repo: lightseekorg/smg PR: 1305
File: model_gateway/src/routers/grpc/harmony/builder.rs:726-728
Timestamp: 2026-04-22T20:29:00.426Z
Learning: In repo lightseekorg/smg, `model_gateway/src/routers/grpc/harmony/builder.rs`: The warn message "Approval item reached Harmony conversion" in `parse_response_item_to_harmony_message` is pre-existing code. The `ComputerCall { .. }` and `ComputerCallOutput { .. }` match arms added in PR `#1305` are forced-cascade additions to keep the exhaustive match valid within a protocol-scope PR. Rewording the warning to a neutral message (e.g. "Unsupported input item reached Harmony conversion") is intentionally deferred to a follow-up router task. Do not re-flag this stale log message in protocol-scope PRs.
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 936
File: model_gateway/src/routers/gemini/router.rs:82-90
Timestamp: 2026-03-30T18:52:19.689Z
Learning: In repo lightseekorg/smg, the Gemini router's (`model_gateway/src/routers/gemini/router.rs`) use of `model_id` for per-model retry config lookup in `route_interactions` has a pre-existing limitation: when the request carries only an agent identifier, `get_retry_config(agent_id)` finds no match because agent→model resolution happens later inside `driver::execute` (in `worker_selection.rs`). This same limitation affects policy lookup and worker selection in the Gemini router. It is NOT introduced by PR `#936` and is a known accepted gap. The Gemini router is not actively used currently. Do not flag this as a bug introduced by per-model retry config changes to the Gemini router.
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/workflows/pr-test-rust.yml:439-440
Timestamp: 2026-03-04T20:41:53.013Z
Learning: Ensure GitHub Actions workflows always include an actions/checkout step when referencing local composite actions (uses: ./... or similar) on self-hosted runners. Do not rely on a pre-checked-out workspace unless your environment explicitly guarantees it; follow the standard behavior and include checkout to ensure reliable action resolution.
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: .github/workflows/release-grpc.yml:35-37
Timestamp: 2026-03-05T05:50:07.158Z
Learning: In workflows under .github/workflows that trigger only on push to main, you do not need to guard against the all-zeros github.event.before (0000000000000000000000000000000000000000) since that value only occurs on initial branch creation and cannot happen for an already-existing main. Apply this guidance to similar push-to-main workflows across the repository (pattern: **/.github/workflows/*.yml). If a workflow also handles other events or branches, retain appropriate guards for those cases.
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1036
File: .github/workflows/claude-code-review.yml:71-73
Timestamp: 2026-04-06T16:49:28.633Z
Learning: In GitHub Actions workflow YAMLs, treat any `prompt:` field intended for an LLM (e.g., Claude) as natural-language instructions for the model. Do not embed shell-script-specific constructs or error-handling patterns inside the prompt content (e.g., `git rev-parse --verify`, `2>/dev/null`, explicit exit-code checks like `if [ $? -ne 0 ]`). Instead, express fallback/error-handling guidance in natural language so the LLM can reason about failures from context rather than parsing shell stderr/stdout semantics.
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1099
File: .github/workflows/pr-test-mlx.yml:43-43
Timestamp: 2026-04-13T02:02:07.209Z
Learning: In this repo, when reviewing GitHub Actions workflow YAML files, treat `runs-on: macos-latest` as acceptable for Apple Silicon: GitHub’s standard `macos-latest` runner resolves to an arm64 macOS runner (Intel is being deprecated). Do not flag `macos-latest` as potentially non-arm64/Intel for Apple Silicon + MLX workloads.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
.github/workflows/nightly-mlx-bench.yml (1)
100-104:⚠️ Potential issue | 🟠 MajorDocker runtime is assumed but not provisioned on the macOS runner.
Line 101 and Line 104 depend on a working Docker daemon, but this job has no Docker setup/start step. Please provision Docker explicitly on macOS (or remove the container dependency for
genai-bench) before these steps.Run this verification to confirm the current macOS runner image contents:
#!/bin/bash set -euo pipefail # Check whether Docker appears in the official macOS arm64 runner image manifest. curl -fsSL https://raw.githubusercontent.com/actions/runner-images/main/images/macos/macos-15-arm64-Readme.md \ | rg -n -i '\bdocker\b' || true # Expected result: # - No Docker tool entry found in the manifest (or no matching lines), # which indicates the workflow must install/start Docker itself.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/nightly-mlx-bench.yml around lines 100 - 104, The workflow assumes a Docker daemon but the macOS runner doesn't provide one; before the steps named "Pull genai-bench image" and "Run benchmark (this is the long step)" either provision Docker on macOS or remove the container dependency. Fix by adding a setup step that installs and starts Docker on macOS (for example: install Docker Desktop via brew install --cask docker, launch Docker.app and wait for the daemon to be ready) or change the job to use an Ubuntu runner or modify ./benchmarks/mlx_grpc_vs_http/run.sh to use a non-container execution path so the docker pull/run calls are no longer required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/nightly-mlx-bench.yml:
- Around line 25-34: Remove the temporary CI trigger by deleting the
pull_request block (the YAML mapping starting with pull_request: and its nested
branches: and paths: entries) so only the intended schedule and
workflow_dispatch triggers remain; ensure the commented note about it being
temporary is also removed or updated so the workflow no longer runs on PRs for
the benchmarks/mlx_grpc_vs_http path after merge.
---
Duplicate comments:
In @.github/workflows/nightly-mlx-bench.yml:
- Around line 100-104: The workflow assumes a Docker daemon but the macOS runner
doesn't provide one; before the steps named "Pull genai-bench image" and "Run
benchmark (this is the long step)" either provision Docker on macOS or remove
the container dependency. Fix by adding a setup step that installs and starts
Docker on macOS (for example: install Docker Desktop via brew install --cask
docker, launch Docker.app and wait for the daemon to be ready) or change the job
to use an Ubuntu runner or modify ./benchmarks/mlx_grpc_vs_http/run.sh to use a
non-container execution path so the docker pull/run calls are no longer
required.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 282d6bcd-1201-4a4b-9f0a-f1578d5112c4
📒 Files selected for processing (1)
.github/workflows/nightly-mlx-bench.yml
| # TEMPORARY: enable PR trigger so this workflow can be smoke-tested before | ||
| # merge. Path-filtered so unrelated PRs don't fire it, and the matrix is | ||
| # reduced via the env block below to ~12 minutes. REMOVE this block before | ||
| # merging — once the harness is validated, only schedule + workflow_dispatch | ||
| # should remain. | ||
| pull_request: | ||
| branches: [main] | ||
| paths: | ||
| - "benchmarks/mlx_grpc_vs_http/**" | ||
| - ".github/workflows/nightly-mlx-bench.yml" |
There was a problem hiding this comment.
Remove the temporary pull_request trigger before merge.
This block is explicitly marked temporary; merging it as-is keeps PR smoke runs enabled beyond the validation phase and expands CI spend/surface area unintentionally.
✂️ Proposed cleanup before final merge
- # TEMPORARY: enable PR trigger so this workflow can be smoke-tested before
- # merge. Path-filtered so unrelated PRs don't fire it, and the matrix is
- # reduced via the env block below to ~12 minutes. REMOVE this block before
- # merging — once the harness is validated, only schedule + workflow_dispatch
- # should remain.
- pull_request:
- branches: [main]
- paths:
- - "benchmarks/mlx_grpc_vs_http/**"
- - ".github/workflows/nightly-mlx-bench.yml"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # TEMPORARY: enable PR trigger so this workflow can be smoke-tested before | |
| # merge. Path-filtered so unrelated PRs don't fire it, and the matrix is | |
| # reduced via the env block below to ~12 minutes. REMOVE this block before | |
| # merging — once the harness is validated, only schedule + workflow_dispatch | |
| # should remain. | |
| pull_request: | |
| branches: [main] | |
| paths: | |
| - "benchmarks/mlx_grpc_vs_http/**" | |
| - ".github/workflows/nightly-mlx-bench.yml" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/nightly-mlx-bench.yml around lines 25 - 34, Remove the
temporary CI trigger by deleting the pull_request block (the YAML mapping
starting with pull_request: and its nested branches: and paths: entries) so only
the intended schedule and workflow_dispatch triggers remain; ensure the
commented note about it being temporary is also removed or updated so the
workflow no longer runs on PRs for the benchmarks/mlx_grpc_vs_http path after
merge.
There was a problem hiding this comment.
Deliberately deferred. The PR trigger smoke-tests the harness on each push while the bench is still being validated; without it I can't see whether changes break the workflow until manual dispatch. Will remove the trigger and the surrounding TEMPORARY comment in the final pre-merge commit. The path filter (benchmarks/mlx_grpc_vs_http/** + the workflow file itself) keeps unrelated PRs from firing it, and the env-block matrix reduction (concurrencies=1 4, duration=4 min) caps PR-trigger spend at ~25 min vs the full ~2 h scheduled run.
There was a problem hiding this comment.
@key4ng, understood — that's a reasonable validation strategy. The path filter + reduced matrix (≈25 min cap) makes the temporary trigger cost-conscious, and having it fire on every push is genuinely more convenient than repeated manual dispatches during harness development.
I've noted your commitment to remove the pull_request block and its surrounding TEMPORARY comment in the final pre-merge commit. I'll keep this comment open as a pre-merge reminder.
✏️ Learnings added
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1399
File: .github/workflows/nightly-mlx-bench.yml:29-38
Timestamp: 2026-05-01T00:13:32.765Z
Learning: In repo lightseekorg/smg, `.github/workflows/nightly-mlx-bench.yml`: The `pull_request:` trigger block (with `branches: [main]` and `paths: ["benchmarks/mlx_grpc_vs_http/**", ".github/workflows/nightly-mlx-bench.yml"]`) is explicitly marked TEMPORARY and is intentionally kept during the harness validation phase so the author can smoke-test the workflow on each push without needing manual `workflow_dispatch`. The env-block matrix reduction caps PR-trigger spend at ~25 min. The block is committed to be removed in the final pre-merge commit, leaving only `schedule` and `workflow_dispatch` triggers. Do NOT re-flag this temporary trigger as a blocking issue during the validation period, but do flag it if it is still present at merge time.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1399
File: .github/workflows/nightly-mlx-bench.yml:39-41
Timestamp: 2026-04-27T04:50:22.772Z
Learning: In `.github/workflows/nightly-mlx-bench.yml`, do not flag or recommend changing the `concurrency.group` value `nightly-mlx-bench-${{ github.event_name }}`. It is intentionally designed to separate concurrency between `schedule` and `workflow_dispatch` so a manual run can coexist with (rather than block/cancel) a scheduled nightly run. Keep `cancel-in-progress: false` as-is, since it intentionally prevents new runs of the same event type from cancelling an already-running instance.
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1398
File: .github/workflows/pr-test-mlx.yml:66-119
Timestamp: 2026-04-30T06:05:20.312Z
Learning: In repo lightseekorg/smg, `.github/workflows/pr-test-mlx.yml`: Do NOT suggest adding a Hugging Face model cache step for the `mlx-community/Qwen3-0.6B-4bit` model (~400 MB). The model is intentionally left uncached because it is small enough that the network download cost is acceptable and caching it wastes GitHub Actions cache storage.
Learnt from: VadivelanU
Repo: lightseekorg/smg PR: 1407
File: model_gateway/src/routers/openai/responses/streaming.rs:1061-1173
Timestamp: 2026-04-29T13:26:42.244Z
Learning: In repo lightseekorg/smg, `handle_streaming_with_tool_interception` in `model_gateway/src/routers/openai/responses/streaming.rs`: The inner `break` on `tool_calls_detected` inside the `for normalized_event in normalized_events` loop (after `StreamAction::ExecuteTools` sets the flag) is a pre-existing behavior preserved intentionally from prior iterations. It can theoretically drop later events in the same normalized batch, but fixing it (moving the break outside the inner loop so all normalized events are processed before acting on the flag) is deferred to a follow-up PR. Do not re-flag this early break as a blocking issue in PR `#1407` or similar streaming-refactor PRs.
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 635
File: .github/workflows/pr-test-rust.yml:621-621
Timestamp: 2026-03-04T23:20:28.110Z
Learning: In lightseekorg/smg, keeping `-s` (no capture) unconditionally in pytest invocations for parallel E2E jobs in `.github/workflows/pr-test-rust.yml` is intentional. Removing `-s` for parallel jobs was explicitly rejected because: (1) captured output is discarded on CI timeout/hang, which is unacceptable for debugging, and (2) it removes real-time log visibility. Log interleaving from parallel runs is instead addressed by a `_TestNameFormatter` in `e2e_test/conftest.py` that prefixes every log line with `[test_name]`, allowing users to grep for a specific test name to follow its output through interleaved logs.
Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 1402
File: model_gateway/src/routers/grpc/harmony/streaming.rs:509-513
Timestamp: 2026-04-27T23:48:21.853Z
Learning: In repo lightseekorg/smg, `model_gateway/src/routers/grpc/harmony/streaming.rs`: The unconditional `prefill_stream.mark_completed()` call after `process_decode_stream(...)` (in both `process_responses_dual_stream` and `process_dual_stream`) is a pre-existing pattern that predates PR `#1402`. PR `#1402` only relocated the function; no logic was changed. The fix — guarding `mark_completed()` behind `result.is_ok()` so that decode/parser errors still trigger abort-on-drop cleanup for the prefill RPC — is intentionally deferred to a follow-up PR covering both dual-stream paths. Do not re-flag this as a regression in PR `#1402` or similar relocation PRs.
Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: model_gateway/benches/wasm_middleware_latency.rs:88-91
Timestamp: 2026-02-21T02:37:02.009Z
Learning: Repo: lightseekorg/smg — For clippy-only/enforcement PRs (e.g., PR `#489`), even micro-optimizations (like replacing an async closure with std::future::ready in benches such as model_gateway/benches/wasm_middleware_latency.rs) should be deferred to a follow-up PR rather than included inline.
Learnt from: slin1237
Repo: lightseekorg/smg PR: 1305
File: model_gateway/src/routers/grpc/harmony/builder.rs:726-728
Timestamp: 2026-04-22T20:29:00.426Z
Learning: In repo lightseekorg/smg, `model_gateway/src/routers/grpc/harmony/builder.rs`: The warn message "Approval item reached Harmony conversion" in `parse_response_item_to_harmony_message` is pre-existing code. The `ComputerCall { .. }` and `ComputerCallOutput { .. }` match arms added in PR `#1305` are forced-cascade additions to keep the exhaustive match valid within a protocol-scope PR. Rewording the warning to a neutral message (e.g. "Unsupported input item reached Harmony conversion") is intentionally deferred to a follow-up router task. Do not re-flag this stale log message in protocol-scope PRs.
Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 1250
File: model_gateway/tests/api/responses_api_test.rs:1347-1355
Timestamp: 2026-04-24T20:01:23.917Z
Learning: In repo lightseekorg/smg, the response transformer in the non-streaming OpenAI path rewrites synthesized `function_call_output` items into `mcp_call` items before they reach the final client-visible output. Consequently, the assertion in `test_final_response_hides_internal_mcp_trace_items` (model_gateway/tests/api/responses_api_test.rs) does not need to explicitly block `"function_call_output"` — any such leak would first require the transformer to fail, which would already be caught by the `mcp_call` guard. Adding `"function_call_output"` to that assertion is defense-in-depth only, not a fix for a live gap; it is intentionally deferred to a follow-up PR.
Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 1250
File: model_gateway/tests/api/responses_api_test.rs:643-653
Timestamp: 2026-04-24T20:01:18.002Z
Learning: In repo lightseekorg/smg, PR `#1250` (`model_gateway/tests/api/responses_api_test.rs`): The router-side rejection path for `previous_response_id + mcp_approval_response + no MCP tools` (i.e., a resumed approval continuation where the follow-up request carries zero MCP tool bindings) is intentionally not covered by integration tests in this PR. The existing continuation fixtures all inherit `req1` which carries an MCP binding. A dedicated test for this rejection branch is deferred to a follow-up PR. Do not re-flag the missing coverage in PR `#1250` or similar approval-continuation PRs.
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 1090
File: crates/tool_parser/src/parsers/mistral.rs:48-75
Timestamp: 2026-04-13T23:31:17.654Z
Learning: In repo lightseekorg/smg, `crates/tool_parser/src/parsers/mistral.rs` `MistralParser::build_structural_tag`: The `triggered_tags` format intentionally constrains only single tool calls (one `begin`/`end` tag pair per tool, trigger `"[TOOL_CALLS]"`). Supporting parallel Mistral tool calls (array with multiple items) requires `tags_with_separator` (as sglang implements) and is deferred to a follow-up PR. Do not flag the absence of multi-call/continuation tags in `build_structural_tag` as a bug; the primary target use case is `tool_choice: function` (single call).
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 990
File: crates/tokenizer/benches/stop_decoder.rs:23-24
Timestamp: 2026-04-01T08:02:32.590Z
Learning: In `lightseekorg/smg` (`crates/tokenizer/benches/stop_decoder.rs`), benchmarks intentionally use `MockTokenizer` (not a real HuggingFace tokenizer) to keep them hermetic and avoid network/filesystem dependencies for model downloads. The HuggingFace native `step_decode_stream` fast-path is covered by the integration test `test_sequence_operations` in `crates/tokenizer/tests/tokenizer_integration.rs`, which uses a real HF tokenizer. A fixture-backed HF benchmark is a known follow-up item but is not required in this PR. Do not flag the absence of HF tokenizer usage in the bench file as a gap.
Learnt from: XinyueZhang369
Repo: lightseekorg/smg PR: 723
File: model_gateway/tests/api/interactions_api_test.rs:316-346
Timestamp: 2026-03-11T05:32:45.536Z
Learning: In repo lightseekorg/smg, `test_interactions_multiple_workers` in `model_gateway/tests/api/interactions_api_test.rs` intentionally only verifies that requests succeed with multiple Gemini backends (connectivity), not load distribution. Load distribution across workers is covered separately in `tests/routing/load_balancing_test.rs`. Do not flag the absence of per-worker request count assertions in this test as a gap — distribution verification is deferred to a follow-up PR once the Gemini router implementation matures.
Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 1174
File: crates/mcp/src/core/orchestrator.rs:1097-1131
Timestamp: 2026-04-17T06:48:38.553Z
Learning: In repo lightseekorg/smg, `execute_tool_entry_result` in `crates/mcp/src/core/orchestrator.rs`: `record_call_end` is intentionally called for both `ToolExecutionResult::Executed` and `ToolExecutionResult::PendingApproval` to preserve pre-refactor metric shape (pending approval closes out the current tool-handling attempt). For `PendingApproval`, `succeeded = false` was a bug fixed in PR `#1174` so it no longer inflates error rates, but a dedicated `record_call_pending_approval` counter and full lifecycle split (only recording completion when approval is resolved and execution runs) are deferred to a follow-up PR. Do not flag the absence of a dedicated pending-approval metric as a blocking issue until that follow-up work is done.
Learnt from: pallasathena92
Repo: lightseekorg/smg PR: 687
File: model_gateway/src/routers/openai/realtime/webrtc.rs:238-289
Timestamp: 2026-03-11T01:29:56.655Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/openai/realtime/webrtc.rs: `Metrics::record_router_request` is already emitted in router.rs (around line 1152) before `handle_realtime_webrtc` is called, and `Metrics::record_router_error` is emitted inside `handle_realtime_webrtc` for the no-workers case. The missing instrumentation is success/duration recording after `setup_and_spawn_bridge` returns — this is a metrics improvement deferred to a follow-up PR, not a correctness gap. Do not flag missing success/duration metrics as a blocking issue for PR `#687`.
Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 915
File: model_gateway/src/routers/grpc/client.rs:387-423
Timestamp: 2026-03-26T17:06:14.307Z
Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.
Learnt from: shenoyvvarun
Repo: lightseekorg/smg PR: 1397
File: model_gateway/src/routers/http/pd_router.rs:104-116
Timestamp: 2026-04-29T15:14:12.061Z
Learning: In repo lightseekorg/smg, `model_gateway/src/routers/http/pd_router.rs` and `router.rs`: When `enforce_token_rate_limit` (or its equivalent in the regular HTTP router) returns a 429/413 early exit, do NOT emit the existing worker-dispatch metrics (`Metrics::record_router_request`, `record_router_error`, etc.) for throttled traffic. Those metrics are semantically tied to requests dispatched to backend workers. Instead, a dedicated metric such as `Metrics::record_request_throttled(router_type, backend, model_id, tenant_key)` should be added to track admission-control rejections separately. Do not re-flag the absence of worker-dispatch metrics on throttled paths as a bug; flag only the absence of a dedicated throttle counter if one is not present.
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 936
File: model_gateway/src/routers/gemini/router.rs:82-90
Timestamp: 2026-03-30T18:52:19.689Z
Learning: In repo lightseekorg/smg, the Gemini router's (`model_gateway/src/routers/gemini/router.rs`) use of `model_id` for per-model retry config lookup in `route_interactions` has a pre-existing limitation: when the request carries only an agent identifier, `get_retry_config(agent_id)` finds no match because agent→model resolution happens later inside `driver::execute` (in `worker_selection.rs`). This same limitation affects policy lookup and worker selection in the Gemini router. It is NOT introduced by PR `#936` and is a known accepted gap. The Gemini router is not actively used currently. Do not flag this as a bug introduced by per-model retry config changes to the Gemini router.
Learnt from: RohanSogani
Repo: lightseekorg/smg PR: 1333
File: model_gateway/tests/api/responses_api_test.rs:325-373
Timestamp: 2026-04-23T18:59:15.656Z
Learning: In repo lightseekorg/smg, PR `#1333` adds `extract_forwardable_request_headers` and threads forwarded headers through `McpToolSession` → `McpRequestContext` → `HandlerRequestContext`. This PR's scope is strictly internal request-context preservation (preserving request-scoped headers so they are accessible in MCP execution context), NOT outbound MCP header propagation to dynamic MCP servers. The integration test `test_non_streaming_mcp_e2e_accepts_forwardable_request_headers` in `model_gateway/tests/api/responses_api_test.rs` is intentionally a smoke test (verifies MCP still executes when forwardable headers are present); wire-level header assertions via `MockMCPServer::captured_headers()` belong in the follow-up PR covering config-driven outbound MCP header propagation. Do not flag the absence of `captured_headers()` assertions in this test as a gap for PR `#1333`.
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/workflows/pr-test-rust.yml:439-440
Timestamp: 2026-03-04T20:41:53.013Z
Learning: Ensure GitHub Actions workflows always include an actions/checkout step when referencing local composite actions (uses: ./... or similar) on self-hosted runners. Do not rely on a pre-checked-out workspace unless your environment explicitly guarantees it; follow the standard behavior and include checkout to ensure reliable action resolution.
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: .github/workflows/release-grpc.yml:35-37
Timestamp: 2026-03-05T05:50:07.158Z
Learning: In workflows under .github/workflows that trigger only on push to main, you do not need to guard against the all-zeros github.event.before (0000000000000000000000000000000000000000) since that value only occurs on initial branch creation and cannot happen for an already-existing main. Apply this guidance to similar push-to-main workflows across the repository (pattern: **/.github/workflows/*.yml). If a workflow also handles other events or branches, retain appropriate guards for those cases.
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1036
File: .github/workflows/claude-code-review.yml:71-73
Timestamp: 2026-04-06T16:49:28.633Z
Learning: In GitHub Actions workflow YAMLs, treat any `prompt:` field intended for an LLM (e.g., Claude) as natural-language instructions for the model. Do not embed shell-script-specific constructs or error-handling patterns inside the prompt content (e.g., `git rev-parse --verify`, `2>/dev/null`, explicit exit-code checks like `if [ $? -ne 0 ]`). Instead, express fallback/error-handling guidance in natural language so the LLM can reason about failures from context rather than parsing shell stderr/stdout semantics.
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1099
File: .github/workflows/pr-test-mlx.yml:43-43
Timestamp: 2026-04-13T02:02:07.209Z
Learning: In this repo, when reviewing GitHub Actions workflow YAML files, treat `runs-on: macos-latest` as acceptable for Apple Silicon: GitHub’s standard `macos-latest` runner resolves to an arm64 macOS runner (Intel is being deprecated). Do not flag `macos-latest` as potentially non-arm64/Intel for Apple Silicon + MLX workloads.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b068350219
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| --num-concurrency "$concurrency" \ | ||
| --traffic-scenario "$scenario" \ | ||
| --max-requests-per-run "$MAX_REQUESTS" \ | ||
| --max-time-per-run "$DURATION_MIN" \ |
There was a problem hiding this comment.
Convert max-time-per-run from minutes to seconds
run.sh treats DURATION_MIN as minutes, but passes it directly to genai-bench --max-time-per-run, which is defined in seconds (the upstream manage_run_time loop increments once per second, and this repo’s own benchmark harness documents _MAX_TIME_PER_RUN = 10 # seconds in e2e_test/benchmarks/test_nightly_perf.py). With the current default DURATION_MIN=10, each cell runs for ~10 seconds instead of 10 minutes, so nightly results will be statistically noisy and can mislead HTTP vs gRPC comparisons.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
.github/workflows/nightly-mlx-bench.yml (1)
25-34:⚠️ Potential issue | 🟡 MinorRemove the temporary
pull_requesttrigger before merging.This block is explicitly marked temporary; keeping it in the final workflow expands CI scope/cost beyond the intended nightly + manual modes.
Suggested cleanup
- # TEMPORARY: enable PR trigger so this workflow can be smoke-tested before - # merge. Path-filtered so unrelated PRs don't fire it, and the matrix is - # reduced via the env block below to ~12 minutes. REMOVE this block before - # merging — once the harness is validated, only schedule + workflow_dispatch - # should remain. - pull_request: - branches: [main] - paths: - - "benchmarks/mlx_grpc_vs_http/**" - - ".github/workflows/nightly-mlx-bench.yml"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/nightly-mlx-bench.yml around lines 25 - 34, Remove the temporary pull_request trigger block from the workflow so it only runs on the intended schedule and manual dispatch; specifically delete the entire pull_request: block (including the branches: [main] and paths: entries that reference "benchmarks/mlx_grpc_vs_http/**" and ".github/workflows/nightly-mlx-bench.yml") before merging so CI scope reverts to nightly + workflow_dispatch only.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/mlx_grpc_vs_http/aggregate.py`:
- Around line 49-52: The collect function currently calls results_dir.iterdir()
which raises if the directory doesn't exist; update collect (function name:
collect) to handle a missing results_dir by early-returning the empty out dict
or by wrapping the iterdir() call in a try/except FileNotFoundError (or checking
results_dir.exists() / results_dir.is_dir()) so the function emits the intended
empty-results summary instead of crashing; ensure you still return the same out
variable shape when the directory is absent.
---
Duplicate comments:
In @.github/workflows/nightly-mlx-bench.yml:
- Around line 25-34: Remove the temporary pull_request trigger block from the
workflow so it only runs on the intended schedule and manual dispatch;
specifically delete the entire pull_request: block (including the branches:
[main] and paths: entries that reference "benchmarks/mlx_grpc_vs_http/**" and
".github/workflows/nightly-mlx-bench.yml") before merging so CI scope reverts to
nightly + workflow_dispatch only.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6e389c04-0c6b-402c-967e-dd4f8060b3e5
📒 Files selected for processing (3)
.github/workflows/nightly-mlx-bench.ymlbenchmarks/mlx_grpc_vs_http/aggregate.pybenchmarks/mlx_grpc_vs_http/run.sh
| local container_url="${base_url/127.0.0.1/host.docker.internal}" | ||
| container_url="${container_url/localhost/host.docker.internal}" | ||
|
|
||
| # genai-bench's --max-time-per-run is in SECONDS (despite the name). | ||
| local duration_s=$(( DURATION_MIN * 60 )) | ||
|
|
||
| docker run --rm \ | ||
| --add-host=host.docker.internal:host-gateway \ |
There was a problem hiding this comment.
🔴 Important: The servers started in Phase 1 and Phase 2 all bind to 127.0.0.1 (lines 141, 171, 181). With Colima, host.docker.internal resolves to the Mac host's IP on the virtual network (e.g. 192.168.106.1), not to 127.0.0.1. Since the servers only listen on the loopback interface, connections from the container to host.docker.internal:<port> will be refused.
Docker Desktop for Mac has a built-in proxy that forwards host.docker.internal traffic to the host's loopback, but Colima (Lima-based) does not — it relies on standard IP routing.
Fix: change the three server invocations to bind to 0.0.0.0:
# Line 141 (Phase 1 — mlx-lm)
mlx_lm.server --model "$MODEL" --host 0.0.0.0 --port "$HTTP_PORT" ...
# Line 171 (Phase 2 — gRPC servicer)
python3 -m smg_grpc_servicer.mlx.server --model "$MODEL" --host 0.0.0.0 --port "$GRPC_PORT" ...
# Line 181 (Phase 2 — SMG router)
"$SMG_BIN" launch --host 0.0.0.0 --port "$ROUTER_PORT" ...This is safe on a CI runner. The wait_for_port and wait_for_openai helpers should also be updated to check 127.0.0.1 (or just localhost) since they're run from the host.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e8203fc57c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ROUTER_PORT="${ROUTER_PORT:-30000}" | ||
| SMG_BIN="${SMG_BIN:-target/release/smg}" | ||
|
|
||
| mkdir -p "$RESULTS_DIR" |
There was a problem hiding this comment.
Clear RESULTS_DIR before starting a new benchmark run
run.sh only creates $RESULTS_DIR and never removes prior contents, but the aggregator later walks every experiment subfolder under that directory. In practice this mixes old and new runs (for example, after a smoke run with a reduced matrix and then a full run), so SUMMARY.md can include stale cells that were not executed in the current benchmark and mislead HTTP vs gRPC comparisons.
Useful? React with 👍 / 👎.
| return None | ||
| # Each cell produces one summary JSON; sorted() makes the choice | ||
| # deterministic if a partial rerun left a stale file. | ||
| data = json.loads(json_files[0].read_text()) |
There was a problem hiding this comment.
Read latest metrics file instead of first sorted JSON
parse_experiment currently loads json_files[0] after sorting, which selects the lexicographically earliest file in the folder. If a cell is rerun and multiple result JSON files remain in the same experiment directory, the summary can report stale metrics from an older run rather than the latest output, silently corrupting the final table.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/mlx_grpc_vs_http/run.sh`:
- Around line 157-160: The fixed "sleep 5" after starting the gRPC servicer is
brittle; replace it with a deterministic readiness gate instead of sleep by
adding/using a wait-for-ready helper (e.g., a new wait_for_generator_ready or
reuse wait_for_port) that polls a concrete readiness signal (health HTTP/gRPC
health endpoint, a socket response on "$GRPC_PORT", or a PID-ready file written
by the gen thread) with a timeout and backoff, then call that helper in place of
"sleep 5" so the script proceeds only when the generator is actually warmed up
(reference symbols: wait_for_port, GRPC_PORT, GRPC_PID, and the current "sleep
5" location).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: f7c9e65f-8711-485f-bcad-9de3dcf6ebe2
📒 Files selected for processing (3)
.github/workflows/nightly-mlx-bench.ymlbenchmarks/mlx_grpc_vs_http/README.mdbenchmarks/mlx_grpc_vs_http/run.sh
There was a problem hiding this comment.
♻️ Duplicate comments (1)
benchmarks/mlx_grpc_vs_http/run.sh (1)
159-162:⚠️ Potential issue | 🟠 MajorReplace fixed warmup sleep with a deterministic readiness gate.
A hardcoded delay is still brittle for CI variability and can skew/flake the first measured runs.
🔧 Proposed hardening
+wait_for_log_line() { + local file="$1" + local pattern="$2" + local timeout="${3:-300}" + local start=$SECONDS + while ! grep -q "$pattern" "$file" 2>/dev/null; do + if (( SECONDS - start > timeout )); then + log "Timeout waiting for '$pattern' in $file" + return 1 + fi + sleep 1 + done +} + wait_for_port "$GRPC_PORT" 300 +wait_for_log_line "$RESULTS_DIR/mlx-grpc.log" "gRPC server listening on" 300 log "MLX gRPC servicer up on :$GRPC_PORT (pid=$GRPC_PID)" -# Give the gen thread a beat to finish warmup before starting the router. -sleep 5🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/mlx_grpc_vs_http/run.sh` around lines 159 - 162, Replace the brittle fixed "sleep 5" warmup with a deterministic readiness gate: implement/use a readiness check that polls the generator's health (e.g., a gen-ready HTTP/IPC/UDP/flag endpoint or log marker) and block until it reports ready before starting the router; in this script replace the sleep 5 with a call to that gate (e.g., wait_for_generation_ready or similar) invoked after wait_for_port "$GRPC_PORT" and before starting the router so the script relies on an explicit readiness signal tied to the generator process (identified by GRPC_PID/GRPC_PORT) rather than a fixed timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@benchmarks/mlx_grpc_vs_http/run.sh`:
- Around line 159-162: Replace the brittle fixed "sleep 5" warmup with a
deterministic readiness gate: implement/use a readiness check that polls the
generator's health (e.g., a gen-ready HTTP/IPC/UDP/flag endpoint or log marker)
and block until it reports ready before starting the router; in this script
replace the sleep 5 with a call to that gate (e.g., wait_for_generation_ready or
similar) invoked after wait_for_port "$GRPC_PORT" and before starting the router
so the script relies on an explicit readiness signal tied to the generator
process (identified by GRPC_PID/GRPC_PORT) rather than a fixed timeout.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 94b6b10a-473f-4fcc-bd6e-f339ba3ff658
📒 Files selected for processing (1)
benchmarks/mlx_grpc_vs_http/run.sh
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b4820432bc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| local base="$1" | ||
| local timeout="${2:-180}" | ||
| local start=$SECONDS | ||
| while ! curl -fsS "$base/v1/models" >/dev/null 2>&1; do |
There was a problem hiding this comment.
Add request timeouts to readiness curl probe
The readiness loop is meant to stop after timeout seconds, but curl -fsS has no --connect-timeout/--max-time, so if the endpoint accepts a socket and then stalls, curl can block indefinitely and the timeout branch is never evaluated. In that startup-failure mode the benchmark hangs until the job-level timeout instead of failing fast.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5aed138a9d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - name: Install genai-bench | ||
| # Native pip install instead of the docker image — avoids the | ||
| # Colima-on-macos-latest VM hop, which is unstable in CI. | ||
| run: pip install "genai-bench>=0.0.4" |
There was a problem hiding this comment.
Pin genai-bench version for reproducible nightly runs
Installing genai-bench with >=0.0.4 makes nightly results non-reproducible over time because the benchmark harness can silently change behavior, defaults, or metric calculations as newer releases are published. This job is meant to compare performance trends, so tool-version drift can corrupt those comparisons even when router code is unchanged; use an exact pinned version (or lock file) to keep runs comparable.
Useful? React with 👍 / 👎.
| scenarios = sorted({k[1] for k in results}) | ||
| concurrencies = sorted({k[2] for k in results}) |
There was a problem hiding this comment.
Preserve full benchmark matrix when cells fail
The table header and row dimensions are inferred only from successful result keys, so any scenario/concurrency where both HTTP and gRPC produced no JSON is dropped entirely instead of being shown as missing. In failure-heavy runs (for example, highest concurrency crashing both paths), this hides failed cells and makes the summary look like a smaller intended matrix, which can mislead comparisons.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Intentional — committed in 4580bbc with rationale: the aggregator now infers the rendered matrix from observed data so a CI run with PHASES=mlx grpc produces a two-row table, not a three-row table with a column of —. The same logic applies cell-by-cell. Failed cells are not silent: run.sh writes a .failed marker per failed cell and CI uploads bench-results/ as an artifact, so the failure record is preserved even when the row collapses. The header that previously showed every requested concurrency could imply intent the run did not satisfy — preferring "show what ran" matches how the rest of the SMG nightly suite renders its summary tables. Rendering attempted-but-failed cells from .failed markers is a reasonable follow-up if/when CI starts swallowing whole concurrencies regularly, but the current data does not show that pattern.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fdd9fe6559
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if content: | ||
| if first_token_t is None: | ||
| first_token_t = time.monotonic() | ||
| tokens += 1 |
There was a problem hiding this comment.
Measure generated tokens instead of SSE chunks
This increments tokens once per streamed chunk, but OpenAI-style SSE chunks are transport frames, not token boundaries, so chunk coalescing/splitting differences between mlx_lm.server and the router path will skew total_output_tokens, output_throughput, and tpot_ms even when model speed is unchanged. Because this benchmark’s purpose is comparing those paths, the current counting method can produce misleading regressions/improvements unrelated to actual token generation.
Useful? React with 👍 / 👎.
| ds = load_dataset( | ||
| "anon8231489123/ShareGPT_Vicuna_unfiltered", | ||
| data_files="ShareGPT_V3_unfiltered_cleaned_split.json", | ||
| split="train", |
There was a problem hiding this comment.
Pin dataset revisions for stable nightly comparisons
The benchmark loads Hugging Face datasets without a revision, so each run implicitly tracks the latest dataset state; upstream edits (new/removed rows or content changes) will change prompt-length distributions and can shift latency/throughput trends even when router code is unchanged. This undermines the nightly benchmark’s comparability over time; pin immutable dataset revisions (for both chat and agent loaders) to keep trend lines meaningful.
Useful? React with 👍 / 👎.
| # downloading artifacts. | ||
| cat "$RESULTS_DIR/SUMMARY.md" >> "$GITHUB_STEP_SUMMARY" | ||
|
|
||
| - name: Upload raw genai-bench output |
There was a problem hiding this comment.
🟡 Nit: Stale step name — genai-bench was replaced by bench_client.py in this push.
| - name: Upload raw genai-bench output | |
| - name: Upload raw benchmark output |
| # 4 concurrencies × 2 scenarios × 2 setups × 10 min ≈ 2.7 h, plus build | ||
| # and model download. Cap at 5h45m for safety. |
There was a problem hiding this comment.
🟡 Nit: This comment still references the old "10 min per cell" duration model. The new approach uses prompt-count-based cells (README says ~30 min for a full run), so the 345-min timeout is ~10x the expected runtime. Consider updating the comment and reducing the timeout.
| # 4 concurrencies × 2 scenarios × 2 setups × 10 min ≈ 2.7 h, plus build | |
| # and model download. Cap at 5h45m for safety. | |
| # Custom bench client uses prompt-count-based cells (not time-based). | |
| # Full run ~30 min; cap generously for model download + build overhead. |
| "request_throughput": 3.24, | ||
| "output_throughput": 832.1, | ||
| "total_input_tokens_est": 312000, | ||
| "total_output_tokens": 31750, | ||
| "ttft_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...}, | ||
| "tpot_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...} | ||
| } |
There was a problem hiding this comment.
🟡 Nit: The example JSON includes total_input_tokens_est but _summarize() doesn't produce that field. Either add it to the output or remove it from the docstring to avoid confusing consumers.
| "request_throughput": 3.24, | |
| "output_throughput": 832.1, | |
| "total_input_tokens_est": 312000, | |
| "total_output_tokens": 31750, | |
| "ttft_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...}, | |
| "tpot_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...} | |
| } | |
| { | |
| "scenario": "agent", | |
| "concurrency": 16, | |
| "num_prompts_requested": 128, | |
| "completed_requests": 124, | |
| "error_count": 4, | |
| "duration_s": 38.2, | |
| "request_throughput": 3.24, | |
| "output_throughput": 832.1, | |
| "total_output_tokens": 31750, | |
| "ttft_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...}, | |
| "tpot_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...} | |
| } |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 406e0a4f55
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| then | ||
| log "[$exp_name] FAILED — recording marker, continuing" | ||
| date -u +"%Y-%m-%dT%H:%M:%SZ" >"$exp_dir/.failed" | ||
| return 0 |
There was a problem hiding this comment.
Propagate per-cell benchmark failures to step status
This branch converts any non-zero exit from bench_client.py into success by returning 0, so a phase can finish green even when every cell fails (e.g., invalid workflow_dispatch input, dataset/API breakage, or systemic server errors). In that case the nightly job reports success with empty/placeholder results, which hides that the benchmark is actually broken and can silently drop regression signal.
Useful? React with 👍 / 👎.
| "p95": xs[min(int(len(xs) * 0.95), len(xs) - 1)], | ||
| "p99": xs[min(int(len(xs) * 0.99), len(xs) - 1)], |
There was a problem hiding this comment.
Use correct percentile indexing for latency stats
The percentile calculation uses int(len * q) as a zero-based index, which biases p95/p99 upward for the small sample sizes used here (for example, with 30 samples p95 resolves to the 29th element, effectively ~96.7th percentile). That systematically distorts tail latency comparisons between HTTP and gRPC cells and can mislead nightly trend analysis.
Useful? React with 👍 / 👎.
406e0a4 to
8abd73a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8abd73a1b2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "model": model, | ||
| "messages": [{"role": "user", "content": prompt}], | ||
| "max_tokens": max_tokens, | ||
| "temperature": 1.0, |
There was a problem hiding this comment.
Use deterministic decoding for benchmark traffic
The benchmark requests are sent with temperature set to 1.0 and no fixed seed, so each run samples different continuations and output lengths even when code and prompts are unchanged. Since this harness computes output_throughput and TPOT from generated-token counts, that sampling noise can shift nightly metrics independently of HTTP vs gRPC transport changes and create false regressions/improvements. For comparability, send deterministic generation settings for benchmark runs (for example temperature=0 and a fixed seed when supported).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a6f1afee4b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| result = folder / "result.json" | ||
| if not result.exists(): | ||
| return None |
There was a problem hiding this comment.
Ignore failed experiment folders when building summary
run.sh records a .failed marker for any non-zero bench cell (benchmarks/mlx_grpc_vs_http/run.sh lines 129-132), but parse_experiment only checks whether result.json exists and then loads it. That means failed cells can still be included in SUMMARY.md when result.json is stale or partially written, which can silently skew HTTP vs gRPC comparisons instead of rendering the cell as missing as intended.
Useful? React with 👍 / 👎.
2d9cfc1 to
6570420
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 65704207bb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| run: pip install -e "./grpc_servicer[mlx]" | ||
|
|
||
| - name: Install mlx-lm CLI (for the direct-HTTP baseline) | ||
| run: pip install "mlx-lm>=0.22.0" |
There was a problem hiding this comment.
Pin mlx-lm version for reproducible benchmark trends
This workflow installs mlx-lm as >=0.22.0, so each nightly run can pick up a different upstream runtime/server implementation. Because this job is used to compare performance over time, that version drift can change latency/throughput independently of any SMG changes and produce misleading regressions or improvements; use an exact version (or lockfile) for stable baselines.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Resolved in f7f465e: pinned to mlx-lm==0.31.3. Each nightly will now use the same mlx-lm runtime, so trend changes in the summary diffs reflect SMG-side changes instead of upstream mlx-lm release timing. Future bumps happen in their own commits so breaks are explicit.
| # Python deps in the main env | ||
| pip install -e ./crates/grpc_client/python | ||
| pip install -e "./grpc_servicer[mlx]" | ||
| pip install "mlx-lm>=0.22.0" genai-bench |
There was a problem hiding this comment.
Include oci_openai in documented local install command
The local setup instructions install only genai-bench, but this same commit’s workflow notes that genai-bench eagerly imports oci_openai and can crash on a clean install without it. As written, developers following the README can fail before the benchmark even starts, so the documented local workflow is not reliably runnable; add oci_openai (or pin a version where it is guaranteed present).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Resolved in f7f465e: README's local install line now matches the CI workflow — pip install "mlx-lm==0.31.3" genai-bench oci_openai, with a short note explaining why oci_openai is needed (genai-bench eagerly imports it; pip doesn't resolve it on a fresh wheel install).
| @@ -0,0 +1,245 @@ | |||
| name: Nightly MLX Three-Way Benchmark | |||
There was a problem hiding this comment.
🟡 Nit: The top-level workflow name still says "Three-Way" but the job name (line 51) was updated to "two-way" in this push. They should match now that vllm is dropped from the CI default.
| name: Nightly MLX Three-Way Benchmark | |
| name: Nightly MLX Two-Way Benchmark |
There was a problem hiding this comment.
Resolved in f7f465e: workflow name: is now Nightly MLX Two-Way Benchmark matching the job name.
Three issues flagged by bot review on commit 6570420 / 57e8888: 1. Workflow name said "Nightly MLX Three-Way Benchmark" but the job name was already "MLX two-way (mlx-lm vs SMG+gRPC; vllm-metal local-only)" since vllm was dropped from the CI default. The two were inconsistent; renamed the top-level workflow to "Nightly MLX Two-Way Benchmark" so the GitHub Actions UI tells the truth at every level. 2. `pip install "mlx-lm>=0.22.0"` let each nightly run pull whatever the latest mlx-lm release was at install time. For a benchmark meant to track SMG-side performance trends over time, that drift is noise — a faster decode loop in mlx-lm 0.31.4 would look like an SMG improvement. Pinned to exact version `mlx-lm==0.31.3` (what the failing run actually resolved to and what works with the gemma-3-4b-it-qat-4bit checkpoint). Future bumps now happen in their own commits, making trend breaks explicit in the bench summary diffs rather than implicit in HF/PyPI release timing. 3. Local README install instructions said `pip install "mlx-lm>=0.22.0" genai-bench` but the workflow knows `oci_openai` is a transitive dep that genai-bench imports eagerly — without it, `genai-bench --help` crashes on a clean wheel install. The CI workflow already pins it explicitly; the README didn't, so a developer following local instructions would fail before the bench even started. Updated the README to include `oci_openai` and the same pinned mlx-lm version as CI, with a brief note explaining why both are needed. The temporary `pull_request` trigger flagged by CodeRabbit (#1399 review on b068350) is deliberately kept until just before merge — we need PR-trigger smoke runs while the harness is still being validated. Will be removed in the final pre-merge commit. Signed-off-by: key4ng <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1a267fc039
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| pull_request: | ||
| branches: [main] | ||
| paths: | ||
| - "benchmarks/mlx_grpc_vs_http/**" | ||
| - ".github/workflows/nightly-mlx-bench.yml" |
There was a problem hiding this comment.
Remove temporary PR trigger from nightly benchmark workflow
This workflow is documented as nightly/manual, and the in-file note says the pull_request trigger is temporary and should be removed before merge, but it is still enabled here. Keeping it will run this long benchmark on every matching PR update, which can add hours of CI latency and consume runner capacity/costs for non-nightly validation paths. If this check is required, it can also slow or block normal PR iteration unexpectedly.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Acknowledged but keeping for now. The pull_request trigger is path-filtered (only fires on benchmarks/mlx_grpc_vs_http/** or this workflow), the matrix is reduced under pull_request (concurrencies 1 4, duration 4 min), and the bench harness itself is what is being iterated on in this PR — without the trigger, every code change requires a manual workflow_dispatch, which has been the bottleneck during validation. The TEMPORARY note in the YAML is the standing commitment to remove the trigger before merging to main; that removal will be the final commit on this branch.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f7f465e837
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| def collect(results_dir: Path) -> dict[tuple[str, str, int], dict[str, Any]]: | ||
| out: dict[tuple[str, str, int], dict[str, Any]] = {} | ||
| for sub in sorted(results_dir.iterdir()): |
There was a problem hiding this comment.
Guard missing results directory before iterating
collect() unconditionally calls results_dir.iterdir(), which raises FileNotFoundError when the benchmark directory was never created. In this workflow the aggregate step is if: always(), so early setup failures (or a run where all benchmark phases are skipped) will crash aggregation instead of emitting a graceful "no results" summary and continuing artifact/log upload steps.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 81e97b7. Added if not results_dir.is_dir(): return out at the top of collect(). Aggregate runs under if: always(), so this is the safe path when no phase ever created the dir (setup failure, empty PHASES).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4580bbcf35
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| RESULTS_DIR: bench-results | ||
| # Match the path that vllm-metal's install.sh uses when run via | ||
| # curl | bash (it hard-codes $HOME/.venv-vllm-metal in that branch). | ||
| VLLM_VENV: /Users/runner/.venv-vllm-metal |
There was a problem hiding this comment.
Derive VLLM_VENV from runner home directory
Avoid hard-coding VLLM_VENV to /Users/runner/.venv-vllm-metal: the vllm-metal installer writes into ~/.venv-vllm-metal, so on self-hosted macOS runners where $HOME is not /Users/runner, the workflow looks in the wrong path, reports vllm as missing, and silently skips the vLLM phase even when phases includes vllm. This makes the advertised three-way self-hosted run degrade to two-way without a hard failure.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 81e97b7 by removing the VLLM_VENV env entirely. run.sh already defaults to ${VLLM_VENV:-$HOME/.venv-vllm-metal} (line 43), and the vllm-metal installer writes there too — letting the default resolve at shell-time gets the correct path on both GitHub-hosted (/Users/runner/...) and self-hosted runners.
Compares two paths into the same MLX inference engine: 1. mlx-lm.server (direct HTTP/OpenAI) 2. SMG router + SMG MLX gRPC servicer Both end at the same BatchGenerator; the difference is the transport and the work the router does (Rust tokenization, chat templating, JSON↔proto conversion). Files: - benchmarks/mlx_grpc_vs_http/run.sh — bash orchestrator. Spins up mlx-lm HTTP, runs genai-bench across the matrix, tears down, then starts SMG MLX servicer + SMG router and runs the same matrix against http://router. - benchmarks/mlx_grpc_vs_http/aggregate.py — parses genai-bench JSON output (aggregated_metrics.stats.*) and emits a side-by-side markdown table. - benchmarks/mlx_grpc_vs_http/README.md — local-run instructions and what the benchmark measures. - .github/workflows/nightly-mlx-bench.yml — runs on macos-latest (Apple Silicon) on a nightly schedule and via workflow_dispatch with overridable inputs (model, concurrencies, scenarios, duration). Mirrors the SUMMARY.md to GITHUB_STEP_SUMMARY so the table is visible without downloading artifacts. Defaults match the user's request: - Model: mlx-community/gemma-3-4b-it-qat-4bit (~3 GB, fits 16 GB Mac) - Concurrencies: 1, 4, 16, 64 - Scenarios: D(100,256) and D(2000,128) - 10 min/cell ≈ 2.7 hours total per run Reuses the genai-bench docker image already used by the existing nightly-benchmark workflow (same patterns from e2e_test/benchmarks/conftest.py). Signed-off-by: key4ng <[email protected]>
… test PR pushes now run a reduced matrix (2 concurrencies × 1 scenario × 2 min/cell ≈ 12 min) to validate the orchestrator + aggregator end to end before merge. Path-filtered so only changes to the benchmark files or workflow itself trigger it. REMOVE the pull_request: block before merging — production triggers should only be schedule + workflow_dispatch. Signed-off-by: key4ng <[email protected]>
Critical fixes: - macos-latest runners don't ship Docker. Install Colima in the workflow so docker pull/run actually work. (claude P1, codex P1, coderabbit major). - --max-time-per-run is in SECONDS, not minutes. Multiply DURATION_MIN by 60 in run.sh before passing it to genai-bench, otherwise DURATION_MIN=10 produces 10-second cells, not 10-minute cells. Confirmed against e2e_test/benchmarks/test_nightly_perf.py:30 (claude P1). - --network host doesn't share the host network on Docker for Mac / Colima — containers run inside a Linux VM. Drop --network host and rewrite 127.0.0.1/localhost in the bench client's base URL to host.docker.internal, with --add-host=host.docker.internal: host-gateway to make the lookup work (claude P1, coderabbit major). Minor fixes: - PIDS array element 'removal' was using parameter substitution which replaces the matching value with an empty string instead of dropping the element. Rebuild the array properly (claude nit, coderabbit nit). - aggregate.py: sort json_files for deterministic selection if a partial rerun left stale files (coderabbit nit). - aggregate.py: mkdir parent of --out before writing in case the user passes a fresh path (coderabbit nit). Signed-off-by: key4ng <[email protected]>
Colima boot is unstable on macos-latest GitHub runners (intermittent 'error starting vm: exit status 1' from the lima VZ driver). genai-bench is a public pip package with the same CLI as the docker entrypoint, so 'genai-bench benchmark <args>' replaces 'docker run ... benchmark <args>' verbatim. Also drops the host.docker.internal URL rewriting (no container = no VM = 127.0.0.1 reaches the host directly), and the HF cache mount (genai-bench reads HF_HOME from the same env it inherits). Workflow: - Removed 'Install Docker (Colima)' step. - Replaced 'Pull genai-bench image' with 'Install genai-bench' via pip. - Verification step also calls 'genai-bench --help' to fail fast if the CLI didn't install cleanly. run.sh is ~30 lines smaller, README updated. Signed-off-by: key4ng <[email protected]>
genai-bench multiplies the input by 60 internally
(genai_bench/cli/cli.py:326: 'max_time_per_run *= 60'), so passing 120
makes each cell run for 7200 seconds = 2 hours, not 2 minutes.
The misleading comment in e2e_test/benchmarks/test_nightly_perf.py:30
('seconds per scenario×concurrency combo') is what threw me off in
the previous round of review fixes. Reading the actual genai-bench
source confirms minutes is the correct unit.
This was the cause of run #24978016312 sitting at 'in progress' for
74 minutes when it should have completed in ~10 minutes total.
Signed-off-by: key4ng <[email protected]>
Two fixes:
1. run.sh: '${new_pids[@]}: unbound variable' under set -u when phase 1
finished and the only tracked pid (mlx-lm) was filtered out. The
array reference produced 'new_pids[@]: unbound variable' → script
exited silently before phase 2 (gRPC) could run.
Replaced the rebuild loop with PIDS=() since mlx-lm is the only
server running during phase 1; resetting PIDS to empty is correct
and avoids the unbound-array gotcha entirely.
2. workflow: PR-trigger now runs the full matrix (4 conc × 2 scenarios
× 10 min × 2 phases = ~2.7 h) instead of the smoke matrix, so we
can collect real comparison numbers on this PR before merge.
Marked TEMPORARY — switch back to the smoke shape (or remove the
pull_request trigger) before merging.
Signed-off-by: key4ng <[email protected]>
…ilure Two independent fixes that together give us partial-but-useful results even when something falls over: 1. run.sh now accepts PHASE=http|grpc|all (default 'all' for local). Each phase is a self-contained function with its own server lifecycle, so the workflow can run them as two separate steps. Phase failure no longer leaks to the other phase. 2. run_bench wraps the genai-bench call in 'if !'. A failing cell (e.g. mlx-lm.server collapsing under c=64 with all 503s) writes a .failed marker to the experiment dir and the loop continues. aggregate.py already prints '—' for cells with no JSON, so partial results render cleanly. Workflow: - 'Run benchmark — Phase 1 (direct HTTP)' runs PHASE=http. - 'Run benchmark — Phase 2 (SMG router + gRPC)' runs PHASE=grpc with if: always() so phase 2 still executes if phase 1 errored out. This unblocks the case from run #25015745343 where phase 1 c=64 took mlx-lm.server down and the script exited before phase 2 ever started. Signed-off-by: key4ng <[email protected]>
Per discussion on PR — 'vllm bench serve' isn't usable on macos
because vllm publishes Linux/CUDA wheels only. Replaces genai-bench
(which only supports synthetic D(in,out) length distributions) with a
small custom client that loads real prompt distributions:
- chat: anon8231489123/ShareGPT_Vicuna_unfiltered — typical chat
user prompts, exercises TTFT + decode throughput.
- agent: vdaita/edit_10k_char — real local coding-agent prompts with
~2.5k token code context. Models the canonical Mac local-AI
workload (Cursor/Continue/Cline editing files). Stresses
prefill + tokenization, where Rust router should win big.
bench_client.py:
- Async httpx + HF datasets, ~250 LOC.
- Output JSON shaped like 'vllm bench --save-result' so we can swap
to vllm bench upstream if/when mac support lands.
- Per-cell exit non-zero only when *every* request failed (server
overload), letting run.sh distinguish complete failure from partial.
run.sh:
- prompts_for(scenario, concurrency) scales prompt count so each
cell has a meaningful number of completed requests at every level
(50 → conc*4 for chat, 30 → conc*2 for agent — slower prefill).
- Bench cell call replaced with python3 bench_client.py.
aggregate.py:
- Reads result.json from each experiment dir (new shape).
- Renders per-scenario tables (chat, agent) with RPS, output tok/s,
TTFT p50/p95, TPOT p50, and errors/total.
Workflow:
- pip install httpx[http2] + datasets (replaces genai-bench).
- workflow_dispatch inputs updated for new scenario names.
- Default scenarios = 'chat agent', no DURATION_MIN (driven by
num_prompts now).
Signed-off-by: key4ng <[email protected]>
Signed-off-by: key4ng <[email protected]>
cancel-in-progress: false meant new PR pushes queued behind the running one, and GitHub auto-cancelled stale queued runs when even newer pushes came in — so a series of fast pushes left the actual latest code never running. Switch to cancel-in-progress: true ONLY for PR events. Scheduled nightlies and manual workflow_dispatch runs still queue normally (don't kill in-flight nightly runs). Signed-off-by: key4ng <[email protected]>
The custom bench_client.py + vllm bench experiments produced unstable TTFT/TPOT numbers for mlx-lm.server (vllm bench saw all SSE chunks coalesced in one network read on localhost, so per-token ITL was always ~0). Switching to genai-bench — the same harness the rest of the SMG nightly suite uses — gives clean schema with stats.ttft / stats.tpot / mean_output_throughput_tokens_per_s, and matches what e2e_test/benchmarks already reports. Same change adds vllm-metal as a third backend so the nightly publishes a real 3-way comparison: mlx-lm direct HTTP vs SMG router + MLX gRPC servicer vs vllm-project/vllm-metal `vllm serve`. Changes: - run.sh: unified 3-phase orchestrator (mlx | grpc | vllm), each opt-in via PHASES env. genai-bench replaces the custom client. vllm phase auto-skips if VLLM_VENV is missing. - aggregate.py: parses genai-bench's aggregated_metrics.stats schema and emits one comparison table per scenario across all three backends. - nightly-mlx-bench.yml: best-effort vllm-metal install via the project's install.sh into a workspace-local venv; three independent phase steps with `if: always()` so a crash in one backend doesn't block the others. - README.md: documents the new 3-way layout, scenarios, and tunables. - bench_client.py / aggregate_local.py / prepare_datasets.py / local_data removed — replaced by genai-bench. Scenarios are now genai-bench's deterministic shapes: - chat = D(100,256) — short prompt + medium output - agent = D(2500,256) — RAG / code-edit / Cursor-style local agent traffic Signed-off-by: key4ng <[email protected]>
… budget Three fixes for the first 3-way CI run on PR #1399: 1. vllm-metal install path. The project's install.sh hard-codes `$HOME/.venv-vllm-metal` when run via `curl | bash` — VLLM_METAL_VENV wasn't honored, so we ended up with the venv at the upstream default while VLLM_VENV pointed at the workspace. The workflow now drops the manual venv creation and matches that hard-coded path; install.sh is self-sufficient (auto-installs uv, creates the venv, installs the wheel). VLLM_VENV is set to /Users/runner/.venv-vllm-metal. 2. vllm phase no longer aborts the script when `vllm serve` fails to come up. We now invoke `$VLLM_VENV/bin/vllm` directly (no `source activate` — that way background children inherit the right interpreter reliably), and wrap `wait_for_openai` in `if !` so a startup failure logs the tail of vllm-metal.log and returns 0 instead of tripping `set -e`. Phase 3 stays green even if vllm-metal can't boot. 3. PR runs use 4 min/cell (was 2). At 2 minutes, the `D(2500,256) × concurrency=4` cell finished zero requests on the macos-latest runner — genai-bench then crashed in its Excel post-processing because metrics were empty, leaving the cell as `—` in the summary table. 4 minutes gives at least one full prefill+decode cycle to complete. Also documented in the README: - mlx-lm.server's TPOT≈0 is a real localhost SSE buffering quirk in mlx-lm; the smg→mlx-grpc row in the same scenario shows a real TPOT despite hitting the same MLX BatchGenerator underneath. Worth highlighting rather than hiding — it's exactly the kind of streaming correctness the SMG router gives you. - High-concurrency agent cells will render `—` if DURATION_MIN runs out before any 2.5k-token prefill completes. Signed-off-by: key4ng <[email protected]>
… TPOT
The "mlx-lm TPOT is always 0" symptom traced earlier to the wrong cause.
Wire-level probes (raw socket recv + aiohttp.iter_any) BOTH show mlx-lm
streaming per-token SSE chunks at ~18 ms intervals — exactly the cadence
you'd expect for ~50 tok/s decode. mlx-lm and aiohttp are not the bug.
The actual bug is in genai-bench (genai_bench/user/openai_user.py:335):
for chunk in response.iter_lines(chunk_size=None):
`requests.iter_lines(chunk_size=None)` calls
`urllib3.HTTPResponse.read(amt=None)` under the hood, which reads to EOF
before yielding when the response has neither `Content-Length` nor
`Transfer-Encoding: chunked`. mlx-lm.server uses Python's
`BaseHTTPRequestHandler`, which defaults to HTTP/1.0 and emits neither
header for streaming responses → the bench client buffers the entire
generation and processes all per-token chunks at end-of-stream → TPOT
collapses to ~0 and TTFT inflates to e2e latency.
Local repro after applying the same patch:
| Metric | chunk_size=None | chunk_size=1 |
| TTFT mean | 5,452 ms | 486 ms |
| TPOT mean | ~0 ms | 19.3 ms |
| E2E latency | 5,452 ms | 2,938 ms |
19.3 ms TPOT matches the wire-level inter-chunk gap exactly.
The patch is benign for the smg→mlx-grpc and vllm-metal phases — both
emit HTTP/1.1 with `Transfer-Encoding: chunked`, where urllib3 yields
per chunked frame regardless of the chunk_size argument.
Workflow change: a `Patch genai-bench streaming chunk size` step runs
`sed` on the installed openai_user.py right after `pip install
genai-bench`, swapping `chunk_size=None` for `chunk_size=1`. Fails the
job loudly (rather than silently) if the pattern isn't found, so a
genai-bench upgrade that renames or removes the call site won't ship a
bench run with bogus mlx-lm metrics.
README updated to reflect the actual root cause (drops the wrong "SSE
buffering quirk in mlx-lm" wording from the previous fix attempt) and
documents the equivalent local one-liner for off-CI runs.
Signed-off-by: key4ng <[email protected]>
PR #1414 (drain-and-batch fix for the gRPC servicer's BatchGenerator rope crash) is now on main; rebasing this branch onto it lets the next CI run finally produce a clean three-way comparison without the c≥4 agent rope corruption that dominated previous runs. This commit makes two changes on top of that rebase: 1. CI MODEL default → mlx-community/gemma-3-1b-it-4bit (~0.6 GB). The previous 4 B QAT model was ~3 GB on disk; combined with vllm-metal's ~1.7 GB overhead and the macos-latest runner's ~5 GB Metal budget, the KV-cache budget went negative and the vllm phase OOM'd at startup every run — we never got vllm-metal numbers in CI. The 1 B variant fits comfortably and lets all three backends produce metrics. The bench is comparing the *same model* across three serving paths, so the relative router/streaming comparison (TTFT shape, RPS scaling, TPOT correctness) stays meaningful at 1 B; only the absolute throughput numbers shift. For production-shaped 4 B+ numbers, use local_three_way (run.sh) on a real Mac, which still defaults to gemma-3-4b-it-qat-4bit, or override MODEL via workflow_dispatch on a beefier runner (self-hosted, or macos-latest-xlarge with 14 GB). 2. Tightened the genai-bench chunk_size=1 patch comment in the workflow. Previous wording overstated the case as "HTTP/1.0 buffering bug" — the actual bug class is "no length-delimiting framing on the response", and HTTP/1.0 is just one way mlx-lm lands in that state (HTTP/1.1 with Connection: close and no Content-Length / chunked would reproduce too). Also clarified that chunk_size=1 doesn't read 1 byte at a time — urllib3's read(amt=N) returns up to N bytes from whatever's currently buffered, so it effectively means "yield per recv()". The patch is a no-op for the smg→mlx-grpc and vllm-metal phases (HTTP/1.1 + Transfer-Encoding: chunked), since urllib3 decodes chunk frames before the buffer the iterator reads from. README updated to document the CI vs local model split as a table with the rationale, so the next reader doesn't have to dig through git history to understand why CI numbers are at 1 B. Signed-off-by: key4ng <[email protected]>
…le variant) The 1B non-QAT model from the previous commit didn't fix the vllm phase: vllm-metal silently fell back to PyTorch CPU/fp32 because the non-QAT 4-bit MLX checkpoints aren't recognized by vllm-metal's Metal path. Once on the CPU path it tries to allocate `--gpu-memory- utilization 0.92 × 7 GiB = 6.44 GiB`, but only 3.65 GiB is free on the runner — startup aborts before the engine reports ready. Two facts collected from the docs and the failing logs: 1. macos-latest exposes ~5 GB of Metal memory; lowering `--gpu-memory- utilization` doesn't help fit a 4 B model (the flag gates how much Metal vllm-metal can use, not how much it needs). The 4 B QAT already failed with a NEGATIVE KV-cache budget at the default fraction=0.9; lowering the fraction makes it worse. 2. vllm-metal's Metal path only picks up QAT-format MLX checkpoints (`*-qat-4bit`). Regular 4-bit checkpoints silently route through a PyTorch CPU/fp32 fallback that's documented as "experimental and unreliable in 2026". The intersection of those two constraints is `gemma-3-1b-it-qat-4bit` (~0.7 GB on disk): small enough to leave ~2 GB of KV headroom inside the runner's 5 GB Metal budget, and QAT-format so vllm-metal picks the Metal path. This is the only CI-viable Metal-backed model on the macos-latest runner today. run.sh's local default (gemma-3-4b-it-qat-4bit) is unchanged — real Mac hardware with ≥16 GB unified RAM happily serves 4B QAT. README's CI-vs-local model table updated with the QAT requirement, the gpu-memory-utilization explanation, and a more accurate set of "why CI uses 1B-QAT" reasons. workflow_dispatch input description likewise expanded so override callers see the QAT constraint up front. Signed-off-by: key4ng <[email protected]>
Two changes after another vllm-metal CI failure on the gemma-3-1b-it- qat-4bit attempt: even with the QAT model, vllm-metal silently fell back to PyTorch CPU/fp32 (`device_config=cpu`) and aborted at startup trying to allocate `--gpu-memory-utilization 0.92 × 7 GiB = 6.44 GiB` in only 3.39 GiB of available RAM. Root cause is structural: the macos-latest GitHub runner is a VM that doesn't reliably expose Metal GPU access, so vllm-metal's Metal worker can't initialize. The PyTorch MPS fallback path is documented as "experimental and unreliable in 2026". This isn't fixable from our workflow. 1. Drop vllm from the CI default `PHASES`. New default is `mlx grpc` (two-way). Workflow_dispatch input default still includes vllm so self-hosted M-series runs can opt in. The `Install vllm-metal` step now has `if: contains(env.PHASES, 'vllm')` so it doesn't burn ~2 min on every CI run downloading wheels that won't get used. Local `run.sh` default is still `mlx grpc vllm` — real Apple Silicon hardware (≥16 GB unified RAM) runs the three-way fine. 2. Swap MODEL default to `mlx-community/gemma-4-e2b-it-4bit` (~3.2 GB). Gemma 4 E2B has day-0 mlx-lm support and is the smallest production-quality MLX checkpoint that fits the runner. Local `run.sh` default stays at gemma-3-4b-it-qat-4bit for hardware that can serve 4B comfortably. README's defaults table now lists PHASES alongside MODEL, with a short explanation of why vllm-metal isn't in the CI matrix and how to opt back into the three-way (local hardware or workflow_dispatch override on a self-hosted M-series runner). The previous two-paragraph "why CI uses 1B-QAT" rationale is gone — it was specific to a constraint that no longer applies once vllm-metal is dropped. The job name is updated to "MLX two-way (mlx-lm vs SMG+gRPC; vllm- metal local-only)" so the GitHub Actions UI tells the truth about what runs in CI vs what's deferred to local hardware. Signed-off-by: key4ng <[email protected]>
…loadable via mlx_lm.server)
Latest CI run failed the mlx phase with
ValueError: Received 140 parameters not in model:
language_model.model.layers.X.self_attn.k_norm.weight, ...
mlx-lm 0.31.x added Gemma 4 support, but only for the *dense* 26B/31B
text variants. The Gemma 4 "Effective" e2b/e4b checkpoints are
multimodal-only — their safetensors carry text + vision + audio tower
weights under `language_model.*`, `vision_tower.*`, `audio_tower.*`
namespaces, and `mlx_lm.server`'s `load()` only instantiates text-only
CausalLM classes. There is no mlx-lm version (current or imminent)
that loads multimodal Gemma 4 checkpoints through the text-only server
path; bridging that to mlx-vlm's multimodal loader is a separate
workstream.
Switching CI MODEL back to gemma-3-4b-it-qat-4bit:
- Same as local run.sh default (apples-to-apples between CI and
local), so anyone reading both summaries can compare directly.
- Fits macos-latest's 7 GB RAM for the mlx and grpc phases (which
run sequentially, so each gets full RAM).
- Known-working with the mlx-lm pip pin we already use — no version
gambling.
- The merged drain-and-batch fix (PR #1414) eliminates the c≥4
rope crash that was the real bench-quality concern; with that
landed, gemma-3-4b QAT produces clean three-way numbers locally
and clean two-way numbers in CI.
The previous Gemma 4 swap commit (57e8888) was made on the
assumption that mlx-lm's day-0 Gemma 4 line covered all variants;
this is the corrective revert with a docstring explaining the
multimodal-vs-text-only gap so the next reader doesn't try the same.
Signed-off-by: key4ng <[email protected]>
Three issues flagged by bot review on commit 6570420 / 57e8888: 1. Workflow name said "Nightly MLX Three-Way Benchmark" but the job name was already "MLX two-way (mlx-lm vs SMG+gRPC; vllm-metal local-only)" since vllm was dropped from the CI default. The two were inconsistent; renamed the top-level workflow to "Nightly MLX Two-Way Benchmark" so the GitHub Actions UI tells the truth at every level. 2. `pip install "mlx-lm>=0.22.0"` let each nightly run pull whatever the latest mlx-lm release was at install time. For a benchmark meant to track SMG-side performance trends over time, that drift is noise — a faster decode loop in mlx-lm 0.31.4 would look like an SMG improvement. Pinned to exact version `mlx-lm==0.31.3` (what the failing run actually resolved to and what works with the gemma-3-4b-it-qat-4bit checkpoint). Future bumps now happen in their own commits, making trend breaks explicit in the bench summary diffs rather than implicit in HF/PyPI release timing. 3. Local README install instructions said `pip install "mlx-lm>=0.22.0" genai-bench` but the workflow knows `oci_openai` is a transitive dep that genai-bench imports eagerly — without it, `genai-bench --help` crashes on a clean wheel install. The CI workflow already pins it explicitly; the README didn't, so a developer following local instructions would fail before the bench even started. Updated the README to include `oci_openai` and the same pinned mlx-lm version as CI, with a brief note explaining why both are needed. The temporary `pull_request` trigger flagged by CodeRabbit (#1399 review on b068350) is deliberately kept until just before merge — we need PR-trigger smoke runs while the harness is still being validated. Will be removed in the final pre-merge commit. Signed-off-by: key4ng <[email protected]>
…hes reality
Until this commit, the bench summary was hardcoded to render all
three backend rows ("mlx-lm.server", "smg → mlx-grpc", "vllm-metal")
with the heading "MLX Three-Way Benchmark", regardless of which
phases the workflow actually executed. With the CI default now
`PHASES=mlx grpc` (vllm-metal needs Metal GPU access the macos-
latest VM doesn't expose), the summary table looked like vllm-metal
had failed every cell — every vllm-metal row was a column of `—`,
and the document title still said "Three-Way".
Aggregator now infers which backends are present from the result
directories that exist:
- `LABEL_ORDER` keeps mlx → grpc → vllm display order; only labels
with at least one matching result dir are rendered.
- Title shifts: 1 backend = "Single-backend", 2 = "Two-Way",
3 = "Three-Way".
- The "Backends:" intro list also drops anything that didn't run, so
readers see exactly the backends present in the rows below.
Verified locally with synthetic data:
- Two-way (mlx + grpc): "MLX Two-Way Benchmark", two rows per cell,
no vllm-metal artifacts.
- Three-way (mlx + grpc + vllm): "MLX Three-Way Benchmark", three
rows per cell, behavior unchanged from before.
Signed-off-by: key4ng <[email protected]>
| doesn't reliably expose Apple GPU passthrough — vllm-metal silently | ||
| falls back to a PyTorch CPU/fp32 path that allocates ~6 GB and OOMs | ||
| at startup on the 7 GB runner. To run the three-way comparison, | ||
| either run `local_three_way` on real Apple Silicon, or override |
There was a problem hiding this comment.
🟡 Nit: local_three_way is referenced here and in two workflow comments (lines 76, 191 of nightly-mlx-bench.yml), but no script, Makefile target, or function by that name exists. The local three-way run is just ./benchmarks/mlx_grpc_vs_http/run.sh with the default PHASES=mlx grpc vllm.
| either run `local_three_way` on real Apple Silicon, or override | |
| either run `./benchmarks/mlx_grpc_vs_http/run.sh` on real Apple Silicon, or override |
There was a problem hiding this comment.
Fixed in 81e97b7. README.md line 138 now points at ./benchmarks/mlx_grpc_vs_http/run.sh (with default PHASES=mlx grpc vllm`` to make the three-way default explicit). Same in nightly-mlx-bench.yml lines 76 and 191.
…VLLM_VENV portability, doc references * run.sh: replace `sleep 5` between mlx-grpc port-open and SMG router start with `wait_for_log_line "gRPC server listening on"`. The port binds before warmup; the log line is the real ready signal. * aggregate.py: guard `collect()` against a missing results dir so the `if: always()` aggregate step doesn't crash when no phase produced output (setup failure, empty PHASES). * nightly-mlx-bench.yml: stop pinning `VLLM_VENV` to `/Users/runner/...`. vllm-metal's installer writes to `$HOME/.venv-vllm-metal` and run.sh defaults to the same; hard-coding the GitHub-hosted path silently broke the vllm phase on self-hosted runners with a different $HOME. * README.md + workflow comments: replace `local_three_way` (which isn't a real script/target) with the actual `run.sh` invocation. Signed-off-by: key4ng <[email protected]>
Description
Problem
There's no way to measure how the SMG router + MLX gRPC servicer (#1099) compares against mlx-lm's own HTTP server in throughput / latency terms. We expect the router to win at high concurrency (Rust tokenization, no GIL contention, protobuf for streaming) and roughly tie at concurrency=1 — but expectations need data.
Solution
A nightly benchmark on
macos-latest(Apple Silicon GitHub-hosted runners) that drives the samegenai-benchmatrix against both paths and emits a comparison table.Both paths land at the same mlx-lm
BatchGenerator, so the diff is purely transport + tokenization location.Changes
Bench harness (
benchmarks/mlx_grpc_vs_http/)run.sh— bash orchestrator. Phase 1 startsmlx_lm.serverand runs the matrix; Phase 2 starts the SMG MLX servicer + SMG router and runs the matrix. All tunables via env vars.aggregate.py— parsesgenai-benchoutput JSON (aggregated_metrics.stats.*) and emits a markdown comparison table grouped by traffic scenario.README.md— local-run instructions, what the benchmark measures, expected results.CI (
.github/workflows/nightly-mlx-bench.yml)17 9 * * *UTC), andworkflow_dispatchwith overridable inputs (model,concurrencies,scenarios,duration_min).macos-latest(Apple Silicon).cargo build --release --bin smg(release profile — performance-critical, not theciprofile we use for correctness E2E).ghcr.io/moirai-internal/genai-bench:0.0.4— same image used by the existingnightly-benchmark.ymlworkflow.$GITHUB_STEP_SUMMARYso the comparison is visible directly in the run UI without artifact download.Default matrix
mlx-community/gemma-3-4b-it-qat-4bit(~3 GB)D(100,256)short prompts,D(2000,128)long-prompt RAG-styleTest Plan
In CI: trigger via
Actions → Nightly MLX Benchmark → Run workflowwithduration_min=1for a fast smoke run before relying on full nightly schedule.Checklist
cargo +nightly fmtpasses (no Rust changes)cargo clippy --all-targets --all-features -- -D warningspasses (no Rust changes)bash -n run.shokruff checkcleanSummary by CodeRabbit
New Features
Documentation