Skip to content

feat(mlx-bench): nightly benchmark — MLX direct HTTP vs Router+gRPC#1399

Open
key4ng wants to merge 20 commits into
mainfrom
keyang/mlx-bench
Open

feat(mlx-bench): nightly benchmark — MLX direct HTTP vs Router+gRPC#1399
key4ng wants to merge 20 commits into
mainfrom
keyang/mlx-bench

Conversation

@key4ng
Copy link
Copy Markdown
Collaborator

@key4ng key4ng commented Apr 27, 2026

Description

Problem

There's no way to measure how the SMG router + MLX gRPC servicer (#1099) compares against mlx-lm's own HTTP server in throughput / latency terms. We expect the router to win at high concurrency (Rust tokenization, no GIL contention, protobuf for streaming) and roughly tie at concurrency=1 — but expectations need data.

Solution

A nightly benchmark on macos-latest (Apple Silicon GitHub-hosted runners) that drives the same genai-bench matrix against both paths and emits a comparison table.

       ┌─ Phase 1 ─────────────────────────┐
clients ─▶ mlx_lm.server (HTTP) ─▶ MLX engine
       └───────────────────────────────────┘

       ┌─ Phase 2 ─────────────────────────────────────────────┐
clients ─▶ SMG router (HTTP) ─gRPC─▶ MLX servicer ─▶ MLX engine
       └───────────────────────────────────────────────────────┘

Both paths land at the same mlx-lm BatchGenerator, so the diff is purely transport + tokenization location.

Changes

Bench harness (benchmarks/mlx_grpc_vs_http/)

  • run.sh — bash orchestrator. Phase 1 starts mlx_lm.server and runs the matrix; Phase 2 starts the SMG MLX servicer + SMG router and runs the matrix. All tunables via env vars.
  • aggregate.py — parses genai-bench output JSON (aggregated_metrics.stats.*) and emits a markdown comparison table grouped by traffic scenario.
  • README.md — local-run instructions, what the benchmark measures, expected results.

CI (.github/workflows/nightly-mlx-bench.yml)

  • Triggers: nightly cron (17 9 * * * UTC), and workflow_dispatch with overridable inputs (model, concurrencies, scenarios, duration_min).
  • Runner: macos-latest (Apple Silicon).
  • Build: cargo build --release --bin smg (release profile — performance-critical, not the ci profile we use for correctness E2E).
  • Reuses ghcr.io/moirai-internal/genai-bench:0.0.4 — same image used by the existing nightly-benchmark.yml workflow.
  • Mirrors the SUMMARY.md table to $GITHUB_STEP_SUMMARY so the comparison is visible directly in the run UI without artifact download.

Default matrix

Dimension Default
Model mlx-community/gemma-3-4b-it-qat-4bit (~3 GB)
Concurrencies 1, 4, 16, 64
Scenarios D(100,256) short prompts, D(2000,128) long-prompt RAG-style
Duration 10 min/cell
Total runtime ~2.7 hours (16 cells)

Test Plan

# Local on Apple Silicon
cargo build --release --bin smg
pip install -e ./crates/grpc_client/python
pip install -e "./grpc_servicer[mlx]"
docker pull ghcr.io/moirai-internal/genai-bench:0.0.4

# Quick sanity (1 min/cell, just one scenario, c=1)
DURATION_MIN=1 CONCURRENCIES="1" SCENARIOS="D(100,100)" \
  ./benchmarks/mlx_grpc_vs_http/run.sh

python3 benchmarks/mlx_grpc_vs_http/aggregate.py

In CI: trigger via Actions → Nightly MLX Benchmark → Run workflow with duration_min=1 for a fast smoke run before relying on full nightly schedule.

Checklist
  • cargo +nightly fmt passes (no Rust changes)
  • cargo clippy --all-targets --all-features -- -D warnings passes (no Rust changes)
  • Bash syntax: bash -n run.sh ok
  • Python lint: ruff check clean
  • Workflow YAML validated locally
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features

    • Added a scheduled/manual macOS benchmark to compare MLX HTTP vs Router+gRPC, capture artifacts, and publish summaries.
    • Added a runnable benchmark that executes HTTP and gRPC phases, manages lifecycle, and uploads results and logs.
    • Added a CLI to aggregate experiment outputs into a human-readable SUMMARY.md.
  • Documentation

    • Added detailed README describing how to run, tune, and interpret the benchmarks.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added documentation Improvements or additions to documentation ci CI/CD configuration changes labels Apr 27, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a nightly/manual macOS GitHub Actions workflow and a benchmark suite that compares MLX direct HTTP vs Router+gRPC: workflow config, a Bash orchestrator to run two benchmark phases, a Python aggregator to summarize results, and README documentation.

Changes

Cohort / File(s) Summary
GitHub Actions Workflow
.github/workflows/nightly-mlx-bench.yml
New macOS workflow (cron + manual dispatch, temporary PR filter) that provisions toolchain, caches Rust builds, builds smg, installs Python deps and MLX components, runs the benchmark runner, aggregates results to bench-results/SUMMARY.md (appends to GITHUB_STEP_SUMMARY), and uploads artifacts/logs with retention rules.
Benchmark Docs
benchmarks/mlx_grpc_vs_http/README.md
Adds README describing benchmark purpose, metrics, local run instructions, env tunables, and CI invocation.
Aggregation Utility
benchmarks/mlx_grpc_vs_http/aggregate.py
New CLI that parses bench-results/* experiment dirs, extracts TTFT/latency/throughput from genai-bench JSONs, formats values, and emits a per-scenario HTTP vs gRPC markdown summary (bench-results/SUMMARY.md).
Benchmark Orchestration Script
benchmarks/mlx_grpc_vs_http/run.sh
New runner performing two phases: start MLX HTTP server and run genai-bench matrix; then start MLX gRPC servicer + SMG router and repeat. Implements readiness polling, result naming, PID tracking, traps, and result output to bench-results/.

Sequence Diagram(s)

sequenceDiagram
    participant CI as GitHub Actions
    participant Runner as run.sh Orchestrator
    participant Bench as genai-bench
    participant HTTPSrv as MLX HTTP Server
    participant GRPCSvc as MLX gRPC Servicer
    participant Router as SMG Router
    participant FS as Filesystem

    CI->>Runner: Start with MODEL, SCENARIOS, CONCURRENCIES, DURATION

    rect rgba(100, 150, 200, 0.5)
    Note over Runner,FS: Phase 1 — Direct HTTP Benchmark
    Runner->>HTTPSrv: Start mlx_lm.server (HTTP)
    Runner->>HTTPSrv: Poll /v1/models for readiness
    Runner->>Bench: Launch genai-bench per scenario×concurrency
    Bench->>HTTPSrv: Send requests
    HTTPSrv->>Bench: Return responses
    Bench->>FS: Write results to bench-results/http_*
    Runner->>HTTPSrv: Stop HTTP server
    end

    rect rgba(150, 200, 100, 0.5)
    Note over Runner,FS: Phase 2 — gRPC via Router Benchmark
    Runner->>GRPCSvc: Start MLX gRPC servicer
    Runner->>Router: Start SMG router (forward to gRPC)
    Runner->>Router: Poll /v1/models for readiness
    Runner->>Bench: Launch genai-bench per scenario×concurrency
    Bench->>Router: Send requests
    Router->>GRPCSvc: Forward via gRPC
    GRPCSvc->>Router: Return responses
    Router->>Bench: Return responses
    Bench->>FS: Write results to bench-results/grpc_*
    Runner->>Router: Stop router
    Runner->>GRPCSvc: Stop servicer
    end

    Runner->>FS: Aggregate bench-results into SUMMARY.md
    CI->>FS: Upload artifacts and summary
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

tests

Suggested reviewers

  • slin1237
  • CatherineSue

Poem

🐰 I hopped through scripts at break of night,
Started servers, chased each benchmark flight;
HTTP hummed, gRPC joined the race,
I tallied rows and saved each trace —
A crunchy carrot earned for every write.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and specifically describes the main change: introducing a nightly benchmark that compares MLX direct HTTP serving versus Router+gRPC architectures.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch keyang/mlx-bench

Comment @coderabbitai help to get the list of available commands and usage tips.

jobs:
bench:
name: MLX HTTP vs Router+gRPC
runs-on: macos-latest
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Important: macos-latest GitHub-hosted runners do not have Docker pre-installed. The docker pull (line 89) and the docker run calls in run.sh will fail.

Options:

  1. Add a step to install Docker (e.g. via colima or the docker/setup-docker-action).
  2. Install genai-bench as a native binary/pip package instead of using the container image.
  3. Run on a self-hosted macOS runner that has Docker available.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1f2953e. Went with option 2 (Colima): brew install colima docker && colima start. Adds ~1-2 min to setup but keeps the existing genai-bench docker pattern intact, matching e2e_test/benchmarks/conftest.py.

Comment thread benchmarks/mlx_grpc_vs_http/run.sh Outdated

# ──────────────────────────────────────────────────────────────────────────
# Phase 2: SMG router + MLX gRPC servicer
# ──────────────────────────────────────────────────────────────────────────
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: This pattern-substitution on an array doesn't actually remove the element — it replaces the matching value with an empty string, leaving a ghost entry in PIDS.

Suggested change
# ──────────────────────────────────────────────────────────────────────────
PIDS=("${PIDS[@]/$MLX_HTTP_PID}")

Not harmful here because the cleanup trap guards with kill -0, but the idiomatic fix is:

local new_pids=()
for p in "${PIDS[@]}"; do
  [[ "$p" != "$MLX_HTTP_PID" ]] && new_pids+=("$p")
done
PIDS=("${new_pids[@]}")

Or since this is the only place you remove a PID, you could just skip the removal entirely — the trap already handles dead PIDs gracefully.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fix as the coderabbit thread — addressed in 1f2953e by rebuilding the array instead of substituting.

Comment thread benchmarks/mlx_grpc_vs_http/run.sh Outdated
Comment on lines +111 to +112
benchmark \
--api-backend openai \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Important: Even if Docker is installed on the macOS runner, --network host does not work the same way on macOS as on Linux. On Docker for Mac (Docker Desktop, colima, etc.), the Docker engine runs inside a Linux VM, so --network host gives the container the VM's network — not the macOS host's. The --api-base URL pointing to 127.0.0.1 will fail to reach the mlx-lm/smg servers running on the Mac host.

On macOS you'd need to either:

  • Replace 127.0.0.1 with host.docker.internal in the --api-base URL passed to genai-bench, or
  • Drop --network host and use -p port publishing instead.

Since this script also needs to work on macOS for local dev (per the README), consider detecting the OS:

if [[ "$(uname)" == "Darwin" ]]; then
    BENCH_HOST="host.docker.internal"
    DOCKER_NET_ARGS=()
else
    BENCH_HOST="127.0.0.1"
    DOCKER_NET_ARGS=(--network host)
fi

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1f2953e. Dropped --network host, added --add-host=host.docker.internal:host-gateway, and rewrite the base URL inside run_bench so the bench container reaches the macOS host's inference servers via host.docker.internal instead of trying to share the VM's network namespace.

--num-concurrency "$concurrency" \
--traffic-scenario "$scenario" \
--max-requests-per-run "$MAX_REQUESTS" \
--max-time-per-run "$DURATION_MIN" \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Important: --max-time-per-run takes seconds, not minutes. See e2e_test/benchmarks/test_nightly_perf.py:30:

_MAX_TIME_PER_RUN = 10  # seconds per scenario×concurrency combo

With the current code, DURATION_MIN=10 passes 10 to --max-time-per-run, meaning each cell runs for 10 seconds instead of the intended 10 minutes. The total runtime would be ~160 seconds, not ~2.7 hours, producing statistically meaningless results.

Fix: convert minutes to seconds:

Suggested change
--max-time-per-run "$DURATION_MIN" \
--max-time-per-run "$((DURATION_MIN * 60))" \

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical catch — fixed in 1f2953e. run.sh now multiplies DURATION_MIN by 60 before passing it as --max-time-per-run. Verified against the comment in e2e_test/benchmarks/test_nightly_perf.py:30. Without this fix the entire 'nightly' bench would have been 16 × 10 seconds = 160 seconds total, not 2.7 hours.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the previous 'fix' in 16769d5 after empirically catching it on run #24978016312. genai-bench takes --max-time-per-run in minutes (it multiplies by 60 internally at genai_bench/cli/cli.py:326). The comment in e2e_test/benchmarks/test_nightly_perf.py:30 is misleading — the existing tests pass 10 and get 10 minutes, not 10 seconds. With DURATION_MIN=2 and the seconds conversion I added, each cell was scheduled for 7200 seconds = 2 hours, which matches the 74-minute c=1 cell observed in the cancelled run.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 15a26171e8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

jobs:
bench:
name: MLX HTTP vs Router+gRPC
runs-on: macos-latest
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Move Docker benchmark job off macOS runner

This job is pinned to runs-on: macos-latest but the workflow later relies on docker pull and docker run (via run.sh) without provisioning a container runtime. On GitHub-hosted runners, macos-latest maps to the arm64 macOS image, and the published macos-15-arm64 image software manifest does not include Docker, so the benchmark fails before collecting any results. Please run this benchmark on a Linux runner or add explicit Docker runtime setup/startup for macOS.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1f2953e. Added a Colima install step to the workflow before any docker call: brew install colima docker && colima start --cpu 2 --memory 4 --disk 20. macOS runner now has a working docker daemon.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-mlx-bench.yml:
- Around line 29-31: The concurrency group currently uses a dynamic name
`nightly-mlx-bench-${{ github.event_name }}` which separates scheduled and
manual runs; change it to a fixed group name like `nightly-mlx-bench` so all
workflows share a single concurrency group and only one instance runs at a
time—update the `concurrency.group` value to the fixed string and keep
`cancel-in-progress` as desired to ensure proper single-instance behavior.
- Around line 88-92: The workflow currently pulls and runs a Docker image on
macos-latest without installing Docker and the benchmark script run.sh uses
"--network host" which does not behave the same on macOS; either switch the job
to run on ubuntu-latest or add a step to install/configure Docker on macOS
(e.g., use actions/setup-docker or install Docker Desktop CLI) before the "Pull
genai-bench image" and "Run benchmark (this is the long step)" steps, and modify
run.sh to avoid or conditionalize the "--network host" flag on macOS (detect
platform in run.sh and skip or use alternative networking) so the benchmark runs
reliably across runners.

In `@benchmarks/mlx_grpc_vs_http/aggregate.py`:
- Around line 136-139: The code currently writes output via
Path(args.out).write_text(table) without ensuring the parent directory exists;
update the logic in aggregate.py right before the write_text call (near where
collect and build_table are used) to create the parent directory if missing (use
Path(args.out).parent.mkdir(parents=True, exist_ok=True)) and then perform
Path(args.out).write_text(table) so standalone runs won't fail when the target
directory doesn't exist.
- Around line 27-45: parse_experiment currently picks json_files[0] arbitrarily
when multiple JSONs exist; make selection deterministic by sorting json_files
before choosing one (e.g., sort by modification time using Path.stat().st_mtime
or by filename) and then read the chosen file; update the logic in
parse_experiment (variable json_files and the selection of json_files[0]) to
pick the latest/appropriately ordered file so repeated or partial reruns yield a
predictable result.

In `@benchmarks/mlx_grpc_vs_http/run.sh`:
- Line 147: The current removal using PIDS=("${PIDS[@]/$MLX_HTTP_PID}") does
substring replacement and can corrupt entries; replace it by rebuilding the
array to exclude the exact PID: iterate over PIDS (use the PIDS variable) and
append only entries not equal to MLX_HTTP_PID into a NEW_PIDS array, then assign
PIDS=("${NEW_PIDS[@]}"); update the code in run.sh where PIDS and MLX_HTTP_PID
are handled (or alternatively rely on the existing cleanup trap that checks kill
-0), but do not use the substring-replacement form.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cf4091d4-f882-44f5-9a62-62fda3e76faf

📥 Commits

Reviewing files that changed from the base of the PR and between e4d4452 and 15a2617.

📒 Files selected for processing (4)
  • .github/workflows/nightly-mlx-bench.yml
  • benchmarks/mlx_grpc_vs_http/README.md
  • benchmarks/mlx_grpc_vs_http/aggregate.py
  • benchmarks/mlx_grpc_vs_http/run.sh

Comment thread .github/workflows/nightly-mlx-bench.yml Outdated
Comment on lines +29 to +31
concurrency:
group: nightly-mlx-bench-${{ github.event_name }}
cancel-in-progress: false
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Concurrency group may cause unexpected cancellations.

The concurrency group nightly-mlx-bench-${{ github.event_name }} means scheduled runs and manual dispatches use different groups (nightly-mlx-bench-schedule vs nightly-mlx-bench-workflow_dispatch). This allows a manual run to execute concurrently with a scheduled run, which could compete for resources or cause confusion. Consider using a fixed group name if only one instance should ever run.

💡 Suggested fix for single-instance concurrency
 concurrency:
-  group: nightly-mlx-bench-${{ github.event_name }}
+  group: nightly-mlx-bench
   cancel-in-progress: false
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-mlx-bench.yml around lines 29 - 31, The
concurrency group currently uses a dynamic name `nightly-mlx-bench-${{
github.event_name }}` which separates scheduled and manual runs; change it to a
fixed group name like `nightly-mlx-bench` so all workflows share a single
concurrency group and only one instance runs at a time—update the
`concurrency.group` value to the fixed string and keep `cancel-in-progress` as
desired to ensure proper single-instance behavior.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already mitigated — the workflow has cancel-in-progress: false, so even though ${{ github.event_name }} produces different group keys for schedule vs workflow_dispatch, neither cancels the other. Keeping as-is intentionally so a manual dispatch never kills an in-progress nightly.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@key4ng, thanks for the clarification — that's a valid and deliberate design. Using ${{ github.event_name }}-scoped groups with cancel-in-progress: false is the correct way to let a manual dispatch and a nightly coexist independently: each group serializes its own event type without interfering with the other. I've noted this for future reviews.


✏️ Learnings added
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1399
File: .github/workflows/nightly-mlx-bench.yml:39-41
Timestamp: 2026-04-27T04:50:22.772Z
Learning: In repo lightseekorg/smg, `.github/workflows/nightly-mlx-bench.yml`: The concurrency group `nightly-mlx-bench-${{ github.event_name }}` is intentional. It creates separate groups for `schedule` and `workflow_dispatch` events so that a manual dispatch run and a nightly scheduled run can coexist without blocking or cancelling each other. `cancel-in-progress: false` ensures neither a new nightly nor a new manual dispatch kills an already-running instance of the same event type. Do not flag this as a bug or suggest collapsing to a single fixed group name.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 635
File: .github/workflows/pr-test-rust.yml:621-621
Timestamp: 2026-03-04T23:20:28.110Z
Learning: In lightseekorg/smg, keeping `-s` (no capture) unconditionally in pytest invocations for parallel E2E jobs in `.github/workflows/pr-test-rust.yml` is intentional. Removing `-s` for parallel jobs was explicitly rejected because: (1) captured output is discarded on CI timeout/hang, which is unacceptable for debugging, and (2) it removes real-time log visibility. Log interleaving from parallel runs is instead addressed by a `_TestNameFormatter` in `e2e_test/conftest.py` that prefixes every log line with `[test_name]`, allowing users to grep for a specific test name to follow its output through interleaved logs.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: model_gateway/benches/wasm_middleware_latency.rs:88-91
Timestamp: 2026-02-21T02:37:02.009Z
Learning: Repo: lightseekorg/smg — For clippy-only/enforcement PRs (e.g., PR `#489`), even micro-optimizations (like replacing an async closure with std::future::ready in benches such as model_gateway/benches/wasm_middleware_latency.rs) should be deferred to a follow-up PR rather than included inline.

Learnt from: pallasathena92
Repo: lightseekorg/smg PR: 687
File: model_gateway/src/routers/openai/realtime/webrtc.rs:238-289
Timestamp: 2026-03-11T01:29:56.655Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/openai/realtime/webrtc.rs: `Metrics::record_router_request` is already emitted in router.rs (around line 1152) before `handle_realtime_webrtc` is called, and `Metrics::record_router_error` is emitted inside `handle_realtime_webrtc` for the no-workers case. The missing instrumentation is success/duration recording after `setup_and_spawn_bridge` returns — this is a metrics improvement deferred to a follow-up PR, not a correctness gap. Do not flag missing success/duration metrics as a blocking issue for PR `#687`.

Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 1250
File: model_gateway/tests/api/responses_api_test.rs:643-653
Timestamp: 2026-04-24T20:01:12.229Z
Learning: In repo lightseekorg/smg, PR `#1250` (`model_gateway/tests/api/responses_api_test.rs`): The router-side rejection path for `previous_response_id + mcp_approval_response + no MCP tools` (i.e., a resumed approval continuation where the follow-up request carries zero MCP tool bindings) is intentionally not covered by integration tests in this PR. The existing continuation fixtures all inherit `req1` which carries an MCP binding. A dedicated test for this rejection branch is deferred to a follow-up PR. Do not re-flag the missing coverage in PR `#1250` or similar approval-continuation PRs.

Learnt from: XinyueZhang369
Repo: lightseekorg/smg PR: 723
File: model_gateway/tests/api/interactions_api_test.rs:316-346
Timestamp: 2026-03-11T05:32:45.536Z
Learning: In repo lightseekorg/smg, `test_interactions_multiple_workers` in `model_gateway/tests/api/interactions_api_test.rs` intentionally only verifies that requests succeed with multiple Gemini backends (connectivity), not load distribution. Load distribution across workers is covered separately in `tests/routing/load_balancing_test.rs`. Do not flag the absence of per-worker request count assertions in this test as a gap — distribution verification is deferred to a follow-up PR once the Gemini router implementation matures.

Learnt from: RohanSogani
Repo: lightseekorg/smg PR: 1333
File: model_gateway/tests/api/responses_api_test.rs:325-373
Timestamp: 2026-04-23T18:59:11.719Z
Learning: In repo lightseekorg/smg, PR `#1333` adds `extract_forwardable_request_headers` and threads forwarded headers through `McpToolSession` → `McpRequestContext` → `HandlerRequestContext`. This PR's scope is strictly internal request-context preservation (preserving request-scoped headers so they are accessible in MCP execution context), NOT outbound MCP header propagation to dynamic MCP servers. The integration test `test_non_streaming_mcp_e2e_accepts_forwardable_request_headers` in `model_gateway/tests/api/responses_api_test.rs` is intentionally a smoke test (verifies MCP still executes when forwardable headers are present); wire-level header assertions via `MockMCPServer::captured_headers()` belong in the follow-up PR covering config-driven outbound MCP header propagation. Do not flag the absence of `captured_headers()` assertions in this test as a gap for PR `#1333`.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 915
File: model_gateway/src/routers/grpc/client.rs:387-423
Timestamp: 2026-03-26T17:06:14.307Z
Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 1305
File: model_gateway/src/routers/grpc/harmony/builder.rs:726-728
Timestamp: 2026-04-22T20:29:00.426Z
Learning: In repo lightseekorg/smg, `model_gateway/src/routers/grpc/harmony/builder.rs`: The warn message "Approval item reached Harmony conversion" in `parse_response_item_to_harmony_message` is pre-existing code. The `ComputerCall { .. }` and `ComputerCallOutput { .. }` match arms added in PR `#1305` are forced-cascade additions to keep the exhaustive match valid within a protocol-scope PR. Rewording the warning to a neutral message (e.g. "Unsupported input item reached Harmony conversion") is intentionally deferred to a follow-up router task. Do not re-flag this stale log message in protocol-scope PRs.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 936
File: model_gateway/src/routers/gemini/router.rs:82-90
Timestamp: 2026-03-30T18:52:19.689Z
Learning: In repo lightseekorg/smg, the Gemini router's (`model_gateway/src/routers/gemini/router.rs`) use of `model_id` for per-model retry config lookup in `route_interactions` has a pre-existing limitation: when the request carries only an agent identifier, `get_retry_config(agent_id)` finds no match because agent→model resolution happens later inside `driver::execute` (in `worker_selection.rs`). This same limitation affects policy lookup and worker selection in the Gemini router. It is NOT introduced by PR `#936` and is a known accepted gap. The Gemini router is not actively used currently. Do not flag this as a bug introduced by per-model retry config changes to the Gemini router.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/workflows/pr-test-rust.yml:439-440
Timestamp: 2026-03-04T20:41:53.013Z
Learning: Ensure GitHub Actions workflows always include an actions/checkout step when referencing local composite actions (uses: ./... or similar) on self-hosted runners. Do not rely on a pre-checked-out workspace unless your environment explicitly guarantees it; follow the standard behavior and include checkout to ensure reliable action resolution.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: .github/workflows/release-grpc.yml:35-37
Timestamp: 2026-03-05T05:50:07.158Z
Learning: In workflows under .github/workflows that trigger only on push to main, you do not need to guard against the all-zeros github.event.before (0000000000000000000000000000000000000000) since that value only occurs on initial branch creation and cannot happen for an already-existing main. Apply this guidance to similar push-to-main workflows across the repository (pattern: **/.github/workflows/*.yml). If a workflow also handles other events or branches, retain appropriate guards for those cases.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1036
File: .github/workflows/claude-code-review.yml:71-73
Timestamp: 2026-04-06T16:49:28.633Z
Learning: In GitHub Actions workflow YAMLs, treat any `prompt:` field intended for an LLM (e.g., Claude) as natural-language instructions for the model. Do not embed shell-script-specific constructs or error-handling patterns inside the prompt content (e.g., `git rev-parse --verify`, `2>/dev/null`, explicit exit-code checks like `if [ $? -ne 0 ]`). Instead, express fallback/error-handling guidance in natural language so the LLM can reason about failures from context rather than parsing shell stderr/stdout semantics.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1099
File: .github/workflows/pr-test-mlx.yml:43-43
Timestamp: 2026-04-13T02:02:07.209Z
Learning: In this repo, when reviewing GitHub Actions workflow YAML files, treat `runs-on: macos-latest` as acceptable for Apple Silicon: GitHub’s standard `macos-latest` runner resolves to an arm64 macOS runner (Intel is being deprecated). Do not flag `macos-latest` as potentially non-arm64/Intel for Apple Silicon + MLX workloads.

Comment thread .github/workflows/nightly-mlx-bench.yml Outdated
Comment thread benchmarks/mlx_grpc_vs_http/aggregate.py Outdated
Comment thread benchmarks/mlx_grpc_vs_http/aggregate.py
Comment thread benchmarks/mlx_grpc_vs_http/run.sh Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/nightly-mlx-bench.yml (1)

100-104: ⚠️ Potential issue | 🟠 Major

Docker runtime is assumed but not provisioned on the macOS runner.

Line 101 and Line 104 depend on a working Docker daemon, but this job has no Docker setup/start step. Please provision Docker explicitly on macOS (or remove the container dependency for genai-bench) before these steps.

Run this verification to confirm the current macOS runner image contents:

#!/bin/bash
set -euo pipefail

# Check whether Docker appears in the official macOS arm64 runner image manifest.
curl -fsSL https://raw.githubusercontent.com/actions/runner-images/main/images/macos/macos-15-arm64-Readme.md \
  | rg -n -i '\bdocker\b' || true

# Expected result:
# - No Docker tool entry found in the manifest (or no matching lines),
#   which indicates the workflow must install/start Docker itself.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-mlx-bench.yml around lines 100 - 104, The workflow
assumes a Docker daemon but the macOS runner doesn't provide one; before the
steps named "Pull genai-bench image" and "Run benchmark (this is the long step)"
either provision Docker on macOS or remove the container dependency. Fix by
adding a setup step that installs and starts Docker on macOS (for example:
install Docker Desktop via brew install --cask docker, launch Docker.app and
wait for the daemon to be ready) or change the job to use an Ubuntu runner or
modify ./benchmarks/mlx_grpc_vs_http/run.sh to use a non-container execution
path so the docker pull/run calls are no longer required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-mlx-bench.yml:
- Around line 25-34: Remove the temporary CI trigger by deleting the
pull_request block (the YAML mapping starting with pull_request: and its nested
branches: and paths: entries) so only the intended schedule and
workflow_dispatch triggers remain; ensure the commented note about it being
temporary is also removed or updated so the workflow no longer runs on PRs for
the benchmarks/mlx_grpc_vs_http path after merge.

---

Duplicate comments:
In @.github/workflows/nightly-mlx-bench.yml:
- Around line 100-104: The workflow assumes a Docker daemon but the macOS runner
doesn't provide one; before the steps named "Pull genai-bench image" and "Run
benchmark (this is the long step)" either provision Docker on macOS or remove
the container dependency. Fix by adding a setup step that installs and starts
Docker on macOS (for example: install Docker Desktop via brew install --cask
docker, launch Docker.app and wait for the daemon to be ready) or change the job
to use an Ubuntu runner or modify ./benchmarks/mlx_grpc_vs_http/run.sh to use a
non-container execution path so the docker pull/run calls are no longer
required.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 282d6bcd-1201-4a4b-9f0a-f1578d5112c4

📥 Commits

Reviewing files that changed from the base of the PR and between 15a2617 and b068350.

📒 Files selected for processing (1)
  • .github/workflows/nightly-mlx-bench.yml

Comment on lines +25 to +34
# TEMPORARY: enable PR trigger so this workflow can be smoke-tested before
# merge. Path-filtered so unrelated PRs don't fire it, and the matrix is
# reduced via the env block below to ~12 minutes. REMOVE this block before
# merging — once the harness is validated, only schedule + workflow_dispatch
# should remain.
pull_request:
branches: [main]
paths:
- "benchmarks/mlx_grpc_vs_http/**"
- ".github/workflows/nightly-mlx-bench.yml"
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove the temporary pull_request trigger before merge.

This block is explicitly marked temporary; merging it as-is keeps PR smoke runs enabled beyond the validation phase and expands CI spend/surface area unintentionally.

✂️ Proposed cleanup before final merge
-  # TEMPORARY: enable PR trigger so this workflow can be smoke-tested before
-  # merge. Path-filtered so unrelated PRs don't fire it, and the matrix is
-  # reduced via the env block below to ~12 minutes. REMOVE this block before
-  # merging — once the harness is validated, only schedule + workflow_dispatch
-  # should remain.
-  pull_request:
-    branches: [main]
-    paths:
-      - "benchmarks/mlx_grpc_vs_http/**"
-      - ".github/workflows/nightly-mlx-bench.yml"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# TEMPORARY: enable PR trigger so this workflow can be smoke-tested before
# merge. Path-filtered so unrelated PRs don't fire it, and the matrix is
# reduced via the env block below to ~12 minutes. REMOVE this block before
# merging — once the harness is validated, only schedule + workflow_dispatch
# should remain.
pull_request:
branches: [main]
paths:
- "benchmarks/mlx_grpc_vs_http/**"
- ".github/workflows/nightly-mlx-bench.yml"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-mlx-bench.yml around lines 25 - 34, Remove the
temporary CI trigger by deleting the pull_request block (the YAML mapping
starting with pull_request: and its nested branches: and paths: entries) so only
the intended schedule and workflow_dispatch triggers remain; ensure the
commented note about it being temporary is also removed or updated so the
workflow no longer runs on PRs for the benchmarks/mlx_grpc_vs_http path after
merge.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deliberately deferred. The PR trigger smoke-tests the harness on each push while the bench is still being validated; without it I can't see whether changes break the workflow until manual dispatch. Will remove the trigger and the surrounding TEMPORARY comment in the final pre-merge commit. The path filter (benchmarks/mlx_grpc_vs_http/** + the workflow file itself) keeps unrelated PRs from firing it, and the env-block matrix reduction (concurrencies=1 4, duration=4 min) caps PR-trigger spend at ~25 min vs the full ~2 h scheduled run.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@key4ng, understood — that's a reasonable validation strategy. The path filter + reduced matrix (≈25 min cap) makes the temporary trigger cost-conscious, and having it fire on every push is genuinely more convenient than repeated manual dispatches during harness development.

I've noted your commitment to remove the pull_request block and its surrounding TEMPORARY comment in the final pre-merge commit. I'll keep this comment open as a pre-merge reminder.


✏️ Learnings added
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1399
File: .github/workflows/nightly-mlx-bench.yml:29-38
Timestamp: 2026-05-01T00:13:32.765Z
Learning: In repo lightseekorg/smg, `.github/workflows/nightly-mlx-bench.yml`: The `pull_request:` trigger block (with `branches: [main]` and `paths: ["benchmarks/mlx_grpc_vs_http/**", ".github/workflows/nightly-mlx-bench.yml"]`) is explicitly marked TEMPORARY and is intentionally kept during the harness validation phase so the author can smoke-test the workflow on each push without needing manual `workflow_dispatch`. The env-block matrix reduction caps PR-trigger spend at ~25 min. The block is committed to be removed in the final pre-merge commit, leaving only `schedule` and `workflow_dispatch` triggers. Do NOT re-flag this temporary trigger as a blocking issue during the validation period, but do flag it if it is still present at merge time.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: key4ng
Repo: lightseekorg/smg PR: 1399
File: .github/workflows/nightly-mlx-bench.yml:39-41
Timestamp: 2026-04-27T04:50:22.772Z
Learning: In `.github/workflows/nightly-mlx-bench.yml`, do not flag or recommend changing the `concurrency.group` value `nightly-mlx-bench-${{ github.event_name }}`. It is intentionally designed to separate concurrency between `schedule` and `workflow_dispatch` so a manual run can coexist with (rather than block/cancel) a scheduled nightly run. Keep `cancel-in-progress: false` as-is, since it intentionally prevents new runs of the same event type from cancelling an already-running instance.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1398
File: .github/workflows/pr-test-mlx.yml:66-119
Timestamp: 2026-04-30T06:05:20.312Z
Learning: In repo lightseekorg/smg, `.github/workflows/pr-test-mlx.yml`: Do NOT suggest adding a Hugging Face model cache step for the `mlx-community/Qwen3-0.6B-4bit` model (~400 MB). The model is intentionally left uncached because it is small enough that the network download cost is acceptable and caching it wastes GitHub Actions cache storage.

Learnt from: VadivelanU
Repo: lightseekorg/smg PR: 1407
File: model_gateway/src/routers/openai/responses/streaming.rs:1061-1173
Timestamp: 2026-04-29T13:26:42.244Z
Learning: In repo lightseekorg/smg, `handle_streaming_with_tool_interception` in `model_gateway/src/routers/openai/responses/streaming.rs`: The inner `break` on `tool_calls_detected` inside the `for normalized_event in normalized_events` loop (after `StreamAction::ExecuteTools` sets the flag) is a pre-existing behavior preserved intentionally from prior iterations. It can theoretically drop later events in the same normalized batch, but fixing it (moving the break outside the inner loop so all normalized events are processed before acting on the flag) is deferred to a follow-up PR. Do not re-flag this early break as a blocking issue in PR `#1407` or similar streaming-refactor PRs.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 635
File: .github/workflows/pr-test-rust.yml:621-621
Timestamp: 2026-03-04T23:20:28.110Z
Learning: In lightseekorg/smg, keeping `-s` (no capture) unconditionally in pytest invocations for parallel E2E jobs in `.github/workflows/pr-test-rust.yml` is intentional. Removing `-s` for parallel jobs was explicitly rejected because: (1) captured output is discarded on CI timeout/hang, which is unacceptable for debugging, and (2) it removes real-time log visibility. Log interleaving from parallel runs is instead addressed by a `_TestNameFormatter` in `e2e_test/conftest.py` that prefixes every log line with `[test_name]`, allowing users to grep for a specific test name to follow its output through interleaved logs.

Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 1402
File: model_gateway/src/routers/grpc/harmony/streaming.rs:509-513
Timestamp: 2026-04-27T23:48:21.853Z
Learning: In repo lightseekorg/smg, `model_gateway/src/routers/grpc/harmony/streaming.rs`: The unconditional `prefill_stream.mark_completed()` call after `process_decode_stream(...)` (in both `process_responses_dual_stream` and `process_dual_stream`) is a pre-existing pattern that predates PR `#1402`. PR `#1402` only relocated the function; no logic was changed. The fix — guarding `mark_completed()` behind `result.is_ok()` so that decode/parser errors still trigger abort-on-drop cleanup for the prefill RPC — is intentionally deferred to a follow-up PR covering both dual-stream paths. Do not re-flag this as a regression in PR `#1402` or similar relocation PRs.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: model_gateway/benches/wasm_middleware_latency.rs:88-91
Timestamp: 2026-02-21T02:37:02.009Z
Learning: Repo: lightseekorg/smg — For clippy-only/enforcement PRs (e.g., PR `#489`), even micro-optimizations (like replacing an async closure with std::future::ready in benches such as model_gateway/benches/wasm_middleware_latency.rs) should be deferred to a follow-up PR rather than included inline.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 1305
File: model_gateway/src/routers/grpc/harmony/builder.rs:726-728
Timestamp: 2026-04-22T20:29:00.426Z
Learning: In repo lightseekorg/smg, `model_gateway/src/routers/grpc/harmony/builder.rs`: The warn message "Approval item reached Harmony conversion" in `parse_response_item_to_harmony_message` is pre-existing code. The `ComputerCall { .. }` and `ComputerCallOutput { .. }` match arms added in PR `#1305` are forced-cascade additions to keep the exhaustive match valid within a protocol-scope PR. Rewording the warning to a neutral message (e.g. "Unsupported input item reached Harmony conversion") is intentionally deferred to a follow-up router task. Do not re-flag this stale log message in protocol-scope PRs.

Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 1250
File: model_gateway/tests/api/responses_api_test.rs:1347-1355
Timestamp: 2026-04-24T20:01:23.917Z
Learning: In repo lightseekorg/smg, the response transformer in the non-streaming OpenAI path rewrites synthesized `function_call_output` items into `mcp_call` items before they reach the final client-visible output. Consequently, the assertion in `test_final_response_hides_internal_mcp_trace_items` (model_gateway/tests/api/responses_api_test.rs) does not need to explicitly block `"function_call_output"` — any such leak would first require the transformer to fail, which would already be caught by the `mcp_call` guard. Adding `"function_call_output"` to that assertion is defense-in-depth only, not a fix for a live gap; it is intentionally deferred to a follow-up PR.

Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 1250
File: model_gateway/tests/api/responses_api_test.rs:643-653
Timestamp: 2026-04-24T20:01:18.002Z
Learning: In repo lightseekorg/smg, PR `#1250` (`model_gateway/tests/api/responses_api_test.rs`): The router-side rejection path for `previous_response_id + mcp_approval_response + no MCP tools` (i.e., a resumed approval continuation where the follow-up request carries zero MCP tool bindings) is intentionally not covered by integration tests in this PR. The existing continuation fixtures all inherit `req1` which carries an MCP binding. A dedicated test for this rejection branch is deferred to a follow-up PR. Do not re-flag the missing coverage in PR `#1250` or similar approval-continuation PRs.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 1090
File: crates/tool_parser/src/parsers/mistral.rs:48-75
Timestamp: 2026-04-13T23:31:17.654Z
Learning: In repo lightseekorg/smg, `crates/tool_parser/src/parsers/mistral.rs` `MistralParser::build_structural_tag`: The `triggered_tags` format intentionally constrains only single tool calls (one `begin`/`end` tag pair per tool, trigger `"[TOOL_CALLS]"`). Supporting parallel Mistral tool calls (array with multiple items) requires `tags_with_separator` (as sglang implements) and is deferred to a follow-up PR. Do not flag the absence of multi-call/continuation tags in `build_structural_tag` as a bug; the primary target use case is `tool_choice: function` (single call).

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 990
File: crates/tokenizer/benches/stop_decoder.rs:23-24
Timestamp: 2026-04-01T08:02:32.590Z
Learning: In `lightseekorg/smg` (`crates/tokenizer/benches/stop_decoder.rs`), benchmarks intentionally use `MockTokenizer` (not a real HuggingFace tokenizer) to keep them hermetic and avoid network/filesystem dependencies for model downloads. The HuggingFace native `step_decode_stream` fast-path is covered by the integration test `test_sequence_operations` in `crates/tokenizer/tests/tokenizer_integration.rs`, which uses a real HF tokenizer. A fixture-backed HF benchmark is a known follow-up item but is not required in this PR. Do not flag the absence of HF tokenizer usage in the bench file as a gap.

Learnt from: XinyueZhang369
Repo: lightseekorg/smg PR: 723
File: model_gateway/tests/api/interactions_api_test.rs:316-346
Timestamp: 2026-03-11T05:32:45.536Z
Learning: In repo lightseekorg/smg, `test_interactions_multiple_workers` in `model_gateway/tests/api/interactions_api_test.rs` intentionally only verifies that requests succeed with multiple Gemini backends (connectivity), not load distribution. Load distribution across workers is covered separately in `tests/routing/load_balancing_test.rs`. Do not flag the absence of per-worker request count assertions in this test as a gap — distribution verification is deferred to a follow-up PR once the Gemini router implementation matures.

Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 1174
File: crates/mcp/src/core/orchestrator.rs:1097-1131
Timestamp: 2026-04-17T06:48:38.553Z
Learning: In repo lightseekorg/smg, `execute_tool_entry_result` in `crates/mcp/src/core/orchestrator.rs`: `record_call_end` is intentionally called for both `ToolExecutionResult::Executed` and `ToolExecutionResult::PendingApproval` to preserve pre-refactor metric shape (pending approval closes out the current tool-handling attempt). For `PendingApproval`, `succeeded = false` was a bug fixed in PR `#1174` so it no longer inflates error rates, but a dedicated `record_call_pending_approval` counter and full lifecycle split (only recording completion when approval is resolved and execution runs) are deferred to a follow-up PR. Do not flag the absence of a dedicated pending-approval metric as a blocking issue until that follow-up work is done.

Learnt from: pallasathena92
Repo: lightseekorg/smg PR: 687
File: model_gateway/src/routers/openai/realtime/webrtc.rs:238-289
Timestamp: 2026-03-11T01:29:56.655Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/openai/realtime/webrtc.rs: `Metrics::record_router_request` is already emitted in router.rs (around line 1152) before `handle_realtime_webrtc` is called, and `Metrics::record_router_error` is emitted inside `handle_realtime_webrtc` for the no-workers case. The missing instrumentation is success/duration recording after `setup_and_spawn_bridge` returns — this is a metrics improvement deferred to a follow-up PR, not a correctness gap. Do not flag missing success/duration metrics as a blocking issue for PR `#687`.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 915
File: model_gateway/src/routers/grpc/client.rs:387-423
Timestamp: 2026-03-26T17:06:14.307Z
Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: shenoyvvarun
Repo: lightseekorg/smg PR: 1397
File: model_gateway/src/routers/http/pd_router.rs:104-116
Timestamp: 2026-04-29T15:14:12.061Z
Learning: In repo lightseekorg/smg, `model_gateway/src/routers/http/pd_router.rs` and `router.rs`: When `enforce_token_rate_limit` (or its equivalent in the regular HTTP router) returns a 429/413 early exit, do NOT emit the existing worker-dispatch metrics (`Metrics::record_router_request`, `record_router_error`, etc.) for throttled traffic. Those metrics are semantically tied to requests dispatched to backend workers. Instead, a dedicated metric such as `Metrics::record_request_throttled(router_type, backend, model_id, tenant_key)` should be added to track admission-control rejections separately. Do not re-flag the absence of worker-dispatch metrics on throttled paths as a bug; flag only the absence of a dedicated throttle counter if one is not present.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 936
File: model_gateway/src/routers/gemini/router.rs:82-90
Timestamp: 2026-03-30T18:52:19.689Z
Learning: In repo lightseekorg/smg, the Gemini router's (`model_gateway/src/routers/gemini/router.rs`) use of `model_id` for per-model retry config lookup in `route_interactions` has a pre-existing limitation: when the request carries only an agent identifier, `get_retry_config(agent_id)` finds no match because agent→model resolution happens later inside `driver::execute` (in `worker_selection.rs`). This same limitation affects policy lookup and worker selection in the Gemini router. It is NOT introduced by PR `#936` and is a known accepted gap. The Gemini router is not actively used currently. Do not flag this as a bug introduced by per-model retry config changes to the Gemini router.

Learnt from: RohanSogani
Repo: lightseekorg/smg PR: 1333
File: model_gateway/tests/api/responses_api_test.rs:325-373
Timestamp: 2026-04-23T18:59:15.656Z
Learning: In repo lightseekorg/smg, PR `#1333` adds `extract_forwardable_request_headers` and threads forwarded headers through `McpToolSession` → `McpRequestContext` → `HandlerRequestContext`. This PR's scope is strictly internal request-context preservation (preserving request-scoped headers so they are accessible in MCP execution context), NOT outbound MCP header propagation to dynamic MCP servers. The integration test `test_non_streaming_mcp_e2e_accepts_forwardable_request_headers` in `model_gateway/tests/api/responses_api_test.rs` is intentionally a smoke test (verifies MCP still executes when forwardable headers are present); wire-level header assertions via `MockMCPServer::captured_headers()` belong in the follow-up PR covering config-driven outbound MCP header propagation. Do not flag the absence of `captured_headers()` assertions in this test as a gap for PR `#1333`.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 633
File: .github/workflows/pr-test-rust.yml:439-440
Timestamp: 2026-03-04T20:41:53.013Z
Learning: Ensure GitHub Actions workflows always include an actions/checkout step when referencing local composite actions (uses: ./... or similar) on self-hosted runners. Do not rely on a pre-checked-out workspace unless your environment explicitly guarantees it; follow the standard behavior and include checkout to ensure reliable action resolution.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: .github/workflows/release-grpc.yml:35-37
Timestamp: 2026-03-05T05:50:07.158Z
Learning: In workflows under .github/workflows that trigger only on push to main, you do not need to guard against the all-zeros github.event.before (0000000000000000000000000000000000000000) since that value only occurs on initial branch creation and cannot happen for an already-existing main. Apply this guidance to similar push-to-main workflows across the repository (pattern: **/.github/workflows/*.yml). If a workflow also handles other events or branches, retain appropriate guards for those cases.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1036
File: .github/workflows/claude-code-review.yml:71-73
Timestamp: 2026-04-06T16:49:28.633Z
Learning: In GitHub Actions workflow YAMLs, treat any `prompt:` field intended for an LLM (e.g., Claude) as natural-language instructions for the model. Do not embed shell-script-specific constructs or error-handling patterns inside the prompt content (e.g., `git rev-parse --verify`, `2>/dev/null`, explicit exit-code checks like `if [ $? -ne 0 ]`). Instead, express fallback/error-handling guidance in natural language so the LLM can reason about failures from context rather than parsing shell stderr/stdout semantics.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1099
File: .github/workflows/pr-test-mlx.yml:43-43
Timestamp: 2026-04-13T02:02:07.209Z
Learning: In this repo, when reviewing GitHub Actions workflow YAML files, treat `runs-on: macos-latest` as acceptable for Apple Silicon: GitHub’s standard `macos-latest` runner resolves to an arm64 macOS runner (Intel is being deprecated). Do not flag `macos-latest` as potentially non-arm64/Intel for Apple Silicon + MLX workloads.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b068350219

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

--num-concurrency "$concurrency" \
--traffic-scenario "$scenario" \
--max-requests-per-run "$MAX_REQUESTS" \
--max-time-per-run "$DURATION_MIN" \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Convert max-time-per-run from minutes to seconds

run.sh treats DURATION_MIN as minutes, but passes it directly to genai-bench --max-time-per-run, which is defined in seconds (the upstream manage_run_time loop increments once per second, and this repo’s own benchmark harness documents _MAX_TIME_PER_RUN = 10 # seconds in e2e_test/benchmarks/test_nightly_perf.py). With the current default DURATION_MIN=10, each cell runs for ~10 seconds instead of 10 minutes, so nightly results will be statistically noisy and can mislead HTTP vs gRPC comparisons.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/nightly-mlx-bench.yml (1)

25-34: ⚠️ Potential issue | 🟡 Minor

Remove the temporary pull_request trigger before merging.

This block is explicitly marked temporary; keeping it in the final workflow expands CI scope/cost beyond the intended nightly + manual modes.

Suggested cleanup
-  # TEMPORARY: enable PR trigger so this workflow can be smoke-tested before
-  # merge. Path-filtered so unrelated PRs don't fire it, and the matrix is
-  # reduced via the env block below to ~12 minutes. REMOVE this block before
-  # merging — once the harness is validated, only schedule + workflow_dispatch
-  # should remain.
-  pull_request:
-    branches: [main]
-    paths:
-      - "benchmarks/mlx_grpc_vs_http/**"
-      - ".github/workflows/nightly-mlx-bench.yml"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-mlx-bench.yml around lines 25 - 34, Remove the
temporary pull_request trigger block from the workflow so it only runs on the
intended schedule and manual dispatch; specifically delete the entire
pull_request: block (including the branches: [main] and paths: entries that
reference "benchmarks/mlx_grpc_vs_http/**" and
".github/workflows/nightly-mlx-bench.yml") before merging so CI scope reverts to
nightly + workflow_dispatch only.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/mlx_grpc_vs_http/aggregate.py`:
- Around line 49-52: The collect function currently calls results_dir.iterdir()
which raises if the directory doesn't exist; update collect (function name:
collect) to handle a missing results_dir by early-returning the empty out dict
or by wrapping the iterdir() call in a try/except FileNotFoundError (or checking
results_dir.exists() / results_dir.is_dir()) so the function emits the intended
empty-results summary instead of crashing; ensure you still return the same out
variable shape when the directory is absent.

---

Duplicate comments:
In @.github/workflows/nightly-mlx-bench.yml:
- Around line 25-34: Remove the temporary pull_request trigger block from the
workflow so it only runs on the intended schedule and manual dispatch;
specifically delete the entire pull_request: block (including the branches:
[main] and paths: entries that reference "benchmarks/mlx_grpc_vs_http/**" and
".github/workflows/nightly-mlx-bench.yml") before merging so CI scope reverts to
nightly + workflow_dispatch only.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6e389c04-0c6b-402c-967e-dd4f8060b3e5

📥 Commits

Reviewing files that changed from the base of the PR and between b068350 and 1f2953e.

📒 Files selected for processing (3)
  • .github/workflows/nightly-mlx-bench.yml
  • benchmarks/mlx_grpc_vs_http/aggregate.py
  • benchmarks/mlx_grpc_vs_http/run.sh

Comment thread benchmarks/mlx_grpc_vs_http/aggregate.py
Comment thread benchmarks/mlx_grpc_vs_http/run.sh Outdated
Comment on lines +109 to +116
local container_url="${base_url/127.0.0.1/host.docker.internal}"
container_url="${container_url/localhost/host.docker.internal}"

# genai-bench's --max-time-per-run is in SECONDS (despite the name).
local duration_s=$(( DURATION_MIN * 60 ))

docker run --rm \
--add-host=host.docker.internal:host-gateway \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Important: The servers started in Phase 1 and Phase 2 all bind to 127.0.0.1 (lines 141, 171, 181). With Colima, host.docker.internal resolves to the Mac host's IP on the virtual network (e.g. 192.168.106.1), not to 127.0.0.1. Since the servers only listen on the loopback interface, connections from the container to host.docker.internal:<port> will be refused.

Docker Desktop for Mac has a built-in proxy that forwards host.docker.internal traffic to the host's loopback, but Colima (Lima-based) does not — it relies on standard IP routing.

Fix: change the three server invocations to bind to 0.0.0.0:

# Line 141 (Phase 1 — mlx-lm)
mlx_lm.server --model "$MODEL" --host 0.0.0.0 --port "$HTTP_PORT" ...

# Line 171 (Phase 2 — gRPC servicer)
python3 -m smg_grpc_servicer.mlx.server --model "$MODEL" --host 0.0.0.0 --port "$GRPC_PORT" ...

# Line 181 (Phase 2 — SMG router)
"$SMG_BIN" launch --host 0.0.0.0 --port "$ROUTER_PORT" ...

This is safe on a CI runner. The wait_for_port and wait_for_openai helpers should also be updated to check 127.0.0.1 (or just localhost) since they're run from the host.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8203fc57c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ROUTER_PORT="${ROUTER_PORT:-30000}"
SMG_BIN="${SMG_BIN:-target/release/smg}"

mkdir -p "$RESULTS_DIR"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear RESULTS_DIR before starting a new benchmark run

run.sh only creates $RESULTS_DIR and never removes prior contents, but the aggregator later walks every experiment subfolder under that directory. In practice this mixes old and new runs (for example, after a smoke run with a reduced matrix and then a full run), so SUMMARY.md can include stale cells that were not executed in the current benchmark and mislead HTTP vs gRPC comparisons.

Useful? React with 👍 / 👎.

return None
# Each cell produces one summary JSON; sorted() makes the choice
# deterministic if a partial rerun left a stale file.
data = json.loads(json_files[0].read_text())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Read latest metrics file instead of first sorted JSON

parse_experiment currently loads json_files[0] after sorting, which selects the lexicographically earliest file in the folder. If a cell is rerun and multiple result JSON files remain in the same experiment directory, the summary can report stale metrics from an older run rather than the latest output, silently corrupting the final table.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/mlx_grpc_vs_http/run.sh`:
- Around line 157-160: The fixed "sleep 5" after starting the gRPC servicer is
brittle; replace it with a deterministic readiness gate instead of sleep by
adding/using a wait-for-ready helper (e.g., a new wait_for_generator_ready or
reuse wait_for_port) that polls a concrete readiness signal (health HTTP/gRPC
health endpoint, a socket response on "$GRPC_PORT", or a PID-ready file written
by the gen thread) with a timeout and backoff, then call that helper in place of
"sleep 5" so the script proceeds only when the generator is actually warmed up
(reference symbols: wait_for_port, GRPC_PORT, GRPC_PID, and the current "sleep
5" location).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f7c9e65f-8711-485f-bcad-9de3dcf6ebe2

📥 Commits

Reviewing files that changed from the base of the PR and between 1f2953e and e8203fc.

📒 Files selected for processing (3)
  • .github/workflows/nightly-mlx-bench.yml
  • benchmarks/mlx_grpc_vs_http/README.md
  • benchmarks/mlx_grpc_vs_http/run.sh

Comment thread benchmarks/mlx_grpc_vs_http/run.sh Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
benchmarks/mlx_grpc_vs_http/run.sh (1)

159-162: ⚠️ Potential issue | 🟠 Major

Replace fixed warmup sleep with a deterministic readiness gate.

A hardcoded delay is still brittle for CI variability and can skew/flake the first measured runs.

🔧 Proposed hardening
+wait_for_log_line() {
+    local file="$1"
+    local pattern="$2"
+    local timeout="${3:-300}"
+    local start=$SECONDS
+    while ! grep -q "$pattern" "$file" 2>/dev/null; do
+        if (( SECONDS - start > timeout )); then
+            log "Timeout waiting for '$pattern' in $file"
+            return 1
+        fi
+        sleep 1
+    done
+}
+
 wait_for_port "$GRPC_PORT" 300
+wait_for_log_line "$RESULTS_DIR/mlx-grpc.log" "gRPC server listening on" 300
 log "MLX gRPC servicer up on :$GRPC_PORT (pid=$GRPC_PID)"
-# Give the gen thread a beat to finish warmup before starting the router.
-sleep 5
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/mlx_grpc_vs_http/run.sh` around lines 159 - 162, Replace the
brittle fixed "sleep 5" warmup with a deterministic readiness gate:
implement/use a readiness check that polls the generator's health (e.g., a
gen-ready HTTP/IPC/UDP/flag endpoint or log marker) and block until it reports
ready before starting the router; in this script replace the sleep 5 with a call
to that gate (e.g., wait_for_generation_ready or similar) invoked after
wait_for_port "$GRPC_PORT" and before starting the router so the script relies
on an explicit readiness signal tied to the generator process (identified by
GRPC_PID/GRPC_PORT) rather than a fixed timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@benchmarks/mlx_grpc_vs_http/run.sh`:
- Around line 159-162: Replace the brittle fixed "sleep 5" warmup with a
deterministic readiness gate: implement/use a readiness check that polls the
generator's health (e.g., a gen-ready HTTP/IPC/UDP/flag endpoint or log marker)
and block until it reports ready before starting the router; in this script
replace the sleep 5 with a call to that gate (e.g., wait_for_generation_ready or
similar) invoked after wait_for_port "$GRPC_PORT" and before starting the router
so the script relies on an explicit readiness signal tied to the generator
process (identified by GRPC_PID/GRPC_PORT) rather than a fixed timeout.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 94b6b10a-473f-4fcc-bd6e-f339ba3ff658

📥 Commits

Reviewing files that changed from the base of the PR and between e8203fc and 16769d5.

📒 Files selected for processing (1)
  • benchmarks/mlx_grpc_vs_http/run.sh

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b4820432bc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

local base="$1"
local timeout="${2:-180}"
local start=$SECONDS
while ! curl -fsS "$base/v1/models" >/dev/null 2>&1; do
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add request timeouts to readiness curl probe

The readiness loop is meant to stop after timeout seconds, but curl -fsS has no --connect-timeout/--max-time, so if the endpoint accepts a socket and then stalls, curl can block indefinitely and the timeout branch is never evaluated. In that startup-failure mode the benchmark hangs until the job-level timeout instead of failing fast.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5aed138a9d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .github/workflows/nightly-mlx-bench.yml Outdated
- name: Install genai-bench
# Native pip install instead of the docker image — avoids the
# Colima-on-macos-latest VM hop, which is unstable in CI.
run: pip install "genai-bench>=0.0.4"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin genai-bench version for reproducible nightly runs

Installing genai-bench with >=0.0.4 makes nightly results non-reproducible over time because the benchmark harness can silently change behavior, defaults, or metric calculations as newer releases are published. This job is meant to compare performance trends, so tool-version drift can corrupt those comparisons even when router code is unchanged; use an exact pinned version (or lock file) to keep runs comparable.

Useful? React with 👍 / 👎.

Comment on lines +112 to +113
scenarios = sorted({k[1] for k in results})
concurrencies = sorted({k[2] for k in results})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve full benchmark matrix when cells fail

The table header and row dimensions are inferred only from successful result keys, so any scenario/concurrency where both HTTP and gRPC produced no JSON is dropped entirely instead of being shown as missing. In failure-heavy runs (for example, highest concurrency crashing both paths), this hides failed cells and makes the summary look like a smaller intended matrix, which can mislead comparisons.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional — committed in 4580bbc with rationale: the aggregator now infers the rendered matrix from observed data so a CI run with PHASES=mlx grpc produces a two-row table, not a three-row table with a column of . The same logic applies cell-by-cell. Failed cells are not silent: run.sh writes a .failed marker per failed cell and CI uploads bench-results/ as an artifact, so the failure record is preserved even when the row collapses. The header that previously showed every requested concurrency could imply intent the run did not satisfy — preferring "show what ran" matches how the rest of the SMG nightly suite renders its summary tables. Rendering attempted-but-failed cells from .failed markers is a reasonable follow-up if/when CI starts swallowing whole concurrencies regularly, but the current data does not show that pattern.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fdd9fe6559

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +164 to +167
if content:
if first_token_t is None:
first_token_t = time.monotonic()
tokens += 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Measure generated tokens instead of SSE chunks

This increments tokens once per streamed chunk, but OpenAI-style SSE chunks are transport frames, not token boundaries, so chunk coalescing/splitting differences between mlx_lm.server and the router path will skew total_output_tokens, output_throughput, and tpot_ms even when model speed is unchanged. Because this benchmark’s purpose is comparing those paths, the current counting method can produce misleading regressions/improvements unrelated to actual token generation.

Useful? React with 👍 / 👎.

Comment on lines +54 to +57
ds = load_dataset(
"anon8231489123/ShareGPT_Vicuna_unfiltered",
data_files="ShareGPT_V3_unfiltered_cleaned_split.json",
split="train",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin dataset revisions for stable nightly comparisons

The benchmark loads Hugging Face datasets without a revision, so each run implicitly tracks the latest dataset state; upstream edits (new/removed rows or content changes) will change prompt-length distributions and can shift latency/throughput trends even when router code is unchanged. This undermines the nightly benchmark’s comparability over time; pin immutable dataset revisions (for both chat and agent loaders) to keep trend lines meaningful.

Useful? React with 👍 / 👎.

# downloading artifacts.
cat "$RESULTS_DIR/SUMMARY.md" >> "$GITHUB_STEP_SUMMARY"

- name: Upload raw genai-bench output
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: Stale step name — genai-bench was replaced by bench_client.py in this push.

Suggested change
- name: Upload raw genai-bench output
- name: Upload raw benchmark output

Comment thread .github/workflows/nightly-mlx-bench.yml Outdated
Comment on lines +43 to +44
# 4 concurrencies × 2 scenarios × 2 setups × 10 min ≈ 2.7 h, plus build
# and model download. Cap at 5h45m for safety.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: This comment still references the old "10 min per cell" duration model. The new approach uses prompt-count-based cells (README says ~30 min for a full run), so the 345-min timeout is ~10x the expected runtime. Consider updating the comment and reducing the timeout.

Suggested change
# 4 concurrencies × 2 scenarios × 2 setups × 10 min ≈ 2.7 h, plus build
# and model download. Cap at 5h45m for safety.
# Custom bench client uses prompt-count-based cells (not time-based).
# Full run ~30 min; cap generously for model download + build overhead.

Comment on lines +29 to +35
"request_throughput": 3.24,
"output_throughput": 832.1,
"total_input_tokens_est": 312000,
"total_output_tokens": 31750,
"ttft_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...},
"tpot_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: The example JSON includes total_input_tokens_est but _summarize() doesn't produce that field. Either add it to the output or remove it from the docstring to avoid confusing consumers.

Suggested change
"request_throughput": 3.24,
"output_throughput": 832.1,
"total_input_tokens_est": 312000,
"total_output_tokens": 31750,
"ttft_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...},
"tpot_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...}
}
{
"scenario": "agent",
"concurrency": 16,
"num_prompts_requested": 128,
"completed_requests": 124,
"error_count": 4,
"duration_s": 38.2,
"request_throughput": 3.24,
"output_throughput": 832.1,
"total_output_tokens": 31750,
"ttft_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...},
"tpot_ms": {"mean": ..., "p50": ..., "p95": ..., "p99": ...}
}

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 406e0a4f55

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread benchmarks/mlx_grpc_vs_http/run.sh Outdated
Comment on lines +129 to +132
then
log "[$exp_name] FAILED — recording marker, continuing"
date -u +"%Y-%m-%dT%H:%M:%SZ" >"$exp_dir/.failed"
return 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate per-cell benchmark failures to step status

This branch converts any non-zero exit from bench_client.py into success by returning 0, so a phase can finish green even when every cell fails (e.g., invalid workflow_dispatch input, dataset/API breakage, or systemic server errors). In that case the nightly job reports success with empty/placeholder results, which hides that the benchmark is actually broken and can silently drop regression signal.

Useful? React with 👍 / 👎.

Comment on lines +213 to +214
"p95": xs[min(int(len(xs) * 0.95), len(xs) - 1)],
"p99": xs[min(int(len(xs) * 0.99), len(xs) - 1)],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use correct percentile indexing for latency stats

The percentile calculation uses int(len * q) as a zero-based index, which biases p95/p99 upward for the small sample sizes used here (for example, with 30 samples p95 resolves to the 29th element, effectively ~96.7th percentile). That systematically distorts tail latency comparisons between HTTP and gRPC cells and can mislead nightly trend analysis.

Useful? React with 👍 / 👎.

@key4ng key4ng force-pushed the keyang/mlx-bench branch from 406e0a4 to 8abd73a Compare April 28, 2026 04:17
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8abd73a1b2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"model": model,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens,
"temperature": 1.0,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use deterministic decoding for benchmark traffic

The benchmark requests are sent with temperature set to 1.0 and no fixed seed, so each run samples different continuations and output lengths even when code and prompts are unchanged. Since this harness computes output_throughput and TPOT from generated-token counts, that sampling noise can shift nightly metrics independently of HTTP vs gRPC transport changes and create false regressions/improvements. For comparability, send deterministic generation settings for benchmark runs (for example temperature=0 and a fixed seed when supported).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a6f1afee4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +22 to +24
result = folder / "result.json"
if not result.exists():
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ignore failed experiment folders when building summary

run.sh records a .failed marker for any non-zero bench cell (benchmarks/mlx_grpc_vs_http/run.sh lines 129-132), but parse_experiment only checks whether result.json exists and then loads it. That means failed cells can still be included in SUMMARY.md when result.json is stale or partially written, which can silently skew HTTP vs gRPC comparisons instead of rendering the cell as missing as intended.

Useful? React with 👍 / 👎.

@key4ng key4ng force-pushed the keyang/mlx-bench branch from 2d9cfc1 to 6570420 Compare April 30, 2026 02:39
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65704207bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .github/workflows/nightly-mlx-bench.yml Outdated
run: pip install -e "./grpc_servicer[mlx]"

- name: Install mlx-lm CLI (for the direct-HTTP baseline)
run: pip install "mlx-lm>=0.22.0"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin mlx-lm version for reproducible benchmark trends

This workflow installs mlx-lm as >=0.22.0, so each nightly run can pick up a different upstream runtime/server implementation. Because this job is used to compare performance over time, that version drift can change latency/throughput independently of any SMG changes and produce misleading regressions or improvements; use an exact version (or lockfile) for stable baselines.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in f7f465e: pinned to mlx-lm==0.31.3. Each nightly will now use the same mlx-lm runtime, so trend changes in the summary diffs reflect SMG-side changes instead of upstream mlx-lm release timing. Future bumps happen in their own commits so breaks are explicit.

Comment thread benchmarks/mlx_grpc_vs_http/README.md Outdated
# Python deps in the main env
pip install -e ./crates/grpc_client/python
pip install -e "./grpc_servicer[mlx]"
pip install "mlx-lm>=0.22.0" genai-bench
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include oci_openai in documented local install command

The local setup instructions install only genai-bench, but this same commit’s workflow notes that genai-bench eagerly imports oci_openai and can crash on a clean install without it. As written, developers following the README can fail before the benchmark even starts, so the documented local workflow is not reliably runnable; add oci_openai (or pin a version where it is guaranteed present).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in f7f465e: README's local install line now matches the CI workflow — pip install "mlx-lm==0.31.3" genai-bench oci_openai, with a short note explaining why oci_openai is needed (genai-bench eagerly imports it; pip doesn't resolve it on a fresh wheel install).

Comment thread .github/workflows/nightly-mlx-bench.yml Outdated
@@ -0,0 +1,245 @@
name: Nightly MLX Three-Way Benchmark
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: The top-level workflow name still says "Three-Way" but the job name (line 51) was updated to "two-way" in this push. They should match now that vllm is dropped from the CI default.

Suggested change
name: Nightly MLX Three-Way Benchmark
name: Nightly MLX Two-Way Benchmark

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in f7f465e: workflow name: is now Nightly MLX Two-Way Benchmark matching the job name.

key4ng added a commit that referenced this pull request May 1, 2026
Three issues flagged by bot review on commit 6570420 / 57e8888:

1. Workflow name said "Nightly MLX Three-Way Benchmark" but the job
   name was already "MLX two-way (mlx-lm vs SMG+gRPC; vllm-metal
   local-only)" since vllm was dropped from the CI default. The two
   were inconsistent; renamed the top-level workflow to
   "Nightly MLX Two-Way Benchmark" so the GitHub Actions UI tells the
   truth at every level.

2. `pip install "mlx-lm>=0.22.0"` let each nightly run pull whatever
   the latest mlx-lm release was at install time. For a benchmark
   meant to track SMG-side performance trends over time, that drift
   is noise — a faster decode loop in mlx-lm 0.31.4 would look like
   an SMG improvement. Pinned to exact version `mlx-lm==0.31.3`
   (what the failing run actually resolved to and what works with
   the gemma-3-4b-it-qat-4bit checkpoint). Future bumps now happen
   in their own commits, making trend breaks explicit in the bench
   summary diffs rather than implicit in HF/PyPI release timing.

3. Local README install instructions said
   `pip install "mlx-lm>=0.22.0" genai-bench`
   but the workflow knows `oci_openai` is a transitive dep that
   genai-bench imports eagerly — without it, `genai-bench --help`
   crashes on a clean wheel install. The CI workflow already pins
   it explicitly; the README didn't, so a developer following local
   instructions would fail before the bench even started. Updated
   the README to include `oci_openai` and the same pinned mlx-lm
   version as CI, with a brief note explaining why both are needed.

The temporary `pull_request` trigger flagged by CodeRabbit (#1399
review on b068350) is deliberately kept until just before merge — we
need PR-trigger smoke runs while the harness is still being validated.
Will be removed in the final pre-merge commit.

Signed-off-by: key4ng <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a267fc039

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +34 to +38
pull_request:
branches: [main]
paths:
- "benchmarks/mlx_grpc_vs_http/**"
- ".github/workflows/nightly-mlx-bench.yml"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove temporary PR trigger from nightly benchmark workflow

This workflow is documented as nightly/manual, and the in-file note says the pull_request trigger is temporary and should be removed before merge, but it is still enabled here. Keeping it will run this long benchmark on every matching PR update, which can add hours of CI latency and consume runner capacity/costs for non-nightly validation paths. If this check is required, it can also slow or block normal PR iteration unexpectedly.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged but keeping for now. The pull_request trigger is path-filtered (only fires on benchmarks/mlx_grpc_vs_http/** or this workflow), the matrix is reduced under pull_request (concurrencies 1 4, duration 4 min), and the bench harness itself is what is being iterated on in this PR — without the trigger, every code change requires a manual workflow_dispatch, which has been the bottleneck during validation. The TEMPORARY note in the YAML is the standing commitment to remove the trigger before merging to main; that removal will be the final commit on this branch.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7f465e837

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


def collect(results_dir: Path) -> dict[tuple[str, str, int], dict[str, Any]]:
out: dict[tuple[str, str, int], dict[str, Any]] = {}
for sub in sorted(results_dir.iterdir()):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard missing results directory before iterating

collect() unconditionally calls results_dir.iterdir(), which raises FileNotFoundError when the benchmark directory was never created. In this workflow the aggregate step is if: always(), so early setup failures (or a run where all benchmark phases are skipped) will crash aggregation instead of emitting a graceful "no results" summary and continuing artifact/log upload steps.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 81e97b7. Added if not results_dir.is_dir(): return out at the top of collect(). Aggregate runs under if: always(), so this is the safe path when no phase ever created the dir (setup failure, empty PHASES).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4580bbcf35

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .github/workflows/nightly-mlx-bench.yml Outdated
RESULTS_DIR: bench-results
# Match the path that vllm-metal's install.sh uses when run via
# curl | bash (it hard-codes $HOME/.venv-vllm-metal in that branch).
VLLM_VENV: /Users/runner/.venv-vllm-metal
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Derive VLLM_VENV from runner home directory

Avoid hard-coding VLLM_VENV to /Users/runner/.venv-vllm-metal: the vllm-metal installer writes into ~/.venv-vllm-metal, so on self-hosted macOS runners where $HOME is not /Users/runner, the workflow looks in the wrong path, reports vllm as missing, and silently skips the vLLM phase even when phases includes vllm. This makes the advertised three-way self-hosted run degrade to two-way without a hard failure.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 81e97b7 by removing the VLLM_VENV env entirely. run.sh already defaults to ${VLLM_VENV:-$HOME/.venv-vllm-metal} (line 43), and the vllm-metal installer writes there too — letting the default resolve at shell-time gets the correct path on both GitHub-hosted (/Users/runner/...) and self-hosted runners.

key4ng added 19 commits May 1, 2026 13:07
Compares two paths into the same MLX inference engine:
1. mlx-lm.server (direct HTTP/OpenAI)
2. SMG router + SMG MLX gRPC servicer

Both end at the same BatchGenerator; the difference is the transport
and the work the router does (Rust tokenization, chat templating,
JSON↔proto conversion).

Files:
- benchmarks/mlx_grpc_vs_http/run.sh — bash orchestrator. Spins up
  mlx-lm HTTP, runs genai-bench across the matrix, tears down, then
  starts SMG MLX servicer + SMG router and runs the same matrix
  against http://router.
- benchmarks/mlx_grpc_vs_http/aggregate.py — parses genai-bench JSON
  output (aggregated_metrics.stats.*) and emits a side-by-side
  markdown table.
- benchmarks/mlx_grpc_vs_http/README.md — local-run instructions and
  what the benchmark measures.
- .github/workflows/nightly-mlx-bench.yml — runs on macos-latest
  (Apple Silicon) on a nightly schedule and via workflow_dispatch
  with overridable inputs (model, concurrencies, scenarios,
  duration). Mirrors the SUMMARY.md to GITHUB_STEP_SUMMARY so the
  table is visible without downloading artifacts.

Defaults match the user's request:
- Model: mlx-community/gemma-3-4b-it-qat-4bit (~3 GB, fits 16 GB Mac)
- Concurrencies: 1, 4, 16, 64
- Scenarios: D(100,256) and D(2000,128)
- 10 min/cell ≈ 2.7 hours total per run

Reuses the genai-bench docker image already used by the existing
nightly-benchmark workflow (same patterns from
e2e_test/benchmarks/conftest.py).

Signed-off-by: key4ng <[email protected]>
… test

PR pushes now run a reduced matrix (2 concurrencies × 1 scenario × 2 min/cell
≈ 12 min) to validate the orchestrator + aggregator end to end before
merge. Path-filtered so only changes to the benchmark files or workflow
itself trigger it.

REMOVE the pull_request: block before merging — production triggers should
only be schedule + workflow_dispatch.

Signed-off-by: key4ng <[email protected]>
Critical fixes:
- macos-latest runners don't ship Docker. Install Colima in the
  workflow so docker pull/run actually work. (claude P1, codex P1,
  coderabbit major).
- --max-time-per-run is in SECONDS, not minutes. Multiply DURATION_MIN
  by 60 in run.sh before passing it to genai-bench, otherwise
  DURATION_MIN=10 produces 10-second cells, not 10-minute cells.
  Confirmed against e2e_test/benchmarks/test_nightly_perf.py:30
  (claude P1).
- --network host doesn't share the host network on Docker for Mac /
  Colima — containers run inside a Linux VM. Drop --network host and
  rewrite 127.0.0.1/localhost in the bench client's base URL to
  host.docker.internal, with --add-host=host.docker.internal:
  host-gateway to make the lookup work (claude P1, coderabbit major).

Minor fixes:
- PIDS array element 'removal' was using parameter substitution which
  replaces the matching value with an empty string instead of dropping
  the element. Rebuild the array properly (claude nit, coderabbit nit).
- aggregate.py: sort json_files for deterministic selection if a
  partial rerun left stale files (coderabbit nit).
- aggregate.py: mkdir parent of --out before writing in case the user
  passes a fresh path (coderabbit nit).

Signed-off-by: key4ng <[email protected]>
Colima boot is unstable on macos-latest GitHub runners (intermittent
'error starting vm: exit status 1' from the lima VZ driver). genai-bench
is a public pip package with the same CLI as the docker entrypoint,
so 'genai-bench benchmark <args>' replaces 'docker run ... benchmark
<args>' verbatim.

Also drops the host.docker.internal URL rewriting (no container = no
VM = 127.0.0.1 reaches the host directly), and the HF cache mount
(genai-bench reads HF_HOME from the same env it inherits).

Workflow:
- Removed 'Install Docker (Colima)' step.
- Replaced 'Pull genai-bench image' with 'Install genai-bench' via pip.
- Verification step also calls 'genai-bench --help' to fail fast if
  the CLI didn't install cleanly.

run.sh is ~30 lines smaller, README updated.

Signed-off-by: key4ng <[email protected]>
genai-bench multiplies the input by 60 internally
(genai_bench/cli/cli.py:326: 'max_time_per_run *= 60'), so passing 120
makes each cell run for 7200 seconds = 2 hours, not 2 minutes.

The misleading comment in e2e_test/benchmarks/test_nightly_perf.py:30
('seconds per scenario×concurrency combo') is what threw me off in
the previous round of review fixes. Reading the actual genai-bench
source confirms minutes is the correct unit.

This was the cause of run #24978016312 sitting at 'in progress' for
74 minutes when it should have completed in ~10 minutes total.

Signed-off-by: key4ng <[email protected]>
Two fixes:

1. run.sh: '${new_pids[@]}: unbound variable' under set -u when phase 1
   finished and the only tracked pid (mlx-lm) was filtered out. The
   array reference produced 'new_pids[@]: unbound variable' → script
   exited silently before phase 2 (gRPC) could run.
   Replaced the rebuild loop with PIDS=() since mlx-lm is the only
   server running during phase 1; resetting PIDS to empty is correct
   and avoids the unbound-array gotcha entirely.

2. workflow: PR-trigger now runs the full matrix (4 conc × 2 scenarios
   × 10 min × 2 phases = ~2.7 h) instead of the smoke matrix, so we
   can collect real comparison numbers on this PR before merge.
   Marked TEMPORARY — switch back to the smoke shape (or remove the
   pull_request trigger) before merging.

Signed-off-by: key4ng <[email protected]>
…ilure

Two independent fixes that together give us partial-but-useful results
even when something falls over:

1. run.sh now accepts PHASE=http|grpc|all (default 'all' for local).
   Each phase is a self-contained function with its own server lifecycle,
   so the workflow can run them as two separate steps. Phase failure no
   longer leaks to the other phase.

2. run_bench wraps the genai-bench call in 'if !'. A failing cell (e.g.
   mlx-lm.server collapsing under c=64 with all 503s) writes a .failed
   marker to the experiment dir and the loop continues. aggregate.py
   already prints '—' for cells with no JSON, so partial results render
   cleanly.

Workflow:
- 'Run benchmark — Phase 1 (direct HTTP)' runs PHASE=http.
- 'Run benchmark — Phase 2 (SMG router + gRPC)' runs PHASE=grpc with
  if: always() so phase 2 still executes if phase 1 errored out.

This unblocks the case from run #25015745343 where phase 1 c=64 took
mlx-lm.server down and the script exited before phase 2 ever started.

Signed-off-by: key4ng <[email protected]>
Per discussion on PR — 'vllm bench serve' isn't usable on macos
because vllm publishes Linux/CUDA wheels only. Replaces genai-bench
(which only supports synthetic D(in,out) length distributions) with a
small custom client that loads real prompt distributions:

- chat:  anon8231489123/ShareGPT_Vicuna_unfiltered — typical chat
         user prompts, exercises TTFT + decode throughput.
- agent: vdaita/edit_10k_char — real local coding-agent prompts with
         ~2.5k token code context. Models the canonical Mac local-AI
         workload (Cursor/Continue/Cline editing files). Stresses
         prefill + tokenization, where Rust router should win big.

bench_client.py:
- Async httpx + HF datasets, ~250 LOC.
- Output JSON shaped like 'vllm bench --save-result' so we can swap
  to vllm bench upstream if/when mac support lands.
- Per-cell exit non-zero only when *every* request failed (server
  overload), letting run.sh distinguish complete failure from partial.

run.sh:
- prompts_for(scenario, concurrency) scales prompt count so each
  cell has a meaningful number of completed requests at every level
  (50 → conc*4 for chat, 30 → conc*2 for agent — slower prefill).
- Bench cell call replaced with python3 bench_client.py.

aggregate.py:
- Reads result.json from each experiment dir (new shape).
- Renders per-scenario tables (chat, agent) with RPS, output tok/s,
  TTFT p50/p95, TPOT p50, and errors/total.

Workflow:
- pip install httpx[http2] + datasets (replaces genai-bench).
- workflow_dispatch inputs updated for new scenario names.
- Default scenarios = 'chat agent', no DURATION_MIN (driven by
  num_prompts now).

Signed-off-by: key4ng <[email protected]>
cancel-in-progress: false meant new PR pushes queued behind the
running one, and GitHub auto-cancelled stale queued runs when even
newer pushes came in — so a series of fast pushes left the actual
latest code never running.

Switch to cancel-in-progress: true ONLY for PR events. Scheduled
nightlies and manual workflow_dispatch runs still queue normally
(don't kill in-flight nightly runs).

Signed-off-by: key4ng <[email protected]>
The custom bench_client.py + vllm bench experiments produced unstable
TTFT/TPOT numbers for mlx-lm.server (vllm bench saw all SSE chunks
coalesced in one network read on localhost, so per-token ITL was always
~0). Switching to genai-bench — the same harness the rest of the SMG
nightly suite uses — gives clean schema with stats.ttft / stats.tpot /
mean_output_throughput_tokens_per_s, and matches what e2e_test/benchmarks
already reports.

Same change adds vllm-metal as a third backend so the nightly publishes
a real 3-way comparison: mlx-lm direct HTTP vs SMG router + MLX gRPC
servicer vs vllm-project/vllm-metal `vllm serve`.

Changes:
- run.sh: unified 3-phase orchestrator (mlx | grpc | vllm), each opt-in
  via PHASES env. genai-bench replaces the custom client. vllm phase
  auto-skips if VLLM_VENV is missing.
- aggregate.py: parses genai-bench's aggregated_metrics.stats schema and
  emits one comparison table per scenario across all three backends.
- nightly-mlx-bench.yml: best-effort vllm-metal install via the project's
  install.sh into a workspace-local venv; three independent phase steps
  with `if: always()` so a crash in one backend doesn't block the others.
- README.md: documents the new 3-way layout, scenarios, and tunables.
- bench_client.py / aggregate_local.py / prepare_datasets.py / local_data
  removed — replaced by genai-bench.

Scenarios are now genai-bench's deterministic shapes:
- chat = D(100,256)  — short prompt + medium output
- agent = D(2500,256) — RAG / code-edit / Cursor-style local agent traffic

Signed-off-by: key4ng <[email protected]>
… budget

Three fixes for the first 3-way CI run on PR #1399:

1. vllm-metal install path. The project's install.sh hard-codes
   `$HOME/.venv-vllm-metal` when run via `curl | bash` — VLLM_METAL_VENV
   wasn't honored, so we ended up with the venv at the upstream default
   while VLLM_VENV pointed at the workspace. The workflow now drops the
   manual venv creation and matches that hard-coded path; install.sh is
   self-sufficient (auto-installs uv, creates the venv, installs the
   wheel). VLLM_VENV is set to /Users/runner/.venv-vllm-metal.

2. vllm phase no longer aborts the script when `vllm serve` fails to
   come up. We now invoke `$VLLM_VENV/bin/vllm` directly (no `source
   activate` — that way background children inherit the right interpreter
   reliably), and wrap `wait_for_openai` in `if !` so a startup failure
   logs the tail of vllm-metal.log and returns 0 instead of tripping
   `set -e`. Phase 3 stays green even if vllm-metal can't boot.

3. PR runs use 4 min/cell (was 2). At 2 minutes, the
   `D(2500,256) × concurrency=4` cell finished zero requests on the
   macos-latest runner — genai-bench then crashed in its Excel
   post-processing because metrics were empty, leaving the cell as
   `—` in the summary table. 4 minutes gives at least one full
   prefill+decode cycle to complete.

Also documented in the README:
 - mlx-lm.server's TPOT≈0 is a real localhost SSE buffering quirk in
   mlx-lm; the smg→mlx-grpc row in the same scenario shows a real TPOT
   despite hitting the same MLX BatchGenerator underneath. Worth
   highlighting rather than hiding — it's exactly the kind of streaming
   correctness the SMG router gives you.
 - High-concurrency agent cells will render `—` if DURATION_MIN runs
   out before any 2.5k-token prefill completes.

Signed-off-by: key4ng <[email protected]>
… TPOT

The "mlx-lm TPOT is always 0" symptom traced earlier to the wrong cause.
Wire-level probes (raw socket recv + aiohttp.iter_any) BOTH show mlx-lm
streaming per-token SSE chunks at ~18 ms intervals — exactly the cadence
you'd expect for ~50 tok/s decode. mlx-lm and aiohttp are not the bug.

The actual bug is in genai-bench (genai_bench/user/openai_user.py:335):

    for chunk in response.iter_lines(chunk_size=None):

`requests.iter_lines(chunk_size=None)` calls
`urllib3.HTTPResponse.read(amt=None)` under the hood, which reads to EOF
before yielding when the response has neither `Content-Length` nor
`Transfer-Encoding: chunked`. mlx-lm.server uses Python's
`BaseHTTPRequestHandler`, which defaults to HTTP/1.0 and emits neither
header for streaming responses → the bench client buffers the entire
generation and processes all per-token chunks at end-of-stream → TPOT
collapses to ~0 and TTFT inflates to e2e latency.

Local repro after applying the same patch:

    | Metric        | chunk_size=None | chunk_size=1 |
    | TTFT mean     | 5,452 ms        | 486 ms       |
    | TPOT mean     | ~0 ms           | 19.3 ms      |
    | E2E latency   | 5,452 ms        | 2,938 ms     |

19.3 ms TPOT matches the wire-level inter-chunk gap exactly.

The patch is benign for the smg→mlx-grpc and vllm-metal phases — both
emit HTTP/1.1 with `Transfer-Encoding: chunked`, where urllib3 yields
per chunked frame regardless of the chunk_size argument.

Workflow change: a `Patch genai-bench streaming chunk size` step runs
`sed` on the installed openai_user.py right after `pip install
genai-bench`, swapping `chunk_size=None` for `chunk_size=1`. Fails the
job loudly (rather than silently) if the pattern isn't found, so a
genai-bench upgrade that renames or removes the call site won't ship a
bench run with bogus mlx-lm metrics.

README updated to reflect the actual root cause (drops the wrong "SSE
buffering quirk in mlx-lm" wording from the previous fix attempt) and
documents the equivalent local one-liner for off-CI runs.

Signed-off-by: key4ng <[email protected]>
PR #1414 (drain-and-batch fix for the gRPC servicer's BatchGenerator
rope crash) is now on main; rebasing this branch onto it lets the next
CI run finally produce a clean three-way comparison without the c≥4
agent rope corruption that dominated previous runs.

This commit makes two changes on top of that rebase:

1. CI MODEL default → mlx-community/gemma-3-1b-it-4bit (~0.6 GB).
   The previous 4 B QAT model was ~3 GB on disk; combined with
   vllm-metal's ~1.7 GB overhead and the macos-latest runner's
   ~5 GB Metal budget, the KV-cache budget went negative and the
   vllm phase OOM'd at startup every run — we never got vllm-metal
   numbers in CI. The 1 B variant fits comfortably and lets all
   three backends produce metrics. The bench is comparing the
   *same model* across three serving paths, so the relative
   router/streaming comparison (TTFT shape, RPS scaling, TPOT
   correctness) stays meaningful at 1 B; only the absolute
   throughput numbers shift. For production-shaped 4 B+ numbers,
   use local_three_way (run.sh) on a real Mac, which still
   defaults to gemma-3-4b-it-qat-4bit, or override MODEL via
   workflow_dispatch on a beefier runner (self-hosted, or
   macos-latest-xlarge with 14 GB).

2. Tightened the genai-bench chunk_size=1 patch comment in the
   workflow. Previous wording overstated the case as "HTTP/1.0
   buffering bug" — the actual bug class is "no length-delimiting
   framing on the response", and HTTP/1.0 is just one way mlx-lm
   lands in that state (HTTP/1.1 with Connection: close and no
   Content-Length / chunked would reproduce too). Also clarified
   that chunk_size=1 doesn't read 1 byte at a time — urllib3's
   read(amt=N) returns up to N bytes from whatever's currently
   buffered, so it effectively means "yield per recv()". The
   patch is a no-op for the smg→mlx-grpc and vllm-metal phases
   (HTTP/1.1 + Transfer-Encoding: chunked), since urllib3 decodes
   chunk frames before the buffer the iterator reads from.

README updated to document the CI vs local model split as a table
with the rationale, so the next reader doesn't have to dig through
git history to understand why CI numbers are at 1 B.

Signed-off-by: key4ng <[email protected]>
…le variant)

The 1B non-QAT model from the previous commit didn't fix the vllm
phase: vllm-metal silently fell back to PyTorch CPU/fp32 because the
non-QAT 4-bit MLX checkpoints aren't recognized by vllm-metal's Metal
path. Once on the CPU path it tries to allocate `--gpu-memory-
utilization 0.92 × 7 GiB = 6.44 GiB`, but only 3.65 GiB is free on
the runner — startup aborts before the engine reports ready.

Two facts collected from the docs and the failing logs:

1. macos-latest exposes ~5 GB of Metal memory; lowering `--gpu-memory-
   utilization` doesn't help fit a 4 B model (the flag gates how much
   Metal vllm-metal can use, not how much it needs). The 4 B QAT
   already failed with a NEGATIVE KV-cache budget at the default
   fraction=0.9; lowering the fraction makes it worse.

2. vllm-metal's Metal path only picks up QAT-format MLX checkpoints
   (`*-qat-4bit`). Regular 4-bit checkpoints silently route through a
   PyTorch CPU/fp32 fallback that's documented as "experimental and
   unreliable in 2026".

The intersection of those two constraints is `gemma-3-1b-it-qat-4bit`
(~0.7 GB on disk): small enough to leave ~2 GB of KV headroom inside
the runner's 5 GB Metal budget, and QAT-format so vllm-metal picks
the Metal path. This is the only CI-viable Metal-backed model on the
macos-latest runner today.

run.sh's local default (gemma-3-4b-it-qat-4bit) is unchanged — real
Mac hardware with ≥16 GB unified RAM happily serves 4B QAT.

README's CI-vs-local model table updated with the QAT requirement,
the gpu-memory-utilization explanation, and a more accurate set of
"why CI uses 1B-QAT" reasons. workflow_dispatch input description
likewise expanded so override callers see the QAT constraint up front.

Signed-off-by: key4ng <[email protected]>
Two changes after another vllm-metal CI failure on the gemma-3-1b-it-
qat-4bit attempt: even with the QAT model, vllm-metal silently fell
back to PyTorch CPU/fp32 (`device_config=cpu`) and aborted at startup
trying to allocate `--gpu-memory-utilization 0.92 × 7 GiB = 6.44 GiB`
in only 3.39 GiB of available RAM. Root cause is structural: the
macos-latest GitHub runner is a VM that doesn't reliably expose Metal
GPU access, so vllm-metal's Metal worker can't initialize. The
PyTorch MPS fallback path is documented as "experimental and
unreliable in 2026". This isn't fixable from our workflow.

1. Drop vllm from the CI default `PHASES`. New default is `mlx grpc`
   (two-way). Workflow_dispatch input default still includes vllm so
   self-hosted M-series runs can opt in. The `Install vllm-metal`
   step now has `if: contains(env.PHASES, 'vllm')` so it doesn't burn
   ~2 min on every CI run downloading wheels that won't get used.
   Local `run.sh` default is still `mlx grpc vllm` — real Apple
   Silicon hardware (≥16 GB unified RAM) runs the three-way fine.

2. Swap MODEL default to `mlx-community/gemma-4-e2b-it-4bit`
   (~3.2 GB). Gemma 4 E2B has day-0 mlx-lm support and is the
   smallest production-quality MLX checkpoint that fits the runner.
   Local `run.sh` default stays at gemma-3-4b-it-qat-4bit for hardware
   that can serve 4B comfortably.

README's defaults table now lists PHASES alongside MODEL, with a
short explanation of why vllm-metal isn't in the CI matrix and how
to opt back into the three-way (local hardware or workflow_dispatch
override on a self-hosted M-series runner). The previous two-paragraph
"why CI uses 1B-QAT" rationale is gone — it was specific to a
constraint that no longer applies once vllm-metal is dropped.

The job name is updated to "MLX two-way (mlx-lm vs SMG+gRPC; vllm-
metal local-only)" so the GitHub Actions UI tells the truth about
what runs in CI vs what's deferred to local hardware.

Signed-off-by: key4ng <[email protected]>
…loadable via mlx_lm.server)

Latest CI run failed the mlx phase with

    ValueError: Received 140 parameters not in model:
      language_model.model.layers.X.self_attn.k_norm.weight, ...

mlx-lm 0.31.x added Gemma 4 support, but only for the *dense* 26B/31B
text variants. The Gemma 4 "Effective" e2b/e4b checkpoints are
multimodal-only — their safetensors carry text + vision + audio tower
weights under `language_model.*`, `vision_tower.*`, `audio_tower.*`
namespaces, and `mlx_lm.server`'s `load()` only instantiates text-only
CausalLM classes. There is no mlx-lm version (current or imminent)
that loads multimodal Gemma 4 checkpoints through the text-only server
path; bridging that to mlx-vlm's multimodal loader is a separate
workstream.

Switching CI MODEL back to gemma-3-4b-it-qat-4bit:

  - Same as local run.sh default (apples-to-apples between CI and
    local), so anyone reading both summaries can compare directly.
  - Fits macos-latest's 7 GB RAM for the mlx and grpc phases (which
    run sequentially, so each gets full RAM).
  - Known-working with the mlx-lm pip pin we already use — no version
    gambling.
  - The merged drain-and-batch fix (PR #1414) eliminates the c≥4
    rope crash that was the real bench-quality concern; with that
    landed, gemma-3-4b QAT produces clean three-way numbers locally
    and clean two-way numbers in CI.

The previous Gemma 4 swap commit (57e8888) was made on the
assumption that mlx-lm's day-0 Gemma 4 line covered all variants;
this is the corrective revert with a docstring explaining the
multimodal-vs-text-only gap so the next reader doesn't try the same.

Signed-off-by: key4ng <[email protected]>
Three issues flagged by bot review on commit 6570420 / 57e8888:

1. Workflow name said "Nightly MLX Three-Way Benchmark" but the job
   name was already "MLX two-way (mlx-lm vs SMG+gRPC; vllm-metal
   local-only)" since vllm was dropped from the CI default. The two
   were inconsistent; renamed the top-level workflow to
   "Nightly MLX Two-Way Benchmark" so the GitHub Actions UI tells the
   truth at every level.

2. `pip install "mlx-lm>=0.22.0"` let each nightly run pull whatever
   the latest mlx-lm release was at install time. For a benchmark
   meant to track SMG-side performance trends over time, that drift
   is noise — a faster decode loop in mlx-lm 0.31.4 would look like
   an SMG improvement. Pinned to exact version `mlx-lm==0.31.3`
   (what the failing run actually resolved to and what works with
   the gemma-3-4b-it-qat-4bit checkpoint). Future bumps now happen
   in their own commits, making trend breaks explicit in the bench
   summary diffs rather than implicit in HF/PyPI release timing.

3. Local README install instructions said
   `pip install "mlx-lm>=0.22.0" genai-bench`
   but the workflow knows `oci_openai` is a transitive dep that
   genai-bench imports eagerly — without it, `genai-bench --help`
   crashes on a clean wheel install. The CI workflow already pins
   it explicitly; the README didn't, so a developer following local
   instructions would fail before the bench even started. Updated
   the README to include `oci_openai` and the same pinned mlx-lm
   version as CI, with a brief note explaining why both are needed.

The temporary `pull_request` trigger flagged by CodeRabbit (#1399
review on b068350) is deliberately kept until just before merge — we
need PR-trigger smoke runs while the harness is still being validated.
Will be removed in the final pre-merge commit.

Signed-off-by: key4ng <[email protected]>
…hes reality

Until this commit, the bench summary was hardcoded to render all
three backend rows ("mlx-lm.server", "smg → mlx-grpc", "vllm-metal")
with the heading "MLX Three-Way Benchmark", regardless of which
phases the workflow actually executed. With the CI default now
`PHASES=mlx grpc` (vllm-metal needs Metal GPU access the macos-
latest VM doesn't expose), the summary table looked like vllm-metal
had failed every cell — every vllm-metal row was a column of `—`,
and the document title still said "Three-Way".

Aggregator now infers which backends are present from the result
directories that exist:

- `LABEL_ORDER` keeps mlx → grpc → vllm display order; only labels
  with at least one matching result dir are rendered.
- Title shifts: 1 backend = "Single-backend", 2 = "Two-Way",
  3 = "Three-Way".
- The "Backends:" intro list also drops anything that didn't run, so
  readers see exactly the backends present in the rows below.

Verified locally with synthetic data:
- Two-way (mlx + grpc): "MLX Two-Way Benchmark", two rows per cell,
  no vllm-metal artifacts.
- Three-way (mlx + grpc + vllm): "MLX Three-Way Benchmark", three
  rows per cell, behavior unchanged from before.

Signed-off-by: key4ng <[email protected]>
@key4ng key4ng force-pushed the keyang/mlx-bench branch from 4580bbc to bfc2af4 Compare May 1, 2026 20:07
Comment thread benchmarks/mlx_grpc_vs_http/README.md Outdated
doesn't reliably expose Apple GPU passthrough — vllm-metal silently
falls back to a PyTorch CPU/fp32 path that allocates ~6 GB and OOMs
at startup on the 7 GB runner. To run the three-way comparison,
either run `local_three_way` on real Apple Silicon, or override
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: local_three_way is referenced here and in two workflow comments (lines 76, 191 of nightly-mlx-bench.yml), but no script, Makefile target, or function by that name exists. The local three-way run is just ./benchmarks/mlx_grpc_vs_http/run.sh with the default PHASES=mlx grpc vllm.

Suggested change
either run `local_three_way` on real Apple Silicon, or override
either run `./benchmarks/mlx_grpc_vs_http/run.sh` on real Apple Silicon, or override

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 81e97b7. README.md line 138 now points at ./benchmarks/mlx_grpc_vs_http/run.sh (with default PHASES=mlx grpc vllm`` to make the three-way default explicit). Same in nightly-mlx-bench.yml lines 76 and 191.

…VLLM_VENV portability, doc references

* run.sh: replace `sleep 5` between mlx-grpc port-open and SMG router
  start with `wait_for_log_line "gRPC server listening on"`. The port
  binds before warmup; the log line is the real ready signal.
* aggregate.py: guard `collect()` against a missing results dir so the
  `if: always()` aggregate step doesn't crash when no phase produced
  output (setup failure, empty PHASES).
* nightly-mlx-bench.yml: stop pinning `VLLM_VENV` to `/Users/runner/...`.
  vllm-metal's installer writes to `$HOME/.venv-vllm-metal` and run.sh
  defaults to the same; hard-coding the GitHub-hosted path silently
  broke the vllm phase on self-hosted runners with a different $HOME.
* README.md + workflow comments: replace `local_three_way` (which
  isn't a real script/target) with the actual `run.sh` invocation.

Signed-off-by: key4ng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci CI/CD configuration changes documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant