LLM Serving Benchmark Lab

Artifact-backed LLM serving performance lab for vLLM baselines, official /metrics ingestion, GuideLLM cross-checks, and SGLang/PD experiment scaffolding. Current completed claims are bounded to a saved Modal L40S vLLM run with Qwen/Qwen2.5-1.5B-Instruct and the chat_short workload.

Warning

Status as of 2026-04-23: M3 reporting checkpoint complete. This checkout contains a fresh real vLLM artifact pack at artifacts/m2-qwen-l40s-modal-chat-short-20260423-r2/, a standalone M3 report at artifacts/m2-qwen-l40s-modal-chat-short-20260423-r2/m3_report.md, a concise result summary at artifacts/m2-qwen-l40s-modal-chat-short-20260423-r2/m3_summary.md, and a completed GuideLLM cross-check at artifacts/m2-qwen-l40s-modal-chat-short-20260423-r2/guidellm/. Public claims must stay bounded to one Modal-hosted L40S x1, Qwen/Qwen2.5-1.5B-Instruct, and chat_short workload. The artifact records git_dirty: true, and the GuideLLM cross-check uses a synthetic token summary rather than exact trace replay.

Current status: M3 reporting checkpoint is complete with a stored real baseline, standalone report, concise summary, and saved official-tool cross-check. The next required work order is M4 SGLang + PD baseline.

Highlighted state

Current hero artifact: artifacts/m2-qwen-l40s-modal-chat-short-20260423-r2/ with standalone report, concise summary, and sibling GuideLLM output under artifacts/m2-qwen-l40s-modal-chat-short-20260423-r2/guidellm/.
What remains risky: the stored hero artifact came from a dirty checkout, the GuideLLM cross-check uses a synthetic token summary, and Modal cold start can make a first probe time out before a later clean warm probe passes. The result is only one bounded baseline and does not support performance-win, routing, PD, regression, profiler, or production-readiness claims.

Highlighted finding

M2 baseline: On L40S x1 via modal with Qwen/Qwen2.5-1.5B-Instruct and chat_short, the stored controller run completed 500/500 requests with p50/p95/p99 client latency 0.780 / 1.316 / 1.651 s, 420 official /metrics rows, 920 total metric rows, and no required official metrics missing. The saved GuideLLM cross-check also completed 500/500 requests and reported median TTFT 187.2 ms, p95 TTFT 266.3 ms, and mean throughput 1.21 req/s.
Primary caveat: the GuideLLM cross-check uses a synthetic token summary derived from the repo workload config rather than a byte-for-byte replay of the controller trace; the current hero artifact was generated from a dirty checkout; and Modal cold-start probe timeouts should be treated as warmup behavior only when a subsequent probe passes cleanly.

Reproduce The Hero Artifact

Audit the stored artifact:

uv run lsp validate-artifact artifacts/m2-qwen-l40s-modal-chat-short-20260423-r2

Produce a fresh equivalent run with a non-colliding run id:

RUN_ID=m2-qwen-l40s-modal-chat-short-repro-$(date +%Y%m%d-%H%M%S)
make reproduce RUN=m2-real REPRO_BACKEND=configs/backends/vllm_modal_m2_qwen_l40s.yaml REPRO_WORKLOAD=configs/workloads/chat_short.yaml REPRO_RUN_ID="$RUN_ID"
uv run lsp cross-check-guidellm --backend-config configs/backends/vllm_modal_m2_qwen_l40s.yaml --workload-config configs/workloads/chat_short.yaml --output-dir "artifacts/$RUN_ID/guidellm" --execute

Why this repo exists

The goal is to build a small but serious inference-performance lab that eventually demonstrates:

real backend bring-up for vLLM and SGLang
workload-shaped benchmark discipline
reproducible artifact packs and regression checks
profiler-backed performance investigation
public writeups with bounded, hardware-specific claims

What exists today

Current implemented scope:

Python package and CLI scaffold (lsp)
strict YAML config validation for backend, workload, policy, threshold, and experiment configs
deterministic workload generation for M1 workload families
zero-GPU synthetic fake-run and dry-run benchmark paths for smoke testing repo wiring
real-mode vLLM adapter path with official Prometheus /metrics ingestion and failure artifacts
official metric expectations aligned with current documented vLLM production metrics rather than deprecated throughput gauges
explicit backend hardware metadata capture for real-mode artifact credibility
external base_url target support for Modal or other HTTPS vLLM deployments
external endpoint validation that rejects non-root base_url values such as /v1 and mismatched /metrics targets before the real run
repo-owned vLLM launch-plan rendering from configs/backends/vllm_dev.yaml
endpoint probe for health, runtime metadata, and official metrics exposure
external GuideLLM cross-check scaffolding plus plan/log capture for M2 verification
stored real M2 artifact pack and saved GuideLLM cross-check for a single-GPU Modal L40S baseline
unit and smoke tests
example configs for future milestones

What does not exist yet

To keep claims honest, this repo does not yet provide:

PD comparison studies
regression gate examples
profiler-backed optimization reports
routing studies
public writeups
upstream contributions
a clean-checkout rerun of the current M2 hero artifact for stronger artifact claims

Quickstart

Requirements:

Python 3.12+
uv

Setup:

make install

Verify the current repo state:

make verify-m2
make smoke
make reproduce RUN=m1 REPRO_RUN_ID=demo-m1

Prepare for an external Modal-backed M2 run after filling the placeholder config:

make verify-m2 BACKEND_CONFIG=configs/backends/vllm_modal_example.yaml
make check-m2-readiness BACKEND_CONFIG=configs/backends/vllm_modal_example.yaml
make probe-m2 BACKEND_CONFIG=configs/backends/vllm_modal_example.yaml

Validate a generated artifact directory:

uv run lsp validate-artifact artifacts/<run_id>

Reproduce the current stable aliases:

make reproduce RUN=m0 REPRO_RUN_ID=demo-m0
make reproduce RUN=m1 REPRO_RUN_ID=demo-m1
make reproduce RUN=configs/workloads/sharegpt_like.yaml REPRO_RUN_ID=demo-sharegpt
make reproduce RUN=m2-real REPRO_WORKLOAD=configs/workloads/chat_short.yaml REPRO_RUN_ID=demo-m2-real
make reproduce RUN=m2-real REPRO_BACKEND=configs/backends/vllm_modal_example.yaml REPRO_WORKLOAD=configs/workloads/chat_short.yaml REPRO_RUN_ID=demo-m2-modal
make reproduce RUN=m2-real REPRO_BACKEND=configs/backends/vllm_modal_m2_qwen_l40s.yaml REPRO_WORKLOAD=configs/workloads/chat_short.yaml REPRO_RUN_ID=demo-m2-modal-live

make reproduce RUN=m2-real is a real-mode path. It requires a reachable vLLM endpoint or a host where you can turn the repo launch template into a working local server command. The stable make entrypoint now runs check-m2-readiness first so placeholder Modal URLs, /v1 base URLs, mismatched metrics endpoints, and missing cross-check tooling fail before workload traffic starts. A real M2 claim only becomes valid once the resulting artifact directory is stored and auditable from repo state.

For a Modal-backed M2 run, fill in configs/backends/vllm_modal_example.yaml with the deployed https://...modal.run URL from Modal's official vLLM example, then use REPRO_BACKEND=configs/backends/vllm_modal_example.yaml. Also replace the placeholder hardware block before the real run so the artifact names the tested GPU explicitly. The repo expects base_url to be the endpoint root and now validates that metrics.scrape_endpoint is exactly <base_url>/metrics, because it derives /health, /version, and /v1/completions from that root. Use make check-m2-readiness BACKEND_CONFIG=configs/backends/vllm_modal_example.yaml before the remote run to catch leftover placeholders and missing local tooling such as GuideLLM.

Inspect the repo-owned M2 launch and cross-check scaffolding directly:

uv run lsp render-vllm-launch --backend-config configs/backends/vllm_dev.yaml
uv run lsp cross-check-guidellm \
  --backend-config configs/backends/vllm_dev.yaml \
  --workload-config configs/workloads/chat_short.yaml
uv run lsp render-vllm-launch --backend-config configs/backends/vllm_modal_example.yaml
uv run lsp cross-check-guidellm \
  --backend-config configs/backends/vllm_modal_example.yaml \
  --workload-config configs/workloads/chat_short.yaml
uv run lsp check-m2-readiness --backend-config configs/backends/vllm_modal_example.yaml
uv run lsp probe-vllm-target --backend-config configs/backends/vllm_modal_example.yaml

CLI

uv run lsp --help
uv run lsp validate-config configs/backends/vllm_dev.yaml
uv run lsp validate-config configs/backends/vllm_modal_example.yaml
uv run lsp validate-examples
uv run lsp render-vllm-launch --backend-config configs/backends/vllm_dev.yaml
uv run lsp cross-check-guidellm \
  --backend-config configs/backends/vllm_dev.yaml \
  --workload-config configs/workloads/chat_short.yaml
uv run lsp render-vllm-launch --backend-config configs/backends/vllm_modal_example.yaml
uv run lsp cross-check-guidellm \
  --backend-config configs/backends/vllm_modal_example.yaml \
  --workload-config configs/workloads/chat_short.yaml
uv run lsp run \
  --backend-config configs/backends/vllm_dev.yaml \
  --workload-config configs/workloads/mixed_short_long.yaml \
  --output-dir artifacts \
  --run-id demo-dry-run \
  --dry-run
uv run lsp fake-run \
  --backend-config configs/backends/vllm_dev.yaml \
  --workload-config configs/workloads/chat_short.yaml \
  --output-dir artifacts \
  --run-id demo-run
uv run lsp validate-artifact artifacts/demo-run

Modal M2 Path

Use configs/backends/vllm_modal_example.yaml as the starting point for a real external HTTPS target. Fill in the placeholder endpoint and hardware fields first. The shortest repo-owned path is:

make verify-m2 BACKEND_CONFIG=configs/backends/vllm_modal_example.yaml
make check-m2-readiness BACKEND_CONFIG=configs/backends/vllm_modal_example.yaml
make probe-m2 BACKEND_CONFIG=configs/backends/vllm_modal_example.yaml
make reproduce RUN=m2-real REPRO_BACKEND=configs/backends/vllm_modal_example.yaml REPRO_WORKLOAD=configs/workloads/chat_short.yaml REPRO_RUN_ID=<run_id>
uv run lsp cross-check-guidellm \
  --backend-config configs/backends/vllm_modal_example.yaml \
  --workload-config configs/workloads/chat_short.yaml \
  --output-dir artifacts/<run_id>/guidellm \
  --execute

After a fresh real run and completed cross-check exist in artifacts/<run_id>/, keep the run id, command invocation, backend config, workload config, and caveats with the artifact.

Artifact contract

A successful run artifact is expected to contain:

run.json
backend_config_resolved.json
system_info.json
scorecard.json
report.md
requests.parquet
responses.parquet
metrics.parquet
plots/

When the external GuideLLM cross-check is executed into artifacts/<run_id>/guidellm, the repo also persists:

repo_cross_check_plan.json
repo_cross_check_execution.json
repo_cross_check_stdout.log
repo_cross_check_stderr.log

The repo writes real parquet files for request, response, and metrics tables. Nested fields such as maps/lists are serialized into JSON strings inside parquet cells to keep the artifact writer deterministic and lightweight.

Development checks

make lint
make format-check
make typecheck
make test
make verify-m2

Repository layout

configs/ — validated example configs
lsp/ — package source for config loading, artifact writing, and CLI behavior
tests/ — unit and smoke coverage
artifacts/ — generated run outputs; most ad hoc runs are gitignored, while the current M2 hero artifact is intentionally tracked

Roadmap

M1 execution is complete. M2 has a fresh artifact-backed baseline. M3 reporting checkpoint is complete. The next required stop is M4 SGLang + PD baseline.

Public-sharing guidance

If you post progress publicly right now, keep the framing honest:

this is an open-source inference-performance lab with one stored real vLLM baseline and one saved GuideLLM cross-check
the safe measured claim today is limited to the stored Modal L40S x1, Qwen/Qwen2.5-1.5B-Instruct, and chat_short artifact
official metrics ingestion follows the current documented core vLLM production metrics
do not generalize the measured latency or throughput beyond that artifact; call out that the artifact records git_dirty: true
do not claim PD, routing, regression, or profiler depth yet

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
artifacts		artifacts
configs		configs
lsp		lsp
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Serving Benchmark Lab

Highlighted state

Highlighted finding

Reproduce The Hero Artifact

Why this repo exists

What exists today

What does not exist yet

Quickstart

CLI

Modal M2 Path

Artifact contract

Development checks

Repository layout

Roadmap

Public-sharing guidance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Serving Benchmark Lab

Highlighted state

Highlighted finding

Reproduce The Hero Artifact

Why this repo exists

What exists today

What does not exist yet

Quickstart

CLI

Modal M2 Path

Artifact contract

Development checks

Repository layout

Roadmap

Public-sharing guidance

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages