Measure per-call latency; add --concurrency flag by tsushanth · Pull Request #2 · kalpalabs/stt-bench

tsushanth · 2026-06-06T15:25:18Z

What

Adds a latency dimension to the benchmark, which is currently accuracy-only (WER/CER).

src/timing.py — new dependency-free module wrapping model.transcribe(...) in time.perf_counter() inside the worker thread. Latency is captured per-call (not at the future level), so reported numbers reflect actual API time and not queue-wait.
run CLI — new --concurrency N flag (default 6, preserves current behavior). Pin to 1 for clean single-call latency.
Predictions CSV — every row now carries a latency_ms column alongside the existing transcript / ground_truth.
evaluate CLI — emits latency_ms_p50, latency_ms_p95, latency_ms_mean per split. Dataset-overall percentiles are computed from the pooled raw samples across splits, not averaged across per-split percentiles (which would understate tail latency).
README — documents the new column, the new flag, and the concurrency caveat for hosted models that throttle.
Tests — tests/test_latency.py with 4 hermetic tests (no model APIs, no datasets). All pass under pytest.

Why

For voice-agent workloads (Kalpa Labs' own product surface), latency is the binding constraint long before WER differences become user-visible. A benchmark that ranks models on WER alone steers product teams wrong; a latency-aware ranking lets them filter to "models that clear the <500ms budget" before optimizing for accuracy within that pool.

The HFT habit of caring about p99 over mean directly applies here — averages hide the cases where a user gets a 4-second pause and disengages.

Compatibility

No breaking changes. Old predictions CSVs without latency_ms are still evaluated; latency columns are simply skipped in the metrics output.
Default concurrency stays at 6, so existing scripts behave identically.
New dep: numpy (already transitively present via pandas/torch).

Sample output

$ stt-bench evaluate --dir inference/gpt-4o-transcribe
Overall WER = 14.32, Overall CER = 6.81, Latency p50 = 487ms, p95 = 1284ms for gpt-4o-transcribe over Fleurs

Tests

$ python3 -m pytest tests/test_latency.py -v
tests/test_latency.py::test_timed_transcribe_returns_text_and_latency PASSED
tests/test_latency.py::test_timed_transcribe_propagates_exceptions PASSED
tests/test_latency.py::test_latency_percentiles_match_pooled_samples PASSED
tests/test_latency.py::test_predictions_csv_carries_latency_column PASSED
4 passed in 6.23s

What I deliberately did not include

No per-model audio-duration normalization (RTF / real-time factor). Would need every dataset's audio length cached; happy to add as a follow-up if useful.
No cold-start vs steady-state separation. First call to a model often includes connection + auth handshake; if you care, I can add a --warmup N flag that discards the first N samples from latency stats.
No ElevenLabs Scribe / Groq Whisper / AssemblyAI adapters. Have drafts ready; figured a single-purpose PR is easier to review than a model-add + framework-change combo.

Happy to iterate on naming, additional percentiles (p99, p99.9), or pushing latency into the comparison plots under results/.

Note

Low Risk
Benchmarking-only changes with backward-compatible evaluate and default concurrency; no auth or production inference path changes.

Overview
Adds per-call latency to the STT benchmark alongside WER/CER, so runs and evaluation can rank models on speed as well as accuracy.

Run path: New --concurrency flag (default 6, unchanged behavior) controls in-flight transcribe calls per split. Inference uses timed_transcribe in worker threads and writes a latency_ms column on each predictions CSV row.

Evaluate path: When latency_ms is present, metrics include latency_ms_p50, latency_ms_p95, and latency_ms_mean per split; dataset overall (split=all) latency stats are computed from pooled raw samples (not averaged split percentiles). Old CSVs without latency still evaluate WER/CER only. Console output can show overall p50/p95.

Docs & tests: README covers concurrency caveats for fair latency comparisons (--concurrency 1). New dependency-light src/timing.py plus tests/test_latency.py (fake model, no APIs).

^{Reviewed by Cursor Bugbot for commit 2981201. Bugbot is set up for automated code reviews on this repo. Configure here.}

…s splits Adds a latency dimension to the benchmark that's currently accuracy-only: - New module `src/timing.py` wraps `model.transcribe(...)` in `time.perf_counter()` inside the worker thread, so reported latency reflects per-call API time (not future queue-wait). Kept dep-free so tests run without torch/openai/deepgram imports. - `run` adds `--concurrency N` (default 6, preserves current behavior). Pin to 1 for clean single-call latency numbers. - Every predictions CSV now carries a `latency_ms` column. - `evaluate` emits `latency_ms_{p50,p95,mean}` per split. Dataset-overall percentiles are computed from the pooled raw samples — averaging per-split percentiles would understate tail latency. - README documents the new column + flag with the apples-to-apples caveat for hosted-model concurrency throttling. - pytest suite for timing wrapper + percentile sanity (4/4 passing). Why this matters: for voice-agent workloads (Kalpa Labs' own product surface), latency is the binding constraint long before WER differences become user-visible. A benchmark that ranks models on WER alone steers product teams wrong; latency-aware ranking lets them pick models that clear the <500ms-to-first-token budget that real-time agents need.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Measure per-call latency; add --concurrency flag#2

Measure per-call latency; add --concurrency flag#2
tsushanth wants to merge 1 commit into
kalpalabs:mainfrom
tsushanth:feat/latency-measurement

tsushanth commented Jun 6, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tsushanth commented Jun 6, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Compatibility

Sample output

Tests

What I deliberately did not include

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tsushanth commented Jun 6, 2026 •

edited by cursor Bot

Loading