Skip to content

Measure per-call latency; add --concurrency flag#2

Open
tsushanth wants to merge 1 commit into
kalpalabs:mainfrom
tsushanth:feat/latency-measurement
Open

Measure per-call latency; add --concurrency flag#2
tsushanth wants to merge 1 commit into
kalpalabs:mainfrom
tsushanth:feat/latency-measurement

Conversation

@tsushanth

@tsushanth tsushanth commented Jun 6, 2026

Copy link
Copy Markdown

What

Adds a latency dimension to the benchmark, which is currently accuracy-only (WER/CER).

  • src/timing.py — new dependency-free module wrapping model.transcribe(...) in time.perf_counter() inside the worker thread. Latency is captured per-call (not at the future level), so reported numbers reflect actual API time and not queue-wait.
  • run CLI — new --concurrency N flag (default 6, preserves current behavior). Pin to 1 for clean single-call latency.
  • Predictions CSV — every row now carries a latency_ms column alongside the existing transcript / ground_truth.
  • evaluate CLI — emits latency_ms_p50, latency_ms_p95, latency_ms_mean per split. Dataset-overall percentiles are computed from the pooled raw samples across splits, not averaged across per-split percentiles (which would understate tail latency).
  • README — documents the new column, the new flag, and the concurrency caveat for hosted models that throttle.
  • Teststests/test_latency.py with 4 hermetic tests (no model APIs, no datasets). All pass under pytest.

Why

For voice-agent workloads (Kalpa Labs' own product surface), latency is the binding constraint long before WER differences become user-visible. A benchmark that ranks models on WER alone steers product teams wrong; a latency-aware ranking lets them filter to "models that clear the <500ms budget" before optimizing for accuracy within that pool.

The HFT habit of caring about p99 over mean directly applies here — averages hide the cases where a user gets a 4-second pause and disengages.

Compatibility

  • No breaking changes. Old predictions CSVs without latency_ms are still evaluated; latency columns are simply skipped in the metrics output.
  • Default concurrency stays at 6, so existing scripts behave identically.
  • New dep: numpy (already transitively present via pandas/torch).

Sample output

$ stt-bench evaluate --dir inference/gpt-4o-transcribe
Overall WER = 14.32, Overall CER = 6.81, Latency p50 = 487ms, p95 = 1284ms for gpt-4o-transcribe over Fleurs

Tests

$ python3 -m pytest tests/test_latency.py -v
tests/test_latency.py::test_timed_transcribe_returns_text_and_latency PASSED
tests/test_latency.py::test_timed_transcribe_propagates_exceptions PASSED
tests/test_latency.py::test_latency_percentiles_match_pooled_samples PASSED
tests/test_latency.py::test_predictions_csv_carries_latency_column PASSED
4 passed in 6.23s

What I deliberately did not include

  • No per-model audio-duration normalization (RTF / real-time factor). Would need every dataset's audio length cached; happy to add as a follow-up if useful.
  • No cold-start vs steady-state separation. First call to a model often includes connection + auth handshake; if you care, I can add a --warmup N flag that discards the first N samples from latency stats.
  • No ElevenLabs Scribe / Groq Whisper / AssemblyAI adapters. Have drafts ready; figured a single-purpose PR is easier to review than a model-add + framework-change combo.

Happy to iterate on naming, additional percentiles (p99, p99.9), or pushing latency into the comparison plots under results/.


Note

Low Risk
Benchmarking-only changes with backward-compatible evaluate and default concurrency; no auth or production inference path changes.

Overview
Adds per-call latency to the STT benchmark alongside WER/CER, so runs and evaluation can rank models on speed as well as accuracy.

Run path: New --concurrency flag (default 6, unchanged behavior) controls in-flight transcribe calls per split. Inference uses timed_transcribe in worker threads and writes a latency_ms column on each predictions CSV row.

Evaluate path: When latency_ms is present, metrics include latency_ms_p50, latency_ms_p95, and latency_ms_mean per split; dataset overall (split=all) latency stats are computed from pooled raw samples (not averaged split percentiles). Old CSVs without latency still evaluate WER/CER only. Console output can show overall p50/p95.

Docs & tests: README covers concurrency caveats for fair latency comparisons (--concurrency 1). New dependency-light src/timing.py plus tests/test_latency.py (fake model, no APIs).

Reviewed by Cursor Bugbot for commit 2981201. Bugbot is set up for automated code reviews on this repo. Configure here.

…s splits

Adds a latency dimension to the benchmark that's currently accuracy-only:

- New module `src/timing.py` wraps `model.transcribe(...)` in
  `time.perf_counter()` inside the worker thread, so reported latency
  reflects per-call API time (not future queue-wait). Kept dep-free so
  tests run without torch/openai/deepgram imports.
- `run` adds `--concurrency N` (default 6, preserves current behavior).
  Pin to 1 for clean single-call latency numbers.
- Every predictions CSV now carries a `latency_ms` column.
- `evaluate` emits `latency_ms_{p50,p95,mean}` per split. Dataset-overall
  percentiles are computed from the pooled raw samples — averaging
  per-split percentiles would understate tail latency.
- README documents the new column + flag with the apples-to-apples
  caveat for hosted-model concurrency throttling.
- pytest suite for timing wrapper + percentile sanity (4/4 passing).

Why this matters: for voice-agent workloads (Kalpa Labs' own product
surface), latency is the binding constraint long before WER differences
become user-visible. A benchmark that ranks models on WER alone steers
product teams wrong; latency-aware ranking lets them pick models that
clear the <500ms-to-first-token budget that real-time agents need.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant