Measure per-call latency; add --concurrency flag#2
Open
tsushanth wants to merge 1 commit into
Open
Conversation
…s splits
Adds a latency dimension to the benchmark that's currently accuracy-only:
- New module `src/timing.py` wraps `model.transcribe(...)` in
`time.perf_counter()` inside the worker thread, so reported latency
reflects per-call API time (not future queue-wait). Kept dep-free so
tests run without torch/openai/deepgram imports.
- `run` adds `--concurrency N` (default 6, preserves current behavior).
Pin to 1 for clean single-call latency numbers.
- Every predictions CSV now carries a `latency_ms` column.
- `evaluate` emits `latency_ms_{p50,p95,mean}` per split. Dataset-overall
percentiles are computed from the pooled raw samples — averaging
per-split percentiles would understate tail latency.
- README documents the new column + flag with the apples-to-apples
caveat for hosted-model concurrency throttling.
- pytest suite for timing wrapper + percentile sanity (4/4 passing).
Why this matters: for voice-agent workloads (Kalpa Labs' own product
surface), latency is the binding constraint long before WER differences
become user-visible. A benchmark that ranks models on WER alone steers
product teams wrong; latency-aware ranking lets them pick models that
clear the <500ms-to-first-token budget that real-time agents need.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a latency dimension to the benchmark, which is currently accuracy-only (WER/CER).
src/timing.py— new dependency-free module wrappingmodel.transcribe(...)intime.perf_counter()inside the worker thread. Latency is captured per-call (not at the future level), so reported numbers reflect actual API time and not queue-wait.runCLI — new--concurrency Nflag (default6, preserves current behavior). Pin to1for clean single-call latency.latency_mscolumn alongside the existingtranscript/ground_truth.evaluateCLI — emitslatency_ms_p50,latency_ms_p95,latency_ms_meanper split. Dataset-overall percentiles are computed from the pooled raw samples across splits, not averaged across per-split percentiles (which would understate tail latency).tests/test_latency.pywith 4 hermetic tests (no model APIs, no datasets). All pass under pytest.Why
For voice-agent workloads (Kalpa Labs' own product surface), latency is the binding constraint long before WER differences become user-visible. A benchmark that ranks models on WER alone steers product teams wrong; a latency-aware ranking lets them filter to "models that clear the <500ms budget" before optimizing for accuracy within that pool.
The HFT habit of caring about p99 over mean directly applies here — averages hide the cases where a user gets a 4-second pause and disengages.
Compatibility
latency_msare still evaluated; latency columns are simply skipped in the metrics output.numpy(already transitively present via pandas/torch).Sample output
Tests
What I deliberately did not include
--warmup Nflag that discards the first N samples from latency stats.Happy to iterate on naming, additional percentiles (p99, p99.9), or pushing latency into the comparison plots under
results/.Note
Low Risk
Benchmarking-only changes with backward-compatible evaluate and default concurrency; no auth or production inference path changes.
Overview
Adds per-call latency to the STT benchmark alongside WER/CER, so runs and evaluation can rank models on speed as well as accuracy.
Run path: New
--concurrencyflag (default 6, unchanged behavior) controls in-flighttranscribecalls per split. Inference usestimed_transcribein worker threads and writes alatency_mscolumn on each predictions CSV row.Evaluate path: When
latency_msis present, metrics includelatency_ms_p50,latency_ms_p95, andlatency_ms_meanper split; dataset overall (split=all) latency stats are computed from pooled raw samples (not averaged split percentiles). Old CSVs without latency still evaluate WER/CER only. Console output can show overall p50/p95.Docs & tests: README covers concurrency caveats for fair latency comparisons (
--concurrency 1). New dependency-lightsrc/timing.pyplustests/test_latency.py(fake model, no APIs).Reviewed by Cursor Bugbot for commit 2981201. Bugbot is set up for automated code reviews on this repo. Configure here.