Add performance regression test suite#97
Merged
prajwal1210 merged 9 commits intoFeb 14, 2026
Merged
Conversation
Replaces the manual workflow of running examples/infer.py on two branches and eyeballing TTFT/TBT numbers. Tests run a parametrized matrix (batch_size x prompt_length x decode_length), average metrics over multiple iterations after warmup, and compare against stored JSON baselines with a configurable tolerance threshold. https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a comprehensive performance regression test suite for YALIS that automates the process of measuring and comparing performance metrics (TTFT, TBT, E2E latency, and throughput) across code changes.
Changes:
- Implements parametrized performance tests with configurable batch sizes, prompt lengths, and decode lengths
- Introduces a baseline management system that stores performance metrics in JSON format with metadata tracking
- Adds CLI options for baseline updates, tolerance thresholds, and iteration counts for warmup and measurement
Reviewed changes
Copilot reviewed 2 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/performance/test_perf_regression.py | Main test file implementing performance regression tests with metric averaging, regression detection, and detailed reporting |
| tests/performance/conftest.py | Pytest configuration with CLI options, BaselineStore for managing JSON baselines, and fixtures for engine, tokenizer, and dataset |
| tests/performance/init.py | Empty init file to make performance directory a Python package |
| tests/performance/baselines/.gitkeep | Placeholder to ensure baselines directory is tracked in git |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Validate that the stored baseline metadata (model, dtype, attn_backend, use_paged_kv_caching) matches the current test configuration before comparing metrics. Raises pytest.UsageError on mismatch to prevent misleading comparisons across different settings. Fields not yet tracked in older baselines are gracefully skipped. Also adds use_paged_kv_caching as an INI-configurable option (default False) wired through fixture -> InferenceConfig -> LLMEngine. https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2
Replace print()-based output (which requires -s / --capture=no) with pytest's terminal reporter hooks so the numbers are always visible. - pytest_configure: initialise a session-wide results collector via pytest.StashKey - pytest_terminal_summary: render a formatted table at the end of the run, covering both update mode (saved values) and compare mode (baseline vs current with % change and !! markers for regressions) - perf_results fixture: gives tests write access to the collector - Tests now append structured dicts instead of printing https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2
fd5e13b to
e47594b
Compare
Prevents a ZeroDivisionError in _average_metrics when the option is set to 0. https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2
Add run_perf_regression_tests.sh following the same pattern as the existing correctness test scripts, setting up environment variables and launching via srun. Extra pytest flags (e.g. --perf-update-baselines) can be passed through the PERF_PYTEST_ARGS env var. Also add tests/README.md documenting the performance regression test workflow, configuration options, and available pytest flags. https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2
Only rank 0 populates benchmark entries during --perf-update-baselines, but every rank was flushing the store on teardown. On multi-GPU runs a non-zero rank could overwrite the JSON with empty benchmark data. Guard the flush so only rank 0 writes the file. https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2
cleanup_dist (scope=module) destroys the process group before the session-scoped baseline_store fixture tears down, so checking dist.is_initialized() at teardown always returns False on every rank. Capture whether the process is rank 0 at setup time and use the saved value during teardown to ensure only rank 0 flushes baselines. https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2
The previous approach of capturing dist.get_rank() at fixture setup still fails when the session fixture is created before init_process_group — dist.is_initialized() is false on every rank, so is_rank_zero becomes true everywhere, reintroducing the last-writer-wins race. Read RANK (set by torchrun/torch.distributed.launch) or SLURM_PROCID (set by SLURM) instead. These env vars are set by the launcher before the process starts and survive the entire process lifetime, independent of the dist init/teardown lifecycle. Falls back to 0 for single-GPU runs where neither is set. https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the manual workflow of running examples/infer.py on two branches and eyeballing TTFT/TBT numbers. Tests run a parametrized matrix (batch_size x prompt_length x decode_length), average metrics over multiple iterations after warmup, and compare against stored JSON baselines with a configurable tolerance threshold.
https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2