Add performance regression test suite by prajwal1210 · Pull Request #97 · axonn-ai/yalis

prajwal1210 · 2026-02-13T17:23:35Z

Replaces the manual workflow of running examples/infer.py on two branches and eyeballing TTFT/TBT numbers. Tests run a parametrized matrix (batch_size x prompt_length x decode_length), average metrics over multiple iterations after warmup, and compare against stored JSON baselines with a configurable tolerance threshold.

https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2

Replaces the manual workflow of running examples/infer.py on two branches and eyeballing TTFT/TBT numbers. Tests run a parametrized matrix (batch_size x prompt_length x decode_length), average metrics over multiple iterations after warmup, and compare against stored JSON baselines with a configurable tolerance threshold. https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2

Copilot

Pull request overview

This PR adds a comprehensive performance regression test suite for YALIS that automates the process of measuring and comparing performance metrics (TTFT, TBT, E2E latency, and throughput) across code changes.

Changes:

Implements parametrized performance tests with configurable batch sizes, prompt lengths, and decode lengths
Introduces a baseline management system that stores performance metrics in JSON format with metadata tracking
Adds CLI options for baseline updates, tolerance thresholds, and iteration counts for warmup and measurement

Reviewed changes

Copilot reviewed 2 out of 4 changed files in this pull request and generated 2 comments.

File	Description
tests/performance/test_perf_regression.py	Main test file implementing performance regression tests with metric averaging, regression detection, and detailed reporting
tests/performance/conftest.py	Pytest configuration with CLI options, BaselineStore for managing JSON baselines, and fixtures for engine, tokenizer, and dataset
tests/performance/init.py	Empty init file to make performance directory a Python package
tests/performance/baselines/.gitkeep	Placeholder to ensure baselines directory is tracked in git

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Validate that the stored baseline metadata (model, dtype, attn_backend, use_paged_kv_caching) matches the current test configuration before comparing metrics. Raises pytest.UsageError on mismatch to prevent misleading comparisons across different settings. Fields not yet tracked in older baselines are gracefully skipped. Also adds use_paged_kv_caching as an INI-configurable option (default False) wired through fixture -> InferenceConfig -> LLMEngine. https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2

Replace print()-based output (which requires -s / --capture=no) with pytest's terminal reporter hooks so the numbers are always visible. - pytest_configure: initialise a session-wide results collector via pytest.StashKey - pytest_terminal_summary: render a formatted table at the end of the run, covering both update mode (saved values) and compare mode (baseline vs current with % change and !! markers for regressions) - perf_results fixture: gives tests write access to the collector - Tests now append structured dicts instead of printing https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2

Prevents a ZeroDivisionError in _average_metrics when the option is set to 0. https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2

Add run_perf_regression_tests.sh following the same pattern as the existing correctness test scripts, setting up environment variables and launching via srun. Extra pytest flags (e.g. --perf-update-baselines) can be passed through the PERF_PYTEST_ARGS env var. Also add tests/README.md documenting the performance regression test workflow, configuration options, and available pytest flags. https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2

Only rank 0 populates benchmark entries during --perf-update-baselines, but every rank was flushing the store on teardown. On multi-GPU runs a non-zero rank could overwrite the JSON with empty benchmark data. Guard the flush so only rank 0 writes the file. https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2

cleanup_dist (scope=module) destroys the process group before the session-scoped baseline_store fixture tears down, so checking dist.is_initialized() at teardown always returns False on every rank. Capture whether the process is rank 0 at setup time and use the saved value during teardown to ensure only rank 0 flushes baselines. https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2

The previous approach of capturing dist.get_rank() at fixture setup still fails when the session fixture is created before init_process_group — dist.is_initialized() is false on every rank, so is_rank_zero becomes true everywhere, reintroducing the last-writer-wins race. Read RANK (set by torchrun/torch.distributed.launch) or SLURM_PROCID (set by SLURM) instead. These env vars are set by the launcher before the process starts and survive the entire process lifetime, independent of the dist init/teardown lifecycle. Falls back to 0 for single-GPU runs where neither is set. https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2

https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2

prajwal1210 requested a review from Copilot February 13, 2026 17:23

Copilot started reviewing on behalf of prajwal1210 February 13, 2026 17:24 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

Comment thread tests/performance/test_perf_regression.py

Comment thread tests/performance/test_perf_regression.py

claude added 2 commits February 13, 2026 18:06

prajwal1210 force-pushed the claude/add-perf-regression-tests-FsB8d branch from fd5e13b to e47594b Compare February 14, 2026 02:53

claude added 6 commits February 14, 2026 02:54

Validate --perf-measure-iters is at least 1

140b0e5

Prevents a ZeroDivisionError in _average_metrics when the option is set to 0. https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2

Fix black formatting for rank env var lookup

a34d951

https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2

prajwal1210 merged commit 6849713 into axonn-ai:develop Feb 14, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add performance regression test suite#97

Add performance regression test suite#97
prajwal1210 merged 9 commits into
axonn-ai:developfrom
prajwal1210:claude/add-perf-regression-tests-FsB8d

prajwal1210 commented Feb 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

prajwal1210 commented Feb 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants