Skip to content

Add performance regression test suite#97

Merged
prajwal1210 merged 9 commits into
axonn-ai:developfrom
prajwal1210:claude/add-perf-regression-tests-FsB8d
Feb 14, 2026
Merged

Add performance regression test suite#97
prajwal1210 merged 9 commits into
axonn-ai:developfrom
prajwal1210:claude/add-perf-regression-tests-FsB8d

Conversation

@prajwal1210

Copy link
Copy Markdown
Collaborator

Replaces the manual workflow of running examples/infer.py on two branches and eyeballing TTFT/TBT numbers. Tests run a parametrized matrix (batch_size x prompt_length x decode_length), average metrics over multiple iterations after warmup, and compare against stored JSON baselines with a configurable tolerance threshold.

https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2

Replaces the manual workflow of running examples/infer.py on two
branches and eyeballing TTFT/TBT numbers. Tests run a parametrized
matrix (batch_size x prompt_length x decode_length), average metrics
over multiple iterations after warmup, and compare against stored
JSON baselines with a configurable tolerance threshold.

https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a comprehensive performance regression test suite for YALIS that automates the process of measuring and comparing performance metrics (TTFT, TBT, E2E latency, and throughput) across code changes.

Changes:

  • Implements parametrized performance tests with configurable batch sizes, prompt lengths, and decode lengths
  • Introduces a baseline management system that stores performance metrics in JSON format with metadata tracking
  • Adds CLI options for baseline updates, tolerance thresholds, and iteration counts for warmup and measurement

Reviewed changes

Copilot reviewed 2 out of 4 changed files in this pull request and generated 2 comments.

File Description
tests/performance/test_perf_regression.py Main test file implementing performance regression tests with metric averaging, regression detection, and detailed reporting
tests/performance/conftest.py Pytest configuration with CLI options, BaselineStore for managing JSON baselines, and fixtures for engine, tokenizer, and dataset
tests/performance/init.py Empty init file to make performance directory a Python package
tests/performance/baselines/.gitkeep Placeholder to ensure baselines directory is tracked in git

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/performance/test_perf_regression.py
Comment thread tests/performance/test_perf_regression.py
Validate that the stored baseline metadata (model, dtype, attn_backend,
use_paged_kv_caching) matches the current test configuration before
comparing metrics. Raises pytest.UsageError on mismatch to prevent
misleading comparisons across different settings. Fields not yet tracked
in older baselines are gracefully skipped.

Also adds use_paged_kv_caching as an INI-configurable option (default
False) wired through fixture -> InferenceConfig -> LLMEngine.

https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2
Replace print()-based output (which requires -s / --capture=no) with
pytest's terminal reporter hooks so the numbers are always visible.

- pytest_configure: initialise a session-wide results collector via
  pytest.StashKey
- pytest_terminal_summary: render a formatted table at the end of
  the run, covering both update mode (saved values) and compare mode
  (baseline vs current with % change and !! markers for regressions)
- perf_results fixture: gives tests write access to the collector
- Tests now append structured dicts instead of printing

https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2
@prajwal1210 prajwal1210 force-pushed the claude/add-perf-regression-tests-FsB8d branch from fd5e13b to e47594b Compare February 14, 2026 02:53
Prevents a ZeroDivisionError in _average_metrics when the option is
set to 0.

https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2
Add run_perf_regression_tests.sh following the same pattern as the
existing correctness test scripts, setting up environment variables
and launching via srun. Extra pytest flags (e.g. --perf-update-baselines)
can be passed through the PERF_PYTEST_ARGS env var.

Also add tests/README.md documenting the performance regression
test workflow, configuration options, and available pytest flags.

https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2
Only rank 0 populates benchmark entries during --perf-update-baselines,
but every rank was flushing the store on teardown. On multi-GPU runs a
non-zero rank could overwrite the JSON with empty benchmark data. Guard
the flush so only rank 0 writes the file.

https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2
cleanup_dist (scope=module) destroys the process group before the
session-scoped baseline_store fixture tears down, so checking
dist.is_initialized() at teardown always returns False on every rank.
Capture whether the process is rank 0 at setup time and use the saved
value during teardown to ensure only rank 0 flushes baselines.

https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2
The previous approach of capturing dist.get_rank() at fixture setup
still fails when the session fixture is created before
init_process_group — dist.is_initialized() is false on every rank,
so is_rank_zero becomes true everywhere, reintroducing the
last-writer-wins race.

Read RANK (set by torchrun/torch.distributed.launch) or SLURM_PROCID
(set by SLURM) instead. These env vars are set by the launcher before
the process starts and survive the entire process lifetime, independent
of the dist init/teardown lifecycle. Falls back to 0 for single-GPU
runs where neither is set.

https://claude.ai/code/session_01W6qwSimLLJTSLXReAQ5Au2
@prajwal1210 prajwal1210 merged commit 6849713 into axonn-ai:develop Feb 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants