test(app): add PR0.2 perf comparator gate by Astro-Han · Pull Request #608 · Astro-Han/pawwork

Astro-Han · 2026-05-13T13:39:36Z

Summary

add PR0.2 perf comparator rules plus focused tests for delta, catastrophic, warn-only, and missing-scenario cases
tag perf runs as base/head from env and add explicit cooldown between the existing 3 runs in each scenario
extend the non-required perf workflow to run base and head on the same runner, compare their JSON outputs, and upload base/head/compare artifacts

Why

PR0.1 only produced baseline artifacts. PR0.2 needs the first real regression gate before any UI rewrite PR can ship, but it stays narrow: no new scenarios, no trace capture yet, and no probe semantic rewrite.

Related Issue

#600

Human Review Status

Pending. A human should make the final merge decision after reviewing the final diff and verification evidence.

Review Focus

comparator thresholds match the locked PR0.2 rule: delta + catastrophic hard fail, Web Vitals good lines warn-only
workflow really runs base and head on the same runner and compares the generated JSON files
run tagging and cooldown do not change the 4-scenario surface from PR0.1

Risk Notes

Low to medium. This changes perf CI behavior and can fail the non-required workflow when the comparator sees a regression, but it does not touch product runtime code or add new user-facing surfaces. Trace-on-failure stays deferred to PR0.3 by design.

How To Verify

Focused tests: bun test --preload ./happydom.ts src/testing/perf-metrics.test.ts -> 6 passed
Perf scenarios: bun test:e2e:local:perf -> 4 passed
Typecheck: bun typecheck -> ok
Comparator dry-run: bun packages/app/script/compare-perf.ts --base .../pr0.1-baseline.json --head .../pr0.1-baseline.json -> pass=true, warnings only

Screenshots or Recordings

Not applicable. No visible UI changes.

Checklist

Summary by CodeRabbit

New Features
- Implemented automated performance baseline comparison to detect regressions and validate performance thresholds.
- Enhanced CI/CD pipeline with performance metrics analysis and automated failure detection across builds.
Tests
- Added comprehensive test coverage for performance scenario comparisons and metric validation.

coderabbitai · 2026-05-13T13:39:52Z

Warning

Rate limit exceeded

@Astro-Han has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 24 minutes and 31 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6c5cac9b-328a-471b-97b4-0fb5d6dbae6f

📥 Commits

Reviewing files that changed from the base of the PR and between 183108e and 914eaba.

📒 Files selected for processing (3)

.github/workflows/perf-probe-baseline.yml
packages/app/script/compare-perf.ts
packages/app/src/testing/perf-metrics.test.ts

📝 Walkthrough

Walkthrough

This PR introduces a performance baseline comparison system: new comparison utilities with configurable thresholds, a CLI script that orchestrates baseline comparison, E2E test instrumentation for branch tracking and cooldown, and a restructured CI workflow that captures dual baselines (base and head) and invokes the comparison to detect regressions.

Changes

Performance baseline comparison

Layer / File(s)	Summary
Comparison types and logic foundation `packages/app/src/testing/perf-metrics.ts`	Introduces `PerfScenarioComparison` and `PerfBaselineComparison` result types; defines threshold constants for delta/ratio and absolute regression detection; implements `comparePerfScenarioSummaries` to evaluate individual scenario metrics against thresholds and `comparePerfBaselines` to align and compare arrays of scenario summaries by name, reporting failures for missing scenarios and aggregating pass/fail status.
Comparison validation tests `packages/app/src/testing/perf-metrics.test.ts`	Adds Bun test cases validating the comparison functions: median regression budget breach detection, catastrophic absolute threshold detection, Web Vitals warn-only behavior under PR0.2 config, and missing scenario detection when head baseline lacks a scenario from base.
Compare-perf CLI script `packages/app/script/compare-perf.ts`	Implements a Bun/Node entrypoint that parses `--base`, `--head`, and optional `--output` CLI arguments, reads JSON perf scenario arrays, invokes `comparePerfBaselines`, optionally writes comparison output to disk, logs a summary to stdout, and sets exit code 1 on failure.
E2E test branch tracking and cooldown `packages/app/e2e/perf/perf-probe.spec.ts`	Adds `PAWWORK_PERF_BRANCH` environment variable for dynamic branch labeling; introduces `cooldownAfterRun` helper for post-run animation frame settling and timeout; applies cooldown between the first two iterations in four baseline scenarios (`homepage-cold`, `session-streaming-long`, `tool-call-expand`, `session-scroll-reading`).
CI workflow dual-baseline and comparison orchestration `.github/workflows/perf-probe-baseline.yml`	Triggers on `compare-perf.ts` changes; restructures job to run two separate checkouts (head and base); sets shared `PERF_ARTIFACT_DIR`; installs and runs perf probes on both, writing `perf-head.json` and `perf-base.json`; invokes compare-perf CLI to generate `perf-compare.json`; updates artifact upload to include baseline reports and test results from both workspaces.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Astro-Han/pawwork#600: Implements the perf-gated regression CI features (comparison utilities, compare-perf script, dual-baseline workflow, and E2E instrumentation) described in the issue.

Possibly related PRs

Astro-Han/pawwork#607: Builds directly on this PR's E2E instrumentation changes to the same perf-probe.spec.ts and perf-probe-baseline.yml workflow.

Suggested labels

ci, app, enhancement, P2

Poem

🐰 A rabbit hops through metrics deep,
Comparing baselines that we keep—
From head to base, the thresholds gleam,
Regression gates fulfill the dream!
No slowdowns sneak past watchful eyes,
Performance truth will never lie.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'test(app): add PR0.2 perf comparator gate' clearly and specifically describes the main change—introducing a performance comparator gate for PR0.2 verification.
Description check	✅ Passed	The description covers all major template sections: summary of changes, rationale (Why), related issue (`#600`), human review status (Pending), review focus, risk notes, verification steps with results, and checklist items.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/pr0-perf-gate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions

Suggested priority: P2 (includes user-path files (packages/app/src/testing/perf-metrics.test.ts, packages/app/src/testing/perf-metrics.ts)).

P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

packages/app/src/testing/perf-metrics.test.ts (1)
166-209: ⚡ Quick win

Test title doesn't match test data.

The test title states "even when there is no regression delta", but the test data shows a 390ms regression (base: 120ms, head: 510ms), which exceeds the 50ms delta threshold. This means both the delta check (interaction_ms_worst_delta) and the catastrophic check (interaction_ms_worst) would trigger, though the test only asserts the catastrophic failure is present.

To better isolate the catastrophic check, use test data where base and head are equal but both exceed the catastrophic threshold (e.g., base: 510, head: 510).
📝 Suggested fix to align test data with title
       base: {
         branch: "base",
         scenario: "tool-call-expand",
         runs: 3,
         interaction_ms_median: 120,
-        interaction_ms_worst: 120,
+        interaction_ms_worst: 510,
         interaction_ms: 120,
         interaction_delay_ms: 12,
         long_task_count: 1,
         long_task_max_ms: 90,
         tbt_ms: 36,
         frame_gap_p95_ms: 32,
         frame_gap_max_ms: 90,
         jank_count_50ms: 1,
         cls: 0.01,
         window_ms: 900,
         run_details: [],
       },
This change ensures both base and head are at 510ms, demonstrating that the catastrophic threshold triggers even with zero delta.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/app/src/testing/perf-metrics.test.ts` around lines 166 - 209, The
test titled "fails a scenario on catastrophic absolute thresholds even when
there is no regression delta" uses differing interaction_ms_worst values (base
120, head 510) which creates a regression delta; update the test data in the
test case that calls comparePerfScenarioSummaries so base.interaction_ms_worst
and head.interaction_ms_worst are equal and both above the catastrophic
threshold (e.g., set both to 510) to ensure only the catastrophic absolute
threshold triggers while keeping the assertions (expect(result.pass).toBe(false)
and expect(result.failures).toContain("interaction_ms_worst")) unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/app/script/compare-perf.ts`:
- Around line 5-9: The readArg function treats the next argv token as a value
even when that token is another flag; update readArg to validate the returned
value by ensuring process.argv[index + 1] exists and is not a flag token (e.g.,
startsWith('-') or '--') before returning it, and return undefined if the next
token is missing or looks like a flag; preserve the current signature and use
the same process.argv lookup in readArg to perform this check.

---

Nitpick comments:
In `@packages/app/src/testing/perf-metrics.test.ts`:
- Around line 166-209: The test titled "fails a scenario on catastrophic
absolute thresholds even when there is no regression delta" uses differing
interaction_ms_worst values (base 120, head 510) which creates a regression
delta; update the test data in the test case that calls
comparePerfScenarioSummaries so base.interaction_ms_worst and
head.interaction_ms_worst are equal and both above the catastrophic threshold
(e.g., set both to 510) to ensure only the catastrophic absolute threshold
triggers while keeping the assertions (expect(result.pass).toBe(false) and
expect(result.failures).toContain("interaction_ms_worst")) unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 24ba7a0f-fe5a-4249-9dd2-edf3abb93932

📥 Commits

Reviewing files that changed from the base of the PR and between 0c012e5 and 183108e.

📒 Files selected for processing (5)

.github/workflows/perf-probe-baseline.yml
packages/app/e2e/perf/perf-probe.spec.ts
packages/app/script/compare-perf.ts
packages/app/src/testing/perf-metrics.test.ts
packages/app/src/testing/perf-metrics.ts

gemini-code-assist

Code Review

This pull request introduces performance comparison utilities, including a new script for comparing performance baselines and updated test logic to validate performance metrics. It also adds a cooldown mechanism to end-to-end performance tests to improve stability. The reviewer provided a suggestion to optimize the complexity of the baseline comparison loop from O(N*M) to O(N) by using a Set and to include defensive checks for safer data handling.

Astro-Han added 3 commits May 13, 2026 21:35

test(app): add perf comparator rules and tests

1e02d8e

test(app): tag perf runs as base/head and add cooldown

f9c26c0

ci(app): compare perf baseline between base and head on same runner

183108e

github-actions Bot reviewed May 13, 2026

View reviewed changes

github-actions Bot added ci Continuous integration / GitHub Actions app Application behavior and product flows labels May 13, 2026

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread packages/app/script/compare-perf.ts

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Comment thread packages/app/src/testing/perf-metrics.ts

ci(app): run base perf with head harness

914eaba

Astro-Han merged commit d3e4e1b into dev May 13, 2026
24 checks passed

Astro-Han deleted the codex/pr0-perf-gate branch May 13, 2026 14:30

This was referenced May 13, 2026

test(app): add PR0.3 perf diagnostics #609

Merged

test(app): add PR0.4 low-end perf profile #610

Merged

test(app): add long-session input lag perf guard #624

Merged

test: harden session scroll perf guard #635

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(app): add PR0.2 perf comparator gate#608

test(app): add PR0.2 perf comparator gate#608
Astro-Han merged 4 commits into
devfrom
codex/pr0-perf-gate

Astro-Han commented May 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Astro-Han commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Related Issue

Human Review Status

Review Focus

Risk Notes

How To Verify

Screenshots or Recordings

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Astro-Han commented May 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading