Skip to content

test(app): add PR0.2 perf comparator gate#608

Merged
Astro-Han merged 4 commits into
devfrom
codex/pr0-perf-gate
May 13, 2026
Merged

test(app): add PR0.2 perf comparator gate#608
Astro-Han merged 4 commits into
devfrom
codex/pr0-perf-gate

Conversation

@Astro-Han
Copy link
Copy Markdown
Owner

@Astro-Han Astro-Han commented May 13, 2026

Summary

  • add PR0.2 perf comparator rules plus focused tests for delta, catastrophic, warn-only, and missing-scenario cases
  • tag perf runs as base/head from env and add explicit cooldown between the existing 3 runs in each scenario
  • extend the non-required perf workflow to run base and head on the same runner, compare their JSON outputs, and upload base/head/compare artifacts

Why

PR0.1 only produced baseline artifacts. PR0.2 needs the first real regression gate before any UI rewrite PR can ship, but it stays narrow: no new scenarios, no trace capture yet, and no probe semantic rewrite.

Related Issue

#600

Human Review Status

Pending. A human should make the final merge decision after reviewing the final diff and verification evidence.

Review Focus

  • comparator thresholds match the locked PR0.2 rule: delta + catastrophic hard fail, Web Vitals good lines warn-only
  • workflow really runs base and head on the same runner and compares the generated JSON files
  • run tagging and cooldown do not change the 4-scenario surface from PR0.1

Risk Notes

Low to medium. This changes perf CI behavior and can fail the non-required workflow when the comparator sees a regression, but it does not touch product runtime code or add new user-facing surfaces. Trace-on-failure stays deferred to PR0.3 by design.

How To Verify

Focused tests: bun test --preload ./happydom.ts src/testing/perf-metrics.test.ts -> 6 passed
Perf scenarios: bun test:e2e:local:perf -> 4 passed
Typecheck: bun typecheck -> ok
Comparator dry-run: bun packages/app/script/compare-perf.ts --base .../pr0.1-baseline.json --head .../pr0.1-baseline.json -> pass=true, warnings only

Screenshots or Recordings

Not applicable. No visible UI changes.

Checklist

  • Human review status is stated above as pending, approved, or not required
  • I linked the related issue, or stated why there is no issue
  • This PR has type, primary area, and priority labels, or I requested maintainer labeling
  • I described the review focus and any meaningful risks
  • I listed the relevant verification steps and the key result for each
  • I did not introduce unrelated refactors, dependencies, generated files, or file changes beyond the stated scope
  • I manually checked visible UI or copy changes when needed, with screenshots or recordings
  • I considered macOS and Windows impact for platform, packaging, updater, signing, paths, shell, or permissions changes
  • I called out docs, release notes, dependencies, permissions, credentials, deletion behavior, generated content, or local file changes when relevant
  • I reviewed the final diff for unrelated changes and suspicious dependency changes
  • I am targeting dev, and my PR title and commit messages use Conventional Commits in English

Summary by CodeRabbit

  • New Features

    • Implemented automated performance baseline comparison to detect regressions and validate performance thresholds.
    • Enhanced CI/CD pipeline with performance metrics analysis and automated failure detection across builds.
  • Tests

    • Added comprehensive test coverage for performance scenario comparisons and metric validation.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

Warning

Rate limit exceeded

@Astro-Han has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 24 minutes and 31 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6c5cac9b-328a-471b-97b4-0fb5d6dbae6f

📥 Commits

Reviewing files that changed from the base of the PR and between 183108e and 914eaba.

📒 Files selected for processing (3)
  • .github/workflows/perf-probe-baseline.yml
  • packages/app/script/compare-perf.ts
  • packages/app/src/testing/perf-metrics.test.ts
📝 Walkthrough

Walkthrough

This PR introduces a performance baseline comparison system: new comparison utilities with configurable thresholds, a CLI script that orchestrates baseline comparison, E2E test instrumentation for branch tracking and cooldown, and a restructured CI workflow that captures dual baselines (base and head) and invokes the comparison to detect regressions.

Changes

Performance baseline comparison

Layer / File(s) Summary
Comparison types and logic foundation
packages/app/src/testing/perf-metrics.ts
Introduces PerfScenarioComparison and PerfBaselineComparison result types; defines threshold constants for delta/ratio and absolute regression detection; implements comparePerfScenarioSummaries to evaluate individual scenario metrics against thresholds and comparePerfBaselines to align and compare arrays of scenario summaries by name, reporting failures for missing scenarios and aggregating pass/fail status.
Comparison validation tests
packages/app/src/testing/perf-metrics.test.ts
Adds Bun test cases validating the comparison functions: median regression budget breach detection, catastrophic absolute threshold detection, Web Vitals warn-only behavior under PR0.2 config, and missing scenario detection when head baseline lacks a scenario from base.
Compare-perf CLI script
packages/app/script/compare-perf.ts
Implements a Bun/Node entrypoint that parses --base, --head, and optional --output CLI arguments, reads JSON perf scenario arrays, invokes comparePerfBaselines, optionally writes comparison output to disk, logs a summary to stdout, and sets exit code 1 on failure.
E2E test branch tracking and cooldown
packages/app/e2e/perf/perf-probe.spec.ts
Adds PAWWORK_PERF_BRANCH environment variable for dynamic branch labeling; introduces cooldownAfterRun helper for post-run animation frame settling and timeout; applies cooldown between the first two iterations in four baseline scenarios (homepage-cold, session-streaming-long, tool-call-expand, session-scroll-reading).
CI workflow dual-baseline and comparison orchestration
.github/workflows/perf-probe-baseline.yml
Triggers on compare-perf.ts changes; restructures job to run two separate checkouts (head and base); sets shared PERF_ARTIFACT_DIR; installs and runs perf probes on both, writing perf-head.json and perf-base.json; invokes compare-perf CLI to generate perf-compare.json; updates artifact upload to include baseline reports and test results from both workspaces.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

  • Astro-Han/pawwork#600: Implements the perf-gated regression CI features (comparison utilities, compare-perf script, dual-baseline workflow, and E2E instrumentation) described in the issue.

Possibly related PRs

  • Astro-Han/pawwork#607: Builds directly on this PR's E2E instrumentation changes to the same perf-probe.spec.ts and perf-probe-baseline.yml workflow.

Suggested labels

ci, app, enhancement, P2

Poem

🐰 A rabbit hops through metrics deep,
Comparing baselines that we keep—
From head to base, the thresholds gleam,
Regression gates fulfill the dream!
No slowdowns sneak past watchful eyes,
Performance truth will never lie.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'test(app): add PR0.2 perf comparator gate' clearly and specifically describes the main change—introducing a performance comparator gate for PR0.2 verification.
Description check ✅ Passed The description covers all major template sections: summary of changes, rationale (Why), related issue (#600), human review status (Pending), review focus, risk notes, verification steps with results, and checklist items.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/pr0-perf-gate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested priority: P2 (includes user-path files (packages/app/src/testing/perf-metrics.test.ts, packages/app/src/testing/perf-metrics.ts)).

P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.

@github-actions github-actions Bot added ci Continuous integration / GitHub Actions app Application behavior and product flows labels May 13, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
packages/app/src/testing/perf-metrics.test.ts (1)

166-209: ⚡ Quick win

Test title doesn't match test data.

The test title states "even when there is no regression delta", but the test data shows a 390ms regression (base: 120ms, head: 510ms), which exceeds the 50ms delta threshold. This means both the delta check (interaction_ms_worst_delta) and the catastrophic check (interaction_ms_worst) would trigger, though the test only asserts the catastrophic failure is present.

To better isolate the catastrophic check, use test data where base and head are equal but both exceed the catastrophic threshold (e.g., base: 510, head: 510).

📝 Suggested fix to align test data with title
       base: {
         branch: "base",
         scenario: "tool-call-expand",
         runs: 3,
         interaction_ms_median: 120,
-        interaction_ms_worst: 120,
+        interaction_ms_worst: 510,
         interaction_ms: 120,
         interaction_delay_ms: 12,
         long_task_count: 1,
         long_task_max_ms: 90,
         tbt_ms: 36,
         frame_gap_p95_ms: 32,
         frame_gap_max_ms: 90,
         jank_count_50ms: 1,
         cls: 0.01,
         window_ms: 900,
         run_details: [],
       },

This change ensures both base and head are at 510ms, demonstrating that the catastrophic threshold triggers even with zero delta.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/app/src/testing/perf-metrics.test.ts` around lines 166 - 209, The
test titled "fails a scenario on catastrophic absolute thresholds even when
there is no regression delta" uses differing interaction_ms_worst values (base
120, head 510) which creates a regression delta; update the test data in the
test case that calls comparePerfScenarioSummaries so base.interaction_ms_worst
and head.interaction_ms_worst are equal and both above the catastrophic
threshold (e.g., set both to 510) to ensure only the catastrophic absolute
threshold triggers while keeping the assertions (expect(result.pass).toBe(false)
and expect(result.failures).toContain("interaction_ms_worst")) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/app/script/compare-perf.ts`:
- Around line 5-9: The readArg function treats the next argv token as a value
even when that token is another flag; update readArg to validate the returned
value by ensuring process.argv[index + 1] exists and is not a flag token (e.g.,
startsWith('-') or '--') before returning it, and return undefined if the next
token is missing or looks like a flag; preserve the current signature and use
the same process.argv lookup in readArg to perform this check.

---

Nitpick comments:
In `@packages/app/src/testing/perf-metrics.test.ts`:
- Around line 166-209: The test titled "fails a scenario on catastrophic
absolute thresholds even when there is no regression delta" uses differing
interaction_ms_worst values (base 120, head 510) which creates a regression
delta; update the test data in the test case that calls
comparePerfScenarioSummaries so base.interaction_ms_worst and
head.interaction_ms_worst are equal and both above the catastrophic threshold
(e.g., set both to 510) to ensure only the catastrophic absolute threshold
triggers while keeping the assertions (expect(result.pass).toBe(false) and
expect(result.failures).toContain("interaction_ms_worst")) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 24ba7a0f-fe5a-4249-9dd2-edf3abb93932

📥 Commits

Reviewing files that changed from the base of the PR and between 0c012e5 and 183108e.

📒 Files selected for processing (5)
  • .github/workflows/perf-probe-baseline.yml
  • packages/app/e2e/perf/perf-probe.spec.ts
  • packages/app/script/compare-perf.ts
  • packages/app/src/testing/perf-metrics.test.ts
  • packages/app/src/testing/perf-metrics.ts

Comment thread packages/app/script/compare-perf.ts
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces performance comparison utilities, including a new script for comparing performance baselines and updated test logic to validate performance metrics. It also adds a cooldown mechanism to end-to-end performance tests to improve stability. The reviewer provided a suggestion to optimize the complexity of the baseline comparison loop from O(N*M) to O(N) by using a Set and to include defensive checks for safer data handling.

Comment thread packages/app/src/testing/perf-metrics.ts
@Astro-Han Astro-Han merged commit d3e4e1b into dev May 13, 2026
24 checks passed
@Astro-Han Astro-Han deleted the codex/pr0-perf-gate branch May 13, 2026 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app Application behavior and product flows ci Continuous integration / GitHub Actions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant