test(app): add PR0.2 perf comparator gate#608
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR introduces a performance baseline comparison system: new comparison utilities with configurable thresholds, a CLI script that orchestrates baseline comparison, E2E test instrumentation for branch tracking and cooldown, and a restructured CI workflow that captures dual baselines (base and head) and invokes the comparison to detect regressions. ChangesPerformance baseline comparison
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Suggested priority: P2 (includes user-path files (packages/app/src/testing/perf-metrics.test.ts, packages/app/src/testing/perf-metrics.ts)).
P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
packages/app/src/testing/perf-metrics.test.ts (1)
166-209: ⚡ Quick winTest title doesn't match test data.
The test title states "even when there is no regression delta", but the test data shows a 390ms regression (base: 120ms, head: 510ms), which exceeds the 50ms delta threshold. This means both the delta check (
interaction_ms_worst_delta) and the catastrophic check (interaction_ms_worst) would trigger, though the test only asserts the catastrophic failure is present.To better isolate the catastrophic check, use test data where base and head are equal but both exceed the catastrophic threshold (e.g., base: 510, head: 510).
📝 Suggested fix to align test data with title
base: { branch: "base", scenario: "tool-call-expand", runs: 3, interaction_ms_median: 120, - interaction_ms_worst: 120, + interaction_ms_worst: 510, interaction_ms: 120, interaction_delay_ms: 12, long_task_count: 1, long_task_max_ms: 90, tbt_ms: 36, frame_gap_p95_ms: 32, frame_gap_max_ms: 90, jank_count_50ms: 1, cls: 0.01, window_ms: 900, run_details: [], },This change ensures both base and head are at 510ms, demonstrating that the catastrophic threshold triggers even with zero delta.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/app/src/testing/perf-metrics.test.ts` around lines 166 - 209, The test titled "fails a scenario on catastrophic absolute thresholds even when there is no regression delta" uses differing interaction_ms_worst values (base 120, head 510) which creates a regression delta; update the test data in the test case that calls comparePerfScenarioSummaries so base.interaction_ms_worst and head.interaction_ms_worst are equal and both above the catastrophic threshold (e.g., set both to 510) to ensure only the catastrophic absolute threshold triggers while keeping the assertions (expect(result.pass).toBe(false) and expect(result.failures).toContain("interaction_ms_worst")) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/app/script/compare-perf.ts`:
- Around line 5-9: The readArg function treats the next argv token as a value
even when that token is another flag; update readArg to validate the returned
value by ensuring process.argv[index + 1] exists and is not a flag token (e.g.,
startsWith('-') or '--') before returning it, and return undefined if the next
token is missing or looks like a flag; preserve the current signature and use
the same process.argv lookup in readArg to perform this check.
---
Nitpick comments:
In `@packages/app/src/testing/perf-metrics.test.ts`:
- Around line 166-209: The test titled "fails a scenario on catastrophic
absolute thresholds even when there is no regression delta" uses differing
interaction_ms_worst values (base 120, head 510) which creates a regression
delta; update the test data in the test case that calls
comparePerfScenarioSummaries so base.interaction_ms_worst and
head.interaction_ms_worst are equal and both above the catastrophic threshold
(e.g., set both to 510) to ensure only the catastrophic absolute threshold
triggers while keeping the assertions (expect(result.pass).toBe(false) and
expect(result.failures).toContain("interaction_ms_worst")) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 24ba7a0f-fe5a-4249-9dd2-edf3abb93932
📒 Files selected for processing (5)
.github/workflows/perf-probe-baseline.ymlpackages/app/e2e/perf/perf-probe.spec.tspackages/app/script/compare-perf.tspackages/app/src/testing/perf-metrics.test.tspackages/app/src/testing/perf-metrics.ts
There was a problem hiding this comment.
Code Review
This pull request introduces performance comparison utilities, including a new script for comparing performance baselines and updated test logic to validate performance metrics. It also adds a cooldown mechanism to end-to-end performance tests to improve stability. The reviewer provided a suggestion to optimize the complexity of the baseline comparison loop from O(N*M) to O(N) by using a Set and to include defensive checks for safer data handling.
Summary
Why
PR0.1 only produced baseline artifacts. PR0.2 needs the first real regression gate before any UI rewrite PR can ship, but it stays narrow: no new scenarios, no trace capture yet, and no probe semantic rewrite.
Related Issue
#600
Human Review Status
Pending. A human should make the final merge decision after reviewing the final diff and verification evidence.
Review Focus
Risk Notes
Low to medium. This changes perf CI behavior and can fail the non-required workflow when the comparator sees a regression, but it does not touch product runtime code or add new user-facing surfaces. Trace-on-failure stays deferred to PR0.3 by design.
How To Verify
Screenshots or Recordings
Not applicable. No visible UI changes.
Checklist
dev, and my PR title and commit messages use Conventional Commits in EnglishSummary by CodeRabbit
New Features
Tests