Skip to content

test(app): add PR0 perf baseline probe#607

Merged
Astro-Han merged 3 commits into
devfrom
codex/pr0-perf-probe
May 13, 2026
Merged

test(app): add PR0 perf baseline probe#607
Astro-Han merged 3 commits into
devfrom
codex/pr0-perf-probe

Conversation

@Astro-Han
Copy link
Copy Markdown
Owner

@Astro-Han Astro-Han commented May 13, 2026

Summary

Add the PR0.1 perf-baseline slice for #600: a Playwright perf probe, four baseline scenarios, JSON baseline output, and a non-blocking artifact workflow.

Why

PR #589 regressed interaction performance badly enough that review and manual checks missed a real regression. PR0.1 installs the measurement layer first so PR0.2 can add the comparator and gate on top of a real baseline instead of guesses.

The probe is intentionally separate from debug-bar.tsx and renderer-diagnostics.ts. Those paths serve runtime and developer-facing diagnostics inside the product, while PR0.1 needs a test-owned sampler that runs from page.addInitScript, stays outside the production dependency graph, and can evolve without leaking E2E-only concerns into the shipped app. At this stage reliability of the measurement path matters more than DRY. If the schema stabilizes and we prove the signal is useful, a later follow-up can extract shared probe primitives deliberately.

Related Issue

Closes #600

Human Review Status

Pending. A human should make the final merge decision after reviewing the final diff and verification evidence.

Review Focus

  1. Whether the probe metrics and JSON shape match [Task] UI rewrite v2 PR0: perf-gated regression CI #600 closely enough for PR0.2 to consume.
  2. Whether the four scenarios are the right minimal baseline surfaces for home, streaming, tool expand, and scroll.
  3. Whether the workflow stays non-blocking and artifact-only in PR0.1.

Risk Notes

Low. This adds test-only helpers, a new perf E2E suite, and a non-required artifact workflow. The only runtime-facing change is ignoring packages/app/e2e/perf-results so generated baseline JSON does not dirty the repo.

How To Verify

bun test --preload ./happydom.ts ./src/testing/perf-metrics.test.ts
2 passed

bun --cwd packages/app test:e2e:local:perf
4 passed; emitted packages/app/e2e/perf-results/pr0.1-baseline.json with 4 scenario summaries x 3 runs each

bun --cwd packages/app typecheck
ok

Screenshots or Recordings

None. This PR adds perf instrumentation, E2E coverage, and CI artifact plumbing only.

Checklist

  • Human review status is stated above as pending, approved, or not required
  • I linked the related issue, or stated why there is no issue
  • This PR has type, primary area, and priority labels, or I requested maintainer labeling
  • I described the review focus and any meaningful risks
  • I listed the relevant verification steps and the key result for each
  • I did not introduce unrelated refactors, dependencies, generated files, or file changes beyond the stated scope
  • I manually checked visible UI or copy changes when needed, with screenshots or recordings
  • I considered macOS and Windows impact for platform, packaging, updater, signing, paths, shell, or permissions changes
  • I called out docs, release notes, dependencies, permissions, credentials, deletion behavior, generated content, or local file changes when relevant
  • I reviewed the final diff for unrelated changes and suspicious dependency changes
  • I am targeting dev, and my PR title and commit messages use Conventional Commits in English

PR0.2 Notes

  • interaction_ms and interaction_ms_median are slightly redundant in PR0.1. PR0.2 should consume *_median and *_worst for comparison, so no schema rename is required in this PR.
  • branch: "dev" is intentionally hardcoded in the PR0.1 baseline artifact. PR0.2 will switch that value to an environment-driven label such as PAWWORK_PERF_BRANCH=base|head when the comparator lands.

Summary by CodeRabbit

  • Tests

    • Added end-to-end performance baseline testing infrastructure to measure and track application performance across key user scenarios.
  • Chores

    • Added automated GitHub Actions workflow for performance testing in pull requests.
    • Added performance measurement and metrics aggregation utilities.

Review Change Stack

@github-actions github-actions Bot added ci Continuous integration / GitHub Actions app Application behavior and product flows labels May 13, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

Adds a Playwright-based perf probe (browser instrumentation, summarization utilities, unit tests), four serial E2E baseline scenarios that collect 3 runs each and write aggregated JSON, a CI workflow to run and upload perf artifacts, and SSE multi-stage streaming support for test fixtures.

Changes

Perf probe measurement pipeline

Layer / File(s) Summary
Perf metrics model and summarization logic
packages/app/src/testing/perf-metrics.ts, packages/app/src/testing/perf-metrics.test.ts
Defines perf sample and summary types, helpers for median/percentiles, implements summarizePerfRun and aggregatePerfRuns, and adds unit tests validating per-run and aggregated metrics.
Browser performance probe instrumentation
packages/app/e2e/perf/probe.ts
Installs in-page PerformanceObserver handlers and RAF frame tracking, exposes window.__pawwork_perf_probe with reset()/snapshot() (including optional heap usage), and provides test helpers installPerfProbe, resetPerfProbe, snapshotPerfProbe, and summarizeScenarioRuns.
E2E baseline test scenarios and harness
packages/app/e2e/perf/perf-probe.spec.ts
Adds four serial baseline scenarios (homepage-cold, session-streaming-long, tool-call-expand, session-scroll-reading) that run 3 traces each, use probe helpers and frame-settling utilities, snapshot and summarize runs, and write aggregated JSON to PAWWORK_PERF_OUTPUT or e2e/perf-results/pr0.1-baseline.json.
CI workflow and package configuration
.github/workflows/perf-probe-baseline.yml, packages/app/package.json, packages/app/.gitignore
New GitHub Actions workflow triggers on path-filtered PRs and manual dispatch, sets up Node.js and Bun, caches deps and Playwright, runs bun --cwd packages/app test:e2e:local:perf with CI=true, and uploads perf results, Playwright report, and test results (7-day retention). Adds test:e2e:local:perf and test:e2e:perf scripts and ignores e2e/perf-results in git.

SSE multi-stage streaming

Layer / File(s) Summary
SSE stages typing and fixture passthrough
packages/opencode/test/lib/llm-server.ts
Adds stages?: Array<{ wait?: PromiseLike<unknown>; chunks: unknown[] }> to the Sse shape and passes input.stages through in raw(...) fixture builder.
flowParts helper and responses assembly
packages/opencode/test/lib/llm-server.ts
Adds flowParts(parts) to derive Flow events and rewrites responses(item, model) to produce head, per-stage staged chunks (wired to stage.wait), and tail while preserving hang/error/call completion semantics.
sequenceDiagram
  participant Runner as Playwright Runner
  participant Page as Browser Page
  participant Probe as __pawwork_perf_probe
  participant Storage as Artifact Upload
  Runner->>Page: installPerfProbe
  Runner->>Page: perform scenario actions (navigate/stream/expand/scroll)
  Runner->>Page: settleFrames
  Runner->>Page: snapshotPerfProbe
  Page-->>Runner: PerfRunSummary
  Runner->>Runner: aggregatePerfRuns (3 runs per scenario)
  Runner->>Storage: upload perf JSON + report + test results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

enhancement, P2

"I'm a rabbit with a stopwatch, I hop and I note,
Frames and shifts and TBT I quote,
Four scenarios ran, three runs apiece,
Baselines captured — may regressions cease! 🐇📈"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'test(app): add PR0 perf baseline probe' accurately summarizes the main change: adding a performance baseline probe for the test suite in the app package.
Linked Issues check ✅ Passed The PR fully implements PR0.1 scope from #600: Playwright perf probe with 4 scenarios, JSON output schema, artifact workflow, and metrics (median/worst values) as required.
Out of Scope Changes check ✅ Passed All changes are in-scope: perf probe, E2E scenarios, metrics helpers, workflow, .gitignore update, and test infrastructure. The LLM server change supports multi-stage streaming for test scenarios.
Description check ✅ Passed The pull request description is comprehensive and well-structured, covering all required template sections including summary, rationale, related issue, review focus, risks, verification steps, and a completed checklist.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/pr0-perf-probe

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested priority: P2 (includes user-path files (packages/app/src/testing/perf-metrics.test.ts, packages/app/src/testing/perf-metrics.ts)).

P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/app/e2e/perf/probe.ts`:
- Line 18: The reset(label?: string) parameter is declared but never used;
remove the unused label parameter from the type/signature and helper so it
becomes reset(): void (update the declaration where reset: (label?: string) =>
void is defined and the resetPerfProbe function that currently accepts/passes
label), and then update all call sites (e.g., in perf-probe.spec.ts) to stop
passing a label argument. Ensure only the symbol names reset and resetPerfProbe
are changed to the no-arg signature so callers compile cleanly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 733706bc-d0cc-4981-9b2c-982e3cb316ee

📥 Commits

Reviewing files that changed from the base of the PR and between 935396d and f1d26b9.

📒 Files selected for processing (7)
  • .github/workflows/perf-probe-baseline.yml
  • packages/app/.gitignore
  • packages/app/e2e/perf/perf-probe.spec.ts
  • packages/app/e2e/perf/probe.ts
  • packages/app/package.json
  • packages/app/src/testing/perf-metrics.test.ts
  • packages/app/src/testing/perf-metrics.ts

Comment thread packages/app/e2e/perf/probe.ts Outdated
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance testing framework for the application, including a new performance probe utility, test scenarios for baseline generation, and metrics aggregation logic. I have reviewed the implementation and provided feedback regarding the scope of the tool-call-expand test, the unused label parameter in the performance probe reset function, and the redundancy in the performance metrics aggregation logic.

Comment thread packages/app/e2e/perf/perf-probe.spec.ts
Comment thread packages/app/e2e/perf/probe.ts
Comment thread packages/app/src/testing/perf-metrics.ts
@github-actions github-actions Bot added the harness Model harness, prompts, tool descriptions, and session mechanics label May 13, 2026
@Astro-Han
Copy link
Copy Markdown
Owner Author

Follow-up on the review bot notes:

  • Removed the unused label parameter from reset / resetPerfProbe, and updated the perf probe call sites.
  • The perf-probe-baseline failure was a workflow setup issue, not a scenario or threshold issue. The job cached ${{ github.workspace }}/.playwright-browsers but did not export PLAYWRIGHT_BROWSERS_PATH, so Playwright could not find chromium_headless_shell. This PR now aligns the workflow env with the existing e2e-artifacts job.
  • The perf metrics aggregation redundancy (interaction_ms vs interaction_ms_median) remains intentionally deferred to PR0.2. The comparator will consume *_median / *_worst, and the current schema stays stable for the baseline artifact.

@Astro-Han Astro-Han added the P2 Medium priority label May 13, 2026
@Astro-Han Astro-Han merged commit 0c012e5 into dev May 13, 2026
26 checks passed
@Astro-Han Astro-Han deleted the codex/pr0-perf-probe branch May 13, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app Application behavior and product flows ci Continuous integration / GitHub Actions harness Model harness, prompts, tool descriptions, and session mechanics P2 Medium priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Task] UI rewrite v2 PR0: perf-gated regression CI

1 participant