test(app): add PR0 perf baseline probe by Astro-Han · Pull Request #607 · Astro-Han/pawwork

Astro-Han · 2026-05-13T12:29:39Z

Summary

Add the PR0.1 perf-baseline slice for #600: a Playwright perf probe, four baseline scenarios, JSON baseline output, and a non-blocking artifact workflow.

Why

PR #589 regressed interaction performance badly enough that review and manual checks missed a real regression. PR0.1 installs the measurement layer first so PR0.2 can add the comparator and gate on top of a real baseline instead of guesses.

The probe is intentionally separate from debug-bar.tsx and renderer-diagnostics.ts. Those paths serve runtime and developer-facing diagnostics inside the product, while PR0.1 needs a test-owned sampler that runs from page.addInitScript, stays outside the production dependency graph, and can evolve without leaking E2E-only concerns into the shipped app. At this stage reliability of the measurement path matters more than DRY. If the schema stabilizes and we prove the signal is useful, a later follow-up can extract shared probe primitives deliberately.

Related Issue

Closes #600

Human Review Status

Pending. A human should make the final merge decision after reviewing the final diff and verification evidence.

Review Focus

Whether the probe metrics and JSON shape match [Task] UI rewrite v2 PR0: perf-gated regression CI #600 closely enough for PR0.2 to consume.
Whether the four scenarios are the right minimal baseline surfaces for home, streaming, tool expand, and scroll.
Whether the workflow stays non-blocking and artifact-only in PR0.1.

Risk Notes

Low. This adds test-only helpers, a new perf E2E suite, and a non-required artifact workflow. The only runtime-facing change is ignoring packages/app/e2e/perf-results so generated baseline JSON does not dirty the repo.

How To Verify

bun test --preload ./happydom.ts ./src/testing/perf-metrics.test.ts
2 passed

bun --cwd packages/app test:e2e:local:perf
4 passed; emitted packages/app/e2e/perf-results/pr0.1-baseline.json with 4 scenario summaries x 3 runs each

bun --cwd packages/app typecheck
ok

Screenshots or Recordings

None. This PR adds perf instrumentation, E2E coverage, and CI artifact plumbing only.

Checklist

PR0.2 Notes

interaction_ms and interaction_ms_median are slightly redundant in PR0.1. PR0.2 should consume *_median and *_worst for comparison, so no schema rename is required in this PR.
branch: "dev" is intentionally hardcoded in the PR0.1 baseline artifact. PR0.2 will switch that value to an environment-driven label such as PAWWORK_PERF_BRANCH=base|head when the comparator lands.

Summary by CodeRabbit

Tests
- Added end-to-end performance baseline testing infrastructure to measure and track application performance across key user scenarios.
Chores
- Added automated GitHub Actions workflow for performance testing in pull requests.
- Added performance measurement and metrics aggregation utilities.

coderabbitai · 2026-05-13T12:29:55Z

📝 Walkthrough

Walkthrough

Adds a Playwright-based perf probe (browser instrumentation, summarization utilities, unit tests), four serial E2E baseline scenarios that collect 3 runs each and write aggregated JSON, a CI workflow to run and upload perf artifacts, and SSE multi-stage streaming support for test fixtures.

Changes

Perf probe measurement pipeline

Layer / File(s)	Summary
Perf metrics model and summarization logic `packages/app/src/testing/perf-metrics.ts`, `packages/app/src/testing/perf-metrics.test.ts`	Defines perf sample and summary types, helpers for median/percentiles, implements `summarizePerfRun` and `aggregatePerfRuns`, and adds unit tests validating per-run and aggregated metrics.
Browser performance probe instrumentation `packages/app/e2e/perf/probe.ts`	Installs in-page PerformanceObserver handlers and RAF frame tracking, exposes `window.__pawwork_perf_probe` with `reset()`/`snapshot()` (including optional heap usage), and provides test helpers `installPerfProbe`, `resetPerfProbe`, `snapshotPerfProbe`, and `summarizeScenarioRuns`.
E2E baseline test scenarios and harness `packages/app/e2e/perf/perf-probe.spec.ts`	Adds four serial baseline scenarios (`homepage-cold`, `session-streaming-long`, `tool-call-expand`, `session-scroll-reading`) that run 3 traces each, use probe helpers and frame-settling utilities, snapshot and summarize runs, and write aggregated JSON to `PAWWORK_PERF_OUTPUT` or `e2e/perf-results/pr0.1-baseline.json`.
CI workflow and package configuration `.github/workflows/perf-probe-baseline.yml`, `packages/app/package.json`, `packages/app/.gitignore`	New GitHub Actions workflow triggers on path-filtered PRs and manual dispatch, sets up Node.js and Bun, caches deps and Playwright, runs `bun --cwd packages/app test:e2e:local:perf` with `CI=true`, and uploads perf results, Playwright report, and test results (7-day retention). Adds `test:e2e:local:perf` and `test:e2e:perf` scripts and ignores `e2e/perf-results` in git.

SSE multi-stage streaming

Layer / File(s)	Summary
SSE stages typing and fixture passthrough `packages/opencode/test/lib/llm-server.ts`	Adds `stages?: Array<{ wait?: PromiseLike<unknown>; chunks: unknown[] }>` to the `Sse` shape and passes `input.stages` through in `raw(...)` fixture builder.
flowParts helper and responses assembly `packages/opencode/test/lib/llm-server.ts`	Adds `flowParts(parts)` to derive Flow events and rewrites `responses(item, model)` to produce head, per-stage staged chunks (wired to stage.wait), and tail while preserving hang/error/call completion semantics.

sequenceDiagram
  participant Runner as Playwright Runner
  participant Page as Browser Page
  participant Probe as __pawwork_perf_probe
  participant Storage as Artifact Upload
  Runner->>Page: installPerfProbe
  Runner->>Page: perform scenario actions (navigate/stream/expand/scroll)
  Runner->>Page: settleFrames
  Runner->>Page: snapshotPerfProbe
  Page-->>Runner: PerfRunSummary
  Runner->>Runner: aggregatePerfRuns (3 runs per scenario)
  Runner->>Storage: upload perf JSON + report + test results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

enhancement, P2

"I'm a rabbit with a stopwatch, I hop and I note,
Frames and shifts and TBT I quote,
Four scenarios ran, three runs apiece,
Baselines captured — may regressions cease! 🐇📈"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'test(app): add PR0 perf baseline probe' accurately summarizes the main change: adding a performance baseline probe for the test suite in the app package.
Linked Issues check	✅ Passed	The PR fully implements PR0.1 scope from `#600`: Playwright perf probe with 4 scenarios, JSON output schema, artifact workflow, and metrics (median/worst values) as required.
Out of Scope Changes check	✅ Passed	All changes are in-scope: perf probe, E2E scenarios, metrics helpers, workflow, .gitignore update, and test infrastructure. The LLM server change supports multi-stage streaming for test scenarios.
Description check	✅ Passed	The pull request description is comprehensive and well-structured, covering all required template sections including summary, rationale, related issue, review focus, risks, verification steps, and a completed checklist.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/pr0-perf-probe

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions

Suggested priority: P2 (includes user-path files (packages/app/src/testing/perf-metrics.test.ts, packages/app/src/testing/perf-metrics.ts)).

P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/app/e2e/perf/probe.ts`:
- Line 18: The reset(label?: string) parameter is declared but never used;
remove the unused label parameter from the type/signature and helper so it
becomes reset(): void (update the declaration where reset: (label?: string) =>
void is defined and the resetPerfProbe function that currently accepts/passes
label), and then update all call sites (e.g., in perf-probe.spec.ts) to stop
passing a label argument. Ensure only the symbol names reset and resetPerfProbe
are changed to the no-arg signature so callers compile cleanly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 733706bc-d0cc-4981-9b2c-982e3cb316ee

📥 Commits

Reviewing files that changed from the base of the PR and between 935396d and f1d26b9.

📒 Files selected for processing (7)

.github/workflows/perf-probe-baseline.yml
packages/app/.gitignore
packages/app/e2e/perf/perf-probe.spec.ts
packages/app/e2e/perf/probe.ts
packages/app/package.json
packages/app/src/testing/perf-metrics.test.ts
packages/app/src/testing/perf-metrics.ts

gemini-code-assist

Code Review

This pull request introduces a performance testing framework for the application, including a new performance probe utility, test scenarios for baseline generation, and metrics aggregation logic. I have reviewed the implementation and provided feedback regarding the scope of the tool-call-expand test, the unused label parameter in the performance probe reset function, and the redundancy in the performance metrics aggregation logic.

Astro-Han · 2026-05-13T12:54:57Z

Follow-up on the review bot notes:

Removed the unused label parameter from reset / resetPerfProbe, and updated the perf probe call sites.
The perf-probe-baseline failure was a workflow setup issue, not a scenario or threshold issue. The job cached ${{ github.workspace }}/.playwright-browsers but did not export PLAYWRIGHT_BROWSERS_PATH, so Playwright could not find chromium_headless_shell. This PR now aligns the workflow env with the existing e2e-artifacts job.
The perf metrics aggregation redundancy (interaction_ms vs interaction_ms_median) remains intentionally deferred to PR0.2. The comparator will consume *_median / *_worst, and the current schema stays stable for the baseline artifact.

test(app): add PR0 perf baseline probe

f1d26b9

github-actions Bot added ci Continuous integration / GitHub Actions app Application behavior and product flows labels May 13, 2026

github-actions Bot reviewed May 13, 2026

View reviewed changes

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread packages/app/e2e/perf/probe.ts Outdated

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Comment thread packages/app/e2e/perf/perf-probe.spec.ts

Comment thread packages/app/e2e/perf/probe.ts

Comment thread packages/app/src/testing/perf-metrics.ts

test(app): measure interaction during active streaming

7eda388

github-actions Bot added the harness Model harness, prompts, tool descriptions, and session mechanics label May 13, 2026

fix(app): align perf probe CI and reset helpers

8556535

Astro-Han added the P2 Medium priority label May 13, 2026

Astro-Han merged commit 0c012e5 into dev May 13, 2026
26 checks passed

Astro-Han deleted the codex/pr0-perf-probe branch May 13, 2026 13:21

Astro-Han mentioned this pull request May 13, 2026

[Task] UI rewrite v2 PR0: perf-gated regression CI #600

Open

This was referenced May 13, 2026

test(app): add PR0.2 perf comparator gate #608

Merged

test(app): add PR0.3 perf diagnostics #609

Merged

test(app): add PR0.4 low-end perf profile #610

Merged

fix(ui): defer default-open tool details #622

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(app): add PR0 perf baseline probe#607

test(app): add PR0 perf baseline probe#607
Astro-Han merged 3 commits into
devfrom
codex/pr0-perf-probe

Astro-Han commented May 13, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

github-actions Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Astro-Han commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Astro-Han commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Related Issue

Human Review Status

Review Focus

Risk Notes

How To Verify

Screenshots or Recordings

Checklist

PR0.2 Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Astro-Han commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Astro-Han commented May 13, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading