tests(issue-fix-workflow): eval suites for steps 3, 4, and 5#239
Conversation
|
More evals! |
|
I was still going through this when it got merged 😄, but flagging now because the squash dragged in scope that wasn't in the PR description. the branch was cut off confirmed by diffing the merge against its parent side effect: #227 is still open but its content is already on main, never got its own review. happy to draft a follow-up that closes #227 and patches the spec nits I caught in the step-3/4/5 fixtures (verdict rubric ambiguity on step-3, missing edge case on step-5, schema overlap between step-4 |
|
Damn . Too fast tried to see if I can speed up the merging before rename (and before @justinmclean goes to bed :) ... |
8 new eval cases across 3 suites filling the gaps in
issue-fix-workflowcoverage(steps 3, 4, and 5 had no fixtures).
Changes
tools/skill-evals/evals/issue-fix-workflow/— three new suites:step-3-failing-test(3 cases) — assesses a proposed regression test:case-1-test-fails-as-expected— issue key present, adapts from reproducer, run confirms FAILED → acceptcase-2-missing-issue-key— test omits the issue key reference → rejectcase-3-test-passes-on-main— test passes before any fix (silent-broken-test trap) → surface-gapstep-4-production-change(3 cases) — assesses a proposed production fix:case-1-minimal-fix-proceeds— root cause fixed, diff clean, targeted test green → proceedcase-2-symptom-masks-root-cause— symptom guard makes the test pass but root cause unaddressed → iteratecase-3-drive-by-in-diff— correct fix but diff includes an unrelated whitespace change → iteratestep-5-module-test-run(2 cases) — interprets module test run output:case-1-clean-module-run— all tests pass → proceedcase-2-regression-introduced— fix breaks an adjacent round-trip test → iterateREADME.mdupdated: suite table and case count (12 → 20).