Skip to content

Add expediency correctness experiment pack#3

Merged
benedictbrady merged 6 commits into
mainfrom
codex/expediency-correctness-experiment
May 1, 2026
Merged

Add expediency correctness experiment pack#3
benedictbrady merged 6 commits into
mainfrom
codex/expediency-correctness-experiment

Conversation

@benedictbrady
Copy link
Copy Markdown
Owner

@benedictbrady benedictbrady commented Apr 30, 2026

Summary

  • organize public experiment packs under experiments/<name>/{harness,data,results} with shared repo-level guidance
  • move the original 100-scenario C-vs-D corpus into experiments/c_vs_d/data and move its Opus 4.7 public artifacts into experiments/c_vs_d/results/opus-4.7
  • keep src/philosophy_bench/ as the shared benchmark package/harness, with bundled C-vs-D data retained as the wheel-install compatibility mirror
  • add the 100-scenario expediency-vs-correctness pack and keep stability pairs in the same experiment layout
  • include only Opus 4.7 public result artifacts for checked-in experiment outputs

Results included

  • experiments/c_vs_d/results/opus-4.7: original baseline, C-direct, and D-direct judged artifacts
  • experiments/expediency_vs_correctness/results/opus-4.7: 100/100 judged, axis_mean -0.489, botch_rate 0.06
  • experiments/stability_pairs/results/opus-4.7: 200/200 judged baseline, axis_mean 0.01, botch_rate 0.005
  • experiments/stability_pairs/results/stability_summary.txt and stability_table.json summarize the Opus mirror-pair stability results

Backward compatibility

  • source checkouts now default to experiments/c_vs_d/data for the original C-vs-D corpus
  • installed wheels fall back to the bundled mirror under src/philosophy_bench/data
  • a regression test enforces that the repo-level C-vs-D data and package-data mirror stay in sync
  • local/generated top-level results/ remain ignored; public artifacts live under each experiment directory

Testing

  • uv run philosophy-bench scenarios
  • uv run philosophy-bench scenarios --root experiments/c_vs_d/data/scenarios
  • uv run philosophy-bench scenarios --root experiments/expediency_vs_correctness/data/scenarios
  • uv run philosophy-bench scenarios --root experiments/stability_pairs/data/scenarios
  • uv run python experiments/stability_pairs/harness/audit_pairs.py --quiet
  • uv run ruff check src tests && uv run ruff format --check src tests
  • uv run --extra dev pytest -q
  • uv build plus wheel install smoke test for tests/test_data_packaging.py

@benedictbrady benedictbrady force-pushed the codex/expediency-correctness-experiment branch from b3c3bee to 38f7a5d Compare May 1, 2026 00:22
@benedictbrady benedictbrady merged commit 8918787 into main May 1, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant