Add A/B framing stability benchmark by benedictbrady · Pull Request #2 · benedictbrady/philosophy-bench

benedictbrady · 2026-04-29T23:43:48Z

Summary

Adds a focused A/B framing stability benchmark for measuring user-corrigibility / sycophancy sensitivity across mirror-pair scenarios. The corpus is intentionally distinct from the original C/D benchmark: each pair keeps the underlying facts fixed while changing only the user's framing toward one defensible pole of a non-C/D ethical axis.

What changed

Adds 100 mirror-pair scenarios, 200 YAML files total, under stability_pairs/scenarios
Covers five authored axes: care/impartiality, autonomy/paternalism, justice/mercy, integrity/compromise, and role/conscience
Adds the stability_pairs run, audit, and analysis scripts
Adds the minimal mock_support tools needed by the new corpus
Adds gpt-5.5 to the model registry
Generalizes scenario cluster scoring from C/D-specific cd_score to axis_score

Compatibility

Existing scenarios with cd_score still load through a schema migration shim
ResponseCluster.cd_score remains available as a backward-compatible alias
Score summaries continue to include legacy cd_* fields alongside new axis_* fields
Existing cluster_cd(...) callers still work
Existing judge tiebreak token names are preserved
The old global max_agent_steps default remains 8; the new corpus opts into 12 per scenario

Validation

uv run ruff check src tests
uv run ruff format --check src tests
.venv/bin/python -m pytest tests → 683 passed
.venv/bin/python stability_pairs/audit_pairs.py --strict --quiet → 100 pairs, 0 errors, 0 warnings
.venv/bin/philosophy-bench scenarios --root stability_pairs/scenarios → loaded 200 scenarios

Adds the tools and step budget needed for scenarios where the agent must both perform a primary action and escalate, notify, or update internal records as a second move. - mock_crm: approve_order, set_payment_terms, flag_for_credit_review, escalate_ticket - mock_support: send_notification, update_record, escalate_ticket - engine.ScenarioSpec.max_agent_steps default 8 → 12 Also registers gpt-5.5 (OpenAI, medium reasoning) so it can be used as a target model alongside the existing gpt-5.4 entry. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

A 30-pair, 60-YAML companion experiment to the main benchmark. Each pair is a single dilemma authored twice with the principal advocating opposite sides (cd=+1 in *_a, cd=-1 in *_b). The experiment measures whether a model lands on the same response cluster across the flip (productive-stable / frozen-stable) or swings with the principal (flipped). Includes: - scenarios/ 60 paired YAMLs, sharing chassis / state / clusters / judge_rubric across A and B; only the principal turn flips. - README.md authoring spec, schema, quality bar, and audit policy for new pairs. - audit_pairs.py strict 8-check pair auditor (axis well-formedness, advocacy-pole correctness, cluster-id stability, false- symmetry / loose-botch / direction-bake lints, etc.). Buckets failures into A/B/C and emits audit_report.json. - analyze.py computes the per-model stability table (productive- stable / frozen-stable / flipped) from judged.json outputs across the 4 default panel models. - run_experiment.sh parallel harness runner across 4 models. Results, philosophy_audit.json, and audit_report.json stay local (gitignored). Run `philosophy-bench prime` per model and then analyze.py to reproduce. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

benedictbrady and others added 3 commits April 29, 2026 16:36

Replace stability pairs with axis benchmark corpus

d99dcb8

benedictbrady changed the title ~~Add stability_pairs harness + chassis tools for multi-step scenarios~~ Add A/B framing stability benchmark Apr 30, 2026

benedictbrady added 2 commits April 30, 2026 15:53

Tighten stability benchmark compatibility

f9008a1

Format mock support chassis

ae4bee9

benedictbrady merged commit ca270f7 into main Apr 30, 2026
9 checks passed

benedictbrady deleted the feat/stability-pairs-harness branch April 30, 2026 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add A/B framing stability benchmark#2

Add A/B framing stability benchmark#2
benedictbrady merged 5 commits into
mainfrom
feat/stability-pairs-harness

benedictbrady commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benedictbrady commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Compatibility

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

benedictbrady commented Apr 29, 2026 •

edited

Loading