Add A/B framing stability benchmark#2
Merged
Conversation
Adds the tools and step budget needed for scenarios where the agent must both perform a primary action and escalate, notify, or update internal records as a second move. - mock_crm: approve_order, set_payment_terms, flag_for_credit_review, escalate_ticket - mock_support: send_notification, update_record, escalate_ticket - engine.ScenarioSpec.max_agent_steps default 8 → 12 Also registers gpt-5.5 (OpenAI, medium reasoning) so it can be used as a target model alongside the existing gpt-5.4 entry. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
A 30-pair, 60-YAML companion experiment to the main benchmark. Each
pair is a single dilemma authored twice with the principal advocating
opposite sides (cd=+1 in *_a, cd=-1 in *_b). The experiment measures
whether a model lands on the same response cluster across the flip
(productive-stable / frozen-stable) or swings with the principal
(flipped).
Includes:
- scenarios/ 60 paired YAMLs, sharing chassis / state / clusters /
judge_rubric across A and B; only the principal turn flips.
- README.md authoring spec, schema, quality bar, and audit policy
for new pairs.
- audit_pairs.py strict 8-check pair auditor (axis well-formedness,
advocacy-pole correctness, cluster-id stability, false-
symmetry / loose-botch / direction-bake lints, etc.).
Buckets failures into A/B/C and emits audit_report.json.
- analyze.py computes the per-model stability table (productive-
stable / frozen-stable / flipped) from judged.json
outputs across the 4 default panel models.
- run_experiment.sh parallel harness runner across 4 models.
Results, philosophy_audit.json, and audit_report.json stay local
(gitignored). Run `philosophy-bench prime` per model and then
analyze.py to reproduce.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a focused A/B framing stability benchmark for measuring user-corrigibility / sycophancy sensitivity across mirror-pair scenarios. The corpus is intentionally distinct from the original C/D benchmark: each pair keeps the underlying facts fixed while changing only the user's framing toward one defensible pole of a non-C/D ethical axis.
What changed
stability_pairs/scenariosstability_pairsrun, audit, and analysis scriptsmock_supporttools needed by the new corpusgpt-5.5to the model registrycd_scoretoaxis_scoreCompatibility
cd_scorestill load through a schema migration shimResponseCluster.cd_scoreremains available as a backward-compatible aliascd_*fields alongside newaxis_*fieldscluster_cd(...)callers still workmax_agent_stepsdefault remains8; the new corpus opts into12per scenarioValidation
uv run ruff check src testsuv run ruff format --check src tests.venv/bin/python -m pytest tests→ 683 passed.venv/bin/python stability_pairs/audit_pairs.py --strict --quiet→ 100 pairs, 0 errors, 0 warnings.venv/bin/philosophy-bench scenarios --root stability_pairs/scenarios→ loaded 200 scenarios