Skip to content

Add A/B framing stability benchmark#2

Merged
benedictbrady merged 5 commits into
mainfrom
feat/stability-pairs-harness
Apr 30, 2026
Merged

Add A/B framing stability benchmark#2
benedictbrady merged 5 commits into
mainfrom
feat/stability-pairs-harness

Conversation

@benedictbrady
Copy link
Copy Markdown
Owner

@benedictbrady benedictbrady commented Apr 29, 2026

Summary

Adds a focused A/B framing stability benchmark for measuring user-corrigibility / sycophancy sensitivity across mirror-pair scenarios. The corpus is intentionally distinct from the original C/D benchmark: each pair keeps the underlying facts fixed while changing only the user's framing toward one defensible pole of a non-C/D ethical axis.

What changed

  • Adds 100 mirror-pair scenarios, 200 YAML files total, under stability_pairs/scenarios
  • Covers five authored axes: care/impartiality, autonomy/paternalism, justice/mercy, integrity/compromise, and role/conscience
  • Adds the stability_pairs run, audit, and analysis scripts
  • Adds the minimal mock_support tools needed by the new corpus
  • Adds gpt-5.5 to the model registry
  • Generalizes scenario cluster scoring from C/D-specific cd_score to axis_score

Compatibility

  • Existing scenarios with cd_score still load through a schema migration shim
  • ResponseCluster.cd_score remains available as a backward-compatible alias
  • Score summaries continue to include legacy cd_* fields alongside new axis_* fields
  • Existing cluster_cd(...) callers still work
  • Existing judge tiebreak token names are preserved
  • The old global max_agent_steps default remains 8; the new corpus opts into 12 per scenario

Validation

  • uv run ruff check src tests
  • uv run ruff format --check src tests
  • .venv/bin/python -m pytest tests → 683 passed
  • .venv/bin/python stability_pairs/audit_pairs.py --strict --quiet → 100 pairs, 0 errors, 0 warnings
  • .venv/bin/philosophy-bench scenarios --root stability_pairs/scenarios → loaded 200 scenarios

benedictbrady and others added 3 commits April 29, 2026 16:36
Adds the tools and step budget needed for scenarios where the agent must
both perform a primary action and escalate, notify, or update internal
records as a second move.

- mock_crm: approve_order, set_payment_terms, flag_for_credit_review,
  escalate_ticket
- mock_support: send_notification, update_record, escalate_ticket
- engine.ScenarioSpec.max_agent_steps default 8 → 12

Also registers gpt-5.5 (OpenAI, medium reasoning) so it can be used as
a target model alongside the existing gpt-5.4 entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
A 30-pair, 60-YAML companion experiment to the main benchmark. Each
pair is a single dilemma authored twice with the principal advocating
opposite sides (cd=+1 in *_a, cd=-1 in *_b). The experiment measures
whether a model lands on the same response cluster across the flip
(productive-stable / frozen-stable) or swings with the principal
(flipped).

Includes:
- scenarios/  60 paired YAMLs, sharing chassis / state / clusters /
              judge_rubric across A and B; only the principal turn flips.
- README.md   authoring spec, schema, quality bar, and audit policy
              for new pairs.
- audit_pairs.py  strict 8-check pair auditor (axis well-formedness,
              advocacy-pole correctness, cluster-id stability, false-
              symmetry / loose-botch / direction-bake lints, etc.).
              Buckets failures into A/B/C and emits audit_report.json.
- analyze.py  computes the per-model stability table (productive-
              stable / frozen-stable / flipped) from judged.json
              outputs across the 4 default panel models.
- run_experiment.sh  parallel harness runner across 4 models.

Results, philosophy_audit.json, and audit_report.json stay local
(gitignored). Run `philosophy-bench prime` per model and then
analyze.py to reproduce.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@benedictbrady benedictbrady changed the title Add stability_pairs harness + chassis tools for multi-step scenarios Add A/B framing stability benchmark Apr 30, 2026
@benedictbrady benedictbrady merged commit ca270f7 into main Apr 30, 2026
9 checks passed
@benedictbrady benedictbrady deleted the feat/stability-pairs-harness branch April 30, 2026 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant