Always set include_thoughts=True on Gemini calls by benedictbrady · Pull Request #1 · benedictbrady/philosophy-bench

benedictbrady · 2026-04-28T23:31:54Z

Summary

`call_gemini()` only constructed a `ThinkingConfig` when the model spec set `thinking_level`. Gemini 2.5 specs (`gemini-2.5-pro`, `gemini-2.5-flash`, `gemini-2.5-flash-lite`) have `thinking_config={}`, so the `ThinkingConfig` was skipped entirely and `include_thoughts` fell back to the API default of `False`. Those models reasoned internally (consuming their thinking budget) but returned zero thought parts to the caller, leaving empty `thinking` entries in the bench transcripts.

The fix always builds a `ThinkingConfig` with `include_thoughts=True`, and only adds `thinking_level` when present (Gemini 2.5 doesn't support it; passing `None` preserves the model's auto/dynamic budget per Google's docs).

Evidence

Side-by-side probe on three scenarios (first user turn only, no tools), `before` is current `main`, `after` is this PR:

model	scenario	before	after
gemini-2.5-pro	attribution-10	0	0
gemini-2.5-pro	honesty-pressure-03	0	2950
gemini-2.5-pro	loyalty-04	0	2179
gemini-2.5-flash	attribution-10	0	1886
gemini-2.5-flash	honesty-pressure-03	0	2556
gemini-2.5-flash	loyalty-04	0	2706
gemini-3.1-pro	(unchanged — already had thinking_level=MEDIUM)

Numbers are character counts of the assembled thinking text. Gemini 3.x specs already passed the conditional, so they are unaffected. The remaining `0`s post-patch reflect the model deciding not to think on a particular call (Gemini's adaptive thinking budget can collapse to 0); they're no longer caused by the SDK silently dropping thoughts.

Tests

New `tests/test_providers_gemini.py` covers:

Gemini 2.5-style spec (empty `thinking_config`) still gets `include_thoughts=True`
Gemini 3.x-style spec (`thinking_level=MEDIUM`) gets both fields
Source-level smoke test that `include_thoughts=True` is not regated behind `if thinking_level:` in a future refactor

Full suite: 682 passed.

Impact on prior results

The Gemini 2.5 family had ~38–55% empty-thinking-trace rates across baseline / d_direct / c_direct in published bench results because of this bug. Re-running those three models against this fix will surface their reasoning and almost certainly move their reasoning-classification numbers upward.

Test plan

`uv run pytest tests/` — all green (682 passed)
Live API probe on gemini-2.5-pro / gemini-2.5-flash / gemini-3.1-pro showing thoughts surface post-patch (table above)
Reviewer to confirm `thinking_level=None` is acceptable for Gemini 2.5 (per docs, passing `None` is equivalent to omitting the field — the model uses dynamic budget by default)

🤖 Generated with Claude Code

call_gemini() only constructed a ThinkingConfig when the model spec set thinking_level. Gemini 2.5 specs (gemini-2.5-pro / -flash / -flash-lite) have thinking_config={} so the ThinkingConfig was skipped entirely, causing include_thoughts to fall back to the API default of False. The result: those models reasoned internally (consuming their thinking budget) but returned zero thought parts to the caller, leaving empty "thinking" entries in the bench transcripts. Side-by-side probe on three scenarios (first user turn only, no tools) before vs after the patch: model scenario before after gemini-2.5-pro attribution-10 0 0 gemini-2.5-pro honesty-pressure-03 0 2950 gemini-2.5-pro loyalty-04 0 2179 gemini-2.5-flash attribution-10 0 1886 gemini-2.5-flash honesty-pressure-03 0 2556 gemini-2.5-flash loyalty-04 0 2706 gemini-3.1-pro (unchanged — already had thinking_level=MEDIUM) Gemini 3.x specs already passed the conditional, so they are unaffected. The new construction always builds a ThinkingConfig with include_thoughts set, and only adds thinking_level when present (Gemini 2.5 doesn't support it). Passing thinking_level=None preserves the model's auto / dynamic budget per Google's API. Added tests/test_providers_gemini.py with three checks: - 2.5-style spec (no thinking_level) still gets include_thoughts=True - 3.x-style spec (thinking_level=MEDIUM) gets both fields - source-level smoke test that include_thoughts=True is not gated behind `if thinking_level:` again Impact on prior published results: the gemini-2.5 family had ~38–55% "empty thinking trace" rates in the dashboard_live data because of this bug. Re-running those three models will surface their reasoning and likely shift their reasoning-classification numbers upward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ruff F401: unittest.mock.patch, pytest, and philosophy_bench.providers were imported but never referenced. Earlier draft of the test mocked the SDK; final form sidesteps that with a local stub class, leaving the imports orphaned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI's lint job runs `ruff format --check`; the prior commit cleared the F401 errors but left a missing blank line that the formatter wanted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

benedictbrady and others added 3 commits April 28, 2026 16:31

Apply ruff format to gemini provider test

32ee11c

CI's lint job runs `ruff format --check`; the prior commit cleared the F401 errors but left a missing blank line that the formatter wanted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

benedictbrady merged commit 285d515 into main Apr 28, 2026
9 checks passed

benedictbrady deleted the fix/gemini-25-include-thoughts branch April 28, 2026 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always set include_thoughts=True on Gemini calls#1

Always set include_thoughts=True on Gemini calls#1
benedictbrady merged 3 commits into
mainfrom
fix/gemini-25-include-thoughts

benedictbrady commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benedictbrady commented Apr 28, 2026

Summary

Evidence

Tests

Impact on prior results

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant