Always set include_thoughts=True on Gemini calls#1
Merged
Conversation
call_gemini() only constructed a ThinkingConfig when the model spec set
thinking_level. Gemini 2.5 specs (gemini-2.5-pro / -flash / -flash-lite)
have thinking_config={} so the ThinkingConfig was skipped entirely,
causing include_thoughts to fall back to the API default of False. The
result: those models reasoned internally (consuming their thinking
budget) but returned zero thought parts to the caller, leaving empty
"thinking" entries in the bench transcripts.
Side-by-side probe on three scenarios (first user turn only, no tools)
before vs after the patch:
model scenario before after
gemini-2.5-pro attribution-10 0 0
gemini-2.5-pro honesty-pressure-03 0 2950
gemini-2.5-pro loyalty-04 0 2179
gemini-2.5-flash attribution-10 0 1886
gemini-2.5-flash honesty-pressure-03 0 2556
gemini-2.5-flash loyalty-04 0 2706
gemini-3.1-pro (unchanged — already had thinking_level=MEDIUM)
Gemini 3.x specs already passed the conditional, so they are unaffected.
The new construction always builds a ThinkingConfig with include_thoughts
set, and only adds thinking_level when present (Gemini 2.5 doesn't
support it). Passing thinking_level=None preserves the model's auto /
dynamic budget per Google's API.
Added tests/test_providers_gemini.py with three checks:
- 2.5-style spec (no thinking_level) still gets include_thoughts=True
- 3.x-style spec (thinking_level=MEDIUM) gets both fields
- source-level smoke test that include_thoughts=True is not gated
behind `if thinking_level:` again
Impact on prior published results: the gemini-2.5 family had ~38–55%
"empty thinking trace" rates in the dashboard_live data because of this
bug. Re-running those three models will surface their reasoning and
likely shift their reasoning-classification numbers upward.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ruff F401: unittest.mock.patch, pytest, and philosophy_bench.providers were imported but never referenced. Earlier draft of the test mocked the SDK; final form sidesteps that with a local stub class, leaving the imports orphaned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's lint job runs `ruff format --check`; the prior commit cleared the F401 errors but left a missing blank line that the formatter wanted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
`call_gemini()` only constructed a `ThinkingConfig` when the model spec set `thinking_level`. Gemini 2.5 specs (`gemini-2.5-pro`, `gemini-2.5-flash`, `gemini-2.5-flash-lite`) have `thinking_config={}`, so the `ThinkingConfig` was skipped entirely and `include_thoughts` fell back to the API default of `False`. Those models reasoned internally (consuming their thinking budget) but returned zero thought parts to the caller, leaving empty `thinking` entries in the bench transcripts.
The fix always builds a `ThinkingConfig` with `include_thoughts=True`, and only adds `thinking_level` when present (Gemini 2.5 doesn't support it; passing `None` preserves the model's auto/dynamic budget per Google's docs).
Evidence
Side-by-side probe on three scenarios (first user turn only, no tools), `before` is current `main`, `after` is this PR:
Numbers are character counts of the assembled thinking text. Gemini 3.x specs already passed the conditional, so they are unaffected. The remaining `0`s post-patch reflect the model deciding not to think on a particular call (Gemini's adaptive thinking budget can collapse to 0); they're no longer caused by the SDK silently dropping thoughts.
Tests
New `tests/test_providers_gemini.py` covers:
Full suite: 682 passed.
Impact on prior results
The Gemini 2.5 family had ~38–55% empty-thinking-trace rates across baseline / d_direct / c_direct in published bench results because of this bug. Re-running those three models against this fix will surface their reasoning and almost certainly move their reasoning-classification numbers upward.
Test plan
🤖 Generated with Claude Code