Skip to content

Always set include_thoughts=True on Gemini calls#1

Merged
benedictbrady merged 3 commits into
mainfrom
fix/gemini-25-include-thoughts
Apr 28, 2026
Merged

Always set include_thoughts=True on Gemini calls#1
benedictbrady merged 3 commits into
mainfrom
fix/gemini-25-include-thoughts

Conversation

@benedictbrady
Copy link
Copy Markdown
Owner

Summary

`call_gemini()` only constructed a `ThinkingConfig` when the model spec set `thinking_level`. Gemini 2.5 specs (`gemini-2.5-pro`, `gemini-2.5-flash`, `gemini-2.5-flash-lite`) have `thinking_config={}`, so the `ThinkingConfig` was skipped entirely and `include_thoughts` fell back to the API default of `False`. Those models reasoned internally (consuming their thinking budget) but returned zero thought parts to the caller, leaving empty `thinking` entries in the bench transcripts.

The fix always builds a `ThinkingConfig` with `include_thoughts=True`, and only adds `thinking_level` when present (Gemini 2.5 doesn't support it; passing `None` preserves the model's auto/dynamic budget per Google's docs).

Evidence

Side-by-side probe on three scenarios (first user turn only, no tools), `before` is current `main`, `after` is this PR:

model scenario before after
gemini-2.5-pro attribution-10 0 0
gemini-2.5-pro honesty-pressure-03 0 2950
gemini-2.5-pro loyalty-04 0 2179
gemini-2.5-flash attribution-10 0 1886
gemini-2.5-flash honesty-pressure-03 0 2556
gemini-2.5-flash loyalty-04 0 2706
gemini-3.1-pro (unchanged — already had thinking_level=MEDIUM)

Numbers are character counts of the assembled thinking text. Gemini 3.x specs already passed the conditional, so they are unaffected. The remaining `0`s post-patch reflect the model deciding not to think on a particular call (Gemini's adaptive thinking budget can collapse to 0); they're no longer caused by the SDK silently dropping thoughts.

Tests

New `tests/test_providers_gemini.py` covers:

  • Gemini 2.5-style spec (empty `thinking_config`) still gets `include_thoughts=True`
  • Gemini 3.x-style spec (`thinking_level=MEDIUM`) gets both fields
  • Source-level smoke test that `include_thoughts=True` is not regated behind `if thinking_level:` in a future refactor

Full suite: 682 passed.

Impact on prior results

The Gemini 2.5 family had ~38–55% empty-thinking-trace rates across baseline / d_direct / c_direct in published bench results because of this bug. Re-running those three models against this fix will surface their reasoning and almost certainly move their reasoning-classification numbers upward.

Test plan

  • `uv run pytest tests/` — all green (682 passed)
  • Live API probe on gemini-2.5-pro / gemini-2.5-flash / gemini-3.1-pro showing thoughts surface post-patch (table above)
  • Reviewer to confirm `thinking_level=None` is acceptable for Gemini 2.5 (per docs, passing `None` is equivalent to omitting the field — the model uses dynamic budget by default)

🤖 Generated with Claude Code

benedictbrady and others added 3 commits April 28, 2026 16:31
call_gemini() only constructed a ThinkingConfig when the model spec set
thinking_level. Gemini 2.5 specs (gemini-2.5-pro / -flash / -flash-lite)
have thinking_config={} so the ThinkingConfig was skipped entirely,
causing include_thoughts to fall back to the API default of False. The
result: those models reasoned internally (consuming their thinking
budget) but returned zero thought parts to the caller, leaving empty
"thinking" entries in the bench transcripts.

Side-by-side probe on three scenarios (first user turn only, no tools)
before vs after the patch:

  model            scenario              before  after
  gemini-2.5-pro   attribution-10             0      0
  gemini-2.5-pro   honesty-pressure-03        0   2950
  gemini-2.5-pro   loyalty-04                 0   2179
  gemini-2.5-flash attribution-10             0   1886
  gemini-2.5-flash honesty-pressure-03        0   2556
  gemini-2.5-flash loyalty-04                 0   2706
  gemini-3.1-pro   (unchanged — already had thinking_level=MEDIUM)

Gemini 3.x specs already passed the conditional, so they are unaffected.
The new construction always builds a ThinkingConfig with include_thoughts
set, and only adds thinking_level when present (Gemini 2.5 doesn't
support it). Passing thinking_level=None preserves the model's auto /
dynamic budget per Google's API.

Added tests/test_providers_gemini.py with three checks:
  - 2.5-style spec (no thinking_level) still gets include_thoughts=True
  - 3.x-style spec (thinking_level=MEDIUM) gets both fields
  - source-level smoke test that include_thoughts=True is not gated
    behind `if thinking_level:` again

Impact on prior published results: the gemini-2.5 family had ~38–55%
"empty thinking trace" rates in the dashboard_live data because of this
bug. Re-running those three models will surface their reasoning and
likely shift their reasoning-classification numbers upward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ruff F401: unittest.mock.patch, pytest, and philosophy_bench.providers
were imported but never referenced. Earlier draft of the test mocked the
SDK; final form sidesteps that with a local stub class, leaving the
imports orphaned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's lint job runs `ruff format --check`; the prior commit cleared the
F401 errors but left a missing blank line that the formatter wanted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@benedictbrady benedictbrady merged commit 285d515 into main Apr 28, 2026
9 checks passed
@benedictbrady benedictbrady deleted the fix/gemini-25-include-thoughts branch April 28, 2026 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant