Skip to content

feat: unify reasoning_content + thinking_blocks across providers (v0.4.9)#14

Merged
vitalii-dynamiq merged 3 commits into
mainfrom
feat/reasoning-content-unified
May 10, 2026
Merged

feat: unify reasoning_content + thinking_blocks across providers (v0.4.9)#14
vitalii-dynamiq merged 3 commits into
mainfrom
feat/reasoning-content-unified

Conversation

@vitalii-dynamiq

@vitalii-dynamiq vitalii-dynamiq commented May 7, 2026

Copy link
Copy Markdown
Contributor

Summary

Reasoning-capable models (DeepSeek-R1, GLM-4.5+, Anthropic Claude with extended thinking, Gemini 2.5 with `includeThoughts`, Groq DeepSeek/Qwen-thinking, Cerebras Qwen-thinking, Together / Fireworks DeepSeek-R1, OpenAI o-series via `/chat/completions`) all expose chain-of-thought, but each family uses a different field name. Previously arcllm dropped this entirely on the floor — callers could see the final answer but not the thinking.

This wires up a unified surface:

  • `Message.reasoning_content: str` — flat-string CoT, populated by every reasoning provider
  • `Message.thinking_blocks: list[ThinkingBlock]` — Anthropic's structured form (`thinking` | `redacted_thinking`, with signatures preserved for tool-use round-trips)
  • `ChunkDelta.reasoning_content / .thinking / .signature` — streaming deltas

Provider mapping

  • OpenAIAdapter (and DeepSeek, GLM, Groq, Cerebras, Together, Fireworks, Nebius, OVHcloud, Moonshot, OpenRouter, Perplexity — every OpenAI-compat subclass): reads `message.reasoning_content` or `message.reasoning`; same for `delta.reasoning_content / .reasoning` in stream events.
  • AnthropicAdapter: extracts `content[].type=="thinking"` and `"redacted_thinking"` blocks; populates both `thinking_blocks` (with signature) and a concatenated `reasoning_content`. Streaming handles `thinking_delta` / `signature_delta` with one block per signature.
  • GeminiAdapter: routes `parts[].thought=true` text into `reasoning_content` (non-thought parts stay in `content`). Same split for streaming.

`stream_chunk_builder` accumulates reasoning across chunks and rebuilds Anthropic's per-block grouping (`signature_delta` closes a block).

Live verification (through arcllm.completion)

Provider content reasoning_len thinking_blocks
Z.AI GLM-4.5-air `"5"` 730
DeepSeek-R1 `"5"` 67
Claude Sonnet 4.5 `"5"` 101 1 (signature ✅)
Gemini 2.5 Flash `"5"` 406

Streaming verified for all four — Anthropic's `thinking_delta` + `signature_delta` correctly group into a single `ThinkingBlock` with the signature attached.

Test plan

  • 18 new unit tests covering wire-format parsing per provider + `stream_chunk_builder` accumulation
  • arcllm full unit suite: 792 passed (was 782)
  • ruff / mypy --strict / pyright clean on changed files
  • dynamiq unit tests: 1149 passed (no regressions from the new fields)
  • dynamiq integration tests: 1066 passed
  • Live smoke against four reasoning families through real APIs

Coverage gap (not in this PR)

OpenAI's Responses API (`/v1/responses`, not `/v1/chat/completions`) returns reasoning as `output[].type=="reasoning"` items. arcllm only uses chat/completions today, so this didn't surface. If a Responses adapter is added, `Message.reasoning_items` (litellm's name) would be the natural extension.


Note

Medium Risk
Adds new response/stream fields and modifies core stream aggregation and multiple provider parsers, which could affect downstream consumers expecting the previous response shape or streaming semantics. Provider-specific handling (especially Anthropic streaming block grouping/signatures) increases edge-case risk but is well-covered by new tests.

Overview
Adds first-class support for reasoning/extended-thinking outputs by introducing Message.reasoning_content (flat string) and Anthropic-specific Message.thinking_blocks/ThinkingBlock, plus streaming deltas (ChunkDelta.reasoning_content, thinking, signature).

Updates OpenAI-compatible parsing to accept both reasoning_content and OpenAI’s reasoning alias; updates Gemini parsing to separate parts[].thought into reasoning_content; and updates Anthropic parsing/streaming to preserve thinking/redacted_thinking blocks (including signatures) and emit thinking/signature stream deltas.

Enhances stream_chunk_builder to accumulate reasoning across chunks and to rebuild per-choice Anthropic thinking blocks (with a fallback that populates reasoning_content from blocks), bumps version to 0.4.9, and adds extensive unit tests covering parsing, streaming accumulation, and serialization.

Reviewed by Cursor Bugbot for commit b5552ff. Bugbot is set up for automated code reviews on this repo. Configure here.

Three drop-in gaps prevented dynamiq's test fixtures from passing
against arcllm even though direct API calls worked.

Exception positional args:
litellm's exception classes take (message, llm_provider, model, ...)
positionally. arcllm previously made these keyword-only. Tests
construct errors as `RateLimitError(msg, "bedrock", "amazon.titan")`
which raised "takes 2 positional arguments but 4 were given".

- ArcLLMError: provider/model/status_code now positional after message;
  llm_provider stays keyword-only as the litellm-name alias
- RateLimitError: accepts (message, provider, model) positionally
- ProviderAPIError: detects litellm shape (status_code, message, ...)
  by type — first int positional becomes status_code
- BadRequestError (renamed from InvalidRequestError to match the
  canonical litellm/OpenAI name; InvalidRequestError stays as alias):
  accepts (message, model, provider) per litellm AND
  (message, provider, model) per arcllm. Disambiguates by checking
  SUPPORTED_PROVIDERS — common provider names always resolve correctly.

Streaming chunk serialisation:
Choice.model_dump() omitted .delta. dynamiq's streaming callback reads
chunk["choices"][0]["delta"]["content"] from the serialized dict, so it
saw KeyError on every streamed event.

token_counter overhead:
Counts now follow OpenAI's per-message formula (3 + per-key
+ 1 for name + 3 priming) so totals match litellm's. Previous
sum-of-fields undercount made dynamiq's history-summarisation logic
preserve more context than the model could actually accept.

ModelResponse defaults:
- choices defaults to [Choice()] so fixtures that do
  ModelResponse()["choices"][0]["message"]["content"] = ... work
- stream: bool = False added so ModelResponse(stream=True) is accepted
- Choice.delta added so streaming fixtures can set delta on the same
  Choice class litellm uses for both modes

Result: dynamiq main suite goes from 281 integration failures → 0
(1066 integration + 1149 unit, all passing). arcllm's own test suite
unchanged (8 pre-existing Ollama integration failures only).
…4.9)

Reasoning-capable models (DeepSeek-R1, GLM-4.5+, Anthropic Claude with
extended thinking, Gemini 2.5 with includeThoughts, Groq DeepSeek/Qwen,
Cerebras Qwen-thinking, Together / Fireworks DeepSeek-R1, OpenAI o-series
via chat/completions) all expose chain-of-thought, but each family uses
a different field name. Previously arcllm dropped this entirely on the
floor — callers could see the final answer but not the thinking.

This wires up a unified surface:

- Message.reasoning_content: str — flat-string CoT, populated by every
  reasoning provider
- Message.thinking_blocks: list[ThinkingBlock] — Anthropic's structured
  form (thinking | redacted_thinking, with signatures preserved for
  tool-use round-trips)
- ChunkDelta.reasoning_content / .thinking / .signature — streaming deltas

Provider mapping:
- OpenAIAdapter (and DeepSeek, GLM, Groq, Cerebras, Together, Fireworks,
  Nebius, OVHcloud, Moonshot, OpenRouter, Perplexity — all subclasses):
  reads message.reasoning_content or message.reasoning from the response;
  same for delta.reasoning_content / .reasoning in stream events.
- AnthropicAdapter: extracts content[].type=="thinking" and
  "redacted_thinking" blocks; populates both thinking_blocks (with
  signature) and a concatenated reasoning_content. Streaming handles
  thinking_delta / signature_delta with one block per signature.
- GeminiAdapter: routes parts[].thought=true text into reasoning_content
  (non-thought parts stay in content). Same split for streaming.

stream_chunk_builder accumulates reasoning across chunks and rebuilds
Anthropic's per-block grouping (signature_delta closes a block).

Verified live end-to-end through arcllm.completion:

  Z.AI GLM-4.5-air      content="5"  reasoning_len=730
  DeepSeek-R1           content="5"  reasoning_len=67
  Claude Sonnet 4.5     content="5"  reasoning_len=101  thinking_blocks=1 (sig)
  Gemini 2.5 Flash      content="5"  reasoning_len=406

Streaming verified for all four — Anthropic's thinking_delta +
signature_delta correctly group into a single ThinkingBlock with the
signature attached.

18 new unit tests cover wire-format parsing for every provider plus
stream_chunk_builder. arcllm own suite: 792 passed (was 782).
dynamiq integration suite unaffected: 1149 unit + 1066 integration,
all passing.
Comment thread arcllm/exceptions.py
kwargs.setdefault("provider", arg2)
kwargs.setdefault("model", arg3)
elif arg2 is not None:
kwargs.setdefault("provider", arg2)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single positional arg always treated as provider incorrectly

Medium Severity

When only arg2 is provided (without arg3), the elif arg2 is not None branch unconditionally treats it as provider. However, the docstring and litellm's documented signature BadRequestError(message, model, llm_provider) indicate the second positional is the model. If litellm callers pass only two positional args (message + model), the model name would be incorrectly stored as provider. The SUPPORTED_PROVIDERS heuristic is only applied when both arg2 and arg3 are present, leaving this single-arg case mishandled.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4dab9eb. Configure here.

@vitalii-dynamiq vitalii-dynamiq merged commit a899e5a into main May 10, 2026
15 checks passed

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b5552ff. Configure here.

finish_reason=None,
)
],
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streaming silently drops redacted_thinking blocks

Medium Severity

The Anthropic streaming handler in parse_stream_event handles content_block_start for type=="thinking" but silently drops type=="redacted_thinking" blocks (returns None). The non-streaming _build_model_response correctly preserves redacted_thinking blocks. Anthropic's streaming protocol does emit content_block_start with type: "redacted_thinking", and these blocks must be preserved unchanged for multi-turn conversation history. Additionally, stream_chunk_builder hardcodes type="thinking" for all assembled blocks, making it impossible to represent redacted_thinking even if the adapter were to emit them.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b5552ff. Configure here.

tool_calls=tool_calls,
function_call=delta_data.get("function_call"),
reasoning_content=delta_data.get("reasoning_content")
or delta_data.get("reasoning"),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Falsy or conflates empty string with absent field

Low Severity

Using or to fall back from reasoning_content to reasoning means an explicit empty string "" in reasoning_content is treated as absent, falling through to the reasoning field. If a provider legitimately sends both fields (e.g., reasoning_content: "" alongside reasoning: null), the result is None rather than "". While semantically an empty string contributes nothing, it prevents callers from distinguishing "field present but empty" from "field absent" via is not None checks on the resulting Message.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b5552ff. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant