fix: tool responses printing for MultimodalConversableAgent handles multimodal content format #2043

Parth220 · 2025-08-18T02:25:14Z

Why are these changes needed?

Fix for Autogen MultimodalConversableAgent Tool Response Issue

Problem

When using MultimodalConversableAgent with tool functions, the agent wraps tool responses in a multimodal format:

[{'type': 'text', 'text': 'actual response text'}]

However, the ToolResponseEvent in autogen expects content to be a simple string, causing validation errors:

ValidationError: Input should be a valid string [type=string_type, input_value=[{'type': 'text', 'text':...}], input_type=list]

Root Cause

The MultimodalConversableAgent formats all responses (including tool responses) in a multimodal format to support both text and images. But the event validation system expects tool responses to be simple strings.

Solution

Modified agent_events.py in the create_received_event_model function for "tool" responses to extract text from the multimodal format:

if role == "tool":
    # Handle multimodal content format - extract text if content is a list of dicts
    content = event.get('content')
    if isinstance(content, list) and len(content) > 0 and isinstance(content[0], dict):
        # Extract text from multimodal format [{'type': 'text', 'text': '...'}]
        text_parts = []
        for item in content:
            if isinstance(item, dict) and item.get('type') == 'text':
                text_parts.append(item.get('text', ''))
        event['content'] = ''.join(text_parts)

This extracts the actual text content from the multimodal format before creating the ToolResponseEvent.

Related issue number

N/A

Checks

I've included any doc changes needed for https://docs.ag2.ai/. See https://docs.ag2.ai/latest/docs/contributor-guide/documentation/ to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

CLAassistant · 2025-08-18T02:25:20Z

All committers have signed the CLA.

qingyun-wu · 2025-08-19T19:31:26Z

@Parth220 could you run pre-commit? Thanks! https://docs.ag2.ai/latest/docs/contributor-guide/pre-commit/

codecov · 2025-08-30T20:18:46Z

Codecov Report

❌ Patch coverage is 14.28571% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
autogen/events/agent_events.py	14.28%	5 Missing and 1 partial ⚠️

Files with missing lines	Coverage Δ
autogen/events/agent_events.py	`95.98% <14.28%> (-1.66%)`	⬇️

... and 36 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2025-10-30T18:56:17Z

Code Review for PR #2043

Summary

This PR addresses a validation error when using MultimodalConversableAgent with tool functions. The agent wraps tool responses in multimodal format [{'type': 'text', 'text': '...'}], but ToolResponseEvent expects a simple string, causing Pydantic validation errors.

Changes Review

1. autogen/events/agent_events.py (Lines 245-253)

Added: Multimodal content extraction for tool responses

if role == "tool":
    # Handle multimodal content format - extract text if content is a list of dicts
    content = event.get('content')
    if isinstance(content, list) and len(content) > 0 and isinstance(content[0], dict):
        # Extract text from multimodal format [{'type': 'text', 'text': '...'}]
        text_parts = []
        for item in content:
            if isinstance(item, dict) and item.get('type') == 'text':
                text_parts.append(item.get('text', ''))
        event['content'] = ''.join(text_parts)

Issues Found:

⚠️ Mutating Input Dictionary: The code directly modifies the event dictionary (event['content'] = ...), which could cause unexpected side effects if the same event object is used elsewhere. Consider using deepcopy or creating a new dict.
⚠️ Empty String Fallback: Using item.get('text', '') could silently hide missing text content. Consider logging a warning or raising an error for malformed content.
⚠️ Incomplete Type Checking: The check only verifies the first element is a dict but assumes all elements are. If the list contains mixed types, this could fail.
⚠️ Silent Content Loss: Non-text items (like images in {'type': 'image_url', ...}) are silently dropped. This might be intentional for tool responses, but should be documented.

2. autogen/agentchat/contrib/multimodal_conversable_agent.py (Lines 116-124)

Added: Tool response unpacking logic

# Fix tool response format for OpenAI API
fixed_messages = []
for msg in messages_with_b64_img:
    if isinstance(msg, dict) and msg.get("role") == "tool" and "tool_responses" in msg:
        # Unpack tool_responses to individual tool messages with tool_call_id
        for tool_response in msg["tool_responses"]:
            fixed_messages.append(tool_response)
    else:
        fixed_messages.append(msg)

Issues Found:

✅ Good: This properly unpacks tool responses for OpenAI API compatibility.
⚠️ Variable Naming: messages_with_b64_img is misleading since it's now being modified to also handle tool responses. Consider renaming to something like processed_messages.
⚠️ Line 128: The change from messages[-1].pop('context', None) to messages[-1].pop('context', None) if messages else None is good for preventing index errors, but this could be an unrelated fix that should be mentioned in the PR description.

Code Quality Assessment

Strengths:

✅ Addresses a real bug with clear symptom description
✅ Minimal, focused changes
✅ Includes helpful inline comments
✅ Properly handles empty lists in the safety check

Concerns:

❌ Mutating input data: Direct modification of the event dict is a code smell
❌ No error handling: Malformed multimodal content could cause silent failures
⚠️ Mixed concerns: The multimodal_conversable_agent.py changes address two different issues

Test Coverage Issues

Major Concern: No Tests Added

The PR description states:

"I've added tests (if relevant) corresponding to the changes introduced in this PR."

However, no test files were modified in this PR. Codecov reports 14.28% patch coverage with 6 lines missing coverage.

Required Tests:

✅ Test case for tool response with multimodal format: [{'type': 'text', 'text': 'result'}]
✅ Test case for tool response with mixed content types
✅ Test case for empty multimodal list
✅ Test case for malformed multimodal content (missing 'text' key)
✅ Test case for unpacking tool_responses in MultimodalConversableAgent
✅ Integration test showing end-to-end tool call with MultimodalConversableAgent

Suggested Test Location:

test/events/test_agent_events.py - Add test to TestToolResponseEvent class
test/agentchat/test_multimodal_integration.py - Add integration test

Security Concerns

Low Risk, but consider:

Input validation: No checks for malicious content in multimodal payloads
Type confusion: Mixed types in content list could cause unexpected behavior

Performance Considerations

Minor overhead: Additional loop through content items for tool responses
Memory: Creating new fixed_messages list could be optimized for large message histories
Overall impact: Negligible for typical use cases

Recommendations

High Priority (Should Fix)

Add comprehensive tests to achieve reasonable coverage (target 80%+)
Avoid mutating input: Create a copy of the event dict before modification
Add error handling: Validate multimodal content structure and provide clear error messages

Medium Priority (Should Consider)

Document behavior: Add docstring explaining multimodal content extraction
Type hints: Add type annotations to the new code sections
Logging: Add debug logging for content transformation

Low Priority (Nice to Have)

Refactor: Extract multimodal content parsing into a separate utility function
Edge cases: Handle empty strings, None values, and nested structures

Suggested Code Improvements

For agent_events.py:

if role == "tool":
    # Handle multimodal content format - extract text if content is a list of dicts
    content = event.get('content')
    if isinstance(content, list) and len(content) > 0:
        # Extract text from multimodal format [{'type': 'text', 'text': '...'}]
        text_parts = []
        for item in content:
            if not isinstance(item, dict):
                continue  # Skip non-dict items
            if item.get('type') == 'text':
                text_value = item.get('text')
                if text_value is not None:
                    text_parts.append(str(text_value))
        # Only modify if we found text content
        if text_parts:
            event = event.copy()  # Don't mutate the original
            event['content'] = ''.join(text_parts)
    return ToolResponseEvent(**event, sender=sender.name, recipient=recipient.name, uuid=uuid)

Conclusion

Overall Assessment: Needs Work ⚠️

The PR addresses a legitimate issue with a reasonable approach, but has significant gaps:

❌ Critical: Missing test coverage - only 14.28%
⚠️ Important: Direct mutation of input data
⚠️ Important: No error handling for malformed content
ℹ️ Minor: Code could be more robust and maintainable

Recommendation: Request changes

Add comprehensive tests before merging
Fix the input mutation issue
Add basic error handling

Estimated Effort to Fix: 2-3 hours

1-2 hours for comprehensive tests
30 min for code improvements
30 min for documentation

Additional Notes

The changes are merged from multiple branches, make sure all tests pass
Consider adding a notebook example demonstrating the fix
Pre-commit hooks need to be run (mentioned in comments)

Let me know if you need help writing the tests or making the recommended improvements!

Review generated by Claude Code

fix: tool response printing throws pydantic error

fb265d7

Parth220 added 2 commits August 17, 2025 21:17

fix: intra agent-loop messages tool_call_id continuity.

063c0ce

chore: remove debugging print statements

e100be4

Merge branch 'main' into main

35cd86a

Merge branch 'main' into main

42f3257

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: tool responses printing for MultimodalConversableAgent handles multimodal content format #2043

fix: tool responses printing for MultimodalConversableAgent handles multimodal content format #2043

Uh oh!

Parth220 commented Aug 18, 2025

Uh oh!

CLAassistant commented Aug 18, 2025 •

edited

Loading

Uh oh!

qingyun-wu commented Aug 19, 2025

Uh oh!

codecov bot commented Aug 30, 2025

Uh oh!

github-actions bot commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fix: tool responses printing for MultimodalConversableAgent handles multimodal content format #2043

Are you sure you want to change the base?

fix: tool responses printing for MultimodalConversableAgent handles multimodal content format #2043

Uh oh!

Conversation

Parth220 commented Aug 18, 2025

Why are these changes needed?

Fix for Autogen MultimodalConversableAgent Tool Response Issue

Problem

Root Cause

Solution

Related issue number

Checks

Uh oh!

CLAassistant commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qingyun-wu commented Aug 19, 2025

Uh oh!

codecov bot commented Aug 30, 2025

Codecov Report

Uh oh!

github-actions bot commented Oct 30, 2025

Code Review for PR #2043

Summary

Changes Review

1. autogen/events/agent_events.py (Lines 245-253)

2. autogen/agentchat/contrib/multimodal_conversable_agent.py (Lines 116-124)

Code Quality Assessment

Test Coverage Issues

Security Concerns

Performance Considerations

Recommendations

High Priority (Should Fix)

Medium Priority (Should Consider)

Low Priority (Nice to Have)

Suggested Code Improvements

Conclusion

Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLAassistant commented Aug 18, 2025 •

edited

Loading