Skip to content

Conversation

@Parth220
Copy link

Why are these changes needed?

Fix for Autogen MultimodalConversableAgent Tool Response Issue

Problem

When using MultimodalConversableAgent with tool functions, the agent wraps tool responses in a multimodal format:

[{'type': 'text', 'text': 'actual response text'}]

However, the ToolResponseEvent in autogen expects content to be a simple string, causing validation errors:

ValidationError: Input should be a valid string [type=string_type, input_value=[{'type': 'text', 'text':...}], input_type=list]

Root Cause

The MultimodalConversableAgent formats all responses (including tool responses) in a multimodal format to support both text and images. But the event validation system expects tool responses to be simple strings.

Solution

Modified agent_events.py in the create_received_event_model function for "tool" responses to extract text from the multimodal format:

if role == "tool":
    # Handle multimodal content format - extract text if content is a list of dicts
    content = event.get('content')
    if isinstance(content, list) and len(content) > 0 and isinstance(content[0], dict):
        # Extract text from multimodal format [{'type': 'text', 'text': '...'}]
        text_parts = []
        for item in content:
            if isinstance(item, dict) and item.get('type') == 'text':
                text_parts.append(item.get('text', ''))
        event['content'] = ''.join(text_parts)

This extracts the actual text content from the multimodal format before creating the ToolResponseEvent.

Related issue number

N/A

Checks

@CLAassistant
Copy link

CLAassistant commented Aug 18, 2025

CLA assistant check
All committers have signed the CLA.

@qingyun-wu
Copy link
Collaborator

@Parth220 could you run pre-commit? Thanks! https://docs.ag2.ai/latest/docs/contributor-guide/pre-commit/

@codecov
Copy link

codecov bot commented Aug 30, 2025

Codecov Report

❌ Patch coverage is 14.28571% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
autogen/events/agent_events.py 14.28% 5 Missing and 1 partial ⚠️
Files with missing lines Coverage Δ
autogen/events/agent_events.py 95.98% <14.28%> (-1.66%) ⬇️

... and 36 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Contributor

Code Review for PR #2043

Summary

This PR addresses a validation error when using MultimodalConversableAgent with tool functions. The agent wraps tool responses in multimodal format [{'type': 'text', 'text': '...'}], but ToolResponseEvent expects a simple string, causing Pydantic validation errors.


Changes Review

1. autogen/events/agent_events.py (Lines 245-253)

Added: Multimodal content extraction for tool responses

if role == "tool":
    # Handle multimodal content format - extract text if content is a list of dicts
    content = event.get('content')
    if isinstance(content, list) and len(content) > 0 and isinstance(content[0], dict):
        # Extract text from multimodal format [{'type': 'text', 'text': '...'}]
        text_parts = []
        for item in content:
            if isinstance(item, dict) and item.get('type') == 'text':
                text_parts.append(item.get('text', ''))
        event['content'] = ''.join(text_parts)

Issues Found:

  1. ⚠️ Mutating Input Dictionary: The code directly modifies the event dictionary (event['content'] = ...), which could cause unexpected side effects if the same event object is used elsewhere. Consider using deepcopy or creating a new dict.

  2. ⚠️ Empty String Fallback: Using item.get('text', '') could silently hide missing text content. Consider logging a warning or raising an error for malformed content.

  3. ⚠️ Incomplete Type Checking: The check only verifies the first element is a dict but assumes all elements are. If the list contains mixed types, this could fail.

  4. ⚠️ Silent Content Loss: Non-text items (like images in {'type': 'image_url', ...}) are silently dropped. This might be intentional for tool responses, but should be documented.

2. autogen/agentchat/contrib/multimodal_conversable_agent.py (Lines 116-124)

Added: Tool response unpacking logic

# Fix tool response format for OpenAI API
fixed_messages = []
for msg in messages_with_b64_img:
    if isinstance(msg, dict) and msg.get("role") == "tool" and "tool_responses" in msg:
        # Unpack tool_responses to individual tool messages with tool_call_id
        for tool_response in msg["tool_responses"]:
            fixed_messages.append(tool_response)
    else:
        fixed_messages.append(msg)

Issues Found:

  1. ✅ Good: This properly unpacks tool responses for OpenAI API compatibility.

  2. ⚠️ Variable Naming: messages_with_b64_img is misleading since it's now being modified to also handle tool responses. Consider renaming to something like processed_messages.

  3. ⚠️ Line 128: The change from messages[-1].pop('context', None) to messages[-1].pop('context', None) if messages else None is good for preventing index errors, but this could be an unrelated fix that should be mentioned in the PR description.


Code Quality Assessment

Strengths:

  • ✅ Addresses a real bug with clear symptom description
  • ✅ Minimal, focused changes
  • ✅ Includes helpful inline comments
  • ✅ Properly handles empty lists in the safety check

Concerns:

  • Mutating input data: Direct modification of the event dict is a code smell
  • No error handling: Malformed multimodal content could cause silent failures
  • ⚠️ Mixed concerns: The multimodal_conversable_agent.py changes address two different issues

Test Coverage Issues

Major Concern: No Tests Added

The PR description states:

"I've added tests (if relevant) corresponding to the changes introduced in this PR."

However, no test files were modified in this PR. Codecov reports 14.28% patch coverage with 6 lines missing coverage.

Required Tests:

  1. ✅ Test case for tool response with multimodal format: [{'type': 'text', 'text': 'result'}]
  2. ✅ Test case for tool response with mixed content types
  3. ✅ Test case for empty multimodal list
  4. ✅ Test case for malformed multimodal content (missing 'text' key)
  5. ✅ Test case for unpacking tool_responses in MultimodalConversableAgent
  6. ✅ Integration test showing end-to-end tool call with MultimodalConversableAgent

Suggested Test Location:

  • test/events/test_agent_events.py - Add test to TestToolResponseEvent class
  • test/agentchat/test_multimodal_integration.py - Add integration test

Security Concerns

Low Risk, but consider:

  • Input validation: No checks for malicious content in multimodal payloads
  • Type confusion: Mixed types in content list could cause unexpected behavior

Performance Considerations

  • Minor overhead: Additional loop through content items for tool responses
  • Memory: Creating new fixed_messages list could be optimized for large message histories
  • Overall impact: Negligible for typical use cases

Recommendations

High Priority (Should Fix)

  1. Add comprehensive tests to achieve reasonable coverage (target 80%+)
  2. Avoid mutating input: Create a copy of the event dict before modification
  3. Add error handling: Validate multimodal content structure and provide clear error messages

Medium Priority (Should Consider)

  1. Document behavior: Add docstring explaining multimodal content extraction
  2. Type hints: Add type annotations to the new code sections
  3. Logging: Add debug logging for content transformation

Low Priority (Nice to Have)

  1. Refactor: Extract multimodal content parsing into a separate utility function
  2. Edge cases: Handle empty strings, None values, and nested structures

Suggested Code Improvements

For agent_events.py:

if role == "tool":
    # Handle multimodal content format - extract text if content is a list of dicts
    content = event.get('content')
    if isinstance(content, list) and len(content) > 0:
        # Extract text from multimodal format [{'type': 'text', 'text': '...'}]
        text_parts = []
        for item in content:
            if not isinstance(item, dict):
                continue  # Skip non-dict items
            if item.get('type') == 'text':
                text_value = item.get('text')
                if text_value is not None:
                    text_parts.append(str(text_value))
        # Only modify if we found text content
        if text_parts:
            event = event.copy()  # Don't mutate the original
            event['content'] = ''.join(text_parts)
    return ToolResponseEvent(**event, sender=sender.name, recipient=recipient.name, uuid=uuid)

Conclusion

Overall Assessment: Needs Work ⚠️

The PR addresses a legitimate issue with a reasonable approach, but has significant gaps:

  1. Critical: Missing test coverage - only 14.28%
  2. ⚠️ Important: Direct mutation of input data
  3. ⚠️ Important: No error handling for malformed content
  4. ℹ️ Minor: Code could be more robust and maintainable

Recommendation: Request changes

  • Add comprehensive tests before merging
  • Fix the input mutation issue
  • Add basic error handling

Estimated Effort to Fix: 2-3 hours

  • 1-2 hours for comprehensive tests
  • 30 min for code improvements
  • 30 min for documentation

Additional Notes

  • The changes are merged from multiple branches, make sure all tests pass
  • Consider adding a notebook example demonstrating the fix
  • Pre-commit hooks need to be run (mentioned in comments)

Let me know if you need help writing the tests or making the recommended improvements!


Review generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants