Skip to content

Conversation

@jottakka
Copy link
Contributor

@jottakka jottakka commented Nov 18, 2025

MCP Server Tool Evaluation Support

Overview

Add support for evaluating tools from remote MCP servers without requiring Python callables. Enables direct evaluation of any MCP-compatible tool server.

What's New

Core Features

  • MCPToolRegistry: Evaluate tools from a single MCP server
  • CompositeMCPRegistry: Evaluate tools from multiple MCP servers simultaneously
  • Automatic loaders: load_from_stdio() and load_from_http() to fetch tools from running servers
  • Automatic namespacing: Tools prefixed with server name (e.g., server_tool_name)
  • Smart name resolution: Use short names if unique, full names if ambiguous
  • OpenAI strict mode: Automatic schema conversion prevents parameter hallucinations

Usage

Automatic Loading:

from arcade_evals import load_from_stdio, MCPToolRegistry

# Load tools automatically from MCP server
tools = load_from_stdio(["npx", "-y", "@modelcontextprotocol/server-github"])
registry = MCPToolRegistry(tools)

Single MCP Server:

from arcade_evals import MCPToolRegistry, ExpectedToolCall

registry = MCPToolRegistry(mcp_tools)
suite = EvalSuite(catalog=registry)

suite.add_case(
    expected_tool_calls=[
        ExpectedToolCall(tool_name="tool_name", args={...})
    ]
)

Multiple MCP Servers:

from arcade_evals import CompositeMCPRegistry, load_from_stdio

# Load from multiple servers
github_tools = load_from_stdio(["npx", "-y", "@modelcontextprotocol/server-github"])
slack_tools = load_from_stdio(["npx", "-y", "@modelcontextprotocol/server-slack"])

composite = CompositeMCPRegistry(
    tool_lists={
        "github": github_tools,
        "slack": slack_tools,
    }
)

suite = EvalSuite(catalog=composite)

suite.add_case(
    expected_tool_calls=[
        ExpectedToolCall(tool_name="github_list_issues", args={...})
    ]
)

Implementation

Files Changed

  • libs/arcade-evals/arcade_evals/registry.py (NEW): Registry abstractions and implementations
  • libs/arcade-evals/arcade_evals/loaders.py (NEW): Automatic tool loading from MCP servers
  • libs/arcade-evals/arcade_evals/eval.py (MODIFIED): Enhanced ExpectedToolCall and evaluation logic
  • libs/arcade-evals/arcade_evals/__init__.py (MODIFIED): Exported new registries and loaders

Key Technical Details

  • Added BaseToolRegistry interface for abstraction
  • MCPToolRegistry handles single server tools
  • CompositeMCPRegistry manages multiple servers with collision detection
  • load_from_stdio() and load_from_http() for automatic tool discovery
  • Fixed name normalization bug: MCP tools use underscores (not dots)
  • Optimized tool copying: 2.5x faster via shallow copy

Testing

  • ✅ 41 tests passing (25 new tests added)
  • test_eval_mcp_registry.py: MCPToolRegistry functionality
  • test_eval_composite_mcp.py: CompositeMCPRegistry with multiple servers
  • ✅ Verified backward compatibility with Python tools

Backward Compatibility

100% backward compatible - No breaking changes

Breaking Changes

None


Note

Adds end-to-end eval UX: examples, a robust CLI runner, and rich outputs.

  • New examples: eval_arcade_gateway.py, eval_stdio_mcp_server.py, eval_http_mcp_server.py, eval_comprehensive_comparison.py with timeouts, error handling, and track-based comparisons; detailed README.md
  • CLI runner: arcade_cli/evals_runner.py to execute evals/capture in parallel with progress, error isolation, failed-only filtering, context inclusion, and multi-provider/model support
  • Output formatters: arcade_cli/formatters/ (txt, md, html, json) for evals and capture; comparative and multi-model HTML with tabs and context rendering
  • Display refactor: display.py now supports writing multiple formats, failed-only disclaimers, include-context, and improved console summaries

Written by Cursor Bugbot for commit ff8acf9. This will update automatically on new commits. Configure here.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds MCP (Model Context Protocol) server support to Arcade Evals, enabling evaluation of remote server tools without requiring Python callables. The implementation introduces registry abstractions and two new registry types for single and multiple MCP servers.

Key changes:

  • New MCPToolRegistry for evaluating tools from a single MCP server
  • New CompositeMCPRegistry for evaluating tools from multiple MCP servers with automatic namespacing
  • Enhanced ExpectedToolCall to support both Python callables and MCP tool names
  • OpenAI strict mode schema conversion to prevent parameter hallucinations

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pyproject.toml Version bump from 1.5.3 to 1.6.0 for new MCP features
libs/arcade-evals/arcade_evals/registry.py New registry abstractions and MCP implementations
libs/arcade-evals/arcade_evals/eval.py Enhanced to support MCP registries alongside Python tools
libs/arcade-evals/arcade_evals/init.py Exported new registry classes
libs/tests/sdk/test_eval_mcp_registry.py Comprehensive tests for MCPToolRegistry
libs/tests/sdk/test_eval_composite_mcp.py Comprehensive tests for CompositeMCPRegistry
examples/mcp_servers/mcp_eval_example.py Example usage of MCPToolRegistry
examples/mcp_evals_example.py Example usage of MCP evaluations
examples/composite_mcp_evals_example.py Example usage of CompositeMCPRegistry
Comments suppressed due to low confidence (2)

libs/arcade-evals/arcade_evals/registry.py:1

  • The docstring for BaseToolRegistry is incomplete. It starts with "This allows evaluations to work with both Python-based tools (ToolCatalog)" but cuts off without finishing the sentence or mentioning MCP-based tools properly.
"""Base registry interface for tool evaluation."""

libs/arcade-evals/arcade_evals/registry.py:1

  • Type narrowing assigns to MCPToolRegistry but should be MCPToolRegistry | CompositeMCPRegistry since both types support the same interface. This could cause issues if the catalog is actually a CompositeMCPRegistry.
"""Base registry interface for tool evaluation."""

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@evantahler
Copy link
Contributor

evantahler commented Nov 20, 2025

I'm of the opinion that this should be merged in from a product/business POV.

  1. We want to make arcade-mcp as useful as we can.
  2. Eval's aren't unique to us any more, and other frameworks have Evals now too
  3. This is OSS, and if we didn't do it, someone could fork it and add remove-MCP eval support anywyay
  4. We (@jottakka and @torresmateo) want to use this to show how our tools are better anyway

@EricGustin EricGustin assigned EricGustin and unassigned EricGustin Dec 3, 2025
Copy link
Member

@EricGustin EricGustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, I really like how this PR keeps ToolCatalog pure (no impact on arcade_mcp_server) and I also agree that being able to evaluate non-local and also non-arcade tools will be very beneficial.

I do have various concerns about this code though, but I also don't want to completely block/stall the progress of something that will provide us benefit. This PR doesn't seem to break anything, so it could technically be merged, but I'm afraid that it will end up being dead code in its current form. cc @evantahler

@jottakka let's setup a ~30minute call to chat about this & ideally in that time we can come up with a rough spec for this project.

With that being said, there are 4 main concerns that I have regarding this PR:

  1. The CLI is the interface through which arcade_evals is used and there is a specified way to define an evaluation suite that the CLI expects. This PR comes up with a second way to define an evaluation suite which is not able to be understood by the CLI. From what I can tell, the additions in this PR won't work for the CLI's arcade evals command.
  2. There is a lot of logic in this PR around resolving tool names into something else that is 'less ambiguous'. Changing anything about the definition of a tool concerns me, because we're no longer evaluating the same tools that the LLM will use in the real world. I'm advocating for removing this logic/complexity entirely.
  3. To support evaluating a remote MCP server's tools is to make the evaluation framework a full-fledged MCP Client. This is a big task and (possibly?) more than we want to take on. This PR sort of does it, but it feels more like an ad-hoc MCP client instead of using an existing MCP client framework or establishing a proper client layer of our own. So all of the logic ends up being bundled into one function with no clear types or separation of concerns. Before merging, I'm advocating for either (1.) adopt an existing MCP client library, or (2.) decide if this is a sufficient use-case to create our own MCP client framework.
  4. Architecturally, this PR is creating an entirely parallel code path via the new registry hierarchy (BaseToolRegistry, PythonToolRegistry, MCPToolRegistry, CompositeToolRegistry, etc.). I don't think we need this complexity. An alternative is to store MCP tools directly in EvalSuite. For example EvalSuite.add_mcp_tools(...) and EvalSuite.load_mcp_tools_http(...).

@jottakka jottakka self-assigned this Dec 8, 2025
@jottakka jottakka requested a review from EricGustin December 29, 2025 02:35
Copy link
Member

@EricGustin EricGustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly some DevEx comments regarding the CLI command.

Also, I'm seeing Session termination failed: 202 logged whenever I run evals for a gateway. What is this?

@jottakka jottakka requested a review from EricGustin January 5, 2026 18:54
Copy link
Member

@EricGustin EricGustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jottakka a couple small comments. Will approve once merge conflicts are resolved & if you decide to address any of the below. DM when ready

@jottakka jottakka requested a review from EricGustin January 7, 2026 19:56
@jottakka jottakka merged commit 98fad93 into main Jan 7, 2026
14 checks passed
@jottakka jottakka deleted the francisco/updating-arcade-evails branch January 7, 2026 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants