Adding MCP Servers supports to Arcade Evals #689

jottakka · 2025-11-18T18:59:43Z

MCP Server Tool Evaluation Support

Overview

Add support for evaluating tools from remote MCP servers without requiring Python callables. Enables direct evaluation of any MCP-compatible tool server.

What's New

Core Features

MCPToolRegistry: Evaluate tools from a single MCP server
CompositeMCPRegistry: Evaluate tools from multiple MCP servers simultaneously
Automatic loaders: load_from_stdio() and load_from_http() to fetch tools from running servers
Automatic namespacing: Tools prefixed with server name (e.g., server_tool_name)
Smart name resolution: Use short names if unique, full names if ambiguous
OpenAI strict mode: Automatic schema conversion prevents parameter hallucinations

Usage

Automatic Loading:

from arcade_evals import load_from_stdio, MCPToolRegistry

# Load tools automatically from MCP server
tools = load_from_stdio(["npx", "-y", "@modelcontextprotocol/server-github"])
registry = MCPToolRegistry(tools)

Single MCP Server:

from arcade_evals import MCPToolRegistry, ExpectedToolCall

registry = MCPToolRegistry(mcp_tools)
suite = EvalSuite(catalog=registry)

suite.add_case(
    expected_tool_calls=[
        ExpectedToolCall(tool_name="tool_name", args={...})
    ]
)

Multiple MCP Servers:

from arcade_evals import CompositeMCPRegistry, load_from_stdio

# Load from multiple servers
github_tools = load_from_stdio(["npx", "-y", "@modelcontextprotocol/server-github"])
slack_tools = load_from_stdio(["npx", "-y", "@modelcontextprotocol/server-slack"])

composite = CompositeMCPRegistry(
    tool_lists={
        "github": github_tools,
        "slack": slack_tools,
    }
)

suite = EvalSuite(catalog=composite)

suite.add_case(
    expected_tool_calls=[
        ExpectedToolCall(tool_name="github_list_issues", args={...})
    ]
)

Implementation

Files Changed

libs/arcade-evals/arcade_evals/registry.py (NEW): Registry abstractions and implementations
libs/arcade-evals/arcade_evals/loaders.py (NEW): Automatic tool loading from MCP servers
libs/arcade-evals/arcade_evals/eval.py (MODIFIED): Enhanced ExpectedToolCall and evaluation logic
libs/arcade-evals/arcade_evals/__init__.py (MODIFIED): Exported new registries and loaders

Key Technical Details

Added BaseToolRegistry interface for abstraction
MCPToolRegistry handles single server tools
CompositeMCPRegistry manages multiple servers with collision detection
load_from_stdio() and load_from_http() for automatic tool discovery
Fixed name normalization bug: MCP tools use underscores (not dots)
Optimized tool copying: 2.5x faster via shallow copy

Testing

✅ 41 tests passing (25 new tests added)
✅ test_eval_mcp_registry.py: MCPToolRegistry functionality
✅ test_eval_composite_mcp.py: CompositeMCPRegistry with multiple servers
✅ Verified backward compatibility with Python tools

Backward Compatibility

✅ 100% backward compatible - No breaking changes

Breaking Changes

None

Note

Adds end-to-end eval UX: examples, a robust CLI runner, and rich outputs.

New examples: eval_arcade_gateway.py, eval_stdio_mcp_server.py, eval_http_mcp_server.py, eval_comprehensive_comparison.py with timeouts, error handling, and track-based comparisons; detailed README.md
CLI runner: arcade_cli/evals_runner.py to execute evals/capture in parallel with progress, error isolation, failed-only filtering, context inclusion, and multi-provider/model support
Output formatters: arcade_cli/formatters/ (txt, md, html, json) for evals and capture; comparative and multi-model HTML with tabs and context rendering
Display refactor: display.py now supports writing multiple formats, failed-only disclaimers, include-context, and improved console summaries

^{Written by Cursor Bugbot for commit ff8acf9. This will update automatically on new commits. Configure here.}

…only-failed

Copilot

Pull Request Overview

This PR adds MCP (Model Context Protocol) server support to Arcade Evals, enabling evaluation of remote server tools without requiring Python callables. The implementation introduces registry abstractions and two new registry types for single and multiple MCP servers.

Key changes:

New MCPToolRegistry for evaluating tools from a single MCP server
New CompositeMCPRegistry for evaluating tools from multiple MCP servers with automatic namespacing
Enhanced ExpectedToolCall to support both Python callables and MCP tool names
OpenAI strict mode schema conversion to prevent parameter hallucinations

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
pyproject.toml	Version bump from 1.5.3 to 1.6.0 for new MCP features
libs/arcade-evals/arcade_evals/registry.py	New registry abstractions and MCP implementations
libs/arcade-evals/arcade_evals/eval.py	Enhanced to support MCP registries alongside Python tools
libs/arcade-evals/arcade_evals/init.py	Exported new registry classes
libs/tests/sdk/test_eval_mcp_registry.py	Comprehensive tests for MCPToolRegistry
libs/tests/sdk/test_eval_composite_mcp.py	Comprehensive tests for CompositeMCPRegistry
examples/mcp_servers/mcp_eval_example.py	Example usage of MCPToolRegistry
examples/mcp_evals_example.py	Example usage of MCP evaluations
examples/composite_mcp_evals_example.py	Example usage of CompositeMCPRegistry

Comments suppressed due to low confidence (2)

libs/arcade-evals/arcade_evals/registry.py:1

The docstring for BaseToolRegistry is incomplete. It starts with "This allows evaluations to work with both Python-based tools (ToolCatalog)" but cuts off without finishing the sentence or mentioning MCP-based tools properly.

"""Base registry interface for tool evaluation."""

libs/arcade-evals/arcade_evals/registry.py:1

Type narrowing assigns to MCPToolRegistry but should be MCPToolRegistry | CompositeMCPRegistry since both types support the same interface. This could cause issues if the catalog is actually a CompositeMCPRegistry.

"""Base registry interface for tool evaluation."""

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

libs/arcade-evals/arcade_evals/eval.py

codecov · 2025-11-18T19:01:36Z

Codecov Report

❌ Patch coverage is 81.39800% with 652 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
libs/arcade-cli/arcade_cli/formatters/markdown.py	69.82%	239 Missing ⚠️
libs/arcade-cli/arcade_cli/formatters/text.py	75.19%	159 Missing ⚠️
libs/arcade-cli/arcade_cli/formatters/json.py	78.85%	59 Missing ⚠️
libs/arcade-cli/arcade_cli/main.py	1.88%	52 Missing ⚠️
libs/arcade-cli/arcade_cli/evals_runner.py	83.52%	29 Missing ⚠️
libs/arcade-evals/arcade_evals/eval.py	84.39%	22 Missing ⚠️
libs/arcade-evals/arcade_evals/loaders.py	86.79%	21 Missing ⚠️
libs/arcade-cli/arcade_cli/formatters/base.py	94.35%	17 Missing ⚠️
libs/arcade-cli/arcade_cli/display.py	75.86%	14 Missing ⚠️
...s/arcade-evals/arcade_evals/_evalsuite/_capture.py	78.26%	10 Missing ⚠️
... and 6 more

📢 Thoughts on this report? Let us know!

evantahler · 2025-11-20T17:13:22Z

I'm of the opinion that this should be merged in from a product/business POV.

We want to make arcade-mcp as useful as we can.
Eval's aren't unique to us any more, and other frameworks have Evals now too
This is OSS, and if we didn't do it, someone could fork it and add remove-MCP eval support anywyay
We (@jottakka and @torresmateo) want to use this to show how our tools are better anyway

…only-failed

EricGustin

Firstly, I really like how this PR keeps ToolCatalog pure (no impact on arcade_mcp_server) and I also agree that being able to evaluate non-local and also non-arcade tools will be very beneficial.

I do have various concerns about this code though, but I also don't want to completely block/stall the progress of something that will provide us benefit. This PR doesn't seem to break anything, so it could technically be merged, but I'm afraid that it will end up being dead code in its current form. cc @evantahler

@jottakka let's setup a ~30minute call to chat about this & ideally in that time we can come up with a rough spec for this project.

With that being said, there are 4 main concerns that I have regarding this PR:

The CLI is the interface through which arcade_evals is used and there is a specified way to define an evaluation suite that the CLI expects. This PR comes up with a second way to define an evaluation suite which is not able to be understood by the CLI. From what I can tell, the additions in this PR won't work for the CLI's arcade evals command.
There is a lot of logic in this PR around resolving tool names into something else that is 'less ambiguous'. Changing anything about the definition of a tool concerns me, because we're no longer evaluating the same tools that the LLM will use in the real world. I'm advocating for removing this logic/complexity entirely.
To support evaluating a remote MCP server's tools is to make the evaluation framework a full-fledged MCP Client. This is a big task and (possibly?) more than we want to take on. This PR sort of does it, but it feels more like an ad-hoc MCP client instead of using an existing MCP client framework or establishing a proper client layer of our own. So all of the logic ends up being bundled into one function with no clear types or separation of concerns. Before merging, I'm advocating for either (1.) adopt an existing MCP client library, or (2.) decide if this is a sufficient use-case to create our own MCP client framework.
Architecturally, this PR is creating an entirely parallel code path via the new registry hierarchy (BaseToolRegistry, PythonToolRegistry, MCPToolRegistry, CompositeToolRegistry, etc.). I don't think we need this complexity. An alternative is to store MCP tools directly in EvalSuite. For example EvalSuite.add_mcp_tools(...) and EvalSuite.load_mcp_tools_http(...).

examples/composite_mcp_evals_example.py

libs/arcade-evals/arcade_evals/eval.py

libs/arcade-evals/arcade_evals/loaders.py

libs/arcade-evals/arcade_evals/registry.py

…only-failed

libs/arcade-evals/arcade_evals/eval.py

libs/arcade-evals/arcade_evals/loaders.py

libs/arcade-cli/arcade_cli/formatters/html.py

EricGustin

Mostly some DevEx comments regarding the CLI command.

Also, I'm seeing Session termination failed: 202 logged whenever I run evals for a gateway. What is this?

pyproject.toml

libs/arcade-cli/arcade_cli/main.py

libs/arcade-cli/arcade_cli/evals_runner.py

libs/arcade-cli/arcade_cli/main.py

libs/arcade-evals/arcade_evals/_evalsuite/_convenience.py

libs/arcade-cli/arcade_cli/formatters/html.py

EricGustin

@jottakka a couple small comments. Will approve once merge conflicts are resolved & if you decide to address any of the below. DM when ready

examples/evals/README.md

libs/tests/conftest.py

examples/evals/eval_arcade_gateway.py

libs/arcade-cli/arcade_cli/formatters/html.py

Francisco Liberal and others added 7 commits November 5, 2025 15:38

Add --failed-only and --output flags to evals command

fb35d7b

Add tests for display_eval_results with --failed-only and --output flags

c11a908

Add additional test cases for better coverage of display_eval_results

19199e7

Extract filtering logic to testable function and add tests

22d0ecb

Merge branch 'main' into francisco/arcade-cli/updating-evals-to-show-…

9bbb691

…only-failed

Merge branch 'main' into francisco/arcade-cli/updating-evals-to-show-…

c293f18

…only-failed

Adding MCP Servers supports to Arcade Evals

d593a4a

jottakka requested review from EricGustin, Copilot and torresmateo November 18, 2025 18:59

Copilot AI reviewed Nov 18, 2025

View reviewed changes

libs/arcade-evals/arcade_evals/eval.py Outdated Show resolved Hide resolved

Francisco Liberal and others added 11 commits November 20, 2025 15:57

Updating loading from mcp server

eda0260

Merge branch 'main' into francisco/updating-arcade-evails

aa1fff9

Updating to accept headers

2d889f6

fixing linting

de2df04

Merge branch 'main' into francisco/arcade-cli/updating-evals-to-show-…

8c5a096

…only-failed

added session support for the http loader

049f965

removed debug print

3998b39

handled unsupported protocol for http tool loader

56399e2

Open API issue

477ccbe

Updating strict mode

5f28a55

Updating strict mode

bf7678e

EricGustin assigned EricGustin and unassigned EricGustin Dec 3, 2025

EricGustin requested changes Dec 3, 2025

View reviewed changes

jottakka self-assigned this Dec 8, 2025

Merge branch 'main' into francisco/arcade-cli/updating-evals-to-show-…

baad441

…only-failed

jottakka mentioned this pull request Dec 9, 2025

[CLI][Sugestion] Adding flags for evals to return only failed and print to output file #678

Closed