-
Notifications
You must be signed in to change notification settings - Fork 83
Adding MCP Servers supports to Arcade Evals #689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds MCP (Model Context Protocol) server support to Arcade Evals, enabling evaluation of remote server tools without requiring Python callables. The implementation introduces registry abstractions and two new registry types for single and multiple MCP servers.
Key changes:
- New
MCPToolRegistryfor evaluating tools from a single MCP server - New
CompositeMCPRegistryfor evaluating tools from multiple MCP servers with automatic namespacing - Enhanced
ExpectedToolCallto support both Python callables and MCP tool names - OpenAI strict mode schema conversion to prevent parameter hallucinations
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Version bump from 1.5.3 to 1.6.0 for new MCP features |
| libs/arcade-evals/arcade_evals/registry.py | New registry abstractions and MCP implementations |
| libs/arcade-evals/arcade_evals/eval.py | Enhanced to support MCP registries alongside Python tools |
| libs/arcade-evals/arcade_evals/init.py | Exported new registry classes |
| libs/tests/sdk/test_eval_mcp_registry.py | Comprehensive tests for MCPToolRegistry |
| libs/tests/sdk/test_eval_composite_mcp.py | Comprehensive tests for CompositeMCPRegistry |
| examples/mcp_servers/mcp_eval_example.py | Example usage of MCPToolRegistry |
| examples/mcp_evals_example.py | Example usage of MCP evaluations |
| examples/composite_mcp_evals_example.py | Example usage of CompositeMCPRegistry |
Comments suppressed due to low confidence (2)
libs/arcade-evals/arcade_evals/registry.py:1
- The docstring for
BaseToolRegistryis incomplete. It starts with "This allows evaluations to work with both Python-based tools (ToolCatalog)" but cuts off without finishing the sentence or mentioning MCP-based tools properly.
"""Base registry interface for tool evaluation."""
libs/arcade-evals/arcade_evals/registry.py:1
- Type narrowing assigns to
MCPToolRegistrybut should beMCPToolRegistry | CompositeMCPRegistrysince both types support the same interface. This could cause issues if the catalog is actually a CompositeMCPRegistry.
"""Base registry interface for tool evaluation."""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
I'm of the opinion that this should be merged in from a product/business POV.
|
EricGustin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, I really like how this PR keeps ToolCatalog pure (no impact on arcade_mcp_server) and I also agree that being able to evaluate non-local and also non-arcade tools will be very beneficial.
I do have various concerns about this code though, but I also don't want to completely block/stall the progress of something that will provide us benefit. This PR doesn't seem to break anything, so it could technically be merged, but I'm afraid that it will end up being dead code in its current form. cc @evantahler
@jottakka let's setup a ~30minute call to chat about this & ideally in that time we can come up with a rough spec for this project.
With that being said, there are 4 main concerns that I have regarding this PR:
- The CLI is the interface through which
arcade_evalsis used and there is a specified way to define an evaluation suite that the CLI expects. This PR comes up with a second way to define an evaluation suite which is not able to be understood by the CLI. From what I can tell, the additions in this PR won't work for the CLI'sarcade evalscommand. - There is a lot of logic in this PR around resolving tool names into something else that is 'less ambiguous'. Changing anything about the definition of a tool concerns me, because we're no longer evaluating the same tools that the LLM will use in the real world. I'm advocating for removing this logic/complexity entirely.
- To support evaluating a remote MCP server's tools is to make the evaluation framework a full-fledged MCP Client. This is a big task and (possibly?) more than we want to take on. This PR sort of does it, but it feels more like an ad-hoc MCP client instead of using an existing MCP client framework or establishing a proper client layer of our own. So all of the logic ends up being bundled into one function with no clear types or separation of concerns. Before merging, I'm advocating for either (1.) adopt an existing MCP client library, or (2.) decide if this is a sufficient use-case to create our own MCP client framework.
- Architecturally, this PR is creating an entirely parallel code path via the new registry hierarchy (
BaseToolRegistry,PythonToolRegistry,MCPToolRegistry,CompositeToolRegistry, etc.). I don't think we need this complexity. An alternative is to store MCP tools directly inEvalSuite. For exampleEvalSuite.add_mcp_tools(...)andEvalSuite.load_mcp_tools_http(...).
EricGustin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly some DevEx comments regarding the CLI command.
Also, I'm seeing Session termination failed: 202 logged whenever I run evals for a gateway. What is this?
EricGustin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jottakka a couple small comments. Will approve once merge conflicts are resolved & if you decide to address any of the below. DM when ready
MCP Server Tool Evaluation Support
Overview
Add support for evaluating tools from remote MCP servers without requiring Python callables. Enables direct evaluation of any MCP-compatible tool server.
What's New
Core Features
MCPToolRegistry: Evaluate tools from a single MCP serverCompositeMCPRegistry: Evaluate tools from multiple MCP servers simultaneouslyload_from_stdio()andload_from_http()to fetch tools from running serversserver_tool_name)Usage
Automatic Loading:
Single MCP Server:
Multiple MCP Servers:
Implementation
Files Changed
libs/arcade-evals/arcade_evals/registry.py(NEW): Registry abstractions and implementationslibs/arcade-evals/arcade_evals/loaders.py(NEW): Automatic tool loading from MCP serverslibs/arcade-evals/arcade_evals/eval.py(MODIFIED): EnhancedExpectedToolCalland evaluation logiclibs/arcade-evals/arcade_evals/__init__.py(MODIFIED): Exported new registries and loadersKey Technical Details
BaseToolRegistryinterface for abstractionMCPToolRegistryhandles single server toolsCompositeMCPRegistrymanages multiple servers with collision detectionload_from_stdio()andload_from_http()for automatic tool discoveryTesting
test_eval_mcp_registry.py: MCPToolRegistry functionalitytest_eval_composite_mcp.py: CompositeMCPRegistry with multiple serversBackward Compatibility
✅ 100% backward compatible - No breaking changes
Breaking Changes
None
Note
Adds end-to-end eval UX: examples, a robust CLI runner, and rich outputs.
eval_arcade_gateway.py,eval_stdio_mcp_server.py,eval_http_mcp_server.py,eval_comprehensive_comparison.pywith timeouts, error handling, and track-based comparisons; detailedREADME.mdarcade_cli/evals_runner.pyto execute evals/capture in parallel with progress, error isolation, failed-only filtering, context inclusion, and multi-provider/model supportarcade_cli/formatters/(txt, md, html, json) for evals and capture; comparative and multi-model HTML with tabs and context renderingdisplay.pynow supports writing multiple formats, failed-only disclaimers, include-context, and improved console summariesWritten by Cursor Bugbot for commit ff8acf9. This will update automatically on new commits. Configure here.