-
Notifications
You must be signed in to change notification settings - Fork 83
Adding MCP Servers supports to Arcade Evals #689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
56 commits
Select commit
Hold shift + click to select a range
fb35d7b
Add --failed-only and --output flags to evals command
c11a908
Add tests for display_eval_results with --failed-only and --output flags
19199e7
Add additional test cases for better coverage of display_eval_results
22d0ecb
Extract filtering logic to testable function and add tests
9bbb691
Merge branch 'main' into francisco/arcade-cli/updating-evals-to-show-…
jottakka c293f18
Merge branch 'main' into francisco/arcade-cli/updating-evals-to-show-…
d593a4a
Adding MCP Servers supports to Arcade Evals
eda0260
Updating loading from mcp server
aa1fff9
Merge branch 'main' into francisco/updating-arcade-evails
2d889f6
Updating to accept headers
de2df04
fixing linting
8c5a096
Merge branch 'main' into francisco/arcade-cli/updating-evals-to-show-…
049f965
added session support for the http loader
torresmateo 3998b39
removed debug print
torresmateo 56399e2
handled unsupported protocol for http tool loader
torresmateo 477ccbe
Open API issue
5f28a55
Updating strict mode
bf7678e
Updating strict mode
baad441
Merge branch 'main' into francisco/arcade-cli/updating-evals-to-show-…
d28b572
Merge branch 'main' into francisco/updating-arcade-evails
627bcee
Merge branch 'francisco/arcade-cli/updating-evals-to-show-only-failed…
4a8c6a5
Merge branch 'main' into francisco/updating-arcade-evails
dd25223
Merge branch 'main' into francisco/updating-arcade-evails
823e39d
Merge branch 'main' into francisco/updating-arcade-evails
6b0a725
updating eval suit to contain all tool sources
ccfae39
Adding anthropic support
6318e4a
Adding fuzzy weights
0ccf790
fix cursor reported bug
73dc94c
Delete libs/arcade-evals/arcade_evals/_experimental/__init__.py
jottakka 39b7f91
Adding capture mode and smashing some bugs after reviews
jottakka 8b6e17d
fixing output formating when capture mode
jottakka bbbdbf8
added options to export result to md, txt and html
jottakka 702e2eb
fixing bugs
jottakka dd1e335
fixes after cursor bot review
jottakka e4beb77
some updates
jottakka 81006fc
Adding compare mode
jottakka de0f8e6
Updating evals for multiple models/providers/tracks
jottakka af916e9
removing self implemented loader and adding flag to override arcade url
jottakka f38ebbb
Add locks for loading tools from mcp servers only once and avoid conc…
jottakka aaca430
Add locks for loading tools from mcp servers only once and avoid conc…
jottakka 27ae785
Add locks for loading tools from mcp servers only once and avoid conc…
jottakka 1f3cb55
Fixing html template
jottakka b5e04aa
Add locks for loading tools from mcp servers only once and avoid conc…
jottakka ead5b13
Add locks for loading tools from mcp servers only once and avoid conc…
jottakka a234574
Add locks for loading tools from mcp servers only once and avoid conc…
jottakka 22d9943
Add locks for loading tools from mcp servers only once and avoid conc…
jottakka b26135c
Fix CLI help tests: strip ANSI codes before assertions
jottakka b460ac2
adressing some changes after code review
jottakka e0acb78
updating after erics review
jottakka 524c77d
adding examples
jottakka 5c074d3
fixing ci failing
jottakka 716787d
minor changes
jottakka 435e191
minor fix
jottakka 8c0d677
updates after erics review
jottakka b9847ad
Merge branch 'main' into francisco/updating-arcade-evails
jottakka ff8acf9
fixing some linting
jottakka File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,236 @@ | ||
| """ | ||
jottakka marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Example: Evaluating Tools from Multiple MCP Servers | ||
|
|
||
| This example demonstrates how to use CompositeMCPRegistry to evaluate tools | ||
| from multiple MCP servers in a single evaluation suite. | ||
| """ | ||
|
|
||
| from arcade_evals import ( | ||
| BinaryCritic, | ||
| CompositeMCPRegistry, | ||
| EvalSuite, | ||
| ExpectedToolCall, | ||
| ) | ||
|
|
||
| # Step 1: Define tool descriptors from multiple MCP servers | ||
| # In practice, these would come from different MCP server tools/list responses | ||
|
|
||
| calculator_tools = [ | ||
| { | ||
| "name": "add", | ||
| "description": "Add two numbers together", | ||
| "inputSchema": { | ||
| "type": "object", | ||
| "properties": { | ||
| "a": {"type": "number", "description": "First number"}, | ||
| "b": {"type": "number", "description": "Second number", "default": 0}, | ||
| }, | ||
| "required": ["a"], | ||
| }, | ||
| }, | ||
| { | ||
| "name": "multiply", | ||
| "description": "Multiply two numbers together", | ||
| "inputSchema": { | ||
| "type": "object", | ||
| "properties": { | ||
| "a": {"type": "number"}, | ||
| "b": {"type": "number"}, | ||
| }, | ||
| "required": ["a", "b"], | ||
| }, | ||
| }, | ||
| ] | ||
|
|
||
| string_tools = [ | ||
| { | ||
| "name": "uppercase", | ||
| "description": "Convert string to uppercase", | ||
| "inputSchema": { | ||
| "type": "object", | ||
| "properties": { | ||
| "text": {"type": "string", "description": "Text to convert"}, | ||
| }, | ||
| "required": ["text"], | ||
| }, | ||
| }, | ||
| { | ||
| "name": "reverse", | ||
| "description": "Reverse a string", | ||
| "inputSchema": { | ||
| "type": "object", | ||
| "properties": { | ||
| "text": {"type": "string", "description": "Text to reverse"}, | ||
| }, | ||
| "required": ["text"], | ||
| }, | ||
| }, | ||
| ] | ||
|
|
||
| datetime_tools = [ | ||
| { | ||
| "name": "format_date", | ||
| "description": "Format a date string", | ||
| "inputSchema": { | ||
| "type": "object", | ||
| "properties": { | ||
| "date": {"type": "string"}, | ||
| "format": {"type": "string", "default": "%Y-%m-%d"}, | ||
| }, | ||
| "required": ["date"], | ||
| }, | ||
| }, | ||
| ] | ||
|
|
||
| # Step 2: Create a composite registry with tools from multiple servers | ||
| # Method 1: Pass tool lists directly | ||
| composite = CompositeMCPRegistry( | ||
| tool_lists={ | ||
| "calculator": calculator_tools, | ||
| "strings": string_tools, | ||
| "datetime": datetime_tools, | ||
| } | ||
| ) | ||
|
|
||
| print("🎯 Composite MCP Registry Created!") | ||
| print(f"Servers: {', '.join(composite.get_server_names())}") | ||
| print() | ||
|
|
||
| # Step 3: Show how tools are namespaced | ||
| print("📋 All Tools (with namespacing):") | ||
| tools = composite.list_tools_for_model(tool_format="openai") | ||
| for tool in tools: | ||
| name = tool["function"]["name"] | ||
| desc = tool["function"]["description"] | ||
| print(f" - {name}: {desc}") | ||
| print() | ||
|
|
||
| # Step 4: Create an evaluation suite using the composite registry | ||
| suite = EvalSuite( | ||
| name="Multi-Server Evaluation Suite", | ||
| system_message="You are a helpful assistant with access to calculator, string, and datetime tools.", | ||
| catalog=composite, | ||
| ) | ||
|
|
||
| # Step 5: Add test cases using tools from different servers | ||
|
|
||
| # Test 1: Calculator server - using fully namespaced name | ||
| suite.add_case( | ||
| name="Addition with namespace", | ||
| user_message="What is 15 plus 7?", | ||
| expected_tool_calls=[ | ||
| ExpectedToolCall( | ||
| tool_name="calculator.add", # Fully namespaced | ||
| args={"a": 15, "b": 7}, | ||
| ) | ||
| ], | ||
| critics=[ | ||
| BinaryCritic(critic_field="a", weight=0.5), | ||
| BinaryCritic(critic_field="b", weight=0.5), | ||
| ], | ||
| ) | ||
|
|
||
| # Test 2: String server - using short unique name | ||
| suite.add_case( | ||
| name="String uppercase", | ||
| user_message="Convert 'hello world' to uppercase", | ||
| expected_tool_calls=[ | ||
| ExpectedToolCall( | ||
| tool_name="uppercase", # Short name (unique across all servers) | ||
| args={"text": "hello world"}, | ||
| ) | ||
| ], | ||
| critics=[ | ||
| BinaryCritic(critic_field="text", weight=1.0), | ||
| ], | ||
| ) | ||
|
|
||
| # Test 3: Multiple tool calls from different servers | ||
| suite.add_case( | ||
| name="Mixed server operations", | ||
| user_message="Calculate 10 times 5, then reverse the result", | ||
| expected_tool_calls=[ | ||
| ExpectedToolCall( | ||
| tool_name="calculator.multiply", | ||
| args={"a": 10, "b": 5}, | ||
| ), | ||
| ExpectedToolCall( | ||
| tool_name="strings.reverse", | ||
| args={"text": "50"}, | ||
| ), | ||
| ], | ||
| critics=[ | ||
| BinaryCritic(critic_field="a", weight=0.25), | ||
| BinaryCritic(critic_field="b", weight=0.25), | ||
| BinaryCritic(critic_field="text", weight=0.5), | ||
| ], | ||
| ) | ||
|
|
||
| # Test 4: Using defaults from schema | ||
| suite.add_case( | ||
| name="Date formatting with default", | ||
| user_message="Format the date 2025-11-18", | ||
| expected_tool_calls=[ | ||
| ExpectedToolCall( | ||
| tool_name="datetime.format_date", | ||
| args={"date": "2025-11-18"}, # 'format' will use default | ||
| ) | ||
| ], | ||
| critics=[ | ||
| BinaryCritic(critic_field="date", weight=1.0), | ||
| ], | ||
| ) | ||
|
|
||
| # Step 6: Display configured cases | ||
| print("✅ Evaluation Suite Configured!") | ||
| print(f"Suite: {suite.name}") | ||
| print(f"Total cases: {len(suite.cases)}\n") | ||
|
|
||
| print("Configured test cases:") | ||
| for i, case in enumerate(suite.cases, 1): | ||
| print(f"\n{i}. {case.name}") | ||
| print(f" Expected {len(case.expected_tool_calls)} tool call(s):") | ||
| for tc in case.expected_tool_calls: | ||
| print(f" - {tc.name}({tc.args})") | ||
|
|
||
| # Step 7: Demonstrate name collision handling | ||
| print("\n\n🔍 Name Collision Example:") | ||
| print("=" * 60) | ||
|
|
||
| # Create two servers with the same tool name | ||
| tools_a = [ | ||
| { | ||
| "name": "process", | ||
| "description": "Process A", | ||
| "inputSchema": {"type": "object", "properties": {}}, | ||
| } | ||
| ] | ||
| tools_b = [ | ||
| { | ||
| "name": "process", | ||
| "description": "Process B", | ||
| "inputSchema": {"type": "object", "properties": {}}, | ||
| } | ||
| ] | ||
|
|
||
| collision_composite = CompositeMCPRegistry(tool_lists={"server_a": tools_a, "server_b": tools_b}) | ||
|
|
||
| # Short name is ambiguous | ||
| try: | ||
| collision_composite.resolve_tool_name("process") | ||
| except ValueError as e: | ||
| print(f"❌ Short name fails: {e}") | ||
|
|
||
| # But namespaced names work fine | ||
| print(f"✅ Namespaced works: {collision_composite.resolve_tool_name('server_a.process')}") | ||
| print(f"✅ Namespaced works: {collision_composite.resolve_tool_name('server_b.process')}") | ||
|
|
||
| print("\n\n💡 Key Features:") | ||
| print(" • Combine tools from multiple MCP servers") | ||
| print(" • Automatic namespacing prevents collisions (server.tool)") | ||
| print(" • Short names work when unique across all servers") | ||
| print(" • Each server's tools maintain their own schemas and defaults") | ||
| print(" • All existing Python tool evaluations still work unchanged") | ||
|
|
||
| print("\n💡 To run actual evaluations, use:") | ||
| print(" results = suite.run(provider_api_key='your-api-key', model='gpt-4')") | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,129 @@ | ||
| """ | ||
| Example: Evaluating MCP Server Tools with arcade-evals | ||
|
|
||
| This example demonstrates how to evaluate tools from an MCP server | ||
| without requiring Python callables. | ||
| """ | ||
|
|
||
| from arcade_evals import BinaryCritic, EvalSuite, ExpectedToolCall, MCPToolRegistry | ||
|
|
||
| # Step 1: Define MCP tool descriptors | ||
| # These would typically come from an MCP server's tools/list response | ||
| mcp_tools = [ | ||
| { | ||
| "name": "calculator_add", | ||
| "description": "Add two numbers together", | ||
| "inputSchema": { | ||
| "type": "object", | ||
| "properties": { | ||
| "a": { | ||
| "type": "number", | ||
| "description": "First number", | ||
| }, | ||
| "b": { | ||
| "type": "number", | ||
| "description": "Second number", | ||
| "default": 0, # Optional: MCP tools can specify defaults | ||
| }, | ||
| }, | ||
| "required": ["a"], | ||
| }, | ||
| }, | ||
| { | ||
| "name": "calculator_multiply", | ||
| "description": "Multiply two numbers together", | ||
| "inputSchema": { | ||
| "type": "object", | ||
| "properties": { | ||
| "a": {"type": "number", "description": "First number"}, | ||
| "b": {"type": "number", "description": "Second number"}, | ||
| }, | ||
| "required": ["a", "b"], | ||
| }, | ||
| }, | ||
| ] | ||
|
|
||
| # Step 2: Create an MCP tool registry | ||
| registry = MCPToolRegistry(mcp_tools) | ||
|
|
||
| # Step 3: Create an evaluation suite using the MCP registry | ||
| suite = EvalSuite( | ||
| name="Calculator MCP Evaluation", | ||
| system_message="You are a helpful calculator assistant. Use the available tools to perform calculations.", | ||
| catalog=registry, # Use MCP registry instead of ToolCatalog | ||
| ) | ||
|
|
||
| # Step 4: Add test cases using tool names (not Python functions!) | ||
| suite.add_case( | ||
| name="Simple addition", | ||
| user_message="What is 5 plus 3?", | ||
| expected_tool_calls=[ | ||
| ExpectedToolCall( | ||
| tool_name="calculator_add", # String name, not a callable | ||
| args={"a": 5, "b": 3}, | ||
| ) | ||
| ], | ||
| critics=[ | ||
| BinaryCritic(critic_field="a", weight=0.5), | ||
| BinaryCritic(critic_field="b", weight=0.5), | ||
| ], | ||
| ) | ||
|
|
||
| suite.add_case( | ||
| name="Addition with implicit default", | ||
| user_message="Add 10", | ||
| expected_tool_calls=[ | ||
| ExpectedToolCall( | ||
| tool_name="calculator_add", | ||
| args={"a": 10}, # 'b' will use default value of 0 | ||
| ) | ||
| ], | ||
| critics=[ | ||
| BinaryCritic(critic_field="a", weight=1.0), | ||
| ], | ||
| ) | ||
|
|
||
| suite.add_case( | ||
| name="Multiplication", | ||
| user_message="What is 7 times 6?", | ||
| expected_tool_calls=[ | ||
| ExpectedToolCall( | ||
| tool_name="calculator_multiply", | ||
| args={"a": 7, "b": 6}, | ||
| ) | ||
| ], | ||
| critics=[ | ||
| BinaryCritic(critic_field="a", weight=0.5), | ||
| BinaryCritic(critic_field="b", weight=0.5), | ||
| ], | ||
| ) | ||
|
|
||
| # Step 5: Demo the configuration | ||
| if __name__ == "__main__": | ||
| print("Running MCP tool evaluations...") | ||
| print(f"Suite: {suite.name}") | ||
| print(f"Cases: {len(suite.cases)}") | ||
| print() | ||
|
|
||
| print("✅ MCP evaluation suite configured successfully!") | ||
| print("\nConfigured cases:") | ||
| for i, case in enumerate(suite.cases, 1): | ||
| print(f"{i}. {case.name}") | ||
| print(f" Expected: {len(case.expected_tool_calls)} tool call(s)") | ||
| for tc in case.expected_tool_calls: | ||
| print(f" - {tc.name}({tc.args})") | ||
|
|
||
| print("\n💡 To run actual evaluations, use:") | ||
| print(" results = suite.run(provider_api_key='your-api-key', model='gpt-4')") | ||
|
|
||
| # Demo: Show how MCP tools are converted to OpenAI format | ||
| print("\n📋 MCP tools converted to OpenAI format:") | ||
| tools = registry.list_tools_for_model(tool_format="openai") | ||
| for tool in tools: | ||
| print(f"\n- {tool['function']['name']}") | ||
| print(f" Description: {tool['function']['description']}") | ||
| function_params = tool["function"].get("parameters") | ||
| if function_params and isinstance(function_params, dict): | ||
| params = function_params.get("properties", {}) | ||
| if params: | ||
| print(f" Parameters: {', '.join(params.keys())}") | ||
jottakka marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.