Skip to content

EVAL SYS is a living, open-source community to track and advance model agentic capabilities. We’ll be releasing benchmarks, datasets, toolchains, models to push the field forward. Initiated by LobeHub and Allison Zhang, we would love to collaborate with research labs, MCP servers, independent contributors, and more.

Join us, contribute, or reach out!


MCPMark: Stress-Testing Comprehensive MCP Use

An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).

MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.

MCPMark

Pinned Loading

  1. mcpmark mcpmark Public

    MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.

    Python 159 10

  2. mcpmark-experiments mcpmark-experiments Public

    Collection of evaluation results for MCPMark

    1

Repositories

Showing 5 of 5 repositories

Most used topics

Loading…