EVAL SYS

EVAL SYS is a living, open-source community to track and advance model agentic capabilities. We’ll be releasing benchmarks, datasets, toolchains, models to push the field forward. Initiated by LobeHub and Allison Zhang, we would love to collaborate with research labs, MCP servers, independent contributors, and more.

Join us, contribute, or reach out！

MCPMark: Stress-Testing Comprehensive MCP Use

An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).

MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EVAL SYS

MCPMark: Stress-Testing Comprehensive MCP Use

Pinned Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Most used topics

Uh oh!