Goal
Expand from 10 to 50 benchmark tasks for more statistically meaningful results.
Proposed new tasks
Code (5 more)
- Refactor a class hierarchy to use composition over inheritance
- Add type hints to an untyped Python module
- Fix a race condition in async code
- Implement a caching decorator with TTL
- Convert callbacks to async/await
Data (5 more)
- Clean and merge two messy CSVs
- Build a simple ML pipeline (train/eval/predict)
- Visualize time series with anomaly highlights
- Parse and analyze a web server access log
- Generate a PDF report from structured data
Research (3 more)
- Compare 3 database options for a specific use case
- Write a threat model for a web application
- Summarize and critique a technical RFC
Tool Use (4 more)
- Set up a GitHub Actions workflow
- Create and configure a Docker Compose stack
- Automate a Slack notification pipeline
- Build a simple MCP server
Multi-step (3 more)
- Full PR workflow: branch → code → test → PR
- Debug production incident from logs to fix
- Migrate a project from JavaScript to TypeScript
How to contribute
See task authoring guide. Each task is a single YAML file.
Goal
Expand from 10 to 50 benchmark tasks for more statistically meaningful results.
Proposed new tasks
Code (5 more)
Data (5 more)
Research (3 more)
Tool Use (4 more)
Multi-step (3 more)
How to contribute
See task authoring guide. Each task is a single YAML file.