[ENH] Add agentic workflow benchmark suite for sktime-mcp by Rohitstwt · Pull Request #396 · sktime/sktime-mcp

Rohitstwt · 2026-04-30T06:18:59Z

ESoC 2026 - Agentic Benchmark Suite

This PR introduces the first benchmarking framework for sktime-mcp.
The vision is for this to become the backbone benchmarking layer for
the entire repo — every agentic workflow measurable through one suite.

What's working right now

8/8 tests passing locally:

Quick example

from sktime_mcp.benchmark.runner import BenchmarkRunner
from sktime_mcp.benchmark.scorer import BenchmarkScorer
from sktime_mcp.benchmark.tasks import load_all_tasks

runner = BenchmarkRunner()
scorer = BenchmarkScorer()

all_results = runner.run_all()
for task in load_all_tasks():
    scores = scorer.score_results(all_results[task.name], task)
    print(scorer.summary(scores))

3 month ESoC roadmap

Given 3 months I want to build this into:

20+ tasks covering forecasting, classification, detection
CI integration — benchmark runs on every PR automatically
LLM agent comparison — benchmark GPT-4o vs Claude vs Gemini
Estimator leaderboard per task
Pipeline complexity vs accuracy tradeoff analysis

Reference Issues/PRs

Part of ESoC 2026 application - sktime agentic track
Related to my previous PRs :

What does this implement?

A 3-dimensional benchmark scoring system:

Pipeline validity — uses CompositionValidator to check compositions (30%)
Task match — did the agent pick the right estimator type? (20%)
Performance — forecast accuracy via cross-validation MAPE (50%)

Contributors can add new tasks by writing a single yaml file. No code needed.

New dependencies?

No. Uses pyyaml which is already in the project.

What should reviewers focus on?

The 3-dimensional scoring approach in scorer.py
The yaml-based task definition format in tasks/
Whether this architecture fits the long term vision for sktime-mcp
Suggestions for additional task types or scoring dimensions

PR Checklist

For all contributions

I've added unit tests and made sure they pass locally
I've added myself to the list of contributors with any new badges (code, bug, doc, maintenance)
The PR title starts with [ENH], [MNT], [DOC], or [BUG]

ENH: add agentic workflow benchmark suite for sktime-mcp

af94c73

Rohitstwt changed the title ~~ENH: add agentic workflow benchmark suite for sktime-mcp~~ [ENH] Add agentic workflow benchmark suite for sktime-mcp Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Add agentic workflow benchmark suite for sktime-mcp#396

[ENH] Add agentic workflow benchmark suite for sktime-mcp#396
Rohitstwt wants to merge 1 commit intosktime:mainfrom
Rohitstwt:benchmark/agentic-workflow-benchmark

Rohitstwt commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Rohitstwt commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ESoC 2026 - Agentic Benchmark Suite

What's working right now

Quick example

3 month ESoC roadmap

Reference Issues/PRs

What does this implement?

New dependencies?

What should reviewers focus on?

PR Checklist

For all contributions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Rohitstwt commented Apr 30, 2026 •

edited

Loading