Skip to content

[ENH] Add agentic workflow benchmark suite for sktime-mcp#396

Open
Rohitstwt wants to merge 1 commit intosktime:mainfrom
Rohitstwt:benchmark/agentic-workflow-benchmark
Open

[ENH] Add agentic workflow benchmark suite for sktime-mcp#396
Rohitstwt wants to merge 1 commit intosktime:mainfrom
Rohitstwt:benchmark/agentic-workflow-benchmark

Conversation

@Rohitstwt
Copy link
Copy Markdown

@Rohitstwt Rohitstwt commented Apr 30, 2026

ESoC 2026 - Agentic Benchmark Suite

This PR introduces the first benchmarking framework for sktime-mcp.
The vision is for this to become the backbone benchmarking layer for
the entire repo — every agentic workflow measurable through one suite.

What's working right now

8/8 tests passing locally:

  • test_load_all_tasks
  • test_task_fields
  • test_task_from_dict
  • test_run_single_task
  • test_run_invalid_estimator
  • test_pipeline_validation
  • test_score_result
  • test_summary

Quick example

from sktime_mcp.benchmark.runner import BenchmarkRunner
from sktime_mcp.benchmark.scorer import BenchmarkScorer
from sktime_mcp.benchmark.tasks import load_all_tasks

runner = BenchmarkRunner()
scorer = BenchmarkScorer()

all_results = runner.run_all()
for task in load_all_tasks():
    scores = scorer.score_results(all_results[task.name], task)
    print(scorer.summary(scores))

3 month ESoC roadmap

Given 3 months I want to build this into:

  • 20+ tasks covering forecasting, classification, detection
  • CI integration — benchmark runs on every PR automatically
  • LLM agent comparison — benchmark GPT-4o vs Claude vs Gemini
  • Estimator leaderboard per task
  • Pipeline complexity vs accuracy tradeoff analysis

Reference Issues/PRs

Part of ESoC 2026 application - sktime agentic track
Related to my previous PRs :

What does this implement?

A 3-dimensional benchmark scoring system:

  • Pipeline validity — uses CompositionValidator to check compositions (30%)
  • Task match — did the agent pick the right estimator type? (20%)
  • Performance — forecast accuracy via cross-validation MAPE (50%)

Contributors can add new tasks by writing a single yaml file. No code needed.

New dependencies?

No. Uses pyyaml which is already in the project.

What should reviewers focus on?

  • The 3-dimensional scoring approach in scorer.py
  • The yaml-based task definition format in tasks/
  • Whether this architecture fits the long term vision for sktime-mcp
  • Suggestions for additional task types or scoring dimensions

PR Checklist

For all contributions

  • I've added unit tests and made sure they pass locally
  • I've added myself to the list of contributors with any new badges (code, bug, doc, maintenance)
  • The PR title starts with [ENH], [MNT], [DOC], or [BUG]

@Rohitstwt Rohitstwt changed the title ENH: add agentic workflow benchmark suite for sktime-mcp [ENH] Add agentic workflow benchmark suite for sktime-mcp Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant