Skip to content

feat: add diagnose_residuals_tool for agentic residual diagnostics#402

Open
DEEP-600 wants to merge 1 commit intosktime:mainfrom
DEEP-600:feat/diagnose-residuals
Open

feat: add diagnose_residuals_tool for agentic residual diagnostics#402
DEEP-600 wants to merge 1 commit intosktime:mainfrom
DEEP-600:feat/diagnose-residuals

Conversation

@DEEP-600
Copy link
Copy Markdown

Implements #400.

What and why

Reading through the existing tool surface and #386, I noticed the
agentic loop has no feedback mechanism after a poor evaluation score.
The agent gets MAPE = 25% and has nothing to work with beyond trying
another model.

Human forecasters look at residuals at this point — ACF for missed
seasonality, normality checks, bias direction. Agents can't look at
plots, so this tool runs those same checks and returns structured text
they can reason over.

What's in this PR

  • src/sktime_mcp/tools/diagnose.py — the new tool
  • __init__.py and server.py updated to expose it
  • tests/test_diagnose.py — three test cases

How it works

Takes an estimator_handle and dataset, reloads the data the same
way evaluate.py does, pulls the fitted instance from
_handle_manager, and runs three tests on the residuals:

  • Ljung-Box (statsmodels) — catches missed seasonality
  • Shapiro-Wilk (scipy) — catches non-normality
  • Mean bias — catches systematic under/over-forecasting

No new dependencies — statsmodels and scipy are already in
pyproject.toml.

Example output

{
  "success": true,
  "diagnostics": {
    "bias": {"mean_error": -4.2, "status": "consistently over-forecasting"},
    "autocorrelation": {"ljung_box_passed": false, "significant_lags": [12, 24]},
    "normality": {"shapiro_passed": false, "p_value": 0.01}
  },
  "llm_hint": "Residuals show significant autocorrelation at lags [12, 24]. 
  This may indicate missed annual seasonality. Consider switching to SARIMA 
  or adding a Deseasonalizer pipeline."
}

A couple of things I'd like feedback on

  • I used predict_residuals(y) as primary with a manual fallback for
    complex pipelines — is that the right call or overkill?
  • Kept heteroskedasticity out of scope for now, happy to add in a
    follow-up if that's useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant