[STEP] split `sktime` into per component packages #45

yarnabrina · 2025-07-13T07:16:12Z

No description provided.

fkiraly · 2025-07-13T09:40:01Z

steps/25_split_package/step.md

+
+As the sktime codebase continues to expand, the current monolithic structure is
+presenting increasing challenges in terms of maintenance, dependency management, and
+release coordination. This proposal outlines a plan to modularise sktime into


would more packages not increase the amount of challenges in release *coordination? Given that it is easier to coordinate (or simply: to carry out) a release for a single package than for 5.

fkiraly · 2025-07-13T09:42:50Z

steps/25_split_package/step.md

+- **Meta-package (`sktime-all`):** While useful for transition, maintaining a
+  meta-package long-term increases maintenance overhead and can reintroduce dependency
+  conflicts.
+


there is another alternative: improving the sub-packaging usint scikit-base - already estimators have their own dependency sets.

That would leave a single pypi package, but might mitigate a significant part of the issues.
See the registry module, python_dependencies tag, tests:vm tag, and deps or craft.

fkiraly · 2025-07-13T09:43:33Z

steps/25_split_package/step.md

+  - ...and others as needed.
+- **Optional:** If pipeline logic grows in complexity, introduce `sktime-pipeline` as a
+  separate package.
+- **Meta-package:** `sktime` (which essentially is `sktime-all`) would be retained only


deprecating sktime itself is a very bad idea. I am against it.

This sounds like a veto, and this is precisely the reason I never made a formal proposal in Github issue or discussion or otherwise. I gave this split proposal explicitly in Discord (to you indivually and in the latest thread: https://discord.com/channels/1075852648688930887/1386186193066131517) and in my opinion everyone (probably excluding yourself) was clear that the single central package will no longer be needed. I don't think there is any point in me responding to any other comments any further.

Anirban, I don't know if I have the full context of your discussions with Franz, but this change is indeed something really important that impact a lot of users and the current state of the package.

I would not bet on having the first version proposal approved by everyone. It will indeed require convincing and providing evidence that this change is beneficial. And it is also ok to have a first proposal that is not the best one and incorporating others' suggestions to make it better.

I'm still reading and thinking about the overall content, but I thought that this comment was important.

Hi Felipe, I was not expecting immediate approval and I know as of now it has little to no specific details as Franz requested below and I 100% agree that's important and I'd have reviewed the same. I've no objection to Franz's other reciew comments, and would be very happy to discuss those.

But discussing that and/or spending any effort to make this proposal mature needs a collaborative discussion, not specifically Franz or someone else making a comment saying deprecation and eventual removal of single do-all-together-in-single-repo-or-package is not at all acceptable under any circumstances, especially after this specific suggestion is the core idea of the proposal and of course mentioned in Discord thread (and if I am not too mistaken in last 1/2 annual roadmaps - but I am not certain). Otherwise it's a futile effort from my side (or if anyone else contribute to the proposal in future).

@yarnabrina, I am strictly against removal of sktime as a package, which people can easily pip install sktime. As outlined below, this would impact our users negatively.

Compare scikit-learn, there is not scikit-learn-classification and scikit-learn-regression and scikit-learn-pipeline which people separately install.

But that does not mean I am vetoing other ideas in this proposal, such as better modularity and even more packages to contain the growing set of estimators.

But discussing that and/or spending any effort to make this proposal mature needs a collaborative discussion

@yarnabrina, please do not suggest that I am not engaging in collaborative discussion. I am stating my opinions and am open to be convinced by arguments!

I have outlined why removing sktime as a package would be quite a bad idea, and nothing positive seems to offset it, since increased modularity or outsourcing of estimators to expansion packages can be had without removing sktime as a package.

Finally, my change requests below are constructive, I am asking you explicitly to be more detailed in your suggestions. Where would individual parts of the current package go? How would we manage releases?

especially after this specific suggestion is the core idea of the proposal

This is the first time I understand that your suggestion implies - and apparently has at is core, according to you - the removal of sktime as a package.

I was working under the assumption that you wanted to move sets of estimators into their own expansion packages, or move individual modules out, without removing the single-install UX and architectural cohesion.

This sounds like a veto

I am, at least, against doing something as massive as this, when benefits and drawbacks have not yet been spelled out and discussed, or implementation details - please do that if you are in favour of this change.

The onus of convincing core developers to do this rests on you, it is unfair to blame me or others for being skeptical. Constructive collaboration requires doing the homework to present your intention and arguments in a complete fashion.

fkiraly

Thanks for the effort. Some remarks:

there is no clear plan here what goes where. I think this needs to be much more concrete than this so we can even consider implementation.
you also need to comment about the user journey. What would a user import now? How would the common vignettes look like? What would a user have to install, and how would they learn of this information?
I think it is an extremely bad idea to deprecate sktime itself. Very strongly against.
you need to comment about the maintenance and release workflow. How would it looks like?
overall, the explanation of the end state lacks concretion.
you should comment explicitly on how the repository or repositories would be managed.

Finally, who would carry this out? Would you be a part of this effort if some form of it reaches consensus?

fkiraly · 2025-07-13T10:17:50Z

I have also asked a popular chatbot whether they would recommend splitting a package like scikit-learn (which is in a similar situation) into multiple packages, here is the answer:

For a mature, widely-used, and heavily-integrated package like scikit-learn, I would not recommend breaking it up into multiple separately-distributed packages at this time — at least not publicly or for general use. Here's a reasoned breakdown that weighs technical, community, and ecosystem-wide considerations:

🔍 scikit-learn Context

Mature & Stable API used across industry and academia.
Broad Scope: classification, regression, clustering, dimensionality reduction, model selection, preprocessing, etc.
Tight Internal Coupling: Many modules depend on common conventions, shared utilities, and base classes.
Huge User Base: Breaking changes or friction would ripple through tutorials, courses, books, and production systems.

Reasons to Keep it as One Package

User Convenience and Consistency

scikit-learn’s current single-install UX (pip install scikit-learn) is extremely simple and beginner-friendly.

Most users rely on multiple submodules (e.g., sklearn.linear_model, sklearn.preprocessing) and expect them all to be co-installed and compatible.

Internal Cohesion

Algorithms, pipelines, metrics, and preprocessing tools often work together. Splitting into multiple packages could introduce version mismatches or break that cohesion.

Tooling and Ecosystem Integration

Scikit-learn integrates tightly with numpy, pandas, joblib, matplotlib, and others.
Downstream libraries (like imbalanced-learn, auto-sklearn, skorch, etc.) expect a unified API surface and directory structure.

High Maintenance Overhead of Splitting

Maintaining consistent documentation, versioning, testing, and compatibility across multiple modules is non-trivial and would increase CI, maintenance, and community overhead.

Recommended Alternatives to Full Split

Internal Refactoring with Clean Submodules:

Improve code modularity internally without affecting the external API.

Example: keeping sklearn.cluster or sklearn.ensemble in separate directories/repos internally, while still building and distributing a single wheel.

Pluggable Extension System:

Support optional extensions or plugin APIs (e.g., third-party models, transformers).

Could enable innovation outside the core without bloating the package.

Namespace Package Design (in future):

If scikit-learn 2.0 ever happens, consider using namespace packages (sklearn.linear, sklearn.ensemble, etc.) only if strict backward compatibility is not required.

Meta-packages (Not Recommended Now):

While some ecosystems use meta-packages (e.g., scikit-learn-core, scikit-learn-models), this would likely cause more confusion than benefit for this audience.

Risks of Splitting for scikit-learn

Massive Backward Compatibility Break.
Frustration in the community (especially educators and researchers).
Fragmentation of documentation and support.
Increased chances of dependency hell and version drift.

Final Recommendation

Stick with the single-package structure for scikit-learn — it is well-suited for its audience and scope, and the costs of splitting outweigh the benefits.

However, modularize the internals, encourage plugin ecosystems, and future-proof the architecture to allow partial optional loading or third-party innovation without bloating core.

fkiraly · 2025-07-13T10:20:44Z

I broadly agree with the above.

I think if we split, we should keep machinery and core estimators in sktime proper and under no circumstances deprecate the package.

We could move less commonly used estimators to separate packages and treat them as extensions.

Though that behaviour is already the status quo if users have no soft dependencies installed.

yarnabrina · 2025-07-13T11:45:53Z

This is scikit-learn's dependency list, just 4 of them. And this is not just "core" dependency set, this is their entire dependency set unless you count spefic docs/tests etc. dependencies.

vs

sktime has a lot of soft dependencies, and few of them to the points it can't even be mentioned in pyproject.toml with its other dependencies, and recently we have started adding forks of other projects as part of its main codebase to be installed on users systems. Just

scikit-learn's primary functionality is not as a framework for unified interface, and which is what I believe is sktime's is. There is no comparability among the two packages with respect to maintenance complexity in my opinion.

I have also asked a popular chatbot whether they would recommend splitting a package like scikit-learn (which is in a similar situation) into multiple packages, here is the answer:

I do not know which chatbot you asked, but I think scikit-learn being in similar situation to sktime is fundamentally and factually wrong. This is what I received as a response when I just asked ChatGPT (free version) about justification of keeping monolith architecture.

https://chatgpt.com/share/68739758-8c9c-800c-87a9-ee036edcad96

I think if we split, we should keep machinery and core estimators in sktime proper and under no circumstances deprecate the package.

Nothing stops sktime-core in my proposal to be named just sktime, if that naming is your only and primary concern, but it should not contain logic of all modules into a single package for this to be modular. But I am sorry to say under no circumstance sounds once again like a "shut up" veto instead of a collaborative discussion between core-developers and all other contributors and users.

fkiraly · 2025-07-13T14:42:49Z

This is scikit-learn's dependency list, just 4 of them. And this is not just "core" dependency set, this is their entire dependency set unless you count spefic docs/tests etc. dependencies.

Our core dependency set is similarly lean:

  "joblib>=1.2.0,<1.6",  # required for parallel processing
  "numpy>=1.21,<2.4",  # required for framework layer and base class logic
  "packaging",  # for estimator specific dependency parsing
  "pandas<2.4.0,>=1.1",  # pandas is the main in-memory data container
  "scikit-base>=0.6.1,<0.13.0",  # base module for sklearn compatible base API
  "scikit-learn>=0.24,<1.8.0",  # required for estimators and framework layer
  "scipy<2.0.0,>=1.2",  # required for estimators and framework layer

Out of these, packaging and scikit-base have no dependencies, and scikit-learn implies most others, while scikit-base dependencies module soft-implies packaging, so the sktime core depset is really just scikit-base, scikit-learn, and pandas.

scikit-learn's primary functionality is not as a framework for unified interface, and which is what I believe is sktime's is. There is no comparability among the two packages with respect to maintenance complexity in my opinion.

I think this is wrong. scikit-learn's primary functionality is being a unified API framework!

It is, in fact, the precursor of all unified APIs for AI in python!

The architecture is also quite similar, except that sktime also contains "extensions" with API adapters to other packages.

I do not know which chatbot you asked, but I think scikit-learn being in similar situation to sktime is fundamentally and factually wrong.

I disagree about the architectural perception and vision.

fkiraly · 2025-07-13T14:51:40Z

Nothing stops sktime-core in my proposal to be named just sktime, if that naming is your only and primary concern, but it should not contain logic of all modules into a single package for this to be modular.

Naming is one of my concerns, but also layer architecture.

sktime should not be removed from pypi, or cease to exist as a package.
I think the framework layer should remain cohesive.

But I am sorry to say under no circumstance sounds once again like a "shut up" veto instead of a collaborative discussion between core-developers and all other contributors and users.

If you read my statement, I am against removal of sktime as a package.

For the rest, I am willing to listen to your proposal, once it spells out what would go where.

You also said the following:

and in my opinion everyone (probably excluding yourself) was clear that the single central package will no longer be needed.

I do not see how this could ever be the case, it seems like a very significant operation.

As said I am happy to listen to argument, so please spell out where things would go and how things would move.

felipeangelimvieira · 2025-07-13T20:29:04Z

Anirban, I think this discussion is really useful, even if we decide not to go with it, but a least have motivations behind splitting/not splitting clearly defined for future users/devs. Another good consequence of this proposal is having new ideias for the future, thinking about long-term.

About deprecating sktime: I'm not in favour of it, because it would probably damage sktime's organization in general... Old tutorials would not work, people would have a reason to change package and they could possibly leave sktime userbase.
Before, I was thinking that the split proposal would be more in the sense of separating specific implementations or parts of sktime. For example: datasets, benchmarking, plotting utils, and maybe less used features such as segmentation, detection, that we developers have less knowledge about.

Thinking about some of concrete advantages and disadvantages of having separate packages:

Less cognitive load when trying to learn a sub-section of the codebase. Example: I want to benchmark, or extend benchmarking, but the documentation and codebase can be so big that it is hard to find it there
Easier management of issues and PRs.

Disadvantages:

If the packages are not independent, we would need to sync versions anyway, and the maintenance cost can be larger since it would be hard to test both packages at the same time, without knowing if a change in one will break another.
Any change in skbase or other common dependencies would multiple the current effort required to test compatibility

I wonder if we should make clear what problem are we trying to solve:

Making release easier?
Making user experience better?

felipeangelimvieira · 2025-07-13T20:36:22Z

I wonder if we should make clear what problem are we trying to solve:

Making release easier?
Making user experience better?

I believe that any split that creates package that are coupled but separated is bad for both. I believe we should have the same principles of software development here: aim for high coherence, and low coupling.

Question: what parts of sktime are currently together, but are not used together, or could be easily separated without problems?

About sktime-core/sktime, to avoid breaking user code, one potential solution would be requiring an optional dependency such as sktime-datasets, sktime-benchmarking... But when I think about that I wonder if this is already achieved with optional extra dependencies and modules

fkiraly · 2025-07-14T06:23:24Z

@felipeangelimvieira, to add to your thoughts: I think the idea of coupling/cohesion and modularity is a good one. I also want to add: we should not exclusivly frame an initiative to increase modularity in the context of splitting the package.

In fact, it is almost always true that improved modularity can be achieved within a single package just as well as in multiple packages.

That increasing modularity and improving architectural structure requires to split the package is a misleading implication.

Hence I would suggest to first consider structure and architecture, and only second how to distribute the give structure across modules and packages.

From this perspective I suspect that "splitting package" will feel less necessary, since we uncouple actual pain points (like testing times, coupling, etc) from the package management question.

fkiraly · 2025-07-14T06:27:12Z

If the packages are not independent,

And I think this precisely is the crux of the problem. At the current state I do not think the modules are uncoupled, they are interdependent, and necessarliy interdependent for the composition cases that are sktime's USP.

For instance, users will build pipelines from transformers, detectors, and forecasters. If we now separate into sktime-transformers, sktime-detection, sktime-forecasting, and sktime-pipelines, will such a user now have to install five packages (including sktime-core) and import from five packages?

This is why I think that @yarnabrina should spell out:

how he thinks he will concretely distribute current package content on which concrete new packages
how the user journey for common vignettes would look like, including usage and installation - especially for common composition cases like common pipelines, but also for the base vignettes.

added first draft

6bcc5fe

yarnabrina force-pushed the split-package branch from 02fd6e8 to 6bcc5fe Compare July 13, 2025 07:17

fkiraly reviewed Jul 13, 2025

View reviewed changes

fkiraly requested changes Jul 13, 2025

View reviewed changes

[STEP] split sktime into per component packages #45

Are you sure you want to change the base?

[STEP] split sktime into per component packages #45

Uh oh!

Conversation

yarnabrina commented Jul 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fkiraly Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fkiraly Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

fkiraly commented Jul 13, 2025

🔍 scikit-learn Context

Reasons to Keep it as One Package

Recommended Alternatives to Full Split

Internal Refactoring with Clean Submodules:

Pluggable Extension System:

Namespace Package Design (in future):

Meta-packages (Not Recommended Now):

Risks of Splitting for scikit-learn

Final Recommendation

Uh oh!

fkiraly commented Jul 13, 2025

Uh oh!

yarnabrina commented Jul 13, 2025

Uh oh!

fkiraly commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fkiraly commented Jul 13, 2025

Uh oh!

felipeangelimvieira commented Jul 13, 2025

Uh oh!

felipeangelimvieira commented Jul 13, 2025

Uh oh!

fkiraly commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fkiraly commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[STEP] split `sktime` into per component packages #45

[STEP] split `sktime` into per component packages #45

fkiraly Jul 13, 2025 •

edited

Loading

fkiraly Jul 13, 2025 •

edited

Loading

fkiraly commented Jul 13, 2025 •

edited

Loading

fkiraly commented Jul 14, 2025 •

edited

Loading

fkiraly commented Jul 14, 2025 •

edited

Loading