Sklearn Workflow Integration by jxl26 · Pull Request #5 · chemprop/chemprop-contrib

jxl26 · 2025-12-02T03:00:57Z

Description

Implement sklearn transformer & regressor modules that encapsulate functionalities of Chemprop, such that users can readily employ the Chemprop model as an sklearn estimator and apply the sklearn library to validate/optimize it. Compatible with the latest Chemprop version.

Questions

N/A

Relevant Chemprop Issue

#1075

Checklist

License file included
Documentation provided (perhaps a Notebook or README)
Tests provided and passing on the stated Chemprop version

KnathanM

Looks like your notebook needs to have the imports updated. (the tests fail)

You could add this as the license file

MIT License

Copyright (c) 2025 Chemprop Dev Team

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

KnathanM · 2025-12-02T16:41:05Z

+                self.model, train_dataloaders=train_loader, val_dataloaders=val_loader
+            )
+        else:
+            trainer.fit(self.model, train_dataloaders=train_loader)


I think you can always do trainer.fit(self.model, train_dataloaders=train_loader, val_dataloaders=val_loader) even if val_loader is None, because that is the default for val_dataloaders.

KnathanM · 2025-12-02T16:41:30Z

+
+    def predict(self, X):
+        if self.model is None:
+            raise ValueError("The regressor has not been fitted.")


Maybe a RuntimeError is more accurate? I don't know a lot about python error though.

KnathanM · 2025-12-02T16:41:49Z

+        if not self.args.no_cache:
+            test_set.cache = True


The test set doesn't need to be cached as it is only ever used once.

Suggested change

if not self.args.no_cache:

test_set.cache = True

KnathanM · 2025-12-02T16:42:19Z

+            test_set,
+            batch_size=self.args.batch_size,
+            num_workers=self.args.num_workers,
+            collate_fn=pick_collate(test_set),


Maybe you could use chemprop.data.build_dataloader instead of defining your own pick_collate function.

KnathanM · 2025-12-02T16:42:34Z

+            accelerator=self.args.accelerator, devices=1, enable_progress_bar=True
+        )
+        preds = eval_trainer.predict(
+            self.model, dataloaders=dl, return_predictions=True


The lightning docs say:

return_predictions: Whether to return predictions. ``True`` by default except when an accelerator that spawns processes is used (not supported).

So I don't think you need to include this arg.

KnathanM · 2025-12-02T16:43:14Z

+
+
+class ChempropEnsembleRegressor(ChempropRegressor):
+    def __init__(self, ensemble_size: int = 5, **chemprop_kwargs):


If you use **chemprop_kwargs, I don't think you can use this class with any cross_val things like cross_val_score and RandomizedSearchCV. These functions clone the estimator for each separate split of data and the clone method uses the init signature to know what parameters to copy over. Here it looks like the ensemble regressor only takes two parameters, ensemble_size and chemprop_kwargs. But the estimator does not have an attribute self.chemprop_kwargs, so clone` will error.

I think the structure that is more typical in sklearn is to make the sub-estimators outside the composite estimator and pass those in as parameters. Here is some documentation about that: https://scikit-learn.org/stable/modules/grid_search.html#composite-estimators-and-parameter-spaces.

What I am not sure about is the number of sub-estimators can change, so we have to assign all of them to a single attribute, maybe as a list? Or maybe we just document that the ensemble estimator shouldn't be used with CV.

I'm a bit confused on this again. We discussed how the issue stems from the cv function directly mutating the arguments and not re-running init(), but doesn't this mean that our issue remains unsolved even if we explicitly have each argument and create namespace in fit, since that would refer to fields of self, and assignments to these fields are made in init()?

Assignments to these fields are also made when cross validate goes to fit a cloned copy of the estimator. https://github.com/scikit-learn/scikit-learn/blob/eec13ccc9c81027ce9387e1fce6f04fd22e80d4d/sklearn/model_selection/_validation.py#L821

This is why the signature of __init__() needs to match what parameters the estimator has. Instead of relying on __init__() to set the parameters of a cloned estimator, sklearn manually sets the parameters using the signature of __init__().

Got it, now addressed, thanks!

KnathanM · 2025-12-02T16:44:55Z

+        if self.checkpoint is not None:
+            if len(self.checkpoint) != self.ensemble_size:
+                logger.warning(
+                    f"The number of models in ensemble for each splitting of data is set to {len(self.args.checkpoint)}."


Probably would be good in the warning to add something along the lines of "number of model checkpoints supplied is not equal to the specified ensemble size. got {len(self.args.checkpoint)} model checkpoints" to explain why the number of models is being set to a number.

Co-authored-by: Nathan Morgan <nate.k.morgan@gmail.com>

KnathanM

LGTM thanks for your work on this. I hope that people will find this code useful when incorporating chemprop models in an sklearn workflow.

I am rerunning the tests before merging as it has been a while and chemprop has updated versions. Hopefully there are no problems.

….2.2

KnathanM · 2026-03-26T14:01:10Z

+      "\n",
+      "  | Name            | Type                         | Params | Mode \n",
+      "-------------------------------------------------------------------------\n",
+      "0 | message_passing | MulticomponentMessagePassing | 252 K  | train\n",


I'm now noticing that this is using MulticomponentMessagePassing instead of BondMessagePassing despite being single component. This is because make_datapoints (imported from chemprop and used in chemprop_estimator.py) always returns a list of lists. If it is single component, then it is a list with a single list. So instead of checking isinstance(X[0], list), we should do len(X) > 1 and X= X[0] like the chemprop CLI here and here

I pushed some changes (b9b5061) that address this.

Good call! The changes make sense to me, but they cause the examples in the notebook to fail with sklearn complaint: "TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType)", likely due to type mismatch induced by the fix. I will see if I can pin down the issue, and let you know if I need help!

KnathanM · 2026-03-26T14:03:20Z

See the linked PR on Chemprop for why the tests fail on this PR.

Jiaxu and others added 2 commits December 1, 2025 21:54

add sklearn integration support to contrib

9959d42

add pytest cov

b81b090

KnathanM reviewed Dec 2, 2025

View reviewed changes

jxl26 and others added 8 commits December 14, 2025 18:58

Adopted

ba7f64c

Co-authored-by: Nathan Morgan <nate.k.morgan@gmail.com>

address comments

a5d6240

add license file

2a9182d

don't shuffle test set

094d005

changes

778cf3c

Fix unsafe parameter assignment in ensemble regressor

51faa1c

remove unneeded dependencies

e1e4026

fix dependency order

b75f263

This was referenced Jan 16, 2026

Sklearn Workflow Integration chemprop/chemprop#1270

Closed

[v2 FEATURE]: sklearn pipeline integration chemprop/chemprop#1075

Closed

KnathanM approved these changes Feb 4, 2026

View reviewed changes

Jiaxu and others added 6 commits March 8, 2026 16:05

adjust input type of message_hidden_dim in accordance with chemprop 2…

a45b2ac

….2.2

also adjust depth argument to take list

4607e5c

also fix example notebook

5a6f66a

mirror chemprop multicomp check

b9b5061

make depth list

a09c83b

rerun example notebook

61ad22b

KnathanM mentioned this pull request Mar 26, 2026

Remove exit line mistake chemprop/chemprop#1346

Merged

KnathanM reviewed Mar 26, 2026

View reviewed changes



		class ChempropEnsembleRegressor(ChempropRegressor):
		def __init__(self, ensemble_size: int = 5, **chemprop_kwargs):

Conversation

jxl26 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Questions

Relevant Chemprop Issue

Checklist

Uh oh!

KnathanM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KnathanM left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KnathanM commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jxl26 commented Dec 2, 2025 •

edited

Loading