Skip to content

Commit

Permalink
fix: bugs in examples (#23)
Browse files Browse the repository at this point in the history
* fix: remove unseen categorical column

* fix: example for featurizers and feature selection

* chore: add SMILES info and wikipage
  • Loading branch information
vassilismin authored Dec 2, 2024
1 parent b6d22b8 commit 7b8fc94
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 4 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,6 @@ We define a test dataset for external evaluation and prepare it using `JaqpotpyD
X_test = pd.DataFrame(
{
"smiles": ["CCCOC", "CO"],
"cat_col": ["low", "low"],
"temperature": [27.0, 22.0],
"activity": [89.0, 86.0],
}
Expand All @@ -85,7 +84,7 @@ X_test = pd.DataFrame(
test_dataset = JaqpotpyDataset(
df=X_test,
smiles_cols="smiles",
x_cols=["cat_col", "temperature"],
x_cols=["temperature"],
y_cols=["activity"],
task="regression",
featurizer=featurizer,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,33 @@ sidebar_position: 4

## Using Multiple Featurizers

In the first script, we initialize two molecular descriptor calculators from JaqpotPy:
This guide is about using multiple featurizers and performing feature selection.

First, we import necessary libraries.

```python
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from jaqpotpy.models import SklearnModel
from jaqpotpy.datasets import JaqpotpyDataset
from jaqpotpy.descriptors import RDKitDescriptors, MACCSKeysFingerprint
```

Create a dataframe with SMILES strings, a categorical variable, temperature, and activity values. SMILES is a unified method to represent chemical structures in the form of a line notation. For more info about SMILES check [this Wikipedia page](https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Line_Entry_System).

```python
data = pd.read_csv("https://github.com/ntua-unit-of-control-and-informatics/jaqpot-google-colab-examples/raw/doc/JAQPOT-425/Sklearn_jupyter_examples/datasets/regression_smiles_categorical.csv")
```

Define a list of desired featurizers.

```python
from jaqpotpy.descriptors import RDKitDescriptors, MACCSKeysFingerprint
featurizers = [RDKitDescriptors(), MACCSKeysFingerprint()]
```

Expand Down Expand Up @@ -48,6 +70,7 @@ Alternatively, you can directly select specific columns by name using the `Selec
```python
myList = [
"temperature",
"cat_col",
"MaxAbsEStateIndex",
"MaxEStateIndex",
"MinAbsEStateIndex",
Expand Down Expand Up @@ -91,10 +114,11 @@ We then pass these preprocessing pipelines to the `SklearnModel` object:
```python
jaqpot_model = SklearnModel(
dataset=train_dataset,
model=model,
model=RandomForestRegressor(random_state=42),
preprocess_x=double_preprocessing,
preprocess_y=single_preprocessing,
)
jaqpot_model.fit()
```

This ensures that the feature and target variables are properly preprocessed before being used to train the machine learning model.
Expand Down

0 comments on commit 7b8fc94

Please sign in to comment.