fix: bugs in examples (#23)

* fix: remove unseen categorical column * fix: example for featurizers and feature selection * chore: add SMILES info and wikipage
ntua-unit-of-control-and-informatics · Dec 2, 2024 · 7b8fc94 · 7b8fc94
1 parent b6d22b8
commit 7b8fc94
Show file tree

Hide file tree

Showing 2 changed files with 27 additions and 4 deletions.
diff --git a/docusaurus/docs/jaqpotpy/scikit-learn-models/evaluate-a-model.md b/docusaurus/docs/jaqpotpy/scikit-learn-models/evaluate-a-model.md
@@ -75,7 +75,6 @@ We define a test dataset for external evaluation and prepare it using `JaqpotpyD
 X_test = pd.DataFrame(
     {
         "smiles": ["CCCOC", "CO"],
-        "cat_col": ["low", "low"],
         "temperature": [27.0, 22.0],
         "activity": [89.0, 86.0],
     }
@@ -85,7 +84,7 @@ X_test = pd.DataFrame(
 test_dataset = JaqpotpyDataset(
     df=X_test,
     smiles_cols="smiles",
-    x_cols=["cat_col", "temperature"],
+    x_cols=["temperature"],
     y_cols=["activity"],
     task="regression",
     featurizer=featurizer,

diff --git a/docusaurus/docs/jaqpotpy/scikit-learn-models/feature-preprocessing.md b/docusaurus/docs/jaqpotpy/scikit-learn-models/feature-preprocessing.md
@@ -5,11 +5,33 @@ sidebar_position: 4
 
 ## Using Multiple Featurizers
 
-In the first script, we initialize two molecular descriptor calculators from JaqpotPy:
+This guide is about using multiple featurizers and performing feature selection.
+
+First, we import necessary libraries.
 
 ```python
+# Import necessary libraries
+import pandas as pd
+import numpy as np
+from sklearn.feature_selection import VarianceThreshold
+from sklearn.compose import ColumnTransformer
+from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, StandardScaler
+from sklearn.ensemble import RandomForestRegressor
+from jaqpotpy.models import SklearnModel
+from jaqpotpy.datasets import JaqpotpyDataset
 from jaqpotpy.descriptors import RDKitDescriptors, MACCSKeysFingerprint
+```
+
+Create a dataframe with SMILES strings, a categorical variable, temperature, and activity values. SMILES is a unified method to represent chemical structures in the form of a line notation. For more info about SMILES check [this Wikipedia page](https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Line_Entry_System).
 
+```python
+data = pd.read_csv("https://github.com/ntua-unit-of-control-and-informatics/jaqpot-google-colab-examples/raw/doc/JAQPOT-425/Sklearn_jupyter_examples/datasets/regression_smiles_categorical.csv")
+```
+
+Define a list of desired featurizers.
+
+```python
+from jaqpotpy.descriptors import RDKitDescriptors, MACCSKeysFingerprint
 featurizers = [RDKitDescriptors(), MACCSKeysFingerprint()]
 ```
 
@@ -48,6 +70,7 @@ Alternatively, you can directly select specific columns by name using the `Selec
 ```python
 myList = [
     "temperature",
+    "cat_col",
     "MaxAbsEStateIndex",
     "MaxEStateIndex",
     "MinAbsEStateIndex",
@@ -91,10 +114,11 @@ We then pass these preprocessing pipelines to the `SklearnModel` object:
 ```python
 jaqpot_model = SklearnModel(
     dataset=train_dataset,
-    model=model,
+    model=RandomForestRegressor(random_state=42),
     preprocess_x=double_preprocessing,
     preprocess_y=single_preprocessing,
 )
+jaqpot_model.fit()
 ```
 
 This ensures that the feature and target variables are properly preprocessed before being used to train the machine learning model.