Model Interface #5

mazhurin · 2020-05-26T13:12:22Z

Model Interface class. Categorical features for spark.iForest.

…s. Categorical features for spark.iForest. ModelManager deleted.

mkaranasou

Hey, great effort in wrapping the model stuff up together 👏 👍
Just some comments - I know some of the stuff here we have discussed before but please, bear with me :)

src/baskerville/models/anomaly_model.py

mkaranasou · 2020-05-27T14:38:14Z

src/baskerville/models/anomaly_model.py

+            index_columns.append(index_model.getOutputCol())
+
+        add_categories = F.udf(lambda features, arr: Vectors.dense(np.append(features, [v for v in arr])),
+                               VectorUDT())


👍
Do you think this could be moved under spark/udfs, or is it one of those cases where we need to have it here for it to work properly?

I would rather do the opposite and move to_dense_vector_udf from spark/udfs to anomaly_model.py and delete the import spark/udfs here. I feel like these 2 udfs are not going to be used anywhere and solely belong to the model implementation file. But you are right, we need to be consistent one way or another. It's not a big deal, I moved it.

src/baskerville/models/anomaly_model.py

mkaranasou · 2020-05-27T14:49:06Z

src/baskerville/models/anomaly_model.py

+        iforest.setSeed(self.seed)
+        params = {'threshold': self.threshold}
+        self.iforest_model = iforest.fit(df, params)
+        df.unpersist()


Is this the last part where df is used? (wondering if it makes sense to unpersist here)

Right. Should not be here. I call persist both in train/predict in build_features_vectors. For train() we don't need it to be persisted anymore. I moved unpersist to the training pipeline. Do you think it's OK?

mkaranasou · 2020-05-27T15:08:14Z

src/baskerville/util/model_serialization.py

 from pyspark.ml.feature import StandardScalerModel


-def pickle_model(model_id, db_config, out_path, ml_model_out_path):


Oh.. okay this whole file changes must be a merge gone wrong?

Yes. We don't need this anymore. The model implementation is responsible for serialization now.
I removed test_model parameter for now. We will need a proper model unit test. For test_model parameter I think we first need to implement model_path parameter. Then we will be able to load/test/run the pipelines with a testing model saved in the repo.

mkaranasou · 2020-05-27T15:11:27Z

tests/unit/baskerville_tests/models_tests/test_base_spark.py

        self.spark_pipeline.refresh_cache.assert_called_once()

-    @mock.patch('baskerville.models.base_spark.F.udf')
-    def test_predict_no_ml_model(self, mock_udf):


Why is this removed? We do have a case where we can run without an ML model, right?

Or is just unsalvageable? 😛 :)

Yes, I jumped the gun here. I put this test back. BTW, I wanted to talk to you about that threshold. It does not make much sense to store it per row. It will be a parameter in engine config probably. Currently, we don't need it yet. In the dashboard, we set it manually. We will start using it when we implement the challenge logic.

Sure the threshold is something that's in configuration, it doesn't make sense to have it per row - the dashboard threshold though is something different, right? I mean there is the threshold with which the classifier was trained (e.g. fit parameters) and the threshold we adjust manually for now.

Yes, currently, the dashboard ignores predictions and using its own threshold to classify an anomaly from the score. So far this is convenient (very easy to manually change directly) since the dashboard itself is sending the notification. But as soon as Baskerville becomes responsible for the challenge/ban commands this threshold has to be in Baskerville configuration. Don't forget, we also have another threshold #2 for attack detection. This threshold #2 defines the maximum portion of anomalies in a batch.

mkaranasou · 2020-05-27T15:13:02Z

tests/unit/baskerville_tests/models_tests/test_feature_manager.py

            4
        )

-    def test_get_active_features_all_features(self):


Could you please remind me how this will all work with the model features and the extra features?

Yes, I still don't understand what this 'extra' mean.
Here is how I see it. The FeatureManager (running on the client) has a list of features. Ideally, this is a superset of all the features we support. This parameter is engine.features. The training pipeline (running on Bakerstreet) has its own list of features. training.model_parameter.features. It might be a subset of all the features if needed. This training features will be owned by the model. It will be saved with the model and used by the model during the prediction time. If at prediction time FeatureManager does not provide some feature(s) the model uses the default values. In this case, we can cherry-pick features for different models, introduce new models with new features, and deploy them without breaking the pipelines.

mkaranasou · 2020-05-27T16:06:19Z

tests/unit/baskerville_tests/models_tests/test_training_pipelines.py

@@ -1,107 +1,60 @@
-import sys


This file probably is a merge fluke? :)

Yes, I deleted this file.

mkaranasou · 2020-05-27T17:30:09Z

tests/unit/baskerville_tests/utils_tests/test_file_manager.py

+        self.assertDictEqual(res, some_dict)
+
+    # HDFS tests are commented our since it should no be executed with all the unit tests
+    # def test_json_hdfs(self):


We should mock the hdfs stuff - or move these tests under entity / functional tests :)
Cool that you implemented them :) 👍

Mocking the hdfs is not a real test. It would be nice to have that real functional test. Do you want me to move these tests to the functional folder and keep them commented out?

Well, that's the purpose of unit tests, right? :) But again, between mocking and doing functional tests, I think the later is preferable. (It will be nice to have more functional tests also)

…ka pipeline. Unit test added. Some minor improvements.

Generic ModelInterface class both for spark.iForest and sklearn model…

8bb159d

…s. Categorical features for spark.iForest. ModelManager deleted.

mazhurin requested a review from mkaranasou May 26, 2020 13:12

mkaranasou changed the base branch from master to develop May 27, 2020 14:29

mkaranasou reviewed May 27, 2020

View reviewed changes

Minor changes in AnomalyModel path manipulations. Fixing no model kaf…

5ea0167

…ka pipeline. Unit test added. Some minor improvements.

mazhurin merged commit 34dc1ba into develop Jun 1, 2020

		from pyspark.ml.feature import StandardScalerModel


		def pickle_model(model_id, db_config, out_path, ml_model_out_path):

Model Interface #5

Model Interface #5

Uh oh!

Conversation

mazhurin commented May 26, 2020

Uh oh!

mkaranasou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants