Skip to content

Conversation

@mazhurin
Copy link
Collaborator

Model Interface class. Categorical features for spark.iForest.

…s. Categorical features for spark.iForest. ModelManager deleted.
@mazhurin mazhurin requested a review from mkaranasou May 26, 2020 13:12
@mkaranasou mkaranasou changed the base branch from master to develop May 27, 2020 14:29
Copy link
Collaborator

@mkaranasou mkaranasou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, great effort in wrapping the model stuff up together 👏 👍
Just some comments - I know some of the stuff here we have discussed before but please, bear with me :)

index_columns.append(index_model.getOutputCol())

add_categories = F.udf(lambda features, arr: Vectors.dense(np.append(features, [v for v in arr])),
VectorUDT())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
Do you think this could be moved under spark/udfs, or is it one of those cases where we need to have it here for it to work properly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather do the opposite and move to_dense_vector_udf from spark/udfs to anomaly_model.py and delete the import spark/udfs here. I feel like these 2 udfs are not going to be used anywhere and solely belong to the model implementation file. But you are right, we need to be consistent one way or another. It's not a big deal, I moved it.

iforest.setSeed(self.seed)
params = {'threshold': self.threshold}
self.iforest_model = iforest.fit(df, params)
df.unpersist()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the last part where df is used? (wondering if it makes sense to unpersist here)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Should not be here. I call persist both in train/predict in build_features_vectors. For train() we don't need it to be persisted anymore. I moved unpersist to the training pipeline. Do you think it's OK?

from pyspark.ml.feature import StandardScalerModel


def pickle_model(model_id, db_config, out_path, ml_model_out_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh.. okay this whole file changes must be a merge gone wrong?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We don't need this anymore. The model implementation is responsible for serialization now.
I removed test_model parameter for now. We will need a proper model unit test. For test_model parameter I think we first need to implement model_path parameter. Then we will be able to load/test/run the pipelines with a testing model saved in the repo.

self.spark_pipeline.refresh_cache.assert_called_once()

@mock.patch('baskerville.models.base_spark.F.udf')
def test_predict_no_ml_model(self, mock_udf):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed? We do have a case where we can run without an ML model, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is just unsalvageable? 😛 :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I jumped the gun here. I put this test back. BTW, I wanted to talk to you about that threshold. It does not make much sense to store it per row. It will be a parameter in engine config probably. Currently, we don't need it yet. In the dashboard, we set it manually. We will start using it when we implement the challenge logic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure the threshold is something that's in configuration, it doesn't make sense to have it per row - the dashboard threshold though is something different, right? I mean there is the threshold with which the classifier was trained (e.g. fit parameters) and the threshold we adjust manually for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently, the dashboard ignores predictions and using its own threshold to classify an anomaly from the score. So far this is convenient (very easy to manually change directly) since the dashboard itself is sending the notification. But as soon as Baskerville becomes responsible for the challenge/ban commands this threshold has to be in Baskerville configuration. Don't forget, we also have another threshold #2 for attack detection. This threshold #2 defines the maximum portion of anomalies in a batch.

4
)

def test_get_active_features_all_features(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please remind me how this will all work with the model features and the extra features?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I still don't understand what this 'extra' mean.
Here is how I see it. The FeatureManager (running on the client) has a list of features. Ideally, this is a superset of all the features we support. This parameter is engine.features. The training pipeline (running on Bakerstreet) has its own list of features. training.model_parameter.features. It might be a subset of all the features if needed. This training features will be owned by the model. It will be saved with the model and used by the model during the prediction time. If at prediction time FeatureManager does not provide some feature(s) the model uses the default values. In this case, we can cherry-pick features for different models, introduce new models with new features, and deploy them without breaking the pipelines.

@@ -1,107 +1,60 @@
import sys
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file probably is a merge fluke? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I deleted this file.

self.assertDictEqual(res, some_dict)

# HDFS tests are commented our since it should no be executed with all the unit tests
# def test_json_hdfs(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mock the hdfs stuff - or move these tests under entity / functional tests :)
Cool that you implemented them :) 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mocking the hdfs is not a real test. It would be nice to have that real functional test. Do you want me to move these tests to the functional folder and keep them commented out?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's the purpose of unit tests, right? :) But again, between mocking and doing functional tests, I think the later is preferable. (It will be nice to have more functional tests also)

…ka pipeline. Unit test added. Some minor improvements.
@mazhurin mazhurin merged commit 34dc1ba into develop Jun 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants