Merge pull request #478 from aakankshaduggal/osp-model

Add readme for optimal stopping point
aicoe-aiops · Mar 9, 2022 · 81cae51 · 81cae51
2 parents 67b91fc + 231473d
commit 81cae51
Show file tree

Hide file tree

Showing 2 changed files with 42 additions and 0 deletions.
diff --git a/docs/content.md b/docs/content.md
@@ -10,6 +10,7 @@
   * [Github Time to Merge Prediction](#github-time-to-merge-prediction)
   * [TestGrid Failure Type Classification](#testgrid-failure-type-classification)
   * [Prow Log Classification](#prow-log-classification)
+  * [Optimal Stopping Point Prediction](#optimal-stopping-point-prediction)
   * [More Projects Coming Soon…](#more-projects-coming-soon)
 - [Automate Notebook Pipelines using Elyra and Kubeflow](#automate-notebook-pipelines-using-elyra-and-kubeflow)
 
@@ -103,6 +104,19 @@ We start by applying a clustering algorithm to job runs based on the term freque
 
 * [Build Log Classification Notebook](../notebooks/data-sources/gcsweb-ci/build-logs/build_log_term_freq.ipynb)
 
+## Optimal Stopping Point Prediction
+
+Every new Pull Request to a repository with new code changes is subjected to an automated set of builds and tests before being merged. Some tests may run for longer durations for various reasons such as unoptimized algorithms, slow networks, or the simple fact that many different independent services are part of a single test. Longer running tests are often painful as they can block the CI/CD process for longer periods of time. By predicting the optimal stopping point for the test, we can better allocate development resources.
+
+[TestGrid](https://testgrid.k8s.io/) is a platform that is used to aggregate and visually represent the results of all these automated tests. Based on test and build time duration data available on testgrid, we can predict and suggest a stopping point, beyond which a given test is likely to result in a failure.
+
+* [Detailed project description](../notebooks/optimal-stopping-point/README.md)
+* Interactive model endpoint: http://optimal-stopping-point-ds-ml-workflows-ws.apps.smaug.na.operate-first.cloud/predict
+* [Model Inference Notebook](../notebooks/optimal-stopping-point/model_inference.ipynb)
+* [Optimal Stopping Model Training Notebook](../notebooks/optimal-stopping-point/osp_model.ipynb)
+* [Deployment Configuration for Seldon Service](../notebooks/optimal-stopping-point/seldon-deployment-config.yaml)
+
+
 ## More Projects Coming Soon…
 
 *  [List of potential ML projects](https://github.com/aicoe-aiops/ocp-ci-analysis/issues?q=is%3Aissue+is%3Aopen+%22ML+Request%22+).

diff --git a/notebooks/optimal-stopping-point/README.md b/notebooks/optimal-stopping-point/README.md
@@ -0,0 +1,28 @@
+# Optimal Stopping Point Prediction
+
+The aim of this ML problem is to predict an optimal stopping point for CI tests based on their test duration (runtimes). We perform an initial data analysis as well as feature engineering on the testgrid data. Furthermore, we also calculated the optimal stopping point by identifying the distribution of the test duration values for different CI tests and comparing the distributions between the passing and failing tests.
+
+## Dataset
+
+ In order to achieve the optimal stopping point, we would be looking into the  [testgrid](https://testgrid.k8s.io/) data for all the passing and failed tests and find the distribution type for the `test_duration` metric. The `test_duration` metric tracks the time it took for a test to complete its execution. We can visualize the distribution of the `test_duration` metric across various testgrid dashboards and jobs. Based on the distribution type identified, we can find a point after which the test has a higher probability of failing.
+
+## Feature Engineering
+
+After fetching the data, we approximate the distributions of the `test_duration` metric and also check its goodness of fit for different TestGrid tests across all TestGrid dashboards and grids. Based on the type of distribution identified, we can calculate the probability of the test failing.
+
+We fetch data for all the passing and failed tests and find the distribution type for the `test_duration` metric. The `test_duration` metric tracks the time it took for a test to complete its execution. We can visualize the distribution of the `test_duration` metric across various testgrid dashboards and jobs. Based on the distribution type identified, we can find top two distributions based on betterment of fit. Probability density plots are used to understand data distribution for a continuous variable and likelihood (or probability) of obtaining a range of values that the continuous variable can assume. The area under the curve contains the probabilities for the test duration values. In a test_duration probability distribution function, the area under the curve from 0 to a given value represents the probability that the test_duration is less than or equal to that value.
+
+   * [Probability To Fail notebook](../data-sources/TestGrid/metrics/probability_to_fail.ipynb)
+
+## Model Training
+
+After performing initial data analysis and calulating the probability to fail, we predict the optimal stopping for a given test based on their test duration(runtimes). We find the best distribution(s) for the given test, and find an optimal stopping point for the test by finding the point where the probability of the test failing is greater than the probabilty of the test passing.
+
+   * [Model Training Notebook](osp_model.ipynb)
+
+## Model Deployment
+
+To make the machine learning model available at an interactive endpoint,  we serve the model yielding the best results into a Seldon service.
+
+   * Interactive model endpoint: http://optimal-stopping-point-ds-ml-workflows-ws.apps.smaug.na.operate-first.cloud/predict
+   * [Model Inference Notebook](model_inference.ipynb)