If you are a Data Scientist and would like to get involved with this project, here are a few ways to get started.
One of the goals of the AI4CI project is to build an open AIOps community centered around open IT operations datasets. As a part of this project, we've curated various open source data sets from the Continuous Integration (CI) process of OpenShift (Other projects to be added in the future) and made them available for others to explore and apply there own analytics and ML. A detailed overview of our current efforts with each of these data sources, including data collection scripts and exploratory analyses is available here for you to learn more.
There are interactive and reproducible notebooks for this entire project available for anyone to start using on our public JupyterHub instance on the Massachusetts Open Cloud (MOC) right now!
- To get started, access JupyterHub, select log in with
moc-sso
and sign in using your Google Account. - After signing in, on the spawner page, please select the
ocp-ci-analysis:latest
image in the JupyterHub Notebook Image section from the list and selectMedium
from the container size drop down and hitStart
to spawn your server. - Once your server has spawned, you should see a directory titled
ocp-ci-analysis-<current-timestamp>
. This directory contains the entire project repo, including notebooks that can be run directly in this Jupyter Hub environment. - To interact with the S3 bucket and access the stored datasets, make sure you have a
.env
file at the root of your repo. Check .env-example for an example.env
file and open an issue for access credentials.
You can find more information on the various notebooks and their purpose here.
If you need more help navigating the Operate First environment, we have a few short videos to help you get started.
As a part of AI4CI, we collect the relevant metrics and key performance indicators (KPIs) and visualize them using dashboards. You can view and interact with the publicly available dashboard here.
-
Github Time to Merge Model: We have an interactive endpoint available for a model which can predict the time taken to merge a PR and classifies it into one of a few predefined time ranges. To interact with the model, check out this Model Inference Notebook.
You can find more information about Github Time to Merge Model here.
-
Build Log Clustering Model : We also have an interactive endpoint for a model that uses unsupervised machine learning techniques such as k-means and tf-idf for clustering build logs. To interact with the model, check out this Seldon deployment.
We use Elyra and Kubeflow pipelines to automate the various steps in the project responsible for data collection, metric calculation and ML analysis. To automate your notebook workflows you can follow this guide or tutorial video.
Here is a video playlist for the AI4CI project which goes over different analyses and walks through various notebooks within the project.