This repo demonstrates how GitOps can be used with Kubeflow Pipelines from deployKF.
NOTE:
- This repo is about using GitOps to manage pipelines definitions and pipeline schedules NOT the platform itself.
- This repo only supports Kubeflow Pipelines compiled in V1 mode.
This repository is logically grouped into four steps:
- Render Pipelines: demonstrates how to render pipelines
- Run Pipelines: demonstrates how run the rendered pipelines
- Schedule Pipelines: demonstrates how to schedule the rendered pipelines
- Automatic Reconciliation: demonstrates how to automatically reconcile the schedule configs
Unlike this demo, in the real world you typically store pipeline definitions and schedules in separate repositories.
For example, you may have the following repositories:
Repository | Purpose | Demo Steps Used |
---|---|---|
ml-project-1
|
pipeline definitions for "ml project 1" |
"Step 1: Render Pipelines"
"Step 2: Run Pipelines" |
ml-project-2
|
pipeline definitions for "ml project 2" | |
ml-project-3
|
pipeline definitions for "ml project 3" | |
kfp-schedules
|
schedules for all pipelines |
"Step 3: Run Schedule Pipelines"
"Step 4: Automatic Reconciliation" |
This repository contains the following content:
Directory | Description |
---|---|
/.github/workflows/
|
reference GitHub Actions workflows |
/common_python/
|
shared Python code |
/common_scripts/
|
shared Bash scripts |
/step-1--render-pipelines/
|
examples/scripts for rendering pipelines |
/step-2--run-pipelines/
|
examples/scripts for running rendered pipelines |
/step-3--schedule-pipelines/
|
examples/scripts for scheduling rendered pipelines |
The Kubeflow Pipelines SDK is a Python DSL which compiles down to Argo Workflow
resources,
the Kubeflow Pipelines backend is able to execute compiled pipelines on a Kubernetes cluster on a schedule.
To manage pipeline definitions/schedules with GitOps, we need a reliable way to render the pipelines from their "dynamic Python representation" into their "static YAML representation".
You will find the following items under /step-1--render-pipelines/example_pipeline_1/
:
File/Directory | Description |
---|---|
./pipeline.py
|
|
./render_pipeline.sh
|
|
./RENDERED_PIPELINE/
|
|
./example_component.yaml
|
|
WARNING:
It is NOT recommended to run
pipeline.py
directly, but rather to use scripts likerender_pipeline.sh
that ensure the rendered pipeline is only updated if the pipeline definition actually changes.
TIP:
If each run of
render_pipeline.sh
results in a different rendered pipeline, your pipeline definition is not deterministic, for example, it might be usingdatetime.now()
in the definition itself, rather than within a step.If a step in your pipeline requires the current date/time, you may use the Argo Workflows "variables" feature to set a step's inputs:
{{workflow.creationTimestamp.RFC3339}}
becomes the run-time of the workflow ("2030-01-01T00:00:00Z"){{workflow.creationTimestamp.<STRFTIME_CHAR>}}
becomes the run-time formatted by a single strftime character
- TIP: custom time formats can be created using multiple variables,
{{workflow.creationTimestamp.Y}}-{{workflow.creationTimestamp.m}}-{{workflow.creationTimestamp.d}}
becomes "2030-01-01"
TIP:
Additional arguments may be added to
pipeline.py
so that the same pipeline definition can render multiple variants:
- If you do this, you will need to create a separate
render_pipeline.sh
script for each variant, for example,render_pipeline_dev.sh
,render_pipeline_test.sh
,render_pipeline_prod.sh
.- These scripts should be configured to render the pipeline into a separate directory, for example,
RENDERED_PIPELINE_dev/
,RENDERED_PIPELINE_test/
,RENDERED_PIPELINE_prod/
.
We provide the following GitHub Actions as reusable workflow templates under /.github/workflows/
:
Workflow Template | Description |
---|---|
./_check-pipelines-are-rendered.yaml
|
|
Before scheduling a pipeline, developers will likely want to run it manually to ensure it works as expected.
As we have already rendered the pipeline in "step 1", we now need a way to run it.
You will find the following items under /step-2--run-pipelines/example_pipeline_1/
:
File/Directory | Description |
---|---|
./run_pipeline.sh
|
|
To manage the pipeline schedules with GitOps, we need a system with the following features:
- Declarative Configs: The system should have a single set of configs which completely define the desired state of the scheduled pipelines.
- Reconciliation: The system should be able to read the declarative configs, determine if the current state matches the configs, and if not, make the required changes to bring the current state into alignment with the configs.
- Version Control: The system should store the declarative configs in a version control system, so that changes to the configs can be reviewed, and so that the history of changes can be viewed.
You will find the following items under /step-3--schedule-pipelines/
:
File/Directory | Description |
---|---|
./team-1/
|
|
./team-1/experiments.yaml
|
|
./team-1/recurring_runs.yaml
|
|
./reconcile_team-1.sh
|
|
WARNING:
Because Kubeflow Pipelines is NOT able to update existing recurring runs (kubeflow/pipelines#3789), the reconciliation script uses the following process:
- creates a paused recurring run with the new definition
- pauses the existing recurring run
- NOTE: in-progress runs will continue to run until completion
- unpauses the new recurring run
- deletes old versions of the recurring run until there are only
keep_history
versions remaining
- WARNING: in-progress runs for the deleted versions will be immediately terminated
WARNING:
The only way to ensure a recurring run never has more than one active instance is to do ONE of the following:
- set
keep_history
to0
andjob.max_concurrency
to1
(if your pipeline can safely be terminated at any time)- create a step at the beginning of your pipeline which checks if there is already a run in progress, and if so, exits
WARNING:
Removing a recurring run from the
recurring_runs.yaml
file will NOT pause or delete any recurring runs already in the cluster, to delete a recurring run requires the following steps:
- update the
job.enabled
flag tofalse
for the recurring run (in therecurring_runs.yaml
file)- run the reconciliation script
- delete the recurring run from the
recurring_runs.yaml
file- run the reconciliation script
- (optional) delete the remaining paused recurring runs using the KFP Web UI
We provide the following GitHub Actions as reusable workflow templates under /.github/workflows/
:
Workflow Template | Description |
---|---|
./_check-reconciliation-configs.yaml
|
|
For true GitOps, we need to ensure the state of the cluster is ALWAYS in sync with the configs in this repo.
Generally speaking, there are two approaches to achieve automatic reconciliation:
- PUSH-Based (GitHub Actions): whenever a change is pushed to GitHub, a job is triggered to reconcile the configs.
- PULL-Based (Kubernetes Deployment): a kubernetes deployment in the cluster periodically reconciles the configs.
NOTE:
- This approach requires GitHub Actions to have access to your Kubeflow Pipelines API, either by it being public, or by connecting it to your private network.
- Drift is possible when the cluster state is changed outside the GitOps repo, this is because changes are only reverted when the next push occurs.
TBA
TBA