-
Notifications
You must be signed in to change notification settings - Fork 913
Synthesis of research related to deployment of Kedro to modern MLOps platforms
Authored with @alparibal
This document aims to cover the current state regarding deploying Kedro on enterprise-grade MLOps platforms:
- pain points observed integrating with distributed, container-based systems.
- feedback we gathered from Kedro developers, users, and plugin developers such as GetInData.
- learnings from implementing an MLRun-specific integration.
High level graphic summary of the problem space identified:
- A node within an orchestrator is typically an entire container.
- There is often a significant conceptual mismatch between a single Kedro node and an orchestrator container node.
- One needs to decide on what a "node" means in the orchestrator's environment i.e. the "granularity" of your nodes.
Expand detail
This is where a single Kedro node is translated to a single orchestrator node.
- Kedro encourages small, manageable nodes.
- These nodes contain smaller logic units than typical orchestrator containers.
- Distributing very small steps in orchestrators can lead to performance overhead. Consider running the pipeline in a single container mode (M:1 granularity) for efficiency.
Distributing each node also complicates the data flow between them:
- When the pipeline is run locally non-persisted data is passed around as
MemoryDataset
s. - When each step runs in isolation, this feature is lost and most implementations require every step to be persisted. See this section for more details.
Currently, most deployment plugins use 1:1 mapping and hence are impacted by these drawbacks.
This is where the whole Kedro pipeline is run as a single node on the target platform.
-
The main benefit is simplicity: One job goes to the orchestrator, executed on a single machine.
-
However, there are inefficiencies:
- Large setup: Setting up the single node to execute all tasks involves creating an environment, handling potentially conflicting requirements, and more.
- Limited parallelization: This approach often underutilizes available compute resources.
All dependencies need to be compatible in this configuration, see Requirements management in a Kedro project for more details.
This is where the full pipeline is divided into a set of sub-pipelines, that can be run separately. Today, there is no obvious way to do this.
This approach provides a middle ground between shortcomings of both the 1:1 and M:1 mappings:
- Small node groups form large buckets of work that justify the overhead of creating an execution environment.
- The orchestrator is free to schedule the sub-pipelines to be run in parallel / isolation.
Kedro is a fast, iterative development tool largely because the user is not required to think about execution contexts. This unmanaged complexity is why it is difficult to resolve this granularity mismatch in production contexts.
Piecemeal localised conventions for describing M:N granularity have emerged across mature users:
Convention | Merits | Drawbacks |
---|---|---|
Node tags | Simple to use, CLI accessible,Applies across pipelines | No bounded context, zero validation |
Registered pipelines | Simple to use, conceptually maps to sub-pipelines, CLI accessible | No bounded context, zero validation |
Pipeline namespaces | Bounded context, CLI accessible, Visualisation integration | Harder to use, confusing error messages, verbose catalog¹ |
Each of these has merits and drawbacks. In every case, the user is given no easy way to validate if these groups are mutually exclusive or collectively exhaustive.
Despite the namespace option being the most robust approach available (since v0.16.x), these are not in wide use across our power-user base. There are several hypotheses for the low adoption rate:
Hypothesis area | Comments |
---|---|
Confusing feature space | • namespaces != modular pipelines != micropackaging , Overlapping features all unrelated to deployment confuse the value for the user.• Today, namespaces are primarily used for visualisation and pipeline-reuse not deployment. • Internal monorepo tooling now covers much of the micropackaging feature space. |
UX | • Users have reported they dislike the catalog verbosity introduced by namespaces¹ • The error messages provided by Kedro when applying namespaces are unhelpful² |
¹ May be resolved by new dataset factory feature
² e.g.
Failed to map datasets and/or parameters: params:features
Even for a mid-sized pipeline, it is not trivial to find the "optimum" grouping of nodes.
Approach | Thoughts |
---|---|
Manual Grouping | Pipeline developers are typically aware of broad groups (e.g. preprocessing, training). However, this is something that may take a while to stabilise during development |
Via Kedro metadata (nodes, tags, namespaces) | See M:N Mapping section above, each approach requires some human direction and unvalidated conventions. |
Via DAG branching | Nodes which split the pipeline graph into distinct branches can be used as sub-pipeline boundaries. This is a similar mechanism as used by ParallelRunner and ThreadRunner . |
Via persistence points | Nodes that persist data, i.e. nodes whose dataset type in the catalog is not MemoryDataset , to be the starting node of a new group. The assumption is that users persist data after checkpointing meaningful work. In a theoretically perfect production system one would only persist the very end of the pipeline. |
After nodes are mapped to several groups, sanity checks and questions need to be answered.
- How may we enforce that the groups are still acyclic?
- Should a node be able to be re-used across multiple groups?
- How do we surface / manage un-grouped nodes?
- Do we try and add validation to registered pipelines / node tags to better bound their context?
A possible solution here is to introduce a formal
before_pipelines_registered
andafter_pipelines_registered
hooks which would expose the pipelines in a state where grouping validation could be injected and applied (See issue #3000 here). There is no way to do this on a portable, plug-in level at the time of writing.
- After the pipeline is broken down into groups, there must be a way to express these groups.
- This expression mechanism must be serialisable so that it can be stored, reused, and passed between Kedro core, plugins, and orchestrators.
A possible solution is to build upon
Pipeline.filter
. If run configuration parameters share the same names (from_nodes
,tags
etc.), then at execution time, we can get the pipeline with the given name and just executepipe.filter(**args)
.
Expand detail
- The Kedro project template comes with a single requirements file for the whole Kedro registry.
- The requirements of individual pipelines and nodes are not captured. All pipelines are usually run using the same environment.
- It can be hard to manage a single environment for large projects. We have evidence of users adopting a monorepo of distinct projects when this becomes a blocker.
-
Modular pipelines do support localised
requirements.txt
s, but it is still up to the user to make these work neatly in independent environments.
There is a 1:1 relationship between pipeline granularity and the dependencies required for that scope. A full solution could include metadata such as dependencies, Docker base image, preferred execution engine (e.g. Pod, Spark job, Ray parallel processing), and other relevant aspects.
Expand detail
This section is very much coupled with Requirements management in a Kedro project
- Most project-scoped CLI commands eagerly load all pipelines of the project.
- Since Kedro nodes keep a pointer to the function object that the node has to run, loading all pipelines means importing all modules at once:
- This is very expensive in large pipelines. The most common manifestation of this problem is where Kedro-Viz takes several minutes to load, despite not requiring a functional DAG.
- This hinders the ability to isolate different work teams within a project, e.g. the data science team has to install Spark and the data engineering team has to install TensorFlow.
There are active initiatives to address this, but no concrete progress has been made at the time of writing.
Expand detail
As described below, most deployment plugins run the Kedro CLI under the hood.
- When the execution of the pipeline is separated into multiple steps, a new
KedroSession
for each of these steps is created, and a separatesession_id
is assigned to each of them. - This makes it hard to have a single overview of the pipeline execution.
This point has been raised by the community and there is ongoing work by the Kedro team. Users often report bypassing Kedro's
session_id
and introducing their own mechanism.
Expand detail
Kedro, by default, uses MemoryDataSet
s to hold intermediate data. However, this dataset type cannot be used in a distributed setting since containers do not share main memory.
Deployment plugins usually replace the MemoryDataset
by:
- Having a
Runner
implementation with another default dataset type- In 0.18.2 onwards there is now a more straightforward way of achieving this with dataset factories.
- Explicitly mapping catalog entries to another dataset type
In either case, ephemeral data is, at least temporarily, persisted to storage (cloud bucket, Kubernetes volume, etc.). The [de-]seriliasation of data throttles the pipeline execution speed and, in many cases, leads to worse performance in the distributed setting compared to a local run.
There are some solutions like the CNCF vineyard project that have in-memory data access offerings that might improve execution speed in only K8s specific situations.
Expand detail
There is a wider point here that granular information about the entire Kedro execution lifecycle needs often to be exposed to the underlying MLOps platform in order to maximise the features available.
- Most mature MLOps platforms differentiate between kinds of pipeline steps, models and artifacts.
- For example:
- SageMaker differentiates between ModelArtifacts, Processing and Training Steps
- MLRun has separate artifact kinds for datasets and models.
- Kedro Dataset classes do not contain metadata about the kind of data they store or
load.
- For example, a
PickleDataSet
can store any Python object and it is not known whether the dataset stores a model. In general, there is a strong argument that ONNX (LFAI) must be the default model serialisation mechanism within Kedro. - There are some 1st and 3rd party model specific datasets, but it is a manual exercise to classify these.
- For example, a
- Users who hit this problem are forced to rely on type hints or some sort of object introspection to retrieve this information (see example).
- Kedro hooks can be utilised to inspect objects at the right time during pipeline execution.
- At translation time, type annotations of the node functions can be used similarly.
A potential solution here is to establish and enforce conventions. Introducing something like
AbstractModelDataSet
would make this much easier. We could also use the newmetadata
catalog key, but the onus is on the user to update this.
Expand detail
Currently, deployment plugins address the one-way task of converting a developed pipeline into a deployment. When deployment is viewed as an iterative process of development and deployment steps, additional gaps need to be bridged.
There are two popular configurations (1) tight (2) loose between source code and platform:
- When the execution environment is not necessarily aware of ML concepts such as pipelines, models and artifacts it is on the user to ensure that deployments are versioned correctly. For example - steps must be taken to avoid pushing untracked code into deployment.
- When the execution environment is loosely coupled with the project's source code (e.g.Databricks Repos, AzureML Environment, MLRun Function), the deployment platform usually maintains the linkage between code and pipeline execution.
- By design, Kedro separates code and configuration.
- Configuration is not included as part of
kedro package
in strict adherence with 12factor app.
- Configuration is not included as part of
- However, in most deployment patterns, the configuration is baked into the deployment.
- Since v0.18.5 it is possible to easily pass a zip file containing configuration via the command line, but it is not easy to:
- Point to a shared or cloud bucket location, only local directories are supported
- Inject configuration directly through the command line in something like JSON.
- Since v0.18.5 it is possible to easily pass a zip file containing configuration via the command line, but it is not easy to:
One option is to use environment variables in the configuration and manage environment variables at deployment time. There is significant complexity in doing this at scale.
In a setup where the pipeline is continuously deployed, repeating the same deployment workflow may lead to inefficiencies:
- re-translating the pipeline: Ideally, the pipeline is translated only for changes ("deltas") in the repo that alter the structure of the pipeline. For example, some changes in the bound node function should not necessitate re-translation.
- re-creating the environment: When source code is injected into the execution environment, the same Docker image should be re-used across several versions of the deployment. Some platforms support this out-of-the-box (MLRun, AzureML, Databricks).
It might be possible to implement a platform-agnostic solution, e.g. cloning the repo at execution time before executing the Kedro command.
Expand detail
There may be some situations where Kedro integrating with a target platform leaves much of the platform feature set under-utilised. From the platform's perspective, deployed Kedro pipelines may feel like "closed boxes".
For many deployment plugins, translating a Kedro pipeline means encapsulating the Kedro project within a Docker container and executing specific nodes via the Kedro CLI.
So, pipeline execution depends on Kedro in two ways:
- Session management:
- Kedro still manages the run context, execution order, as well as importing and running lifecycle hooks.
- This gives the user a familiar way to modify execution behaviour but can also be limiting for the orchestrator. For example, the nodes in the pipeline may not be fully transparent to the orchestrator in the case of M:1 mapping.
- While it might be possible to remove session management from a simple project, it becomes very challenging when the project heavily utilizes hooks or any sort of dynamic pipelining.
- I/O:
- Kedro datasets contain arbitrary custom logic that cannot be reliably mapped to data loading native logic supported by the orchestrator or platform.
- If platforms are opinionated (like Sagemaker
/opt/models
) in the way that they handle artifact management these features often be bypassed and not automatically available to the users.
-
Distributed
session_id
Setting: Simplifysession_id
management in distributed Kedro pipelines (see issue #2182). - Artifact kind assignment: Enhance dataset integration with artifact kinds. Make ONNX the default path.
-
M:N
Groups in Kedro: Establish conventions forM:N
groups with deployment focus. (See kedro-plugins PR#241) - Modular Requirements: Simplify pipeline deployments and development constraints. (Slack conversation)
- Group-Level Validation Hooks: Add hooks for enforcing constraints like MECE pipelines (see issue #3000).
- Lazy loading of pipeline structure Enable DAG resolution without dependencies present in the environment (#2829).
- Make Kedro pipeline serialisable: Inputs, outputs and fully qualified function references would enable easier translation into target DSLs. JSON target seems reasonable.
- Deterministic toposort: Users often report that the sort order is not reproducible, this affects any implicit grouping strategies considerably.
Almost all plugins rely on a Docker image to wrap the Kedro project. The Docker image is usually built just before executing the pipeline, and source code is copied into the image as part of the build.
- A Docker container is spun up from this image on the MLOps platform.
- The Kedro pipeline is run through the Kedro CLI.
- Plugins provide hooks and datasets to manage the communication between Kedro and the platform.
- This communication includes mapping Kedro datasets to platform artifacts, managing experiment tracking via MLflow deployed on the platform and [de-]serialisation of
MemoryDatasets
.
It is also worth noting that beyond data management and experiment tracking, deployment plugins often fail to leverage or unlock the full potential of platform-specific capabilities.
These unused capabilities include:
- serving trained models via an endpoint
- labelling and retraining workflows
- incorporating feature stores
- model monitoring
Plugin | Mapping Support | Handling Memory Datasets | Execution Setup | Source Code | Translation | Platform Integration Reflections |
---|---|---|---|---|---|---|
kedro-airflow[O] | 1:1 only | Not supported | Airflow-defined environment | Source available in executor; cwd set to project path | Kedro DAG -> Python script using Airflow API | Designed for Airflow, not for container platforms. |
kedro-docker[O] | M:1 only | N/A since M:1 | Dockerfile environment | Source available via Docker mount at execution time | No translation or orchestration | Introductory, platform-agnostic tutorial. |
kedro-sagemaker[G] | 1:1 only | Cloudpickle & AWS bucket | Dockerfile environment | Source copied to container at build; auto-rebuilt | SageMakerPipeline object using SageMaker API | MLflow tracking, native pipeline visualization. |
kedro-vertexai[G] | 1:1 only | Cloudpickle & GCS | Dockerfile environment | Source expected in container | Kubeflow Pipelines DSL | MLflow tracking, elastic machine allocation. |
kedro-azureml[G] | 1:1 only | Cloudpickle & Azure Blob | AzureML Environment | Source in AzureML Environment | Inputs/outputs -> AzureML counterparts | AzureMLPipelineDataSet, MLflow, distributed training. |
kedro-kubeflow[G] | 1:1, M:1 | KFP Volumes | Dockerfile environment | Source expected in container | Kubeflow Pipelines DSL | Scaling issue with KFP Volumes. |
kedro-mlrun | 1:1, M:N (prototype) | MLRun artifact | Dockerfile environment | Source fetched from repo by MLRun | Kubeflow Pipelines DSL | Native model tracking, serving, pipeline visualization. |
* [O] maintained by the Kedro org, [G] maintained by the GetInData org
- Contribute to Kedro
- Guidelines for contributing developers
- Contribute changes to Kedro that are tested on Databricks
- Backwards compatibility and breaking changes
- Contribute to the Kedro documentation
- Kedro documentation style guide
- Creating developer documentation
- Kedro new project creation - how it works
- The CI Setup: GitHub Actions
- The Performance Test Setup: Airspeed Velocity
- Kedro Framework team norms & ways of working ⭐️
- Kedro Framework Pull Request and Review team norms ✍️