Skip to content

Commit 2d595db

Browse files
committed
docs: add machinepool rationale doc
Signed-off-by: Bharath Nallapeta <[email protected]>
1 parent 1c45137 commit 2d595db

File tree

5 files changed

+126
-16
lines changed

5 files changed

+126
-16
lines changed

docs/book/src/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
[Quick Start](./user/quick-start.md)
55
[Quick Start Operator](./user/quick-start-operator.md)
66
[Concepts](./user/concepts.md)
7+
- [MachinePool](./concepts/machinepool.md)
78
[Manifesto](./user/manifesto.md)
89
- [Tasks](./tasks/index.md)
910
- [Certificate Management](./tasks/certs/index.md)
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# MachinePool
2+
3+
## Table of Contents
4+
5+
- [Introduction](#introduction)
6+
- [What is a MachinePool?](#what-is-a-machinepool)
7+
- [Why MachinePool?](#why-machinepool)
8+
- [When to use MachinePool vs MachineDeployment](#when-to-use-machinepool-vs-machinedeployment)
9+
- [Next steps](#next-steps)
10+
11+
## Introduction
12+
13+
Cluster API (CAPI) manages Kubernetes worker nodes primarily through Machine, MachineSet, and MachineDeployment objects. These primitives manage nodes individually (Machines), and have served well across a wide variety of providers.
14+
15+
However, many infrastructure providers already offer first-class abstractions for groups of compute instances (AWS: Auto Scaling Groups (ASG), Azure: Virtual Machine Scale Sets (VMSS), or GCP: Managed Instance Groups (MIG)). These primitives natively support scaling, rolling upgrades, and health management.
16+
17+
MachinePool brings these provider features into Cluster API by introducing a higher-level abstraction for managing a group of machines as a single unit.
18+
19+
## What is a MachinePool?
20+
21+
A MachinePool is a Cluster API resource representing a group of worker nodes. Instead of reconciling each machine individually, CAPI delegates lifecycle management to the infrastructure provider.
22+
23+
- **MachinePool (core API)**: defines desired state (replicas, Kubernetes version, bootstrap template, infrastructure reference).
24+
- **InfrastructureMachinePool (provider API)**: provides an implementation that backs a pool. A provider may offer more than one type depending on how it is managed. For example:
25+
- `AWSMachinePool`: self-managed ASG
26+
- `AWSManagedMachinePool`: EKS managed node group
27+
- `AzureMachinePool`: VM Scale Set
28+
- `AzureManagedMachinePool`: AKS managed node pool
29+
- `GCPManagedMachinePool`: GKE managed node pool
30+
- `OCIManagedMachinePool`: OKE managed node pool
31+
- `ScalewayManagedMachinePool`: Scaleway Kapsule node pool
32+
- **Bootstrap configuration**: still applies (e.g., kubeadm configs), ensuring that new nodes join the cluster with the correct setup.
33+
34+
The MachinePool controller coordinates between the Cluster API core and provider-specific implementations:
35+
36+
- Reconciles desired replicas with the infrastructure pool.
37+
- Matches provider IDs from the infrastructure resource with Kubernetes Nodes in the workload cluster.
38+
- Updates MachinePool status (ready replicas, conditions, etc.)
39+
40+
## Why MachinePool?
41+
42+
### Leverage provider primitives
43+
44+
Most cloud providers already manage scaling, instance replacement, and health monitoring at the group level. MachinePool lets CAPI delegate lifecycle operations instead of duplicating that logic.
45+
46+
**Example:**
47+
- AWS Auto Scaling Groups replace failed nodes automatically.
48+
- Azure VM Scale Sets support rolling upgrades with configurable surge/availability strategies.
49+
50+
### Simplify upgrades and scaling
51+
52+
Upgrades and scaling events are managed at the pool level:
53+
- Update Kubernetes version or bootstrap template → cloud provider handles rolling replacement.
54+
- Scale up/down replicas → provider adjusts capacity.
55+
56+
This provides more predictable, cloud-native semantics compared to reconciling many individual Machine objects.
57+
58+
### Autoscaling integration
59+
60+
MachinePool integrates with the Cluster Autoscaler in the same way that MachineDeployments do. In practice, the autoscaler treats a MachinePool as a node group, enabling scale-up and scale-down decisions based on cluster load.
61+
62+
### Tradeoffs and limitations
63+
64+
While powerful, MachinePool comes with tradeoffs:
65+
66+
- **Infrastructure provider complexity**: requires infrastructure providers to implement and maintain an InfrastructureMachinePool type.
67+
- **Less per-machine granularity**: you cannot configure each node individually; the pool defines a shared template.
68+
> **Note**: While this is typically true, certain cloud providers do offer flexibility.
69+
> **Example**: AWS allows `AWSMachinepool.spec.mixedInstancesPolicy.instancesDistribution` while Azure allows `AzureMachinePool.spec.orchestrationMode`.
70+
- **Complex reconciliation**: node-to-providerID matching introduces edge cases (delays, inconsistent states).
71+
- **Draining**: The cloud resources for MachinePool may not necessarily support draining of Kubernetes worker nodes. For example, with an AWSMachinePool, AWS would normally terminate instances as quickly as possible. To solve this, tools like `aws-node-termination-handler` combined with ASG lifecycle hooks (defined in `AWSMachine.spec.lifecycleHooks`) must be installed, and is not a built-in feature of the infrastructure provider (CAPA in this example).
72+
- **Maturity**: The MachinePool API is still considered experimental/beta.
73+
74+
## When to use MachinePool vs MachineDeployment
75+
76+
Both MachineDeployment and MachinePool are valid options for managing worker nodes in Cluster API. The right choice depends on your infrastructure provider's capabilities and your operational requirements.
77+
78+
### Use MachinePool when:
79+
80+
- **Cloud provider supports scaling group primitives**: AWS Auto Scaling Groups, Azure Virtual Machine Scale Sets, GCP Managed Instance Groups, OCI Compute Instances, Scaleway Kapsule. These resources natively handle scaling, rolling upgrades, and health checks.
81+
- **You want to leverage cloud provider-level features**: MachinePool enables direct use of cloud-native upgrade strategies (e.g., surge, maxUnavailable) and autoscaling behaviors.
82+
- **You are operating medium-to-large node groups**: Managing 50+ nodes through individual Machine objects can add significant reconciliation overhead. MachinePool reduces this by consolidating the group into a single object.
83+
84+
### Use MachineDeployment when:
85+
86+
- **The provider does not support scaling groups**: Common in environments such as bare metal, vSphere, or Docker.
87+
- **You need fine-grained per-machine control**: MachineDeployments allow unique bootstrap configurations, labels, and taints across different MachineSets.
88+
- **You prefer maturity and portability**: MachineDeployment is stable, GA, and supported across all providers. MachinePool remains experimental in some implementations.
89+
- **Your clusters are small**: For clusters with only a handful of nodes, the additional API object overhead from Machines is minimal, and MachineDeployment provides simpler semantics.
90+
91+
## Next Steps
92+
93+
- **Enable the feature**: [MachinePool Experimental Feature](../tasks/experimental-features/machine-pools.md)
94+
- **Developer documentation**: [MachinePool Controller](../developer/core/controllers/machine-pool.md)
95+
- **Future work**: Planned improvements are tracked [here](https://github.com/kubernetes-sigs/cluster-api/issues/9005)

docs/book/src/developer/core/controllers/machine-pool.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
![](../../../images/cluster-admission-machinepool-controller.png)
44

5+
📖 **For conceptual information about MachinePools, when to use them, and how they compare to MachineDeployments**, see the [MachinePool Concepts Guide](../../../concepts/machinepool.md).
6+
57
The MachinePool controller's main responsibilities are:
68

79
* Setting an OwnerReference on each MachinePool object to:
Lines changed: 24 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,34 @@
11
# Experimental Feature: MachinePool (beta)
22

3-
The `MachinePool` feature provides a way to manage a set of machines by defining a common configuration, number of desired machine replicas etc. similar to `MachineDeployment`,
4-
except `MachineSet` controllers are responsible for the lifecycle management of the machines for `MachineDeployment`, whereas in `MachinePools`,
5-
each infrastructure provider has a specific solution for orchestrating these `Machines`.
3+
The `MachinePool` feature provides a way to manage a set of machines by leveraging infrastructure provider scaling groups (e.g., AWS Auto Scaling Groups, Azure VM Scale Sets) rather than managing individual machines through MachineDeployments.
64

75
**Feature gate name**: `MachinePool`
86

97
**Variable name to enable/disable the feature gate**: `EXP_MACHINE_POOL`
108

11-
Infrastructure providers can support this feature by implementing their specific `MachinePool` such as `AzureMachinePool`.
9+
## Overview
1210

13-
More details on `MachinePool` can be found at:
14-
[MachinePool CAEP](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20190919-machinepool-api.md)
11+
Infrastructure providers can support this feature by implementing their specific `MachinePool` such as `AWSMachinePool` or `AzureMachinePool`.
1512

16-
For developer docs on the MachinePool controller, see [here](./../../developer/core/controllers/machine-pool.md).
13+
📖 **For comprehensive information about MachinePool concepts, use cases, and comparisons with MachineDeployment**, see the [MachinePool Concepts Guide](../../concepts/machinepool.md).
1714

18-
## MachinePools vs MachineDeployments
15+
## Enabling MachinePool
1916

20-
Although MachinePools provide a similar feature to MachineDeployments, MachinePools do so by leveraging an InfraMachinePool which corresponds 1:1 with a resource like VMSS on Azure or Autoscaling Groups on AWS which we treat as a black box. When a MachinePool is scaled up, the InfraMachinePool scales itself up and populates its provider ID list based on the response from the infrastructure provider. On the other hand, when a MachineDeployment is scaled up, new Machines are created which then create an individual InfraMachine, which corresponds to a VM in any infrastructure provider.
17+
Starting from Cluster API v1.7, MachinePool is enabled by default. No additional configuration is needed.
2118

22-
| MachinePools | MachineDeployments |
23-
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
24-
| Creates new instances through a single infrastructure resource like VMSS in Azure or Autoscaling Groups in AWS. | Creates new instances by creating new Machines, which create individual VM instances on the infra provider. |
25-
| Set of instances is orchestrated by the infrastructure provider. | Set of instances is orchestrated by Cluster API using a MachineSet. |
26-
| Each MachinePool corresponds 1:1 with an associated InfraMachinePool. | Each MachineDeployment includes a MachineSet, and for each replica, it creates a Machine and InfraMachine. |
27-
| Each MachinePool requires only a single BootstrapConfig. | Each MachineDeployment uses an InfraMachineTemplate and a BootstrapConfigTemplate, and each Machine requires a unique BootstrapConfig. |
28-
| Maintains a list of instances in the `providerIDList` field in the MachinePool spec. This list is populated based on the response from the infrastructure provider. | Maintains a list of instances through the Machine resources owned by the MachineSet. |
19+
For Cluster API versions prior to v1.7, you need to set the `EXP_MACHINE_POOL` environment variable:
20+
21+
```bash
22+
export EXP_MACHINE_POOL=true
23+
clusterctl init
24+
```
25+
26+
Or when upgrading an existing management cluster:
27+
28+
```bash
29+
export EXP_MACHINE_POOL=true
30+
clusterctl upgrade
31+
```
2932

3033
## MachinePool provider implementations
3134

@@ -38,3 +41,8 @@ The following Cluster API infrastructure providers have implemented support for
3841
| GCP | `GCPMachinePool` | In Progress | https://github.com/kubernetes-sigs/cluster-api-provider-gcp/pull/1506 |
3942
| OCI | `OCIManagedMachinePool`<br> `OCIMachinePool` | Implemented, MachinePoolMachines supported | https://oracle.github.io/cluster-api-provider-oci/managed/managedcluster.html |
4043
| Scaleway | `ScalewayManagedMachinePool` | Implemented | https://github.com/scaleway/cluster-api-provider-scaleway/blob/main/docs/scalewaymanagedmachinepool.md |
44+
45+
## Additional Resources
46+
47+
- **Design Document**: [MachinePool CAEP](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20190919-machinepool-api.md)
48+
- **Developer Documentation**: [MachinePool Controller](./../../developer/core/controllers/machine-pool.md)

docs/book/src/user/concepts.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,10 @@ A MachineDeployment provides declarative updates for Machines and MachineSets.
6161

6262
A MachineDeployment works similarly to a core Kubernetes [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/). A MachineDeployment reconciles changes to a Machine spec by rolling out changes to 2 MachineSets, the old and the newly updated.
6363

64+
### MachinePool
65+
66+
A MachinePool is a declarative spec for a group of Machines. It is similar to a MachineDeployment, but is specific to a particular Infrastructure Provider. For more information, please check out [MachinePool](../concepts/machinepool.md).
67+
6468
### MachineSet
6569

6670
A MachineSet's purpose is to maintain a stable set of Machines running at any given time.

0 commit comments

Comments
 (0)