Skip to content

Latest commit

 

History

History
473 lines (390 loc) · 23.8 KB

fault_domains.md

File metadata and controls

473 lines (390 loc) · 23.8 KB

Controlling Fault Domains

The operator provides multiple options for defining fault domains for your cluster. The fault domain defines how data is replicated and how processes and coordinators are distributed across machines. Choosing a fault domain is an important process of managing your deployments.

Fault domains are controlled through the faultDomain field in the cluster spec.

Option 1: Single-Kubernetes Replication

If the faultDomain field is not specified then the following configuration is always used:

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: sample-cluster
spec:
  version: 7.1.26
  faultDomain:
    key: kubernetes.io/hostname
    valueFrom: spec.nodeName

This configuration will set the fdbmonitor configuration for all processes to use the value from spec.nodeName on the pod as the zoneid locality field:

[fdbserver.1]
locality_zoneid = $FDB_ZONE_ID

Operator will replicate pods across nodes using a preferred pod anti-affinity rule (preferredDuringSchedulingIgnoredDuringExecution) which discourages running multiple pods of the same process class on the same node, for any given cluster.

fault domain from node label

NOTE: This feature requires the unified image.

The fdb-kubernetes-monitor has support to read the labels from the node where it's running on an populate those as environment variables. The node label keys will be prefixed by "NODE_LABEL" and "/" and "." will be replaced in the key as "_". E.g. from the label "foundationdb.org/testing = awesome" the env variables NODE_LABEL_FOUNDATIONDB_ORG_TESTING = awesome" will be generated. This feature requires that the according RBAC permissions for the unified image (ServiceAccount) are setup. Those labels can then be added as localities or used as ValueFrom in the FaultDomain configuration of the cluster config. The environment variable name in ValueFrom must be prefixed with a $.

Overriding pod anti-affinity

You can override the pod anti-affinity rules generated by operator by specifying one in the pod spec template (either general or for a specific class), for example to implement requiredDuringSchedulingIgnoredDuringExecution:

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: sample-cluster
spec:
  version: 7.1.26
  processes:
    storage:
      podTemplate:
        spec:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - topologyKey: kubernetes.io/hostname
                labelSelector:
                  matchLabels:
                    foundationdb.org/fdb-cluster-name: sample-cluster
                    foundationdb.org/fdb-process-class: storage

This example would replicate storage pods across nodes and enforce that there are never 2 or more pods scheduled on same node.

Using kubernetes failure zones

There is no clear pattern in Kubernetes for allowing pods to access node information other than the host name, which presents challenges using any other kind of fault domain. If you have some other mechanism to make this information available in your pod's environment, you can tell the operator to use an environment variable as the source for the zone locality:

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: sample-cluster
spec:
  version: 7.1.26
  faultDomain:
    key: topology.kubernetes.io/zone
    valueFrom: $RACK

If you specify this RACK variable in the cluster spec.sidecarVariables then it will set the zoneid locality to whatever is in the RACK environment variable for the containers providing the monitor conf, which are foundationdb-kubernetes-init and foundationdb-kubernetes-sidecar. For ideas on how to inject environment variables, see ADDITIONAL_ENV_FILE in Warnings.

Option 2: Multi-Kubernetes Replication

Our second strategy is to run multiple Kubernetes cluster, each as its own fault domain. This strategy adds significant operational complexity, but may allow you to have stronger fault domains and thus more reliable deployments. You can enable this strategy by using a special key in the fault domain:

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: sample-cluster
spec:
  version: 7.1.26
  processGroupIDPrefix: zone2
  faultDomain:
    key: foundationdb.org/kubernetes-cluster
    value: zone2
    zoneIndex: 2
    zoneCount: 5

This tells the operator to use the value "zone2" as the fault domain for every process it creates. The zoneIndex and zoneCount tell the operator where this fault domain is within the list of Kubernetes clusters (KCs) you are using in this DC. This is used to divide processes across fault domains. For instance, this configuration has 7 stateless processes, which need to be divided across 5 fault domains. The zones with zoneIndex 1 and 2 will allocate 2 stateless processes each. The zones with zoneIndex 3, 4, and 5 will allocate 1 stateless process each.

When running across multiple KCs, you will need to apply more care in managing the configurations to make sure all the KCs converge on the same view of the desired configuration. You will likely need some kind of external, global system to store the canonical configuration and push it out to all of your KCs. You will also need to make sure that the different KCs are not fighting each other to control the database configuration.

You must always specify an processGroupIDPrefix when deploying an FDB cluster to multiple Kubernetes clusters. You must set it to a different value in each Kubernetes cluster. This will prevent process group ID duplicates in the different Kubernetes clusters.

Option 3: Fake Replication

In local test environments, you may not have any real fault domains to use, and may not care about availability. You can test in this environment while still having replication enabled by using fake fault domains:

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: sample-cluster
spec:
  version: 7.1.26
  faultDomain:
    key: foundationdb.org/none

This strategy uses the pod name as the fault domain, which allows each process to act as a separate failure domain. Any hardware failure could lead to a complete loss of the cluster. This configuration should not be used in any production environment.

Three-Data-Hall Replication

The three-data-hall replication can be used to replicate data across three data halls, or availability zones. This requires that your fault domains are properly labeled on the Kubernetes nodes. Most cloud-providers will use the well-known label topology.kubernetes.io/zone for this. NOTE: This setup expects that the unified image is used with the operator version v1.53.0 or newer. The setup requires that the fdb-kubernetes-monitor is able to read node resources to get the node label from them. The rbac setup can be used for this. We then just create a FoundationDBCluster that the operator will spread across the availability zones.

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  labels:
    cluster-group: test-cluster
  name: sample-cluster
spec:
  # The unified image supports to make use of node labels, so setting up a three data hall cluster
  # is easier with the unified image.
  imageType: unified
  version: 7.1.63
  faultDomain:
    key: kubernetes.io/hostname
  processCounts:
    stateless: -1
  databaseConfiguration:
    # Ensure that enough coordinators are available. The processes will be spread across the different zones.
    logs: 9
    storage: 9
    redundancy_mode: "three_data_hall"
  processes:
    general:
      customParameters:
        # This is a special env variables that will be updated by the fdb-kubernetes-monitor.
        - "locality_data_hall=$NODE_LABEL_TOPOLOGY_KUBERNETES_IO_ZONE"
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: "16G"
      podTemplate:
        spec:
          securityContext:
            runAsUser: 4059
            runAsGroup: 4059
            fsGroup: 4059
          serviceAccount: fdb-kubernetes
          # Make sure that the pods are spread equally across the different availability zones.
          topologySpreadConstraints:
            - maxSkew: 1
              topologyKey: topology.kubernetes.io/zone
              whenUnsatisfiable: DoNotSchedule
              labelSelector:
                matchLabels:
                  foundationdb.org/fdb-cluster-name: sample-cluster
                  # This must be repeated for the other process classes with the correct
                  # process class.
                  foundationdb.org/fdb-process-class: general
          containers:
            - name: foundationdb
              env:
                # This feature allows the fdb-kubernetes-monitor to read the labels from the node where
                # it is running.
                - name: ENABLE_NODE_WATCH
                  value: "true"
              resources:
                requests:
                  cpu: 250m
                  memory: 128Mi

Migrating an existing cluster to Three-Data-Hall Replication

NOTE: these steps are for the split image setup, which is still the default; for the unified image setup migration path is simpler.

It is possible to gracefully migrate a cluster in single, double or triple replication to Three-Data-Hall replication by following these steps:

  1. create 3 exact clones of the original k8s FoundationDB cluster object (thus still with same replication) and change these fields: metadata.name,spec.processGroupIDPrefix, spec.dataHall, spec.dataCenter
  2. make sure that skip: true is set on their YAML definition so that operator will not attempt to configure them
  3. make sure that seedConnectionString: is set to to the current connection string of the original cluster
  4. run kubectl create for each of them
  5. set the configured state and connection string in their status subresource by using: kubectl patch fdb my-new-cluster-a --type=merge --subresource status --patch "status: {configured: true, connectionString: \"...\" }"; use the original cluster connection string here
  6. set skip: false on each of the 3 new clusters, and then wait for reconciliation to finish
  7. start a lengthy exclude procedure that will exclude all the processes of the original cluster; suggested order: log, storage, coordinator, stateless
  8. delete the original cluster once all exclusions are complete
  9. set redundancyMode to three_data_hall for the 3 new FoundationDB clusters, one after another
  10. patch seed connection string of 2 of the 3 new clusters to point to the third one e.g. if you have created clusters A,B,C, set the seed connection string of B and C to point to A and make sure that A has no seed connection string; this step is not crucial but practically helpful sometimes
  11. scale down clusters to use 1/3 of the original resources (each of them needs only 3 coordinators and 1/3 of the resources used for other classes)

This procedure mitigates temporary issues (1031 timeouts) which may happen with sustained traffic when data distributor and master are being reallocated while redundancy mode is changed and/or data is being moved.

Multi-Region Replication

The replication strategies above all describe how data is replicated within a data center or a single region. They control the zoneid field in the cluster's locality. If you want to run a cluster across multiple data centers or regions, you can use FoundationDB's multi-region replication. This can work with any of the replication strategies above. The data center will be a separate fault domain from whatever you provide for the zone.

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: sample-cluster
spec:
  version: 7.1.26
  dataCenter: dc1
  databaseConfiguration:
    regions:
      - datacenters:
          - id: dc1
            priority: 1
          - id: dc2
            priority: 1
            satellite: 1
      - datacenters:
          - id: dc3
            priority: 0
          - id: dc4
            priority: 1
            satellite: 1

The dataCenter field in the top level of the spec specifies what data center these process groups are running in. This will be used to set the dcid locality field. The regions section of the database describes all of the available regions. See the FoundationDB documentation for more information on how to configure regions.

Replicating across data centers will likely mean running your cluster across multiple Kubernetes clusters, even if you are using a single-Kubernetes replication strategy within each DC. This will mean taking on the operational challenges described in the "Multi-Kubernetes Replication" section above.

An example on how to setup a multi-region FDB cluster with the operator can be found in multi-dc. If you want to do an experiment locally with Kind you can use the setup_e2e.sh script. Basically the example is performing the following steps:

Create a single DC FDB cluster:

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: sample-cluster
spec:
  version: 7.1.26
  dataCenter: dc1
  # Using the processGroupIDPrefix will prevent name conflicts.
  processGroupIDPrefix: dc1
  databaseConfiguration:
    regions:
      - datacenters:
          - id: dc1
            priority: 1

Once the cluster is fully reconciled you can create the FDB clusters in the other DCs, now with the full configuration and a seedConnectionString:

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: sample-cluster
spec:
  version: 7.1.26
  dataCenter: dc1
  processGroupIDPrefix: dc1
  seedConnectionString: # Replace with the value from the initial single DC cluster
  databaseConfiguration:
    regions:
      - datacenters:
          - id: dc1
            priority: 1
          - id: dc2
            priority: 1
            satellite: 1
      - datacenters:
          - id: dc3
            priority: 0
          - id: dc4
            priority: 1
            satellite: 1

Coordinating Global

When running a FoundationDB cluster that is deployed across multiple Kubernetes clusters, each Kubernetes cluster will have its own instance of the operator working on the processes in its cluster. There will be some operations that cannot be scoped to a single Kubernetes cluster, such as changing the database configuration. The operator provides a locking system to reduce the risk of those independent operator instance performing the same action at the same time. All actions that the operator performs like changing the configuration or restarting processes will lead to the same desired state. The locking system is only intended to reduce the risk of frequent reoccurring recoveries.

You can enable this locking system by setting lockOptions.disableLocks = false in the cluster spec. The locking system is automatically enabled by default for any cluster that has multiple regions in its database configuration, a zoneCount greater than 1 in its fault domain configuration, or redundancy_mode equal to three_data_hall.

The locking system uses the processGroupIDPrefix from the cluster spec to identify an process group of the operator. Make sure to set this to a unique value for each Kubernetes cluster, both to support the locking system and to prevent duplicate process group IDs.

This locking system uses the FoundationDB cluster as its data source. This means that if the cluster is unavailable, no instance of the operator will be able to get a lock. If you hit a case where this becomes an issue, you can disable the locking system by setting lockOptions.disableLocks = true in the cluster spec.

In most cases, restarts will be done independently in each Kubernetes cluster, and the locking system will be used to try to ensure a minimum time between the different restarts and avoid multiple recoveries in a short span of time. During upgrades, however, all instances must be restarted at the same time. The operator will use the locking system to coordinate this. Each instance of the operator will store records indicating what processes it is managing and what version they will be running after the restart. Each instance will then try to acquire a lock and confirm that every process reporting to the cluster is ready for the upgrade. If all processes are prepared, the operator will restart all of them at once. If any instance of the operator is stuck and unable to prepare its processes for the upgrade, the restart will not occur.

Synchronization Mode For Coordination

The operator supports two different synchronization modes for coordination in mutli-region clusters, local (default) and global. When the synchronization mode is set to local, the operator will act only with its local information on actions like a restart, exclude, include and coordinator change. With this mode, rolling out a new knob will cause multiple recoveries as each local operator instance will restart the "local" processes independently synchronized by the locking system.

With the synchronization mode set to global, the operator tries to coordinate the actions across the Kubernetes clusters. The coordination is based on an optimistic coordination system where the operator is adding additional information in FDB to coordinate, for more information see better coordination for multi-cluster operator. When doing operations only in a single Kubernetes cluster, e.g. doing a replacement of a single process group, the operation might be slowed down by the coordination overhead, as the operation can not differentiate between a local and global operation. The trade-off here is that local actions might be a bit slower but global operations, e.g. like a knob rollout, will be faster and less disruptive.

Deny List

There are some situations where an instance of the operator is able to get locks but should not be trusted to perform global actions. For instance, the operator could be partitioned in a way where it cannot access the Kubernetes API but can access the FoundationDB cluster. To block such an instance from taking locks, you can add it to the denyList in the lock options. You can set this in the cluster spec on any Kubernetes cluster.

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: sample-cluster
spec:
  processGroupIDPrefix: dc1
  lockOptions:
    denyList:
      - id: dc2

This will clear any locks held by dc2, and prevent it from taking further locks. In order to clear this deny list, you must change it to allow that instance again:

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: sample-cluster
spec:
  processGroupIDPrefix: dc1
  lockOptions:
    denyList:
      - id: dc2
        allow: true

Once that change is fully reconciled, you can clear the deny list from the spec.

Managing Disruption

Pod disruption budgets are a good idea to prevent simultaneous disruption to many components in a cluster, particularly during the upgrade of node pools in public clouds. The operator does not yet create these automatically. To aid in creation of PDBs the operator preferentially selects coordinators from just storage pods, then if there are not enough storage pods, or the storage pods are not spread across enough fault domains it also considers log pods, and finally transaction pods as well.

Coordinators

Per default the FDB operator will try to select the best fitting processes to be coordinators. Depending on the requirements the operator can be configured to either prefer or exclude specific processes. The number of coordinators is currently a hardcoded mechanism based on the following algorithm:

// DesiredCoordinatorCount returns the number of coordinators to recruit for a cluster.
func (cluster *FoundationDBCluster) DesiredCoordinatorCount() int {
    if cluster.Spec.DatabaseConfiguration.UsableRegions > 1 || cluster.Spec.DatabaseConfiguration.RedundancyMode == RedundancyModeThreeDataHall {
        return 9
    }

    return cluster.MinimumFaultDomains() + cluster.DesiredFaultTolerance()
}

For all clusters that use more than one region or uses three_data_hall, the operator will recruit 9 coordinators. If the number of regions is 1 the number of recruited coordinators depends on the redundancy mode. The number of coordinators is chosen based on the fact that the coordinators use a consensus protocol (Paxos) that needs a majority of processes to be up. A common pattern in majority based system is to run n * 2 + 1 processes, where n defines the failures that should be tolerated. The FoundationDB document has more information about choosing coordination servers.

Redundancy mode # Coordinators
Single 1
Double (default) 3
Triple 5

Every coordinator must be in a different zone. That means for Triple replication you need at least 5 different Kubernetes nodes with the default fault domain. Losing one Kubernetes node will lead to have only 4 coordinators since the operator can't recruit another 5th coordinator across different zones.

Coordinator selection

The operator offers a flexible way to select different process classes to be eligible for coordinator selection. Per default the operator will choose all stateful processes classes e.g. storage, log and transaction. In order to get a deterministic result the operator will sort the candidate by priority (per default all have the same priority) and then by the instance-id.

If you want to modify the selection process you can add a coordinatorSelection in the FoundationDBCluster spec:

spec:
coordinatorSelection:
- priority: 0
  processClass: log
- priority: 10
  processClass: storage

Only process classes defined in the coordinatorSelection will be considered as possible candidates. In this example only processes with the class log or storage will be used for coordinators. The priority defines if a specific process class should be preferred to another. In this example the processes with the class storage will be preferred over processes with the class log. That means that a log process will only be considered a valid coordinator if there are no other storage processes that can be selected without hurting the fault domain requirements. Changing the coordinatorSelection can result in new coordinators e.g. if the current preferred class will be removed.

The operator supports the following classes as coordinators:

  • storage
  • log
  • transaction
  • coordinator

Known limitations

FoundationDB clusters that are spread across different DC's or Kubernetes clusters only support the same coordinatorSelection. The reason behind this is that the coordinator selection is a global process and different coordinatorSelection of the FoundationDBCluster resources can lead to an undefined behaviour or in the worst case flapping coordinators.

Next

You can continue on to the next section or go back to the table of contents.