diff --git a/modules/introduction/partials/new-features-80.adoc b/modules/introduction/partials/new-features-80.adoc index 05b0fecf08..fdb1f9a5a9 100644 --- a/modules/introduction/partials/new-features-80.adoc +++ b/modules/introduction/partials/new-features-80.adoc @@ -89,6 +89,30 @@ curl --get -u \ -d clusterLabels=none|uuidOnly|uuidAndName ---- +https://jira.issues.couchbase.com/browse/MB-33315[MB-33315] Allow auto-failover for ephemeral buckets without a replica:: +Previously, Couchbase Server always prevented auto-failover on nodes containing an ephemeral bucket that does not have replicas. +You can now configure Couchbase Server Enterprise Edition to allow a node to auto-failover even if it has an ephemeral bucket without a replica. +You can enable this setting using the Couchbase Server Web Console or through the REST API using the `allowFailoverEphemeralNoReplicas` auto-failover setting. +This option defaults to off. +When you enable it, Couchbase Server creates empty vBuckets on other nodes to replace the lost ephemeral vBuckets on the failed over node. +If the failed over node rejoins the cluster, Couchbase Server moves the replacement vBuckets back to rejoining node. +This option is useful if your application uses ephemeral buckets for data that's not irreplaceable, such as caches. +This setting is not available in Couchbase Server Community Edition. + ++ +See xref:learn:clusters-and-availability/automatic-failover.adoc#auto-failover-and-ephemeral-buckets[Auto-Failover and Ephemeral Buckets] and xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability] for more information. + +https://jira.issues.couchbase.com/browse/MB-34155[MB-34155] Support Auto-failover for exceptionally slow/hanging disks:: +You can now configure Couchbase Server to trigger an auto-failover on a node if its data disk is slow to respond or is hanging. +Before version 8.0, you could only configure Couchbase Server to auto-failover a node if the data disk returned errors for a set period of time. +The new `failoverOnDataDiskNonResponsiveness` setting and corresponding settings in the Couchbase Web Console *Settings* page sets the nuber of seconds allowed for read or write operations to complete. +If this period elapses before the operation completes, Couchbase Server triggers an auto-failover for the node. +This setting is off by default. + ++ +See xref:learn:clusters-and-availability/automatic-failover.adoc#failover-on-data-disk-non-responsiveness[Failover on Data Disk Non-Responsiveness] to learn more about this feature. +See xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability] and xref:rest-api:rest-cluster-autofailover-enable.adoc[] to learn how to enable it. + https://jira.issues.couchbase.com/browse/MB-65779[MB-65779]:: Couchbase supports the REST API `DELETE pools/default/settings/memcached/global/setting/[setting_name]` for some of the settings that are not always passed from the Cluster Manager to memcached. + @@ -111,6 +135,7 @@ These are the services that can be modified: You can modify these services using the Couchbase xref:manage:manage-nodes/modify-services-on-nodes-and-rebalance.adoc#modify-mds-services-from-ui[UI], xref:rest-api:rest-set-up-services-existing-nodes.adoc[REST API], or xref:manage:manage-nodes/modify-services-on-nodes-and-rebalance.adoc#modify-mds-services-using-cli[CLI]. + [#section-new-feature-data-service] === Data Service diff --git a/modules/learn/pages/buckets-memory-and-storage/buckets.adoc b/modules/learn/pages/buckets-memory-and-storage/buckets.adoc index 18ace1ea6a..0c5eb46cd6 100644 --- a/modules/learn/pages/buckets-memory-and-storage/buckets.adoc +++ b/modules/learn/pages/buckets-memory-and-storage/buckets.adoc @@ -203,9 +203,15 @@ You can add or remove buckets and nodes dynamically. | By default, auto-failover starts when a node is inaccessible for 120 seconds. Auto-failover can occur up to a specified maximum number of times before you must reset it manually. When a failed node becomes accessible again, delta-node recovery uses data on disk and resynchronizes it. -| Auto-reprovision starts as soon as a node is inaccessible. +| Auto-reprovision starts for ephemeral buckets with replicas on a failed node as soon as a node is inaccessible. Auto-reprovision can occur multiple times for multiple nodes. When a failed node becomes accessible again, the system does not require delta-node recovery because no data resides on disk. + +If you enable auto-failover for ephemeral buckets without replicas, a failed node can Auto-failover. +In this case, when a failover occurs Couchbase Server creates empty vBuckets on the remaining nodes to replace the missing vBuckets on the failed node. +When the failed node rejoins the cluster, Couchbase Server moves the replacement vBuckets back to it. + +NOTE: the auto-failover for ephemeral buckets feature is only available on Couchbase Server Enterprise Edition. |=== == Bucket Security diff --git a/modules/learn/pages/clusters-and-availability/automatic-failover.adoc b/modules/learn/pages/clusters-and-availability/automatic-failover.adoc index 156205e8a1..a0fbdb108a 100644 --- a/modules/learn/pages/clusters-and-availability/automatic-failover.adoc +++ b/modules/learn/pages/clusters-and-availability/automatic-failover.adoc @@ -33,7 +33,7 @@ For information on managing auto-failover, see the information provided for Couc == Failover Events Auto-failover occurs in response to failed/failing events. -There are three types of event that can trigger auto-failover: +The following events can trigger auto-failover: * _Node failure_. A server-node within the cluster is unresponsive (due to a network failure, very high CPU utilization problem, out-of-memory problem, or other node-specific issue). This means that the the cluster manager of the node has not sent heartbeats in the configured timeout period, and therefore, the health of the services running on the node is unknown. @@ -42,7 +42,16 @@ A server-node within the cluster is unresponsive (due to a network failure, very Concurrent correlated failure of multiple nodes such as physical rack of machines or multiple virtual machines sharing a host. * _Data Service disk read/write issues_. -Data Service disk read/write errors. Attempts by the Data Service to read from or write to disk on a particular node have resulted in a significant rate of failure (errors returned), for longer than a specified time-period. +Data Service disk read/write errors. +Attempts by the Data Service to read from or write to disk on a particular node have resulted in a significant rate of failure (errors returned), for longer than a specified time-period. + +[#failover-on-data-disk-non-responsiveness] +* _Data Disk non-responsiveness_. +You can configure a timeout period for the Data Service's disk read/write threads to complete an operation. +When you enable this setting, if the period elapses and the thread has not completed the operation, Couchbase Server can auto-fail over the node. +This setting differs from the disk error timeout because the data disk does not have to return errors. +Instead, if the disk is so slow that it cannot complete the operation or is hanging, Couchbase Server can take action even when it does not receive an error. + * _Index or Data Service running on the mode is non-responsive or unhealthy_. ** Index Service non-responsiveness. @@ -61,7 +70,12 @@ If a monitored or configured auto-failover event occurs, an auto-failover will n The xref:install:deployment-considerations-lt-3nodes.adoc#quorum-arbitration[quorum constraint] is a critical part of auto-failover since the cluster must be able to form a quorum to initiate a failover, following the failure of some of the nodes. For Server Groups, this means that if you have two server groups with equal number of nodes, for auto-failover of all nodes in one server group to be able to occur, you could deploy an xref:learn:clusters-and-availability:nodes.adoc#adding-arbiter-nodes[arbiter node] (or another node) in a third physical server group which will allow the remaining nodes to form a quorum. -Another critical auto-failover constraint for Server Groups is the maximum number of nodes to be automatically failed over (`maxCount` in `/settings/autoFailover`) before administrator-intervention is required. If you want one entire server group of nodes to be able to be all automatically failed over, then the `maxCount` value should be at least the number of nodes in the server group. You can check the value of `maxCount` in `GET /settings/autoFailover` to see what the `maxCount` setting is. The value of `count` in the same `GET /settings/autoFailover` output tells you how many node auto-failovers have occurred since the parameter was last reset. Running a rebalance will reset the count value back to 0. Running a rebalance will reset the count value back to 0. The count should not be reset manually unless guided by Support, since resetting manually will cause you to lose track of the number of auto-failovers that have already occurred without the cluster being rebalanced. +Another critical auto-failover constraint for Server Groups is the maximum number of nodes to be automatically failed over (`maxCount` in `/settings/autoFailover`) before administrator-intervention is required. +If you want one entire server group of nodes to be able to be all automatically failed over, then the `maxCount` value should be at least the number of nodes in the server group. +You can check the value of `maxCount` in `GET /settings/autoFailover` to see what the `maxCount` setting is. +The value of `count` in the same `GET /settings/autoFailover` output tells you how many node auto-failovers have occurred since the parameter was last reset. +Running a rebalance will reset the count value back to 0. +The count should not be reset manually unless guided by Support, since resetting manually will cause you to lose track of the number of auto-failovers that have already occurred without the cluster being rebalanced. The list below describes other conditions that must be met for an auto-failover to be executed even after a monitored or configured auto-failover event has occurred. @@ -72,8 +86,10 @@ For example, given a cluster of 18 nodes, _10_ nodes are required for the quorum After this maximum number of auto-failovers has been reached, no further auto-failover occurs, until the count is manually reset by the administrator, or until a rebalance is successfully performed. Note, however, that the count can be manually reset, or a rebalance performed, prior to the maximum number being reached. -* In no circumstances where data-loss might result: for example, when a bucket has no replicas. -Therefore, even a single event may not trigger a response; and an administrator-specified maximum number of failed nodes may not be reached. +* By default, Couchbase Server does not allow an auto-failover if it may result in data loss. +For example, with default settings Couchbase Server does not allow the auto-failover of a node that contains a bucket with no replicas. +This restriction includes ephemeral buckets as well as Couchbase buckets. +See <<#auto-failover-and-ephemeral-buckets>> for more information on auto-failover and ephemeral buckets. * Only in accordance with the xref:learn:clusters-and-availability/automatic-failover.adoc#failover-policy[Service-Specific Auto-Failover Policy] for the service or services on the unresponsive node. @@ -89,7 +105,58 @@ Auto-failover is for intra-cluster use only: it does not work with xref:learn:cl See xref:manage:manage-settings/configure-alerts.adoc[Alerts], for details on configuring email alerts related to failover. -See xref:learn:clusters-and-availability/groups.adoc[Server Group Awareness], for information on server groups. +See xref:learn:clusters-and-availability/groups.adoc[Server Group Awareness], for information about server groups. + +[#auto-failover-and-ephemeral-buckets] +== Auto-Failover and Ephemeral Buckets +Couchbase Server supports ephemeral buckets, which are buckets that it stores only in memory. +Their data is never persisted to disk. +This lack of persistence poses several challenges when it comes to node failure. + +If an ephemeral bucket lacks replicas, it loses the data in vBuckets on any node that fails and restarts. +To prevent this data loss, by default Couchbase Server does not allow auto-failover of a node that contains vBuckets for an unreplicated ephemeral bucket. +In this case, you must manually fail over the node if it's unresponsive. +However, all of the ephemeral bucket's data on the node is lost. + +Couchbase Server provides two settings that affect how node failures work with ephemeral buckets: + +Auto-reprovisioning for Ephemeral Buckets:: +This setting helps avoid data loss in cases where a node fails and restarts before Couchbase Server can begin an auto-failover for it. +This setting defaults to enabled. +When it's enabled, Couchbase Server automatically activates the replicas of any ephemeral vBuckets that are active on the restarting node. +If you turn off this setting, there's a risk that the restarting node could cause data loss. +It could roll back asynchronous writes that the replica vBuckets have but its vBuckets are missing. + +[#ephemeral-buckets-with-no-replicas] +Auto-failover for Ephemeral Buckets with No Replicas [.edition]#{enterprise}#:: +When enabled, this setting allows Couchbase Server to auto-failover a node that contains vBuckets for an ephemeral bucket with no replicas. +When Couchbase Server fails over a node with an unreplicated ephemeral bucket, the data in the vBuckets on the node is lost. +Couchbase Server creates empty vBuckets on the remaining nodes to replace the missing vBuckets on the failed node. +When the failed node rejoins the cluster, Couchbase Server moves the replacement vBuckets back to it. + ++ +This setting is off by default. +When it's off, Couchbase Server does not auto-failover a node that contains an unreplicated ephemeral bucket's vBuckets. +If one of these nodes becomes unresponsive, you must manually fail over the node. + ++ +Enable this setting when preserving the data in the ephemeral bucket is not critical for your application. +For example, suppose you use the unreplicated ephemeral bucket for caching data. +In this case, consider enabling this setting to allow Couchbase Server to auto-failover nodes containing its vBuckets. +Losing the data in the cache is not critical, because your application can repopulate the cache with minimal performance cost. + ++ +NOTE: If the data in the ephemeral bucket is critical for your application, enable one or more replicas for it. +See xref:manage:manage-buckets/create-bucket.adoc#ephemeral-bucket-settings[Ephemeral Bucket Settings] for more information about adding replicas for an ephemeral bucket. + ++ +If the unreplicated ephemeral bucket is indexed, Couchbase Server rebuilds the index after it auto-fails over the node even if the index is not on the failed node. +After this type of failover, the index must be rebuilt because it indexes data lost in the failed node's vBuckets. +For more information, see xref:learn:services-and-indexes/indexes/index-replication.adoc#index-rollback-after-failover[Index Rollback After Failover]. + ++ +See xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability] to learn how to change these settings via the Couchbase Server Web Console. +See xref:rest-api:rest-cluster-autofailover-settings.adoc[] for information about changing these settings via the REST API. [#failover-policy] == Service-Specific Auto-Failover Policy @@ -231,7 +298,8 @@ This parameter is available in Enterprise Edition only: in Community Edition, th * _Count_. The number of nodes that have already failed over. The default value is 0. -The value is incremented by 1 for every node that has an automatic-failover that occurs, up to the defined maximum count: beyond this point, no further automatic failover can be triggered until the count is reset to 0. Running a rebalance will reset the count value back to 0. +The value is incremented by 1 for every node that has an automatic-failover that occurs, up to the defined maximum count: beyond this point, no further automatic failover can be triggered until the count is reset to 0. +Run a rebalance to reset the count value back to 0. * _Enablement of disk-related automatic failover; with corresponding time-period_. Whether automatic failover is enabled to handle continuous read-write failures. diff --git a/modules/manage/assets/node-availability.png b/modules/manage/assets/node-availability.png new file mode 100644 index 0000000000..83b0cb9280 Binary files /dev/null and b/modules/manage/assets/node-availability.png differ diff --git a/modules/manage/pages/manage-settings/general-settings.adoc b/modules/manage/pages/manage-settings/general-settings.adoc index cf64a51615..acbf93c920 100644 --- a/modules/manage/pages/manage-settings/general-settings.adoc +++ b/modules/manage/pages/manage-settings/general-settings.adoc @@ -95,35 +95,74 @@ No identifiable information (such as bucket names, bucket data, design-document [#node-availability] === Node Availability +The options in the *Node Availability* panel control if and when Couchbase Server performs an automatic failover on a node. +Couchbase Server only performs an auto-failover in clusters containing three or more nodes. +For detailed information about automatic failover policy and constraints, see xref:learn:clusters-and-availability/automatic-failover.adoc[]. -The options in the *Node Availability* panel control whether and how *Automatic Failover* is applied. -For detailed information on policy and constraints, see xref:learn:clusters-and-availability/automatic-failover.adoc[Automatic Failover]. +image::manage-settings/node-availability.png["The Node Availability panel consists of fields described in the following section.",548,align=center] -The panel appears as follows: +The fields in this section are: + +Auto-failover after X seconds for up to Y node:: +When you enable this setting (which defaults to enabled), Couchbase Server automatically fails over a node after it has been unresponsive for the duration you set in the *seconds* box. +When Couchbase Server auto-fails over a node, it promotes the replica vBuckets on other nodes to active status to maintain data access. +The default duration is 120 seconds. +You can change it to any value from 5 to 3600. -image::manage-settings/node-availability.png["The Node Availability panel",548,align=center] ++ +You also set the maximum number of nodes that Couchbase Server can auto-fail over at a time in the *node* box. +If Couchbase has auto-failed over the maximum number of nodes, it refuses to auto-fail over more nodes until the you either perform a successful rebalance or manually reset the count of failed-over nodes. +This maximum number of allowed auto-failovers also applies to the other auto-failover settings in this section. -The following checkboxes are provided: ++ +You can only enable the other settings in this section when you enable *Auto-failover after X seconds for up to Y node*. + +Auto-failover for sustained data disk read/write failures after X seconds [.edition]#{enterprise}#:: +If you enable this setting, Couchbase Server fails over a node that experiences sustained data disk errors for the duration set in the *seconds* box. +The default duration is 120 seconds. +You can change the duration to a value from 5 to 3600 seconds. +You can only enable this setting if you enable *Auto-failover after X seconds for up to Y node*. +This setting defaults to off. + +Auto-failover for data disk read/write non-responsiveness after X seconds [.edition]#{enterprise}#:: +This setting is similar to the previous setting, except the duration in the *seconds* box sets the amount of time a read or write operation has to finish. +This setting defaults to off. +When you enable it, if a read or write operation on the data disk takes longer than the values in the *seconds* box, Couchbase Server performs an auto-failover on the node. +It lets you handle cases where a node's data disk is indefinitely hanging or is so busy it becomes unresponsive without generating errors. +The default duration is 120 seconds. +You can change the duration to a value from 5 to 3600 seconds. + +Preserve durable writes [.edition]#{enterprise}#:: +If you enable this setting, Couchbase Server does not auto-fail over a node if it could result in the loss of durably written data. +This setting defaults to off. +For more information, see xref:learn:data/durability.adoc#preserving-durable-writes[Preserving Durable Writes]. + +The *Node Availability* section also contains a *For Ephemeral Buckets* subsection that you can expand. +The settings in this subsection are: + +Enable auto-reprovisioning for up to X node:: +This setting helps avoid data loss in cases where a node fails and restarts before Couchbase Server can begin an auto-failover for it. +This setting defaults to enabled. +When it's enabled, Couchbase Server automatically promotes the replicas of the active ephemeral vBuckets on the failed node. +By making these replicas active, Couchbase Server prevents the loss of data caused by the restarted node losing the data in its ephemeral buckets. -* *Auto-failover after _x_ seconds for up to _y_ node*: After the timeout period set here as _x_ seconds has elapsed, an unresponsive or malfunctioning node is failed over, provided that the limit on actionable events set here as _y_ (with the default value of 1) has not yet been reached. -Data replicas are promoted to active on other nodes, as appropriate. -This feature can only be used when three or more nodes are present in the cluster. -The number of seconds to elapse is configurable: the default is 120; the minimum permitted is 5; the maximum 3600. -This option is selected by default. ++ +The *node* box sets the maximum number of nodes Couchbase Server can auto-reprovision at a time. +This value defaults to 1, meaning only a single node's ephemeral buckets are auto-reprovisioned at a time. +After the failed node rejoins the cluster, you must perform a rebalance before another node can be auto-reprovisioned. +Only set this limit greater than 1 if the remaining nodes have enough capacity to handle the increased workload of multiple ephemeral buckets. -* *Auto-failover for sustained data disk read/write failures after _z_ seconds*: After the timeout period set here as _z_ seconds has elapsed, a node is failed over if it has experienced sustained data disk read/write failures. -The timeout period is configurable: the default length is 120 seconds; the minimum permitted is 5; the maximum 3600. -This checkbox can only be checked if *Auto-failover after _x_ seconds for up to _y_ node* has also been checked. -This option is unchecked by default. +Allow auto-failover for ephemeral buckets with no replicas [.edition]#{enterprise}#:: +When enabled, this setting allows Couchbase Server to auto-failover a node that contains an ephemeral bucket with no replicas. +When Couchbase Server fails over a node with an ephemeral bucket with no replicas, it creates empty vBuckets on the remaining nodes to replace the missing vBuckets on the failed node. +When the failed node rejoins the cluster, Couchbase Server moves the replacement vBuckets back to it. -* *Preserve durable writes*: If this checkbox is checked, a node is _not_ failed over if this might result in the loss of durably written data. -The default is that the checkbox is unchecked. -For information, see xref:learn:data/durability.adoc#preserving-durable-writes[Preserving Durable Writes]. ++ +This setting is off by default. +When it's off, Couchbase Server does not auto-failover a node that contains vBuckets for an unreplicated ephemeral bucket. +In this case, you must manually fail over any node that contains an unreplicated ephemeral bucket's vBuckets. -The *Node Availability* panel also contains a *For Ephemeral Buckets* option. -When opened, this provides an *Enable auto-reprovisioning* checkbox, with a configurable number of nodes. -Checking this ensures that if a node containing _active_ Ephemeral buckets becomes unavailable, its replicas on the specified number of other nodes are promoted to active status as appropriate, to avoid data-loss. -Note, however, that this may leave the cluster in an unbalanced state, requiring a rebalance. +See xref:learn:clusters-and-availability/automatic-failover.adoc#auto-failover-and-ephemeral-buckets[Auto-Failover and Ephemeral Buckets] for more information about auto-failover and ephemeral buckets. [#auto-failover-and-durability] ==== Auto-Failover and Durability diff --git a/modules/rest-api/pages/rest-cluster-autofailover-enable.adoc b/modules/rest-api/pages/rest-cluster-autofailover-enable.adoc index 90dcd8b768..8f1470de78 100644 --- a/modules/rest-api/pages/rest-cluster-autofailover-enable.adoc +++ b/modules/rest-api/pages/rest-cluster-autofailover-enable.adoc @@ -1,11 +1,11 @@ = Enabling and Disabling Auto-Failover -:description: pass:q[Auto-failover is enabled and disabled by means of the `POST /settings/autoFailover` HTTP method and URI.] +:description: pass:q[Send a POST message to the `/settings/autoFailover` endpoint to change auto-failover settings.] :page-topic-type: reference [abstract] {description} -== HTTP method and URI +== HTTP Method and URI ---- POST /settings/autoFailover @@ -13,102 +13,129 @@ POST /settings/autoFailover == Description -The `POST /settings/autoFailover` HTTP method and URI can be used to enable and disable auto-failover. +You can use the `POST /settings/autoFailover` HTTP method and URI to enable, turn off, and change auto-failover settings. -Auto-failover settings are global, and therefore apply to all nodes in the cluster. -The Full Admin, Cluster Admin, or Backup Full Admin role is required, to establish the settings. +Auto-failover settings are global, applying to all nodes in the cluster. -== Curl Syntax + +== Syntax [source,bourne] ---- curl -X POST http://:8091/settings/autoFailover -u : - -d enabled=[true|false] - -d timeout= - -d maxCount= - -d failoverOnDataDiskIssues[enabled]=[true|false] - -d failoverOnDataDiskIssues[timePeriod]= - -d canAbortRebalance=[true|false] - -d failoverPreserveDurabilityMajority=[true|false] + -d 'enabled=[true|false]' + -d 'timeout=' + -d 'maxCount=' + -d 'failoverOnDataDiskIssues[enabled]=[true|false]' + -d 'failoverOnDataDiskIssues[timePeriod]=' + -d 'canAbortRebalance=[true|false]' + -d 'failoverPreserveDurabilityMajority=[true|false]' + -d 'allowFailoverEphemeralNoReplicas=[true|false]' + -d 'failoverOnDataDiskNonResponsiveness[enabled]=[true|false]' + -d 'failoverOnDataDiskNonResponsiveness[timePeriod]=' ---- The parameters are as follows: -* `enabled=[true|false]`. +* `enabled`: Enables or disables automatic failover. Default setting is `true`. -Setting `enabled` to `false` automatically sets `failoverOnDataDiskIssues[enabled]` to `false`. -Note that when `enabled` is set to `false`, the values supplied for any additional parameters (including `failoverOnDataDiskIssues[enabled]` and `canAbortRebalance`) are ignored. -The `enabled` parameter is _required_. -Setting `enabled` to `true` requires that the `timeout` parameter also be specified. - -* `timeout=`. -Integer between 5 and 3600. -Specifies the number of seconds that must elapse, with a node unavailable, before automatic failover is triggered. +This parameter is required. +If you set `enabled` to `true`, you must also supply a value for the `timeout` parameter. +Setting `enabled` to `false` automatically sets `failoverOnDataDiskIssues[enabled]` and `failoverOnDataDiskNonResponsiveness` to `false`. + ++ +NOTE: When you set `enabled` to `false`, Couchbase Server ignores any values you supply for additional parameters including `failoverOnDataDiskIssues[enabled]` and `canAbortRebalance`. + +* `timeout`: +Sets the number of seconds Couchbase Server waits before performing an auto-failover on an unresponsive node. Default setting is 120. The `timeout` parameter can only be specified when `enabled` is set to `true`. This parameter and its values are ignored if the value for the `enabled` parameter is `false`. + + -The value of `timeout` can also be specified as `1` second. -Note that low setting of this kind (anything below 5 seconds) significantly increases the sensitivity of failure-detection; and this, in turn, makes responses to _false failures_ more likely. -Additionally, more CPU resources are consumed. -+ -Therefore, it is recommended that testing of a _representative workload_ should occur before the value of `timeout` is established as `1` for a production environment. -Such testing should include both of the following: +You can set the value of `timeout` to `1` second. +A low setting, such as anything less than 5 seconds, increases the sensitivity of failure detection. +This low setting can cause false positives which result in Couchbase Server triggering auto-failovers unnecessarily. +It also increases CPU usage. -** Measuring CPU usage -** Checking for false-failure recognition + ++ +If you want to use a low setting, test a representative workload before setting the value of `timeout` to `1` in a production environment. +Make sure to measure CPU usage. +Monitor the cluster for auto-failovers caused by false positives. [#maxcount] -* `maxCount=`. -Specifies the maximum number of nodes to be automatically failed over before administrator-intervention is required +* `maxCount`: +Sets the maximum number of nodes Couchbase Server can auto-failover at a time. +Once this number of nodes has been auto-failed over, Couchbase Server does not auto-failover more nodes until you reset the count or resolve the auto-failovers with a rebalance to recover or remove the failed-over nodes. The maximum value can be up to the number of configured nodes. The default value is 1. This parameter is optional, and is only supported by Couchbase Server Enterprise Edition. This parameter and its values are ignored if the value for the `enabled` parameter is `false`. -* `failoverOnDataDiskIssues[enabled]=[true|false]` -Allows the triggering of auto-failover when disk read-write attempts have failed continuously throughout at least 60% of the specified time-period. +* `failoverOnDataDiskIssues[enabled]`: +Sets whether Couchbase Server performs auto-failovers on nodes where the data disk read or write attempts have resulted in errors continuously throughout at least 60% of the time-period set in `failoverOnDataDiskIssues[timePeriod]`. The default value for `failoverOnDataDiskIssues[enabled]` is `false`. -A value for `failoverOnDataDiskIssues[timePeriod]` must be specified when `failoverOnDataDiskIssues[enabled]` is `true`. +When you set this value to `true`, you must also supply a value for `failoverOnDataDiskIssues[timePeriod]`. + +* `failoverOnDataDiskIssues[timePeriod]`: +Sets the period of time in seconds that a node's data disk can return errors before Couchbase Server performs an auto-failover. +The valid range for this value is between 5 and 3600 seconds. +If you set `failoverOnDataDiskIssues[enabled]` is to `true`, you must also supply a value for this parameter. -* `failoverOnDataDiskIssues[timePeriod]=`. -The specified value should be an integer between 5 and 3600. -The default value (which is maintained while `failoverOnDataDiskIssues[enabled]` is `false`) is 120; but if `failoverOnDataDiskIssues[enabled]` is set to `true`, a value for `failoverOnDataDiskIssues[timePeriod]` must nevertheless be explicitly specified. + If `failoverOnDataDiskIssues[enabled]` is _not_ specified, but `failoverOnDataDiskIssues[timePeriod]` _is_ specified, the following error message is generated: `The value of "failoverOnDataDiskIssues[enabled]" must be true or false`. + -If `failoverOnDataDiskIssues[enabled]` is `false`, but `failoverOnDataDiskIssues[timePeriod]` is specified, the value specified for `failoverOnDataDiskIssues[timePeriod]` is ignored. -+ -These parameters are _optional_, and are only supported by Couchbase Server Enterprise Edition. -These parameters and their values are ignored, if `enabled` is set to `false`. +If you supply a value for this parameter while `failoverOnDataDiskIssues[enabled]` is `false`, Couchbase Server ignores the setting. * `canAbortRebalance`. -Whether or not auto-failover can be triggered if a _rebalance_ is in progress. +Sets whether Couchbase Server can perform an auto-failover while a rebalance is taking place. This parameter is optional, and is only available in Couchbase Enterprise Edition. The value can be either `true` (the default) or `false`. -The parameter and its value are ignored, if `enabled` is set to `false`. +Couchbase Server ignores this setting if you set `enabled` to `false`. [#preserve-durable-writes] * `failoverPreserveDurabilityMajority`. +Sets whether Couchbase Server refuses to auto-failover a node if doing so could result in the loss of durably written data. Can be `true` or `false` (the default). -If this setting is `true`, a node is _not_ failed over if this might result in the loss of durably written data. For information, see xref:learn:data/durability.adoc#preserving-durable-writes[Preserving Durable Writes]. -== Responses +* `failoverOnDataDiskNonResponsiveness[enabled]`: +Sets whether Couchbase Server performs an auto-failover on a node when the data disk has not completed an operation in the period set by `failoverOnDataDiskNonResponsiveness[timePeriod]`. +The default value is `false`. +When you set this value to `true`, you must also supply a value for `failoverOnDataDiskNonResponsiveness[timePeriod]`. -Success returns `200 OK`. +* `failoverOnDataDiskNonResponsiveness[timePeriod]`: +Sets the period of time in seconds that a node's data disk has to be unresponsive before Couchbase Server performs an auto-failover. +The valid range for this value is between 5 and 3600 seconds. +If you set `failoverOnDataDiskNonResponsiveness[enabled]` to `true`, you must also supply a value for this parameter. -Incorrectly specified values are handled as follows: +* `allowFailoverEphemeralNoReplicas`: +Sets whether Couchbase Server can auto-failover a node that contains vBuckets for an unreplicated ephemeral bucket. +The default value is `false`, which means Couchbase Server does not perform an auto-failover on a node that contains vBuckets for an unreplicated ephemeral bucket . +When you set this value to `true`, Couchbase Server can perform an auto-failover on the node even through it results in the loss of the data from the ephemeral bucket's vBuckets on the node. +This setting is only available in Couchbase Server Enterprise Edition. -* If the value of `enabled` is neither `true` nor `false`, `400 Bad Request` is returned, with the message `The value of "enabled" must be true or false`. +== Required Permissions -* If the value of `timeout` is incorrectly specified, `400 Bad Request` is returned, with the message `The value of "timeout" must be a positive integer in a range from 5 to 3600`. +You must have one of the following roles to make changes to the auto-failover settings: -* If the value of `timePeriod` is incorrectly specified, `400 Bad Request` is returned, with the message `The value of "failoverOnDataDiskIssues[timePeriod]" must be a positive integer in a range from 5 to 3600`. +* xref:learn:security/roles.adoc#backup-full-admin[Backup Full Admin] +* xref:learn:security/roles.adoc#cluster-admin[Cluster Admin] +* xref:learn:security/roles.adoc#full-admin[Full Admin] -Failure to authenticate returns `401 Unauthorized`. +== Responses + +200 OK:: +The call succeeded, and the auto-failover settings were changed. + +400 Bad Request:: +The call failed because the request was malformed or lacked required settings. + +401 Unauthorized:: +The call failed because the user did not have the proper permissions to change the auto-failover settings. [#example] == Example @@ -116,7 +143,7 @@ Failure to authenticate returns `401 Unauthorized`. The following example enables auto-failover for the cluster, with a `timeout` of 72 seconds, and a `maxCount` of `2`. It also enabled auto-failover on disk issues, and establishes the corresponding time period as `89` seconds. -[source#curl-example,javascript] +[source,console] ---- curl -X POST -u Administrator:password \ http://10.144.231.101:8091/settings/autoFailover \ @@ -127,14 +154,25 @@ http://10.144.231.101:8091/settings/autoFailover \ -d 'failoverOnDataDiskIssues[timePeriod]=89' ---- +This example disables auto-failover for the cluster: + +[source,console] +---- +curl -X POST -u Administrator:password \ + http://localhost:8091/settings/autoFailover \ + -d 'enabled=false' +---- + == See Also -For information on retrieving the current auto-failover parameter-values with the REST API, see xref:rest-api:rest-cluster-autofailover-settings.adoc[Retrieving Auto-Failover Settings]. +* For an overview of auto-failover, see xref:learn:clusters-and-availability/automatic-failover.adoc[]. + +* For an overview of durability, see xref:learn:data/durability.adoc[]. + +* To retrieve the current auto-failover setting using the REST API, see xref:rest-api:rest-cluster-autofailover-settings.adoc[Retrieving Auto-Failover Settings]. -The Couchbase CLI allows auto-failover to be managed by means of the xref:cli:cbcli/couchbase-cli-setting-autofailover.adoc[setting-autofailover] command. -For information on managing auto-failover with Couchbase Web Console, see xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability]. +* To manage auto-failover using the command line, see xref:cli:cbcli/couchbase-cli-setting-autofailover.adoc[setting-autofailover] command. -A full description of auto-failover is provided in xref:learn:clusters-and-availability/automatic-failover.adoc[Automatic Failover]. +* To learn how to manage auto-failover with Couchbase Server Web Console, see xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability]. -An overview of durability is provided in xref:learn:data/durability.adoc[Durability]. -For information on establishing durability settings for a bucket, see xref:rest-api:rest-bucket-create.adoc[Creating and Editing Buckets]. +* To learn how to change a bucket's durability settings, see xref:rest-api:rest-bucket-create.adoc[]. diff --git a/modules/rest-api/pages/rest-cluster-autofailover-settings.adoc b/modules/rest-api/pages/rest-cluster-autofailover-settings.adoc index dcc5528360..0897b93fed 100644 --- a/modules/rest-api/pages/rest-cluster-autofailover-settings.adoc +++ b/modules/rest-api/pages/rest-cluster-autofailover-settings.adoc @@ -1,11 +1,11 @@ = Retrieving Auto-Failover Settings -:description: pass:q[Auto-failover settings are retrieved by means of the `GET /settings/autoFailover` HTTP method and URI.] +:description: pass:q[Use the `/settings/autoFailover` endpoint to get the current auto-failover settings.] :page-topic-type: reference [abstract] {description} -== HTTP method and URI +== HTTP Method and URI [source,bourne] ---- @@ -16,63 +16,56 @@ GET /settings/autoFailover The `GET /settings/autoFailover` HTTP method and URI retrieve auto-failover settings for the cluster. -Auto-failover settings are global, and apply to all nodes in the cluster. -To read auto-failover settings, one of the following roles is required: Full Admin, Cluster Admin, Read-Only Admin, Backup Full Admin, Eventing Full Admin, Local User Security Admin, External User Security Admin. +Auto-failover settings are global, applying to all nodes in the cluster. + == Curl Syntax [source,bourne] ---- -curl -X GET http://:8091/settings/autoFailover - -u : +curl -X GET http://:8091/settings/autoFailover \ + -u : ---- -== Responses - -Success returns `200 OK`, and an object that contains the following parameters: - -* `enabled`. -Indicates whether automatic failover is enabled (a value of `true`) or disabled (a value of `false`). - -* `timeout`. -Returns an integer between 5 and 3600, which specifies the number of seconds set to elapse, after a node has become unavailable, before automatic failover is triggered. -The default value is 120. +== Required Permissions -* `count`. -This parameter represents how many auto-failover nodes have occurred since the parameter was itself last reset, to a value of 0, through administrator intervention. -The parameter's default value is 1. -Couchbase Server increments this value by 1 for every node that is auto-failed over, up to the administrator-specified _maximum count_. -If nodes are failed over automatically until the _maximum count_ is reached, no further auto-failover is triggered until a parameter-reset is performed. +You must have one of the following roles to be able to read the auto-failover settings: -* `failoverOnDataDiskIssues`. -This contains two values, which are: +* xref:learn:security/roles.adoc#backup-full-admin[Backup Full Admin] +* xref:learn:security/roles.adoc#bucket-admin[Bucket Admin] +* xref:learn:security/roles.adoc#cluster-admin[Cluster Admin] +* xref:learn:security/roles.adoc#eventing-full-admin[Eventing Full Admin] +* xref:learn:security/roles.adoc#external-user-security-admin[External User Admin] +* xref:learn:security/roles.adoc#full-admin[Full Admin] +* xref:learn:security/roles.adoc#local-user-security-admin[Local User Admin] +* xref:learn:security/roles.adoc#read-only-admin[Read-Only Admin] +* xref:learn:security/roles.adoc#security-admin[Security Admin] +* xref:learn:security/roles.adoc#views-admin[Views Admin] -** `enabled`, which indicates whether auto-failover can occur when a disk has been unresponsive, and which can be `true` or `false` (the default). -** `timePeriod`, which indicates the administrator-specified time-period, in seconds, after which auto-failover is triggered, when a disk is unresponsive. -The value is an integer between 5 and 3600. +== Responses -* `maxCount`. -The administrator-specified maximum number of nodes that can be concurrently auto-failed over. -If nodes are auto-failed over until the value of `maxCount` is reached, no further auto-failover is triggered until a parameter-reset is performed. -The default value is 1. +200 OK:: +The call was successful. +Also returns an object containing the current state of the auto-failover settings. +See <<#example>> for an example of the response. -* `canAbortRebalance`. -Whether or not auto-failover can be triggered if a _rebalance_ is in progress. -This feature is only available in Couchbase Enterprise Edition. -The value can be either `true` (the default) or `false`. +401 Unauthorized:: +The user credentials supplied with the call do not have the correct permissions to read the auto-failover settings. -Failure to authenticate returns `401 Unauthorized`. -An incorrectly specified URL returns `404 Object Not Found`. +404 Not Found:: +The URL was incorrect. +[#example] == Example The following example returns the auto-failover settings for the cluster. -The output is piped to the https://stedolan.github.io/jq[jq^] command, to facilitate readability. +It pipes the output through the https://stedolan.github.io/jq[`jq`^] command to improve readability. -[source,bourne] +[source,console] ---- -curl -X GET http://localhost:8091/settings/autoFailover -u Administrator:password | jq '.' +curl -X GET http://localhost:8091/settings/autoFailover \ + -u Administrator:password | jq '.' ---- If successful, execution returns the auto-failover settings for the cluster. @@ -82,22 +75,90 @@ For example: ---- { "enabled": true, - "timeout": 72, + "timeout": 120, "count": 0, "failoverOnDataDiskIssues": { - "enabled": true, - "timePeriod": 89 + "enabled": false, + "timePeriod": 120 }, - "maxCount": 2, - "canAbortRebalance": true + "maxCount": 1, + "canAbortRebalance": true, + "failoverPreserveDurabilityMajority": false, + "failoverOnDataDiskNonResponsiveness": { + "enabled": false, + "timePeriod": 120 + }, + "allowFailoverEphemeralNoReplicas": false } ---- +// Note: avoiding mention of disableMaxCount here intentionally -- gg 7-22-2025 + +The keys in the object returned in the example are: + +* `enabled` +Whether automatic failover is on (a value of `true`) or off (`false`). + +* `timeout` +The number of seconds Couchbase Server waits after a node has become unavailable before it performs an automatic failover. +This value can be between 5 and 3600. +The default value is 120. + +* `count`. +The number of nodes that Couchbase Server has auto-failed over. +Couchbase Server resets this value to zero either when the cluster rebalances to remove or rejoin the failed nodes, or when an administrator manually resets the count (see xref:rest-api:rest-cluster-autofailover-reset.adoc[]). +The parameter's default value is 0. +If number of failed-over nodes reaches the maximum count set by `maxCount`, Couchbase Server refuses to auto-failover more nodes until you reset the count or resolve the auto-failovers with a recovery and rebalance. + +* `failoverOnDataDiskIssues`. +This object contains two keys: + +** `enabled` indicates whether auto-failover can occur when a disk has been unresponsive, and which can be `true` or `false` (the default). + +** `timePeriod`, which indicates the administrator-specified time-period, in seconds, after which auto-failover is triggered, when a disk is unresponsive. +The value is an integer between 5 and 3600. + +* `maxCount`. +The maximum number of nodes that can be auto-failed over at the same time. +When the count of auto-failed over nodes reaches this value, Couchbase Server does not trigger additional auto-failovers. +You must either resolve the auto-failovers through rebalancing the cluster to remove or recover the failed-over nodes or reset the count of failed over nodes. +The default value is 1. + +* `canAbortRebalance` +Whether or not Couchbase Server can auto-failover a node while a rebalance is taking place. +This feature is only available in Couchbase Server Enterprise Edition. +The value can be either `true` (the default) or `false`. + +* `failoverPreserveDurabilityMajority` +Indicates whether Couchbase Server refuses to auto-failover a node if doing so could result in the loss of durably written data. + +* `failoverOnDataDiskNonResponsiveness` +This object contains two keys that control auto-failover when a data disk is non-responsive: ++ +-- +** `enabled` indicates whether Couchbase Server initiates an auto-failover on a node when its data disk has failed to complete an operation in the period set by `timePeriod`. +This value can be `true`, which enables the auto-failover, or the default `false` which does not trigger a failover due to disk unresponsiveness. + +** `timePeriod` +Indicates amount of time a data disk on a node has to be unresponsive before Couchbase Server can trigger an auto-failover. +This value defaults to 120. +-- ++ +For more information about these values, see xref:learn:clusters-and-availability/automatic-failover.adoc#failover-on-data-disk-non-responsiveness[Failover on Data Disk Non-Responsiveness]. + +* `allowFailoverEphemeralNoReplicas` +Indicates whether Couchbase Server can auto-failover a node that contains vBuckets for an unreplicated ephemeral bucket. +This value can be `true`, which allows auto-failover of such nodes, or the default `false` which prevents auto-failover of such nodes. + ++ +For more information about this value, see xref:learn:clusters-and-availability/automatic-failover.adoc#ephemeral-buckets-with-no-replicas[Auto-failover for Ephemeral Buckets with No Replicas]. + == See Also -For information on setting auto-failover parameters with the REST API, see xref:rest-api:rest-cluster-autofailover-enable.adoc[Enabling and Disabling Auto-Failover]. +* For information about setting auto-failover parameters with the REST API, see xref:rest-api:rest-cluster-autofailover-enable.adoc[Enabling and Disabling Auto-Failover]. + +* The Couchbase Server command line tool xref:cli:cbcli/couchbase-cli-setting-autofailover.adoc[setting-autofailover] lets you manage auto-failover. -The Couchbase CLI allows auto-failover to be managed by means of the xref:cli:cbcli/couchbase-cli-setting-autofailover.adoc[setting-autofailover] command. -For information on managing auto-failover with Couchbase Web Console, see xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability]. +* For information about managing auto-failover with Couchbase Server Web Console, see xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability]. -A full description of auto-failover is provided in xref:learn:clusters-and-availability/automatic-failover.adoc[Automatic Failover]. +* For information about auto-failover see xref:learn:clusters-and-availability/automatic-failover.adoc[].