Skip to content
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions modules/introduction/partials/new-features-80.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,30 @@ curl --get -u <username:password> \
-d clusterLabels=none|uuidOnly|uuidAndName
----

https://jira.issues.couchbase.com/browse/MB-33315[MB-33315] Allow auto-failover for ephemeral buckets without a replica::
Previously, Couchbase Server always prevented auto-failover on nodes containing an ephemeral bucket that does not have replicas.
You can now configure Couchbase Server Enterprise Edition to allow a node to auto-failover even if it has an ephemeral bucket without a replica.
You can enable this setting using the Couchbase Server Web Console or through the REST API using the `allowFailoverEphemeralNoReplicas` auto-failover setting.
This option defaults to off.
When you enable it, Couchbase Server creates empty vBuckets on other nodes to replace the lost ephemeral vBuckets on the failed over node.
If the failed over node rejoins the cluster, Couchbase Server moves the replacement vBuckets back to rejoining node.
This option is useful if your application uses ephemeral buckets for data that's not irreplaceable, such as caches.
This setting is not available in Couchbase Server Community Edition.

+
See xref:learn:clusters-and-availability/automatic-failover.adoc#auto-failover-and-ephemeral-buckets[Auto-Failover and Ephemeral Buckets] and xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability] for more information.

https://jira.issues.couchbase.com/browse/MB-34155[MB-34155] Support Auto-failover for exceptionally slow/hanging disks::
You can now configure Couchbase Server to trigger an auto-failover on a node if its data disk is slow to respond or is hanging.
Before version 8.0, you could only configure Couchbase Server to auto-failover a node if the data disk returned errors for a set period of time.
The new `failoverOnDataDiskNonResponsiveness` setting and correspinding settings in the Couchbase Web Console *Settings* page sets the nuber of seconds allowed for read or write oprtions to complete.
If this period elapses before the operation completes, Couchbase Server triggers an auto-failover for the node.
This setting is off by default.

+
See xref:learn:clusters-and-availability/automatic-failover.adoc#failover-on-data-disk-non-responsiveness[Failover on Data Disk Non-Responsiveness] to learn more about this feature.
See xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability] and xref:rest-api:rest-cluster-autofailover-enable.adoc[] to learn how to enable it.

[#section-new-feature-data-service]
=== Data Service

Expand Down
8 changes: 7 additions & 1 deletion modules/learn/pages/buckets-memory-and-storage/buckets.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -203,9 +203,15 @@ You can add or remove buckets and nodes dynamically.
| By default, auto-failover starts when a node is inaccessible for 120 seconds.
Auto-failover can occur up to a specified maximum number of times before you must reset it manually.
When a failed node becomes accessible again, delta-node recovery uses data on disk and resynchronizes it.
| Auto-reprovision starts as soon as a node is inaccessible.
| Auto-reprovision starts for ephemeral buckets with replicas on a failed node as soon as a node is inaccessible.
Auto-reprovision can occur multiple times for multiple nodes.
When a failed node becomes accessible again, the system does not require delta-node recovery because no data resides on disk.

If you enable auto-failover for ephemeral buckets without replicas, a failed node can Auto-failover.
In this case, when a failover occurs Couchbase Server creates empty vBuckets on the remaining nodes to replace the missing vBuckets on the failed node.
When the failed node rejoins the cluster, Couchbase Server moves the replacement vBuckets back to it.

NOTE: the auto-failover for ephemeral buckets feature is only available on Couchbase Server Enterprise Edition.
|===

== Bucket Security
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ For information on managing auto-failover, see the information provided for Couc
== Failover Events

Auto-failover occurs in response to failed/failing events.
There are three types of event that can trigger auto-failover:
The following events can trigger auto-failover:

* _Node failure_.
A server-node within the cluster is unresponsive (due to a network failure, very high CPU utilization problem, out-of-memory problem, or other node-specific issue). This means that the the cluster manager of the node has not sent heartbeats in the configured timeout period, and therefore, the health of the services running on the node is unknown.
Expand All @@ -42,7 +42,16 @@ A server-node within the cluster is unresponsive (due to a network failure, very
Concurrent correlated failure of multiple nodes such as physical rack of machines or multiple virtual machines sharing a host.

* _Data Service disk read/write issues_.
Data Service disk read/write errors. Attempts by the Data Service to read from or write to disk on a particular node have resulted in a significant rate of failure (errors returned), for longer than a specified time-period.
Data Service disk read/write errors.
Attempts by the Data Service to read from or write to disk on a particular node have resulted in a significant rate of failure (errors returned), for longer than a specified time-period.

[#failover-on-data-disk-non-responsiveness]
* _Data Disk non-responsiveness_.
You can configure a timeout period for the Data Service's disk read/write threads to complete an operation.
When you enable this setting, if the period elapses and the thread has not completed the operation, Couchbase Server can auto-fail over the node.
This setting differs from the disk error timeout because the data disk does not have to return errors.
Instead, if the disk is so slow that it cannot complete the operation or is hanging, Couchbase Server can take action even when it does not receive an error.


* _Index or Data Service running on the mode is non-responsive or unhealthy_.
** Index Service non-responsiveness.
Expand Down Expand Up @@ -72,8 +81,10 @@ For example, given a cluster of 18 nodes, _10_ nodes are required for the quorum
After this maximum number of auto-failovers has been reached, no further auto-failover occurs, until the count is manually reset by the administrator, or until a rebalance is successfully performed.
Note, however, that the count can be manually reset, or a rebalance performed, prior to the maximum number being reached.

* In no circumstances where data-loss might result: for example, when a bucket has no replicas.
Therefore, even a single event may not trigger a response; and an administrator-specified maximum number of failed nodes may not be reached.
* By default, Couchbase Server does not allow an auto-failover if it may result in data loss.
For example, with default settings Couchbase Server does not allows the auto-failover of a node that contains a bucket with no replicas.
This restriction includes ephemeral buckets as well as Couchbase buckets.
See <<#auto-failover-and-ephemeral-buckets>> for more information on auto-failover and ephemeral buckets.

* Only in accordance with the xref:learn:clusters-and-availability/automatic-failover.adoc#failover-policy[Service-Specific Auto-Failover Policy] for the service or services on the unresponsive node.

Expand All @@ -89,7 +100,58 @@ Auto-failover is for intra-cluster use only: it does not work with xref:learn:cl
See xref:manage:manage-settings/configure-alerts.adoc[Alerts], for
details on configuring email alerts related to failover.

See xref:learn:clusters-and-availability/groups.adoc[Server Group Awareness], for information on server groups.
See xref:learn:clusters-and-availability/groups.adoc[Server Group Awareness], for information about server groups.

[#auto-failover-and-ephemeral-buckets]
== Auto-Failover and Ephemeral Buckets
Couchbase Server supports ephemeral buckets, which are buckets that it stores only in memory.
Their data is never persisted to disk.
This lack of persistence poses several challenges when it comes to node failure.

If an ephemeral bucket lacks replicas, it loses the data in vBuckets on any node that fails and restarts.
To prevent this data loss, by default Couchbase Server does not allow auto-failover of a node that contains vBuckets for an unreplicated ephemeral bucket.
In this case, you must manually fail over the node if it is unresponsive.
However, all of ephemeral bucket's data on the node is lost.

Couchbase Server provides two settings that affect how node failures work with ephemeral buckets:

Auto-reprovisioning for Ephemeral Buckets::
This setting helps avoid data loss in cases where a node fails and restarts before Couchbase Server can begin an auto-failover for it.
This setting defaults to enabled.
When it's enabled, Couchbase Server automatically activates the replicas of any ephemeral vBuckets that are active on the restarting node.
If you turn off this setting, there's a risk that the restarting node could cause data loss.
It could roll back asynchronous writes that the replica vBuckets have but its vBuckets are missing.

[#ephemeral-buckets-with-no-replicas]
Auto-failover for Ephemeral Buckets with No Replicas [.edition]#{enterprise}#::
When enabled, this setting allows Couchbase Server to auto-failover a node that contains vBuckets for an ephemeral bucket with no replicas.
When Couchbase Server fails over a node with an unreplicated ephemeral bucket, the data in the vBuckets on the node is lost.
Couchbase Server creates empty vBuckets on the remaining nodes to replace the missing vBuckets on the failed node.
When the failed node rejoins the cluster, Couchbase Server moves the replacement vBuckets back to it.

+
This setting is off by default.
When it's off, Couchbase Server does not auto-failover a node that contains an unreplicated ephemeral bucket's vBuckets.
If one of these nodes becomes unresponsive, you must manually fail over the node.

+
Enable this setting when preserving the data in the ephemeral bucket is not critical for your application.
For example, suppose you use the unreplicated ephemeral bucket for caching data.
In this case, consider enabling this setting to allow Couchbase Server to auto-failover nodes containing its vBuckets.
Losing the data in the cache is not critical, because your application can repopulate the cache with minimal performance cost.

+
NOTE: If the data in the ephemeral bucket is critical for your application, enable one or more replicas for it.
See xref:manage:manage-buckets/create-bucket.adoc#ephemeral-bucket-settings[Ephemeral Bucket Settings] for more information about adding replicas for an ephemeral bucket.

+
If the unreplicated ephemeral bucket is indexed, Couchbase Server rebuilds the index after it auto-fails over the node even if the index is not on the failed node.
After this type of failover, the index must be rebuild because it indexes data lost in the failed node's vBuckets.
For more information, see xref:learn:services-and-indexes/indexes/index-replication.adoc#index-rollback-after-failover[Index Rollback After Failover].

+
See xref:manage:manage-settings/general-settings.adoc#node-availability[Node Availability] to learn how to change these settings via the Couchbase Server Web Console.
See xref:rest-api:rest-cluster-autofailover-settings.adoc[] for information about changing these settings via the REST API.

[#failover-policy]
== Service-Specific Auto-Failover Policy
Expand Down
Binary file added modules/manage/assets/node-availability.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
81 changes: 60 additions & 21 deletions modules/manage/pages/manage-settings/general-settings.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -95,35 +95,74 @@ No identifiable information (such as bucket names, bucket data, design-document

[#node-availability]
=== Node Availability
The options in the *Node Availability* panel control if and when Couchbase Server performs an automatic failover on a node.
Couchbase Server only performs an auto-failover in clusters containing three or more nodes.
For detailed information about automatic failover policy and constraints, see xref:learn:clusters-and-availability/automatic-failover.adoc[].

The options in the *Node Availability* panel control whether and how *Automatic Failover* is applied.
For detailed information on policy and constraints, see xref:learn:clusters-and-availability/automatic-failover.adoc[Automatic Failover].
image::manage-settings/node-availability.png["The Node Availability panel consists of fields described in the following section.",548,align=center]

The panel appears as follows:
The fields in this section are:

Auto-failover after X seconds for up to Y node::
When you enable this setting (which defaults to enabled), Couchbase Server automatically fails over a node after it has been unresponsive for the duration you set in the *seconds* box.
When Couchbase Server auto-fails over a node, it promotes the replica vBuckets on other nodes to active status to maintain data access.
The default duration is 120 seconds.
You can change it to any value from 5 to 3600.

image::manage-settings/node-availability.png["The Node Availability panel",548,align=center]
+
You also set the maximum number of nodes that Couchbase Server can auto-fail over at a time in the *node* box.
If Couchbase has auto-failed over the maximum number of nodes, it refuses to auto-fail over more nodes until the you either perform a successful rebalance or manually reset the count of failed-over nodes.
This maximum number of allowed auto-failovers also applies to the other auto-failover settings in this section.

The following checkboxes are provided:
+
You can only enable the other settings in this section when you enable *Auto-failover after X seconds for up to Y node*.

Auto-failover for sustained data disk read/write failures after X seconds [.edition]#{enterprise}#::
If you enable this setting, Couchbase Server fails over a node that experiences sustained data disk errors for the duration set in the *seconds* box.
The default duration is 120 seconds.
You can change the duration to a value from 5 to 3600 seconds.
You can only enable this setting if you enable *Auto-failover after X seconds for up to Y node*.
This setting defaults to off.

Auto-failover for data disk read/write non-responsiveness after X seconds [.edition]#{enterprise}#::
This setting is similar to the previous setting, except the duration in the *seconds* box sets the amount of time a read or write operation has to finish.
This setting defaults to off.
When you enable it, if a read or write operation on the data disk takes longer than the values in the *seconds* box, Couchbase Server performs an auto-failover on the node.
It lets you handle cases where a node's data disk is indefinitely hanging or is so busy it becomes unresponsive without generating errors.
The default duration is 120 seconds.
You can change the duration to a value from 5 to 3600 seconds.

Preserve durable writes [.edition]#{enterprise}#::
If you enable this setting, Couchbase Server does not auto-fail over a node if it could result in the loss of durably written data.
This setting defaults to off.
For more information, see xref:learn:data/durability.adoc#preserving-durable-writes[Preserving Durable Writes].

The *Node Availability* section also contains a *For Ephemeral Buckets* subsection that you can expand.
The settings in this subsection are:

Enable auto-reprovisioning for up to X node::
This setting helps avoid data loss in cases where a node fails and restarts before Couchbase Server can begin an auto-failover for it.
This setting defaults to enabled.
When it's enabled, Couchbase Server automatically promotes the replicas of the active ephemeral vBuckets on the failed node.
By making these replicas active, Couchbase Server prevents the loss of data caused by the restarted node losing the data in its ephemeral buckets.

* *Auto-failover after _x_ seconds for up to _y_ node*: After the timeout period set here as _x_ seconds has elapsed, an unresponsive or malfunctioning node is failed over, provided that the limit on actionable events set here as _y_ (with the default value of 1) has not yet been reached.
Data replicas are promoted to active on other nodes, as appropriate.
This feature can only be used when three or more nodes are present in the cluster.
The number of seconds to elapse is configurable: the default is 120; the minimum permitted is 5; the maximum 3600.
This option is selected by default.
+
The *node* box sets the maximum number of nodes Couchbase Server can auto-reprovision at a time.
This value defaults to 1, meaning only a single node's ephemeral buckets are auto-reprovisioned at a time.
After the failed node rejoins the cluster, you must perform a rebalance before another node can be auto-reprovisioned.
Only set this limit greater than 1 if the remaining nodes have enough capacity to handle the increased workload of multiple ephemeral buckets.

* *Auto-failover for sustained data disk read/write failures after _z_ seconds*: After the timeout period set here as _z_ seconds has elapsed, a node is failed over if it has experienced sustained data disk read/write failures.
The timeout period is configurable: the default length is 120 seconds; the minimum permitted is 5; the maximum 3600.
This checkbox can only be checked if *Auto-failover after _x_ seconds for up to _y_ node* has also been checked.
This option is unchecked by default.
Allow auto-failover for ephemeral buckets with no replicas [.edition]#{enterprise}#::
When enabled, this setting allows Couchbase Server to auto-failover a node that contains an ephemeral bucket with no replicas.
When Couchbase Server fails over a node with an ephemeral bucket with no replicas, it creates empty vBuckets on the remaining nodes to replace the missing vBuckets on the failed node.
When the failed node rejoins the cluster, Couchbase Server moves the replacement vBuckets back to it.

* *Preserve durable writes*: If this checkbox is checked, a node is _not_ failed over if this might result in the loss of durably written data.
The default is that the checkbox is unchecked.
For information, see xref:learn:data/durability.adoc#preserving-durable-writes[Preserving Durable Writes].
+
This setting is off by default.
When it's off, Couchbase Server does not auto-failover a node that contains vBuckets for an unreplicated ephemeral bucket.
In this case, you must manually fail over any node that contains an unreplicated ephemeral bucket's vBuckets.

The *Node Availability* panel also contains a *For Ephemeral Buckets* option.
When opened, this provides an *Enable auto-reprovisioning* checkbox, with a configurable number of nodes.
Checking this ensures that if a node containing _active_ Ephemeral buckets becomes unavailable, its replicas on the specified number of other nodes are promoted to active status as appropriate, to avoid data-loss.
Note, however, that this may leave the cluster in an unbalanced state, requiring a rebalance.
See xref:learn:clusters-and-availability/automatic-failover.adoc#auto-failover-and-ephemeral-buckets[Auto-Failover and Ephemeral Buckets] for more information about auto-failover and ephemeral buckets.

[#auto-failover-and-durability]
==== Auto-Failover and Durability
Expand Down
Loading