-
Notifications
You must be signed in to change notification settings - Fork 23
feat: PG cross-region disaster recovery (CRDR) #533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
wojcik-dorota
merged 54 commits into
main
from
dorota-postgresql-cross-region-disaster-recovery
Oct 27, 2025
Merged
Changes from all commits
Commits
Show all changes
54 commits
Select commit
Hold shift + click to select a range
7a87769
section structure
wojcik-dorota ffa155d
section structure
wojcik-dorota d68f637
concept part1
wojcik-dorota 3e3f1ab
crdr setup diagram
wojcik-dorota 71954f7
failover diagram
wojcik-dorota 0fdad39
revert opertation
wojcik-dorota e4edc4e
revert diagram
wojcik-dorota 0140a7b
diagrams highlighting
wojcik-dorota 8ec8b56
diagrams look and feel
wojcik-dorota 605f095
enable crdr
wojcik-dorota 7584859
crdr failover
wojcik-dorota 54e8344
revert
wojcik-dorota e323b57
cli instructions for crdr ops
wojcik-dorota f502399
api calls for CRDR management
wojcik-dorota 25f2463
update: separate swizzled components (#604)
ArthurFlag 2ed0a0a
enable crdr
wojcik-dorota d693f3d
fixing terminology re failover vs switchover
wojcik-dorota b4b4e14
fix
wojcik-dorota 2e4b6d5
fix
wojcik-dorota c5a11f0
set up cdrd via terraform
wojcik-dorota 7b6501b
how to detect region outage
wojcik-dorota 9d9d850
switchback
wojcik-dorota a97a49a
switchover diagram
wojcik-dorota a2a4637
switchover &switchback
wojcik-dorota 115610f
toc
wojcik-dorota f5fba14
switchover and switchback via console
wojcik-dorota 021e4ce
related pages
wojcik-dorota f186d84
fix
wojcik-dorota 300bfb7
fix
wojcik-dorota 5fc7971
fix
wojcik-dorota 88ac121
switchover api and cli
wojcik-dorota 815f447
fix
wojcik-dorota f1f4914
TF flows
wojcik-dorota bc66b8c
fix
wojcik-dorota b142cf7
fix
wojcik-dorota 8258af8
fix
wojcik-dorota 7355714
Apply suggestions from code review
wojcik-dorota 1b2c4cb
Update crdr-switchover.md
wojcik-dorota 7437436
Update crdr-revert-to-primary.md
wojcik-dorota a489312
fix typos
wojcik-dorota 4ec69ca
fix typos
wojcik-dorota 324dc58
fix typos
wojcik-dorota 9f55c37
fix typos
wojcik-dorota f88bc77
fix typos
wojcik-dorota d5d53c8
fix typos
wojcik-dorota 204c01e
removing automatic failover for LA
wojcik-dorota 5ccf969
fix
wojcik-dorota 871eb73
fix
wojcik-dorota 1e6bee4
fix
wojcik-dorota ee7ef5e
gui fixes to failover and failback
wojcik-dorota 78c58d3
removed console flows for switchover and switchback
wojcik-dorota 9ab0657
feedback
wojcik-dorota ae023b7
dns name
wojcik-dorota 7d3bfeb
restrictions
wojcik-dorota File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| --- | ||
| title: Cross-region disaster recovery in Aiven for PostgreSQL® | ||
| --- | ||
|
|
||
| import DocCardList from '@theme/DocCardList'; | ||
|
|
||
| <DocCardList /> | ||
|
|
||
| ## Related pages | ||
|
|
||
| - [Backups in Aiven for PostgreSQL®](/docs/products/postgresql/concepts/pg-backups) | ||
| - [Read-only replicas in Aiven for PostgreSQL®](/docs/products/postgresql/howto/create-read-replica) | ||
| - [High availability in Aiven for PostgreSQL®](/docs/products/postgresql/concepts/high-availability) | ||
| - [Upgrade and failover procedures in in Aiven for PostgreSQL®](/docs/products/postgresql/concepts/upgrade-failover) | ||
| - [Backup to another region](/docs/platform/concepts/backup-to-another-region) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,230 @@ | ||
| --- | ||
| title: Cross-region disaster recovery in Aiven for PostgreSQL® | ||
| sidebar_label: CRDR overview | ||
| limited: true | ||
| keywords: [recovery, primary, outage, failure, failover] | ||
| --- | ||
|
|
||
| import ConsoleLabel from "@site/src/components/ConsoleIcons"; | ||
| import RelatedPages from "@site/src/components/RelatedPages"; | ||
| import readyForCrdr from "@site/static/images/content/figma/ready-for-crdr.png"; | ||
| import crdrSetup from "@site/static/images/content/figma/crdr-setup.png"; | ||
| import crdrFailover from "@site/static/images/content/figma/crdr-failover.png"; | ||
| import crdrSwitchover from "@site/static/images/content/figma/crdr-switchover.png"; | ||
| import crdrRevert from "@site/static/images/content/figma/crdr-revert.png"; | ||
| import crdrSwitchback from "@site/static/images/content/figma/crdr-switchback.png"; | ||
|
|
||
| The cross-region disaster recovery (CRDR) feature ensures your business continuity by recovering your workloads to a remote region in the event of a region-wide | ||
| failure. | ||
|
|
||
| ## Region-wide outage | ||
|
|
||
| CRDR allows you to cope with the primary region failure by initiating a recovery transition | ||
| to another region. To identify a region outage, look into the region status: | ||
|
|
||
| - Check your monitoring and alerts, and watch the following metrics: | ||
| - Instances, nodes, services failures | ||
| - Connectivity loss, latency spikes, packet drops | ||
| - High error rates, timeouts, 5xx server errors | ||
| - Check your cloud provider's status page: | ||
| - [AWS](https://health.aws.amazon.com) | ||
| - [Google Cloud](https://status.cloud.google.com) | ||
| - [Azure](https://status.azure.com) | ||
| - Test connectivity and DNS resolution for your instances or services. | ||
|
|
||
| ## CRDR overview | ||
|
|
||
| The CRDR setup is a pair of integrated multi-node services, sharing credentials and a | ||
| DNS name but located in different regions. CRDR peer services can be hosted on 1-3 nodes. | ||
|
|
||
| - **Primary service** hosted in the primary region is your original service you use on | ||
| regular basis. It hands over to the recovery service when you initiate | ||
| [a failover or a switchover](/docs/products/postgresql/crdr/crdr-overview#recovery-transition). | ||
| When you initiate | ||
| [a failback or a switchback](/docs/products/postgresql/crdr/crdr-overview#recovery-reversion), | ||
| the primary service takes back control from the recovery service as soon as the | ||
| infrastructure is up and running again. | ||
| - **Recovery service** hosted in the recovery region is the service you create for | ||
| disaster recovery purposes. It takes over from the primary service when you initiate | ||
| [a failover or a switchover](/docs/products/postgresql/crdr/crdr-overview#recovery-transition). | ||
| When you initiate | ||
| [a failback or a switchback](/docs/products/postgresql/crdr/crdr-overview#recovery-reversion), | ||
| the recovery service hands over to the primary service as soon as the infrastructure is | ||
| up and running again. | ||
|
|
||
| The CRDR cycle is a sequence of actions involving CRDR peer services aimed at enabling and | ||
| executing CRDR as well as resuming the original service operation. | ||
|
|
||
| Throughout the CRDR cycle, CRDR peer services or service nodes go into the following states: | ||
|
|
||
| - **Active**: A CRDR peer service is *active* when it runs on a node that is replicating data to | ||
| CRDR standby nodes. | ||
| - Primary service is active during normal operations, when a region is up and running. | ||
| - Recovery service is active after taking over from primary service in the event of a region outage. | ||
|
|
||
| - **Passive**: A CRDR peer service is *passive* when it runs on CRDR standby nodes only. Either CRDR | ||
| peer service can be passive depending on a phase of the CRDR cycle. | ||
|
|
||
| - **Failed**: A CRDR peer service is *failed* when it's defunct or unreachable after failing over | ||
| in the event of a region outage. Only a primary service can be failed. | ||
|
|
||
| ## Limitations | ||
|
|
||
| - **Service plan requirements**: To set up CRDR, your primary service must use at least a | ||
| Startup plan. Hobbyist and Free plans are not supported. | ||
|
|
||
| :::tip[Upgrading your plan] | ||
| If your Aiven for PostgreSQL service uses a Hobbyist plan or a Free plan, | ||
| [upgrade your free plan](/docs/platform/concepts/service-pricing#free-plans) or | ||
| [change your Hobbyist plan](/docs/platform/howto/scale-services) to at least a Startup | ||
| plan. | ||
| ::: | ||
|
|
||
| - **Console restrictions**: When creating a recovery service through | ||
| the [Aiven Console](https://console.aiven.io/), you must use the same service plan and | ||
| cloud provider as your primary service. | ||
|
|
||
| :::tip[Alternative setup methods] | ||
| For different service plans or cloud providers, create your recovery service using the | ||
| [Aiven CLI](/docs/tools/cli), the [Aiven API](/docs/tools/api), or the | ||
| [Aiven Provider for Terraform](https://registry.terraform.io/providers/aiven/aiven/latest/docs). | ||
| ::: | ||
|
|
||
| ## How it works | ||
|
|
||
| The CRDR feature is eligible for all startup, business, and premium service plans. | ||
|
|
||
| <img src={readyForCrdr} className="centered" alt="Ready for CRDR" width="100%" /> | ||
|
|
||
| ### CRDR setup | ||
|
|
||
| You [enable CRDR by creating a recovery service](/docs/products/postgresql/crdr/enable-crdr). | ||
| The CRDR setup completes as soon as the recovery service is created and in sync with the | ||
| primary service. At that point, the primary service is the **Active** service receiving | ||
| incoming traffic and replicating to the recovery service, and the recovery service is the | ||
| **Passive** service replicating from the primary service. | ||
|
|
||
| <img src={crdrSetup} className="centered" alt="CRDR setup" width="100%" /> | ||
|
|
||
| ### Recovery transition | ||
|
|
||
| CRDR supports two types of the recovery transition: | ||
|
|
||
| - [Failover](/docs/products/postgresql/crdr/crdr-overview#failover-to-the-recovery-region) | ||
| - **Triggered by you** typically in the event of a region-wide outage | ||
| - **Destroys the primary service** and requires the primary service recreation to fail back. | ||
| - [Switchover](/docs/products/postgresql/crdr/crdr-overview#switchover-to-the-recovery-region) | ||
| - **Triggered by you** for any purposes other than a region-wide outage | ||
| - Leaves the **primary service intact** with no need for recreating it to switch back. | ||
|
|
||
| #### Failover to the recovery region | ||
|
|
||
| You typically trigger a | ||
| [failover to the recovery region](/docs/products/postgresql/crdr/failover/crdr-failover-to-recovery) | ||
| in the event of a region-wide outage. This destroys the primary service, which becomes | ||
| **Failed**, and promotes the recovery service to **Active**. To fail back to | ||
| the primary service, it needs to be recreated first. | ||
|
|
||
| <img src={crdrFailover} className="centered" alt="CRDR failover" width="100%" /> | ||
|
|
||
| #### Switchover to the recovery region | ||
|
|
||
| You trigger a | ||
| [switchover to the recovery service](/docs/products/postgresql/crdr/switchover/crdr-switchover) | ||
| for testing, simulating a disaster scenario, or verifying the disaster resilience of your | ||
| infrastructure. This demotes the primary service to **Passive** and promotes the recovery | ||
| service to **Active**. To switch back to the primary service, no service recreation is | ||
| needed. | ||
|
|
||
| <img src={crdrSwitchover} className="centered" alt="CRDR switchover" width="100%" /> | ||
|
|
||
| ### Recovery reversion | ||
|
|
||
| You trigger a recovery reversion to shift your workload back to the primary region and | ||
| restore the CRDR setup to its original configuration. | ||
|
|
||
| There are two types of the recovery reversion: | ||
|
|
||
| - [Failback](/docs/products/postgresql/crdr/crdr-overview#failback-to-the-primary-region) | ||
| - Reverts a | ||
| [failover](/docs/products/postgresql/crdr/crdr-overview#failover-to-the-recovery-region). | ||
| - Recreates the primary service. | ||
| - [Switchback](/docs/products/postgresql/crdr/crdr-overview#switchback-to-the-primary-region) | ||
| - Reverts a | ||
| [switchover](/docs/products/postgresql/crdr/crdr-overview#switchover-to-the-recovery-region). | ||
| - No need to recreate the primary service. | ||
|
|
||
| #### Failback to the primary region | ||
|
|
||
| The failback process consists of two steps you initiate at your convenience: | ||
|
|
||
| 1. [Primary service recreation](/docs/products/postgresql/crdr/failover/crdr-revert-to-primary) | ||
|
|
||
| You initiate this step to restore primary service nodes from the local backups and to | ||
| synchronize (replicate) the most recent data from the active service (recovery service). | ||
| When completed, the primary service is restored and in near real-time sync with the recovery service. | ||
|
|
||
| 1. [Primary service takeover](/docs/products/postgresql/crdr/failover/crdr-revert-to-primary) | ||
|
|
||
| You initiate a takeover as soon as the primary service is recreated. This switches the direction of | ||
| the replication to effectively route the traffic back to the primary region. When | ||
| completed, both the primary service and the recovery service are up and running again: the primary service as an active | ||
| service, and the recovery service as a passive service. | ||
|
|
||
| <img src={crdrRevert} className="centered" alt="CRDR revert" width="100%" /> | ||
|
|
||
| #### Switchback to the primary region | ||
|
|
||
| You initiate a switchback at your convenience to switch the direction of the | ||
| replication and route the traffic back to the primary region. When completed, both the primary service | ||
| and the recovery service are up and running again: the primary service as an active service, and the recovery service as a | ||
| passive service. | ||
|
|
||
| <img src={crdrSwitchback} className="centered" alt="CRDR switchback" width="100%" /> | ||
|
|
||
| ## DNS name and service URI | ||
|
|
||
| ### Active service DNS name | ||
|
|
||
| CRDR allows you to access your active service always using the same **Service URI**, | ||
| which doesn't change in the event of a failover to the recovery region. | ||
|
|
||
| :::note | ||
| **Service URI** is a locator that is shared between the primary service and the recovery service. It always points | ||
| to the replicating node of the active service. This node is the only read-write node | ||
| in both CRDR regions. | ||
| ::: | ||
|
|
||
| The **Service URI** of an active service can remain unchanged in the event of a region outage | ||
| because the DNS record of this **Service URI** is updated to point to the active service. | ||
| This allows your applications to work uninterrupted and adapt to the change automatically | ||
| without updating its code or data. | ||
|
|
||
| ### Standby nodes DNS names | ||
|
|
||
| Regardless of the CRDR cycle phase, you can always connect and access separately | ||
| each standby node in the CRDR peer services. This can help you compensate for potential | ||
| network delays by using the service geographically closer to your applications. | ||
|
|
||
| Standby nodes in the CRDR service pair can have two different URIs, depending on the CRDR | ||
| service (region) they belong to: | ||
|
|
||
| - For the **primary service standby URI**, the DNS record always points to the standby nodes | ||
| in the primary region. | ||
| - For the **recovery service standby URI**, the DNS record always points to the standby nodes | ||
| in the recovery region. | ||
|
|
||
| Both the primary service standby URI and the recovery service standby URI are dedicated, not shared, and read-only. | ||
|
|
||
| ## Backups in the recovery region | ||
|
|
||
| After a failover to the recovery region in the event of a primary region outage, service | ||
| backups start to be taken in the recovery region. You can use this backup history for | ||
| operations and data resiliency purposes. | ||
|
|
||
| <RelatedPages/> | ||
|
|
||
| - [Aiven for PostgreSQL high availability](/docs/products/postgresql/concepts/high-availability) | ||
| - [Aiven for PostgreSQL backups](/docs/products/postgresql/concepts/pg-backups) | ||
| - [Aiven for PostgreSQL read-only replica](/docs/products/postgresql/howto/create-read-replica) | ||
| - [Backup to another region](/docs/platform/concepts/backup-to-another-region) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.