Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions docs/cloud/guides/disaster-recovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
slug: /cloud/disaster-recovery
sidebar_label: 'Disaster recovery'
title: 'Disaster recovery'
description: 'This guide provides an overview of disaster recovery.'
doc_type: 'guide'
---

# ClickHouse Cloud Disaster Recovery {#clickhouse-cloud-disaster-recovery}

This page covers the disaster recovery recommendations for ClickHouse Cloud, and guidance for customers to recover from an outage. ClickHouse Cloud does not currently support automatic failover, or automatic syncing across multiple geographical regions.

## Definitions {#definitions}

It is helpful to cover some definitions first.

**RPO (Recovery Point Objective)**: The maximum acceptable data loss measured in time following a disruptive event. Example: An RPO of 30 mins means that in the event of a failure the DB should be restorable to data no older than 30 mins. This, of course, depends on how frequently backups are taken.

**RTO (Recovery Time Objective)**: The maximum allowable downtime before normal operations must resume following an outage. Example: An RTO of 30 mins means that in the event of a failure, the team is able to restore data and applications and get normal operations going within 30 mins.

**Database Backups and Snapshots**: Backups provide durable long-term storage with a separate copy of the data. Snapshots do not create an additional copy of the data, are usually faster, and provide better RPOs.

## Database Backups {#database-backups}

Having a backup of your primary service is an effective way to utilize the backup and restore from it in the event of primary service downtime. ClickHouse Cloud supports the following capabilities for backups.

**Default backups**: By default, ClickHouse Cloud takes a backup of your service every 24 hours. These backups are in the same region as the service, and happen in the ClickHouse CSP (cloud service provider) storage bucket. In the event that the data in the primary service gets corrupted, the backup can be used to restore to a new service.

**External backups (in customer's own storage bucket)**: Enterprise Tier customers can export backups to their object storage in their own account, in the same region, or in another region. Cross-cloud backup export support is coming soon. Applicable data transfer charges will apply for cross-region, and cross-cloud backups.

**Configurable backups**: Customers can configure backups to happen at a higher frequency, up to every 6 hours, to improve the RPO. Customers can also configure longer retention.

## Restoring from a Backup {#restoring-from-a-backup}

- Default backups, in the ClickHouse Cloud bucket, can be restored to a new service in the same region.
- External backups (in customer object storage) can be restored to a new service in the same or different region.

> **NOTE**: There is currently NO support for automatic failover between 2 ClickHouse Cloud instances whether in the same or different region.

> **NOTE**: There is currently NO automatic syncing of data between different ClickHouse Cloud services in the same or different regions .i.e. Active-Active replication

## Recovery Process {#recovery-process}

This section explains the various recovery options and the process that can be followed in each case.

### Primary Service Data Corruption {#primary-service-data-corruption}

In this case the data can be restored from the backup to another service in the same region. The backup could be up to 24 hours old if using the default backup policy, or up to 6 hours old (if using configurable backups with 6 hours frequency).

### Primary Region Downtime {#primary-region-downtime}

Customers in the Enterprise Tier can export backups to their own cloud provider bucket. If you are concerned about regional failures, we recommend exporting backups to a different region. Keep in mind that cross-region data transfer charges will apply.

If the primary region goes down, the backup in another region can be restored to a new service in a different region.

Once the backup has been restored to another service, you will need to ensure that any DNS, load balancer, or connection string configurations are updated to point to the new service. This may involve:

- Updating environment variables or secrets
- Restarting application services to establish new connections

> **NOTE**: Backup / restore to an external bucket is currently not supported for services utilizing Transparent Data Encryption (TDE).

## Additional Options {#additional-options}

There are some additional options to consider.

**Dual-writing to separate clusters**: In this option, you can set up 2 separate clusters in different regions and dual-write to both. This option of course comes with a higher cost as it involves running multiple clusters but provides higher availability in case of one region being unavailable.

**Utilize CSP replication**: With this option you would utilize the cloud service provider's native object storage replication to copy data over. For instance, with BYOB you can export the backup to a bucket that you own in the primary region, and have that replicated over to another region using AWS cross region replication.
35 changes: 26 additions & 9 deletions scripts/aspell-ignore/en/aspell-dict.txt
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
personal_ws-1.1 en 3734
AArch
ACLs
Accepter
AICPA
ALTERs
AMPLab
AmazonKinesis
AMQP
ANNIndex
ANNIndexes
Expand All @@ -19,6 +17,7 @@ ASOF
ASan
AWND
AWST
Accepter
Actian
ActionsMenu
ActiveRecord
Expand All @@ -34,6 +33,7 @@ Airbyte
Akka
AlertManager
Alexey
AmazonKinesis
Amir
Anthropic
AnyEvent
Expand All @@ -60,8 +60,8 @@ Authenticators
Authy
AutoFDO
AutoML
Autoscaler
Autocompletion
Autoscaler
AvroConfluent
AzureQueue
Azurite
Expand Down Expand Up @@ -221,6 +221,7 @@ Cloudflare
CodeBlock
CodeLLDB
Codecs
Coinhall
CollapsingMergeTree
Combinators
CommonRoom
Expand Down Expand Up @@ -389,6 +390,7 @@ FQDN
Fabi
Failover
FarmHash
Fastly
FileCluster
FileLog
Filebeat
Expand All @@ -413,6 +415,7 @@ Fivetran
FixedString
FlameGraph
Flink
Fong
ForEach
FreeBSD
Fuzzer
Expand All @@ -423,6 +426,7 @@ GTID
GTIDs
GTest
GUID
GWLBs
Gb
Gbit
Gcc
Expand Down Expand Up @@ -480,6 +484,7 @@ HiveText
Holistics
Homebrew
Homebrew's
Hopsworks
HorizontalDivide
Hostname
HouseOps
Expand Down Expand Up @@ -524,6 +529,7 @@ InJodaSyntaxOrZero
Incrementing
IndexesAreNeighbors
InfluxDB
Instacart
Instana
IntN
Integrations
Expand Down Expand Up @@ -652,6 +658,7 @@ LOCALTIMESTAMP
LONGLONG
LOONGARCH
LaGuardia
Lakehouses
Lakekeeper
LangChain
LangGraph
Expand Down Expand Up @@ -703,6 +710,7 @@ MACStringToOUI
MCPHost
MEDIUMINT
MEMTABLE
MLOps
MMapCacheCells
MMappedAllocBytes
MMappedAllocs
Expand All @@ -712,6 +720,8 @@ MQTT
MQTTX
MSSQL
MSan
MTTD
MTTR
MVCC
MacBook
MacOS
Expand Down Expand Up @@ -803,6 +813,7 @@ NEWDECIMAL
NFKC
NFKD
NIST
NLBs
NOAA
NULLIF
NVME
Expand Down Expand Up @@ -845,6 +856,7 @@ NumberOfDatabases
NumberOfDetachedByUserParts
NumberOfDetachedParts
NumberOfTables
O'Reilly
OAuth
ODBCDriver
OFNS
Expand Down Expand Up @@ -1053,6 +1065,7 @@ REPL
RHEL
RIPEMD
ROLLUP
RPOs
RWLock
RWLockActiveReaders
RWLockActiveWriters
Expand Down Expand Up @@ -1155,6 +1168,7 @@ SSRF
SSSE
SaaS
Sackmann's
SageMaker
Sanjeev
Sankey
Sapchuk
Expand Down Expand Up @@ -1190,6 +1204,7 @@ SimHash
Simhash
SimpleAggregateFunction
SimpleState
SingleStore
SipHash
SlackBot
Smartbook
Expand Down Expand Up @@ -2054,6 +2069,7 @@ evalMLMethod
exFAT
expiryMsec
explainer
explorative
exponentialMovingAverage
exponentialTimeDecayedAvg
exponentialTimeDecayedCount
Expand All @@ -2078,6 +2094,7 @@ extractURLParameters
extractable
facto
failover
failovers
farmFingerprint
farmHash
fastmcp
Expand Down Expand Up @@ -2276,10 +2293,8 @@ hiveHash
hnsw
holistics
homebrew
homebrew
hopEnd
hopStart
Hopsworks
horgh
hostName
hostname
Expand Down Expand Up @@ -2423,7 +2438,7 @@ kusto
lagInFrame
laion
lakehouse
Lakehouses
lakehouses
lang
laravel
largestTriangleThreeBuckets
Expand Down Expand Up @@ -2660,7 +2675,6 @@ nats
navbar
ndjson
ness
Nessie
nestjs
netloc
newjson
Expand Down Expand Up @@ -2794,6 +2808,7 @@ plantuml
poco
pointInEllipses
pointInPolygon
pointwise
poller
polygonAreaCartesian
polygonAreaSpherical
Expand Down Expand Up @@ -3034,16 +3049,17 @@ reshards
resolvers
resourceGUID
restartable
restorable
resultset
resync
resynchronization
resyncing
failovers
retentions
rethrow
retransmit
retriable
retryable
reusability
reverseUTF
rewritable
rightPad
Expand Down Expand Up @@ -3289,6 +3305,7 @@ sumcount
sumkahan
summap
summapwithoverflow
summarization
summingmergetree
sumwithoverflow
superaggregates
Expand Down Expand Up @@ -3460,6 +3477,7 @@ transactionLatestSnapshot
transactionOldestSnapshot
transactional
transactionally
transformative
translateUTF
translocality
transpilation
Expand All @@ -3468,7 +3486,6 @@ trie
trimBoth
trimLeft
trimRight
Trino
trunc
tryBase
tryDecrypt
Expand Down