diff --git a/docs/cloud/guides/disaster-recovery.md b/docs/cloud/guides/disaster-recovery.md new file mode 100644 index 00000000000..e35933e3969 --- /dev/null +++ b/docs/cloud/guides/disaster-recovery.md @@ -0,0 +1,69 @@ +--- +slug: /cloud/disaster-recovery +sidebar_label: 'Disaster recovery' +title: 'Disaster recovery' +description: 'This guide provides an overview of disaster recovery.' +doc_type: 'guide' +--- + +# ClickHouse Cloud Disaster Recovery {#clickhouse-cloud-disaster-recovery} + +This page covers the disaster recovery recommendations for ClickHouse Cloud, and guidance for customers to recover from an outage. ClickHouse Cloud does not currently support automatic failover, or automatic syncing across multiple geographical regions. + +## Definitions {#definitions} + +It is helpful to cover some definitions first. + +**RPO (Recovery Point Objective)**: The maximum acceptable data loss measured in time following a disruptive event. Example: An RPO of 30 mins means that in the event of a failure the DB should be restorable to data no older than 30 mins. This, of course, depends on how frequently backups are taken. + +**RTO (Recovery Time Objective)**: The maximum allowable downtime before normal operations must resume following an outage. Example: An RTO of 30 mins means that in the event of a failure, the team is able to restore data and applications and get normal operations going within 30 mins. + +**Database Backups and Snapshots**: Backups provide durable long-term storage with a separate copy of the data. Snapshots do not create an additional copy of the data, are usually faster, and provide better RPOs. + +## Database Backups {#database-backups} + +Having a backup of your primary service is an effective way to utilize the backup and restore from it in the event of primary service downtime. ClickHouse Cloud supports the following capabilities for backups. + +**Default backups**: By default, ClickHouse Cloud takes a backup of your service every 24 hours. These backups are in the same region as the service, and happen in the ClickHouse CSP (cloud service provider) storage bucket. In the event that the data in the primary service gets corrupted, the backup can be used to restore to a new service. + +**External backups (in customer's own storage bucket)**: Enterprise Tier customers can export backups to their object storage in their own account, in the same region, or in another region. Cross-cloud backup export support is coming soon. Applicable data transfer charges will apply for cross-region, and cross-cloud backups. + +**Configurable backups**: Customers can configure backups to happen at a higher frequency, up to every 6 hours, to improve the RPO. Customers can also configure longer retention. + +## Restoring from a Backup {#restoring-from-a-backup} + +- Default backups, in the ClickHouse Cloud bucket, can be restored to a new service in the same region. +- External backups (in customer object storage) can be restored to a new service in the same or different region. + +> **NOTE**: There is currently NO support for automatic failover between 2 ClickHouse Cloud instances whether in the same or different region. + +> **NOTE**: There is currently NO automatic syncing of data between different ClickHouse Cloud services in the same or different regions .i.e. Active-Active replication + +## Recovery Process {#recovery-process} + +This section explains the various recovery options and the process that can be followed in each case. + +### Primary Service Data Corruption {#primary-service-data-corruption} + +In this case the data can be restored from the backup to another service in the same region. The backup could be up to 24 hours old if using the default backup policy, or up to 6 hours old (if using configurable backups with 6 hours frequency). + +### Primary Region Downtime {#primary-region-downtime} + +Customers in the Enterprise Tier can export backups to their own cloud provider bucket. If you are concerned about regional failures, we recommend exporting backups to a different region. Keep in mind that cross-region data transfer charges will apply. + +If the primary region goes down, the backup in another region can be restored to a new service in a different region. + +Once the backup has been restored to another service, you will need to ensure that any DNS, load balancer, or connection string configurations are updated to point to the new service. This may involve: + +- Updating environment variables or secrets +- Restarting application services to establish new connections + +> **NOTE**: Backup / restore to an external bucket is currently not supported for services utilizing Transparent Data Encryption (TDE). + +## Additional Options {#additional-options} + +There are some additional options to consider. + +**Dual-writing to separate clusters**: In this option, you can set up 2 separate clusters in different regions and dual-write to both. This option of course comes with a higher cost as it involves running multiple clusters but provides higher availability in case of one region being unavailable. + +**Utilize CSP replication**: With this option you would utilize the cloud service provider's native object storage replication to copy data over. For instance, with BYOB you can export the backup to a bucket that you own in the primary region, and have that replicated over to another region using AWS cross region replication. \ No newline at end of file diff --git a/scripts/aspell-ignore/en/aspell-dict.txt b/scripts/aspell-ignore/en/aspell-dict.txt index e8344540201..4ea1cdd9b60 100644 --- a/scripts/aspell-ignore/en/aspell-dict.txt +++ b/scripts/aspell-ignore/en/aspell-dict.txt @@ -1,11 +1,9 @@ personal_ws-1.1 en 3734 AArch ACLs -Accepter AICPA ALTERs AMPLab -AmazonKinesis AMQP ANNIndex ANNIndexes @@ -19,6 +17,7 @@ ASOF ASan AWND AWST +Accepter Actian ActionsMenu ActiveRecord @@ -34,6 +33,7 @@ Airbyte Akka AlertManager Alexey +AmazonKinesis Amir Anthropic AnyEvent @@ -60,8 +60,8 @@ Authenticators Authy AutoFDO AutoML -Autoscaler Autocompletion +Autoscaler AvroConfluent AzureQueue Azurite @@ -221,6 +221,7 @@ Cloudflare CodeBlock CodeLLDB Codecs +Coinhall CollapsingMergeTree Combinators CommonRoom @@ -389,6 +390,7 @@ FQDN Fabi Failover FarmHash +Fastly FileCluster FileLog Filebeat @@ -413,6 +415,7 @@ Fivetran FixedString FlameGraph Flink +Fong ForEach FreeBSD Fuzzer @@ -423,6 +426,7 @@ GTID GTIDs GTest GUID +GWLBs Gb Gbit Gcc @@ -480,6 +484,7 @@ HiveText Holistics Homebrew Homebrew's +Hopsworks HorizontalDivide Hostname HouseOps @@ -524,6 +529,7 @@ InJodaSyntaxOrZero Incrementing IndexesAreNeighbors InfluxDB +Instacart Instana IntN Integrations @@ -652,6 +658,7 @@ LOCALTIMESTAMP LONGLONG LOONGARCH LaGuardia +Lakehouses Lakekeeper LangChain LangGraph @@ -703,6 +710,7 @@ MACStringToOUI MCPHost MEDIUMINT MEMTABLE +MLOps MMapCacheCells MMappedAllocBytes MMappedAllocs @@ -712,6 +720,8 @@ MQTT MQTTX MSSQL MSan +MTTD +MTTR MVCC MacBook MacOS @@ -803,6 +813,7 @@ NEWDECIMAL NFKC NFKD NIST +NLBs NOAA NULLIF NVME @@ -845,6 +856,7 @@ NumberOfDatabases NumberOfDetachedByUserParts NumberOfDetachedParts NumberOfTables +O'Reilly OAuth ODBCDriver OFNS @@ -1053,6 +1065,7 @@ REPL RHEL RIPEMD ROLLUP +RPOs RWLock RWLockActiveReaders RWLockActiveWriters @@ -1155,6 +1168,7 @@ SSRF SSSE SaaS Sackmann's +SageMaker Sanjeev Sankey Sapchuk @@ -1190,6 +1204,7 @@ SimHash Simhash SimpleAggregateFunction SimpleState +SingleStore SipHash SlackBot Smartbook @@ -2054,6 +2069,7 @@ evalMLMethod exFAT expiryMsec explainer +explorative exponentialMovingAverage exponentialTimeDecayedAvg exponentialTimeDecayedCount @@ -2078,6 +2094,7 @@ extractURLParameters extractable facto failover +failovers farmFingerprint farmHash fastmcp @@ -2276,10 +2293,8 @@ hiveHash hnsw holistics homebrew -homebrew hopEnd hopStart -Hopsworks horgh hostName hostname @@ -2423,7 +2438,7 @@ kusto lagInFrame laion lakehouse -Lakehouses +lakehouses lang laravel largestTriangleThreeBuckets @@ -2660,7 +2675,6 @@ nats navbar ndjson ness -Nessie nestjs netloc newjson @@ -2794,6 +2808,7 @@ plantuml poco pointInEllipses pointInPolygon +pointwise poller polygonAreaCartesian polygonAreaSpherical @@ -3034,16 +3049,17 @@ reshards resolvers resourceGUID restartable +restorable resultset resync resynchronization resyncing -failovers retentions rethrow retransmit retriable retryable +reusability reverseUTF rewritable rightPad @@ -3289,6 +3305,7 @@ sumcount sumkahan summap summapwithoverflow +summarization summingmergetree sumwithoverflow superaggregates @@ -3460,6 +3477,7 @@ transactionLatestSnapshot transactionOldestSnapshot transactional transactionally +transformative translateUTF translocality transpilation @@ -3468,7 +3486,6 @@ trie trimBoth trimLeft trimRight -Trino trunc tryBase tryDecrypt