|
| 1 | +--- |
| 2 | +title: Migrate and Upgrade a TiDB Cluster |
| 3 | +summary: Learn how to migrate and upgrade a TiDB cluster using BR for full backup and restore, along with TiCDC for incremental data replication. |
| 4 | +--- |
| 5 | + |
| 6 | +# Migrate and Upgrade a TiDB Cluster |
| 7 | + |
| 8 | +This document describes how to migrate and upgrade a TiDB cluster (also known as a blue-green upgrade) using [BR](/br/backup-and-restore-overview.md) for full backup and restore, along with [TiCDC](/ticdc/ticdc-overview.md) for incremental data replication. This solution uses dual-cluster redundancy and incremental replication to enable smooth traffic switchover and fast rollback, providing a reliable and low-risk upgrade path for critical systems. It is recommended to regularly upgrade the database version to continuously benefit from performance improvements and new features, helping you maintain a secure and efficient database system. The key advantages of this solution include: |
| 9 | + |
| 10 | +- **Controllable risk**: supports rollback to the original cluster within minutes, ensuring business continuity. |
| 11 | +- **Data integrity**: uses a multi-stage verification mechanism to prevent data loss. |
| 12 | +- **Minimal business impact**: requires only a brief maintenance window for the final switchover. |
| 13 | + |
| 14 | +The core workflow for migration and upgrade is as follows: |
| 15 | + |
| 16 | +1. **Pre-check risks**: verify cluster status and solution feasibility. |
| 17 | +2. **Prepare the new cluster**: create a new cluster from a full backup of the old cluster and upgrade it to the target version. |
| 18 | +3. **Replicate incremental data**: establish a forward data replication channel using TiCDC. |
| 19 | +4. **Switch and verify**: perform multi-dimensional verification, switch business traffic to the new cluster, and set up a TiCDC reverse replication channel. |
| 20 | +5. **Observe status**: maintain the reverse replication channel. After the observation period, clean up the environment. |
| 21 | + |
| 22 | +**Rollback plan**: if the new cluster encounters issues during the migration and upgrade process, you can switch business traffic back to the original cluster at any time. |
| 23 | + |
| 24 | +The following sections describe the standardized process and general steps for migrating and upgrading a TiDB cluster. The example commands are based on a TiDB Self-Managed environment. |
| 25 | + |
| 26 | +## Step 1: Evaluate solution feasibility |
| 27 | + |
| 28 | +Before migrating and upgrading, evaluate the compatibility of relevant components and check cluster health status. |
| 29 | + |
| 30 | +- Check the TiDB cluster version: this solution applies to TiDB v6.5.0 or later versions. |
| 31 | + |
| 32 | +- Verify TiCDC compatibility: |
| 33 | + |
| 34 | + - **Table schema requirements**: ensure that tables to be replicated contain valid indexes. For more information, see [TiCDC valid index](/ticdc/ticdc-overview.md#best-practices). |
| 35 | + - **Feature limitations**: TiCDC does not support Sequence or TiFlash DDL replication. For more information, see [TiCDC unsupported scenarios](/ticdc/ticdc-overview.md#unsupported-scenarios). |
| 36 | + - **Best practices**: avoid executing DDL operations on the upstream cluster of TiCDC during switchover. |
| 37 | + |
| 38 | +- Verify BR compatibility: |
| 39 | + |
| 40 | + - Review the compatibility matrix of BR full backup. For more information, see [BR version compatibility matrix](/br/backup-and-restore-overview.md#version-compatibility). |
| 41 | + - Check the known limitations of BR backup and restore. For more information, see [BR usage restrictions](/br/backup-and-restore-overview.md#restrictions). |
| 42 | + |
| 43 | +- Check the health status of the cluster, such as [Region](/glossary.md#regionpeerraft-group) health and node resource utilization. |
| 44 | + |
| 45 | +## Step 2: Prepare the new cluster |
| 46 | + |
| 47 | +### 1. Adjust the GC lifetime of the old cluster |
| 48 | + |
| 49 | +To ensure data replication stability, adjust the system variable [`tidb_gc_life_time`](/system-variables.md#tidb_gc_life_time-new-in-v50) to a value that covers the total duration of the following operations and intervals: BR backup, BR restore, cluster upgrade, and TiCDC Changefeed replication setup. Otherwise, the replication task might enter an unrecoverable `failed` state, requiring a restart of the entire migration and upgrade process from a new full backup. |
| 50 | + |
| 51 | +The following example sets `tidb_gc_life_time` to `60h`: |
| 52 | + |
| 53 | +```sql |
| 54 | +-- Check the current GC lifetime setting. |
| 55 | +SHOW VARIABLES LIKE '%tidb_gc_life_time%'; |
| 56 | +-- Set GC lifetime. |
| 57 | +SET GLOBAL tidb_gc_life_time=60h; |
| 58 | +``` |
| 59 | + |
| 60 | +> **Note:** |
| 61 | +> |
| 62 | +> Increasing `tidb_gc_life_time` increases storage usage for [MVCC](/glossary.md#multi-version-concurrency-control-mvcc) data and might affect query performance. For more information, see [GC Overview](/garbage-collection-overview.md). Adjust the GC duration based on estimated operation time while considering storage and performance impacts. |
| 63 | +
|
| 64 | +### 2. Migrate full data to the new cluster |
| 65 | + |
| 66 | +When migrating full data to the new cluster, note the following: |
| 67 | + |
| 68 | +- **Version compatibility**: the BR version used for backup and restore must match the major version of the old cluster. |
| 69 | +- **Performance impact**: BR backup consumes system resources. To minimize business impact, perform backups during off-peak hours. |
| 70 | +- **Time estimation**: under optimal hardware conditions (no disk I/O or network bandwidth bottlenecks), estimated times are: |
| 71 | + |
| 72 | + - Backup speed: backing up 1 TiB of data per TiKV node with 8 threads takes approximately 1 hour. |
| 73 | + - Restore speed: restoring 1 TiB of data per TiKV node takes approximately 20 minutes. |
| 74 | + |
| 75 | +- **Configuration consistency**: ensure that the [`new_collations_enabled_on_first_bootstrap`](/tidb-configuration-file.md#new_collations_enabled_on_first_bootstrap) configuration is identical between the old and new clusters. Otherwise, BR restore will fail. |
| 76 | +- **System table restore**: Use the `--with-sys-table` option during BR restore to recover system table data. |
| 77 | + |
| 78 | +To migrate full data to the new cluster, take the following steps: |
| 79 | + |
| 80 | +1. Perform a full backup on the old cluster: |
| 81 | + |
| 82 | + ```shell |
| 83 | + tiup br:${cluster_version} backup full --pd ${pd_host}:${pd_port} -s ${backup_location} |
| 84 | + ``` |
| 85 | + |
| 86 | +2. Record the TSO of the old cluster for later TiCDC Changefeed creation: |
| 87 | + |
| 88 | + ```shell |
| 89 | + tiup br:${cluster_version} validate decode --field="end-version" \ |
| 90 | + --storage "s3://xxx?access-key=${access-key}&secret-access-key=${secret-access-key}" | tail -n1 |
| 91 | + ``` |
| 92 | + |
| 93 | +3. Deploy the new cluster: |
| 94 | + |
| 95 | + ```shell |
| 96 | + tiup cluster deploy ${new_cluster_name} ${cluster_version} tidb-cluster.yaml |
| 97 | + ``` |
| 98 | + |
| 99 | +4. Restore the full backup to the new cluster: |
| 100 | + |
| 101 | + ```shell |
| 102 | + tiup br:${cluster_version} restore full --pd ${pd_host}:${pd_port} -s ${backup_location} --with-sys-table |
| 103 | + ``` |
| 104 | + |
| 105 | +### 3. Upgrade the new cluster to the target version |
| 106 | + |
| 107 | +To save time, you can perform an offline upgrade using the following commands. For more upgrade methods, see [Upgrade TiDB Using TiUP](/upgrade-tidb-using-tiup.md). |
| 108 | + |
| 109 | +```shell |
| 110 | +tiup cluster stop <new_cluster_name> # Stop the cluster |
| 111 | +tiup cluster upgrade <new_cluster_name> <v_target_version> --offline # Perform offline upgrade |
| 112 | +tiup cluster start <new_cluster_name> # Start the cluster |
| 113 | +``` |
| 114 | + |
| 115 | +To maintain business continuity, you need to replicate essential configurations from the old cluster to the new cluster, such as configuration items and system variables. |
| 116 | + |
| 117 | +## Step 3: Replicate incremental data |
| 118 | + |
| 119 | +### 1. Establish a forward data replication channel |
| 120 | + |
| 121 | +At this stage, the old cluster remains at its original version, while the new cluster has been upgraded to the target version. In this step, you need to establish a forward data replication channel from the old cluster to the new cluster. |
| 122 | + |
| 123 | +> **Note:** |
| 124 | +> |
| 125 | +> The TiCDC component version must match the major version of the old cluster. |
| 126 | + |
| 127 | +- Create a Changefeed task and set the incremental replication starting point (`${tso}`) to the exact backup TSO recorded in [Step 2](#step-2-prepare-the-new-cluster) to prevent data loss: |
| 128 | + |
| 129 | + ```shell |
| 130 | + tiup ctl:${cluster_version} cdc changefeed create --server http://${cdc_host}:${cdc_port} --sink-uri="mysql://${username}:${password}@${tidb_endpoint}:${port}" --config config.toml --start-ts ${tso} |
| 131 | + ``` |
| 132 | + |
| 133 | +- Check the replication task status and confirm that `tso` or `checkpoint` is continuously advancing: |
| 134 | + |
| 135 | + ```shell |
| 136 | + tiup ctl:${cluster_version} cdc changefeed list --server http://${cdc_host}:${cdc_port} |
| 137 | + ``` |
| 138 | + |
| 139 | + The output is as follows: |
| 140 | + |
| 141 | + ```shell |
| 142 | + [{ |
| 143 | + "id": "cdcdb-cdc-task-standby", |
| 144 | + "summary": { |
| 145 | + "state": "normal", |
| 146 | + "tso": 417886179132964865, |
| 147 | + "checkpoint": "202x-xx-xx xx:xx:xx.xxx", |
| 148 | + "error": null |
| 149 | + } |
| 150 | + }] |
| 151 | + ``` |
| 152 | + |
| 153 | +During incremental data replication, continuously monitor the replication channel status and adjust settings if needed: |
| 154 | + |
| 155 | +- Latency metrics: ensure that `Changefeed checkpoint lag` remains within an acceptable range, such as within 5 minutes. |
| 156 | +- Throughput health: ensure that `Sink flush rows/s` consistently exceeds the business write rate. |
| 157 | +- Errors and alerts: regularly check TiCDC logs and alert information. |
| 158 | +- (Optional) Test data replication: update test data and verify that Changefeed correctly replicates it to the new cluster. |
| 159 | +- (Optional) Adjust the TiCDC configuration item [`gc-ttl`](/ticdc/ticdc-server-config.md) (defaults to 24 hours). |
| 160 | + |
| 161 | + If a replication task is unavailable or interrupted and cannot be resolved in time, `gc-ttl` ensures that data needed by TiCDC is retained in TiKV without being cleaned by garbage collection (GC). If this duration is exceeded, the replication task enters a `failed` state and cannot recover. In this case, PD's GC safe point continues advancing, requiring a new backup to restart the process. |
| 162 | +
|
| 163 | + Increasing the value of `gc-ttl` accumulates more MVCC data, similar to increasing `tidb_gc_life_time`. It is recommended to set it to a reasonably long but appropriate value. |
| 164 | +
|
| 165 | +### 2. Verify data consistency |
| 166 | +
|
| 167 | +After data replication is complete, verify data consistency between the old and new clusters using the following methods: |
| 168 | +
|
| 169 | +- Use the [sync-diff-inspector](/sync-diff-inspector/sync-diff-inspector-overview.md) tool: |
| 170 | +
|
| 171 | + ```shell |
| 172 | + ./sync_diff_inspector --config=./config.toml |
| 173 | + ``` |
| 174 | +
|
| 175 | +- Use the snapshot configuration of [sync-diff-inspector](/sync-diff-inspector/sync-diff-inspector-overview.md) with the [Syncpoint](/ticdc/ticdc-upstream-downstream-check.md) feature of TiCDC to verify data consistency without stopping Changefeed replication. For more information, see [Upstream and Downstream Clusters Data Validation and Snapshot Read](/ticdc/ticdc-upstream-downstream-check.md). |
| 176 | +
|
| 177 | +- Perform manual validation of business data, such as comparing table row counts. |
| 178 | +
|
| 179 | +### 3. Finalize the environment setup |
| 180 | +
|
| 181 | +This migration procedure restores some system table data using the BR `--with-sys-table` option. For tables that are not included in the scope, you need to manually restore. Common items to check and supplement include: |
| 182 | +
|
| 183 | +- User privileges: compare the `mysql.user` table. |
| 184 | +- Configuration settings: ensure that configuration items and system variables are consistent. |
| 185 | +- Auto-increment columns: clear auto-increment ID caches in the new cluster. |
| 186 | +- Statistics: collect statistics manually or enable automatic collection in the new cluster. |
| 187 | +
|
| 188 | +Additionally, you can scale out the new cluster to handle expected workloads and migrate operational tasks, such as alert subscriptions, scheduled statistics collection scripts, and data backup scripts. |
| 189 | +
|
| 190 | +## Step 4: Switch business traffic and rollback |
| 191 | +
|
| 192 | +### 1. Prepare for the switchover |
| 193 | +
|
| 194 | +- Confirm replication status: |
| 195 | +
|
| 196 | + - Monitor the latency of TiCDC Changefeed replication. |
| 197 | + - Ensure that the incremental replication throughput is greater than or equal to the peak business write rate. |
| 198 | +
|
| 199 | +- Perform multi-dimensional validation, such as: |
| 200 | +
|
| 201 | + - Ensure that all data validation steps are complete and perform any necessary additional checks. |
| 202 | + - Conduct sanity or integration tests on the application in the new cluster. |
| 203 | +
|
| 204 | +### 2. Execute the switchover |
| 205 | +
|
| 206 | +1. Stop application services to prevent the old cluster from handling business traffic. To further restrict access, you can use one of the following methods: |
| 207 | +
|
| 208 | + - Lock user accounts in the old cluster: |
| 209 | +
|
| 210 | + ```sql |
| 211 | + ALTER USER ACCOUNT LOCK; |
| 212 | + ``` |
| 213 | +
|
| 214 | + - Set the old cluster to read-only mode. It is recommended to restart TiDB nodes in the old cluster to clear active business sessions and prevent connections that have not entered read-only mode: |
| 215 | +
|
| 216 | + ```sql |
| 217 | + SET GLOBAL tidb_super_read_only=ON; |
| 218 | + ``` |
| 219 | +
|
| 220 | +2. Ensure TiCDC catches up: |
| 221 | +
|
| 222 | + - After setting the old cluster to read-only mode, retrieve the current `up-tso`: |
| 223 | +
|
| 224 | + ```sql |
| 225 | + SELECT tidb_current_ts(); |
| 226 | + ``` |
| 227 | +
|
| 228 | + - Monitor the Changefeed `checkpointTs` to confirm it has surpassed `up-tso`, indicating that TiCDC has completed data replication. |
| 229 | +
|
| 230 | +3. Verify data consistency between the new and old clusters: |
| 231 | +
|
| 232 | + - After TiCDC catches up, obtain the `down-tso` from the new cluster. |
| 233 | + - Use the [sync-diff-inspector](/sync-diff-inspector/sync-diff-inspector-overview.md) tool to compare data consistency between the new and old clusters at `up-tso` and `down-tso`. |
| 234 | +
|
| 235 | +4. Pause the forward Changefeed replication task: |
| 236 | +
|
| 237 | + ```shell |
| 238 | + tiup ctl:${cluster_version} cdc changefeed pause --server http://${cdc_host}:${cdc_port} -c <changefeedid> |
| 239 | + ``` |
| 240 | +
|
| 241 | +5. Restart the TiDB nodes in the new cluster to clear the auto-increment ID cache. |
| 242 | +
|
| 243 | +6. Check the operational status of the new cluster using the following methods: |
| 244 | +
|
| 245 | + - Verify that the TiDB version matches the target version: |
| 246 | +
|
| 247 | + ```shell |
| 248 | + tiup cluster display <cluster-name> |
| 249 | + ``` |
| 250 | +
|
| 251 | + - Log into the database and confirm component versions: |
| 252 | +
|
| 253 | + ```sql |
| 254 | + SELECT * FROM INFORMATION_SCHEMA.CLUSTER_INFO; |
| 255 | + ``` |
| 256 | +
|
| 257 | + - Use Grafana to monitor service status: navigate to [**Overview > Services Port Status**](/grafana-overview-dashboard.md) and confirm that all services are in the **Up** state. |
| 258 | +
|
| 259 | +7. Set up reverse replication from the new cluster to the old cluster. |
| 260 | +
|
| 261 | + 1. Unlock user accounts in the old cluster and restore read-write mode: |
| 262 | +
|
| 263 | + ```sql |
| 264 | + ALTER USER ACCOUNT UNLOCK; |
| 265 | + SET GLOBAL tidb_super_read_only=OFF; |
| 266 | + ``` |
| 267 | +
|
| 268 | + 2. Record the current TSO of the new cluster: |
| 269 | +
|
| 270 | + ```sql |
| 271 | + SELECT tidb_current_ts(); |
| 272 | + ``` |
| 273 | +
|
| 274 | + 3. Configure the reverse replication link and ensure the Changefeed task is running properly: |
| 275 | +
|
| 276 | + - Because business operations are stopped at this stage, you can use the current TSO. |
| 277 | + - Ensure that `sink-uri` is set to the address of the old cluster to avoid loopback writing risks. |
| 278 | +
|
| 279 | + ```shell |
| 280 | + tiup ctl:${cluster_version} cdc changefeed create --server http://${cdc_host}:${cdc_port} --sink-uri="mysql://${username}:${password}@${tidb_endpoint}:${port}" --config config.toml --start-ts ${tso} |
| 281 | +
|
| 282 | + tiup ctl:${cluster_version} cdc changefeed list --server http://${cdc_host}:${cdc_port} |
| 283 | + ``` |
| 284 | +
|
| 285 | +8. Redirect business traffic to the new cluster. |
| 286 | +
|
| 287 | +9. Monitor the load and operational status of the new cluster using the following Grafana panels: |
| 288 | +
|
| 289 | + - [**TiDB Dashboard > Query Summary**](/grafana-tidb-dashboard.md#query-summary): check the Duration, QPS, and Failed Query OPM metrics. |
| 290 | + - [**TiDB Dashboard > Server**](/grafana-tidb-dashboard.md#server): monitor the **Connection Count** metric to ensure even distribution of connections across nodes. |
| 291 | +
|
| 292 | +At this point, business traffic has successfully switched to the new cluster, and the TiCDC reverse replication channel is established. |
| 293 | +
|
| 294 | +### 3. Execute emergency rollback |
| 295 | +
|
| 296 | +The rollback plan is as follows: |
| 297 | +
|
| 298 | +- Check data consistency between the new and old clusters regularly to ensure the reverse replication link is operating properly. |
| 299 | +- Monitor the system for a specified period, such as one week. If issues occur, switch back to the old cluster. |
| 300 | +- After the observation period, remove the reverse replication link and delete the old cluster. |
| 301 | +
|
| 302 | +The following introduces the usage scenario and steps for an emergency rollback, which redirects traffic back to the old cluster: |
| 303 | +
|
| 304 | +- Usage scenario: execute the rollback plan if critical issues cannot be resolved. |
| 305 | +- Steps: |
| 306 | +
|
| 307 | + 1. Stop business access to the new cluster. |
| 308 | + 2. Reauthorize business accounts and restore read-write access to the old cluster. |
| 309 | + 3. Check the reverse replication link, confirm TiCDC has caught up, and verify data consistency between the new and old clusters. |
| 310 | + 4. Redirect business traffic back to the old cluster. |
| 311 | +
|
| 312 | +## Step 5: Clean up |
| 313 | +
|
| 314 | +After monitoring the new cluster for a period and confirming stable business operations, you can remove the TiCDC reverse replication and delete the old cluster. |
| 315 | +
|
| 316 | +- Remove the TiCDC reverse replication: |
| 317 | +
|
| 318 | + ```shell |
| 319 | + tiup ctl:${cluster_version} cdc changefeed remove --server http://${cdc_host}:${cdc_port} -c <changefeedid> |
| 320 | + ``` |
| 321 | +
|
| 322 | +- Delete the old cluster. If you choose to retain it, restore `tidb_gc_life_time` to its original value: |
| 323 | +
|
| 324 | + ```sql |
| 325 | + -- Restore to the original value before modification. |
| 326 | + SET GLOBAL tidb_gc_life_time=10m; |
| 327 | + ``` |
0 commit comments