Skip to content

Commit 6560e4a

Browse files
authored
add tidb-upgrade-migration-guide (#20688) (#20694)
1 parent 28b306f commit 6560e4a

File tree

2 files changed

+328
-0
lines changed

2 files changed

+328
-0
lines changed

TOC.md

+1
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,7 @@
187187
- [Use TiUP](/upgrade-tidb-using-tiup.md)
188188
- [Use TiDB Operator](https://docs.pingcap.com/tidb-in-kubernetes/stable/upgrade-a-tidb-cluster)
189189
- [TiDB Smooth Upgrade](/smooth-upgrade-tidb.md)
190+
- [Migrate and Upgrade a TiDB Cluster](/tidb-upgrade-migration-guide.md)
190191
- [TiFlash Upgrade Guide](/tiflash-upgrade-guide.md)
191192
- Scale
192193
- [Use TiUP (Recommended)](/scale-tidb-using-tiup.md)

tidb-upgrade-migration-guide.md

+327
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,327 @@
1+
---
2+
title: Migrate and Upgrade a TiDB Cluster
3+
summary: Learn how to migrate and upgrade a TiDB cluster using BR for full backup and restore, along with TiCDC for incremental data replication.
4+
---
5+
6+
# Migrate and Upgrade a TiDB Cluster
7+
8+
This document describes how to migrate and upgrade a TiDB cluster (also known as a blue-green upgrade) using [BR](/br/backup-and-restore-overview.md) for full backup and restore, along with [TiCDC](/ticdc/ticdc-overview.md) for incremental data replication. This solution uses dual-cluster redundancy and incremental replication to enable smooth traffic switchover and fast rollback, providing a reliable and low-risk upgrade path for critical systems. It is recommended to regularly upgrade the database version to continuously benefit from performance improvements and new features, helping you maintain a secure and efficient database system. The key advantages of this solution include:
9+
10+
- **Controllable risk**: supports rollback to the original cluster within minutes, ensuring business continuity.
11+
- **Data integrity**: uses a multi-stage verification mechanism to prevent data loss.
12+
- **Minimal business impact**: requires only a brief maintenance window for the final switchover.
13+
14+
The core workflow for migration and upgrade is as follows:
15+
16+
1. **Pre-check risks**: verify cluster status and solution feasibility.
17+
2. **Prepare the new cluster**: create a new cluster from a full backup of the old cluster and upgrade it to the target version.
18+
3. **Replicate incremental data**: establish a forward data replication channel using TiCDC.
19+
4. **Switch and verify**: perform multi-dimensional verification, switch business traffic to the new cluster, and set up a TiCDC reverse replication channel.
20+
5. **Observe status**: maintain the reverse replication channel. After the observation period, clean up the environment.
21+
22+
**Rollback plan**: if the new cluster encounters issues during the migration and upgrade process, you can switch business traffic back to the original cluster at any time.
23+
24+
The following sections describe the standardized process and general steps for migrating and upgrading a TiDB cluster. The example commands are based on a TiDB Self-Managed environment.
25+
26+
## Step 1: Evaluate solution feasibility
27+
28+
Before migrating and upgrading, evaluate the compatibility of relevant components and check cluster health status.
29+
30+
- Check the TiDB cluster version: this solution applies to TiDB v6.5.0 or later versions.
31+
32+
- Verify TiCDC compatibility:
33+
34+
- **Table schema requirements**: ensure that tables to be replicated contain valid indexes. For more information, see [TiCDC valid index](/ticdc/ticdc-overview.md#best-practices).
35+
- **Feature limitations**: TiCDC does not support Sequence or TiFlash DDL replication. For more information, see [TiCDC unsupported scenarios](/ticdc/ticdc-overview.md#unsupported-scenarios).
36+
- **Best practices**: avoid executing DDL operations on the upstream cluster of TiCDC during switchover.
37+
38+
- Verify BR compatibility:
39+
40+
- Review the compatibility matrix of BR full backup. For more information, see [BR version compatibility matrix](/br/backup-and-restore-overview.md#version-compatibility).
41+
- Check the known limitations of BR backup and restore. For more information, see [BR usage restrictions](/br/backup-and-restore-overview.md#restrictions).
42+
43+
- Check the health status of the cluster, such as [Region](/glossary.md#regionpeerraft-group) health and node resource utilization.
44+
45+
## Step 2: Prepare the new cluster
46+
47+
### 1. Adjust the GC lifetime of the old cluster
48+
49+
To ensure data replication stability, adjust the system variable [`tidb_gc_life_time`](/system-variables.md#tidb_gc_life_time-new-in-v50) to a value that covers the total duration of the following operations and intervals: BR backup, BR restore, cluster upgrade, and TiCDC Changefeed replication setup. Otherwise, the replication task might enter an unrecoverable `failed` state, requiring a restart of the entire migration and upgrade process from a new full backup.
50+
51+
The following example sets `tidb_gc_life_time` to `60h`:
52+
53+
```sql
54+
-- Check the current GC lifetime setting.
55+
SHOW VARIABLES LIKE '%tidb_gc_life_time%';
56+
-- Set GC lifetime.
57+
SET GLOBAL tidb_gc_life_time=60h;
58+
```
59+
60+
> **Note:**
61+
>
62+
> Increasing `tidb_gc_life_time` increases storage usage for [MVCC](/glossary.md#multi-version-concurrency-control-mvcc) data and might affect query performance. For more information, see [GC Overview](/garbage-collection-overview.md). Adjust the GC duration based on estimated operation time while considering storage and performance impacts.
63+
64+
### 2. Migrate full data to the new cluster
65+
66+
When migrating full data to the new cluster, note the following:
67+
68+
- **Version compatibility**: the BR version used for backup and restore must match the major version of the old cluster.
69+
- **Performance impact**: BR backup consumes system resources. To minimize business impact, perform backups during off-peak hours.
70+
- **Time estimation**: under optimal hardware conditions (no disk I/O or network bandwidth bottlenecks), estimated times are:
71+
72+
- Backup speed: backing up 1 TiB of data per TiKV node with 8 threads takes approximately 1 hour.
73+
- Restore speed: restoring 1 TiB of data per TiKV node takes approximately 20 minutes.
74+
75+
- **Configuration consistency**: ensure that the [`new_collations_enabled_on_first_bootstrap`](/tidb-configuration-file.md#new_collations_enabled_on_first_bootstrap) configuration is identical between the old and new clusters. Otherwise, BR restore will fail.
76+
- **System table restore**: Use the `--with-sys-table` option during BR restore to recover system table data.
77+
78+
To migrate full data to the new cluster, take the following steps:
79+
80+
1. Perform a full backup on the old cluster:
81+
82+
```shell
83+
tiup br:${cluster_version} backup full --pd ${pd_host}:${pd_port} -s ${backup_location}
84+
```
85+
86+
2. Record the TSO of the old cluster for later TiCDC Changefeed creation:
87+
88+
```shell
89+
tiup br:${cluster_version} validate decode --field="end-version" \
90+
--storage "s3://xxx?access-key=${access-key}&secret-access-key=${secret-access-key}" | tail -n1
91+
```
92+
93+
3. Deploy the new cluster:
94+
95+
```shell
96+
tiup cluster deploy ${new_cluster_name} ${cluster_version} tidb-cluster.yaml
97+
```
98+
99+
4. Restore the full backup to the new cluster:
100+
101+
```shell
102+
tiup br:${cluster_version} restore full --pd ${pd_host}:${pd_port} -s ${backup_location} --with-sys-table
103+
```
104+
105+
### 3. Upgrade the new cluster to the target version
106+
107+
To save time, you can perform an offline upgrade using the following commands. For more upgrade methods, see [Upgrade TiDB Using TiUP](/upgrade-tidb-using-tiup.md).
108+
109+
```shell
110+
tiup cluster stop <new_cluster_name> # Stop the cluster
111+
tiup cluster upgrade <new_cluster_name> <v_target_version> --offline # Perform offline upgrade
112+
tiup cluster start <new_cluster_name> # Start the cluster
113+
```
114+
115+
To maintain business continuity, you need to replicate essential configurations from the old cluster to the new cluster, such as configuration items and system variables.
116+
117+
## Step 3: Replicate incremental data
118+
119+
### 1. Establish a forward data replication channel
120+
121+
At this stage, the old cluster remains at its original version, while the new cluster has been upgraded to the target version. In this step, you need to establish a forward data replication channel from the old cluster to the new cluster.
122+
123+
> **Note:**
124+
>
125+
> The TiCDC component version must match the major version of the old cluster.
126+
127+
- Create a Changefeed task and set the incremental replication starting point (`${tso}`) to the exact backup TSO recorded in [Step 2](#step-2-prepare-the-new-cluster) to prevent data loss:
128+
129+
```shell
130+
tiup ctl:${cluster_version} cdc changefeed create --server http://${cdc_host}:${cdc_port} --sink-uri="mysql://${username}:${password}@${tidb_endpoint}:${port}" --config config.toml --start-ts ${tso}
131+
```
132+
133+
- Check the replication task status and confirm that `tso` or `checkpoint` is continuously advancing:
134+
135+
```shell
136+
tiup ctl:${cluster_version} cdc changefeed list --server http://${cdc_host}:${cdc_port}
137+
```
138+
139+
The output is as follows:
140+
141+
```shell
142+
[{
143+
"id": "cdcdb-cdc-task-standby",
144+
"summary": {
145+
"state": "normal",
146+
"tso": 417886179132964865,
147+
"checkpoint": "202x-xx-xx xx:xx:xx.xxx",
148+
"error": null
149+
}
150+
}]
151+
```
152+
153+
During incremental data replication, continuously monitor the replication channel status and adjust settings if needed:
154+
155+
- Latency metrics: ensure that `Changefeed checkpoint lag` remains within an acceptable range, such as within 5 minutes.
156+
- Throughput health: ensure that `Sink flush rows/s` consistently exceeds the business write rate.
157+
- Errors and alerts: regularly check TiCDC logs and alert information.
158+
- (Optional) Test data replication: update test data and verify that Changefeed correctly replicates it to the new cluster.
159+
- (Optional) Adjust the TiCDC configuration item [`gc-ttl`](/ticdc/ticdc-server-config.md) (defaults to 24 hours).
160+
161+
If a replication task is unavailable or interrupted and cannot be resolved in time, `gc-ttl` ensures that data needed by TiCDC is retained in TiKV without being cleaned by garbage collection (GC). If this duration is exceeded, the replication task enters a `failed` state and cannot recover. In this case, PD's GC safe point continues advancing, requiring a new backup to restart the process.
162+
163+
Increasing the value of `gc-ttl` accumulates more MVCC data, similar to increasing `tidb_gc_life_time`. It is recommended to set it to a reasonably long but appropriate value.
164+
165+
### 2. Verify data consistency
166+
167+
After data replication is complete, verify data consistency between the old and new clusters using the following methods:
168+
169+
- Use the [sync-diff-inspector](/sync-diff-inspector/sync-diff-inspector-overview.md) tool:
170+
171+
```shell
172+
./sync_diff_inspector --config=./config.toml
173+
```
174+
175+
- Use the snapshot configuration of [sync-diff-inspector](/sync-diff-inspector/sync-diff-inspector-overview.md) with the [Syncpoint](/ticdc/ticdc-upstream-downstream-check.md) feature of TiCDC to verify data consistency without stopping Changefeed replication. For more information, see [Upstream and Downstream Clusters Data Validation and Snapshot Read](/ticdc/ticdc-upstream-downstream-check.md).
176+
177+
- Perform manual validation of business data, such as comparing table row counts.
178+
179+
### 3. Finalize the environment setup
180+
181+
This migration procedure restores some system table data using the BR `--with-sys-table` option. For tables that are not included in the scope, you need to manually restore. Common items to check and supplement include:
182+
183+
- User privileges: compare the `mysql.user` table.
184+
- Configuration settings: ensure that configuration items and system variables are consistent.
185+
- Auto-increment columns: clear auto-increment ID caches in the new cluster.
186+
- Statistics: collect statistics manually or enable automatic collection in the new cluster.
187+
188+
Additionally, you can scale out the new cluster to handle expected workloads and migrate operational tasks, such as alert subscriptions, scheduled statistics collection scripts, and data backup scripts.
189+
190+
## Step 4: Switch business traffic and rollback
191+
192+
### 1. Prepare for the switchover
193+
194+
- Confirm replication status:
195+
196+
- Monitor the latency of TiCDC Changefeed replication.
197+
- Ensure that the incremental replication throughput is greater than or equal to the peak business write rate.
198+
199+
- Perform multi-dimensional validation, such as:
200+
201+
- Ensure that all data validation steps are complete and perform any necessary additional checks.
202+
- Conduct sanity or integration tests on the application in the new cluster.
203+
204+
### 2. Execute the switchover
205+
206+
1. Stop application services to prevent the old cluster from handling business traffic. To further restrict access, you can use one of the following methods:
207+
208+
- Lock user accounts in the old cluster:
209+
210+
```sql
211+
ALTER USER ACCOUNT LOCK;
212+
```
213+
214+
- Set the old cluster to read-only mode. It is recommended to restart TiDB nodes in the old cluster to clear active business sessions and prevent connections that have not entered read-only mode:
215+
216+
```sql
217+
SET GLOBAL tidb_super_read_only=ON;
218+
```
219+
220+
2. Ensure TiCDC catches up:
221+
222+
- After setting the old cluster to read-only mode, retrieve the current `up-tso`:
223+
224+
```sql
225+
SELECT tidb_current_ts();
226+
```
227+
228+
- Monitor the Changefeed `checkpointTs` to confirm it has surpassed `up-tso`, indicating that TiCDC has completed data replication.
229+
230+
3. Verify data consistency between the new and old clusters:
231+
232+
- After TiCDC catches up, obtain the `down-tso` from the new cluster.
233+
- Use the [sync-diff-inspector](/sync-diff-inspector/sync-diff-inspector-overview.md) tool to compare data consistency between the new and old clusters at `up-tso` and `down-tso`.
234+
235+
4. Pause the forward Changefeed replication task:
236+
237+
```shell
238+
tiup ctl:${cluster_version} cdc changefeed pause --server http://${cdc_host}:${cdc_port} -c <changefeedid>
239+
```
240+
241+
5. Restart the TiDB nodes in the new cluster to clear the auto-increment ID cache.
242+
243+
6. Check the operational status of the new cluster using the following methods:
244+
245+
- Verify that the TiDB version matches the target version:
246+
247+
```shell
248+
tiup cluster display <cluster-name>
249+
```
250+
251+
- Log into the database and confirm component versions:
252+
253+
```sql
254+
SELECT * FROM INFORMATION_SCHEMA.CLUSTER_INFO;
255+
```
256+
257+
- Use Grafana to monitor service status: navigate to [**Overview > Services Port Status**](/grafana-overview-dashboard.md) and confirm that all services are in the **Up** state.
258+
259+
7. Set up reverse replication from the new cluster to the old cluster.
260+
261+
1. Unlock user accounts in the old cluster and restore read-write mode:
262+
263+
```sql
264+
ALTER USER ACCOUNT UNLOCK;
265+
SET GLOBAL tidb_super_read_only=OFF;
266+
```
267+
268+
2. Record the current TSO of the new cluster:
269+
270+
```sql
271+
SELECT tidb_current_ts();
272+
```
273+
274+
3. Configure the reverse replication link and ensure the Changefeed task is running properly:
275+
276+
- Because business operations are stopped at this stage, you can use the current TSO.
277+
- Ensure that `sink-uri` is set to the address of the old cluster to avoid loopback writing risks.
278+
279+
```shell
280+
tiup ctl:${cluster_version} cdc changefeed create --server http://${cdc_host}:${cdc_port} --sink-uri="mysql://${username}:${password}@${tidb_endpoint}:${port}" --config config.toml --start-ts ${tso}
281+
282+
tiup ctl:${cluster_version} cdc changefeed list --server http://${cdc_host}:${cdc_port}
283+
```
284+
285+
8. Redirect business traffic to the new cluster.
286+
287+
9. Monitor the load and operational status of the new cluster using the following Grafana panels:
288+
289+
- [**TiDB Dashboard > Query Summary**](/grafana-tidb-dashboard.md#query-summary): check the Duration, QPS, and Failed Query OPM metrics.
290+
- [**TiDB Dashboard > Server**](/grafana-tidb-dashboard.md#server): monitor the **Connection Count** metric to ensure even distribution of connections across nodes.
291+
292+
At this point, business traffic has successfully switched to the new cluster, and the TiCDC reverse replication channel is established.
293+
294+
### 3. Execute emergency rollback
295+
296+
The rollback plan is as follows:
297+
298+
- Check data consistency between the new and old clusters regularly to ensure the reverse replication link is operating properly.
299+
- Monitor the system for a specified period, such as one week. If issues occur, switch back to the old cluster.
300+
- After the observation period, remove the reverse replication link and delete the old cluster.
301+
302+
The following introduces the usage scenario and steps for an emergency rollback, which redirects traffic back to the old cluster:
303+
304+
- Usage scenario: execute the rollback plan if critical issues cannot be resolved.
305+
- Steps:
306+
307+
1. Stop business access to the new cluster.
308+
2. Reauthorize business accounts and restore read-write access to the old cluster.
309+
3. Check the reverse replication link, confirm TiCDC has caught up, and verify data consistency between the new and old clusters.
310+
4. Redirect business traffic back to the old cluster.
311+
312+
## Step 5: Clean up
313+
314+
After monitoring the new cluster for a period and confirming stable business operations, you can remove the TiCDC reverse replication and delete the old cluster.
315+
316+
- Remove the TiCDC reverse replication:
317+
318+
```shell
319+
tiup ctl:${cluster_version} cdc changefeed remove --server http://${cdc_host}:${cdc_port} -c <changefeedid>
320+
```
321+
322+
- Delete the old cluster. If you choose to retain it, restore `tidb_gc_life_time` to its original value:
323+
324+
```sql
325+
-- Restore to the original value before modification.
326+
SET GLOBAL tidb_gc_life_time=10m;
327+
```

0 commit comments

Comments
 (0)