Merge tikv 7.5 #400

CalvinNeo · 2024-12-05T04:13:57Z

What is changed and how it works?

Issue Number: Close #xxx

What's Changed:

Related changes

PR to update pingcap/docs/pingcap/docs-cn:
Need to cherry-pick to the release branch

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Release note

close tikv#11161 Add back heap profile HTTP API and make it secure. The API is removed by tikv#11162 due to a secure issue that can visit arbitrary files on the server. This PR makes it only show the file name instead of the absolute path, and adds a paranoid check to make sure the passed file name is in the set of heap profiles. Signed-off-by: Connor1996 <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

… tablet (tikv#15332) ref tikv#12842 - Fix a bug of compact range that causes a dirty tablet being reported as clean. - Added an additional check to ensure trim's correctness. - Fix a bug that some tablets are not destroyed and block peer destroy progress. Signed-off-by: tabokie <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…shutting dowm (tikv#15426) ref tikv#15202 not panic in the case of unexepected dropped channel when shutting dowm Signed-off-by: SpadeA-Tang <[email protected]>

…ikv#15427) close tikv#15282 disable duplicated mvcc key check compaction by default Signed-off-by: SpadeA-Tang <[email protected]>

close tikv#15357 Correct the raft_router/apply_router's alive and leak metrics. Signed-off-by: tonyxuqqi <[email protected]>

…ikv#15440) close tikv#15438 fix unwrap panic of region_compact_redundant_rows_percent Signed-off-by: SpadeA-Tang <[email protected]>

close tikv#15430 Use concurrent hashmap to avoid router cache occupying too much memory Signed-off-by: Connor1996 <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

close tikv#13311 Fix the possible meta inconsistency issue. Signed-off-by: cfzjywxk <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

ref tikv#14864 This is the first PR to fix OOM caused by Resolver tracking large txns. Resolver checks memory quota before tracking a lock, and returns false if it exceeds memory quota. Signed-off-by: Neil Shen <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…tikv#15425) close tikv#15424 Signed-off-by: glorv <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…5421) ref tikv#15409 Signed-off-by: bufferflies <[email protected]> Co-authored-by: Spade A <[email protected]>

close tikv#14864 Fix resolved ts OOM caused by Resolver tracking large txns. `ObserveRegion` is deregistered if it exceeds memory quota. It may cause higher CPU usage because of scanning locks, but it's better than OOM. Signed-off-by: Neil Shen <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…15453) ref tikv#12842 support column family based write buffer manager Signed-off-by: SpadeA-Tang <[email protected]>

ref tikv/pd#6556, close tikv#15428 pc_client: add store-level backoff for the reconnect retries Signed-off-by: nolouch <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

close tikv#15405 Signed-off-by: bufferflies <[email protected]> Co-authored-by: Spade A <[email protected]>

ref tikv#12842 - Initialize `persisted_apply_index` on startup. Signed-off-by: tabokie <[email protected]>

…for mvcc scan (tikv#15455) ref tikv#14654 consider unmatch between region range and tablet range for mvcc scan

close tikv#12304 Add logs for assertion failure Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

close tikv#15403 1. split config support to update dynamic. In past, the `optimize_for` function will set the config immutable. Signed-off-by: bufferflies <[email protected]>

ref tikv#15409 reuse failpoint tests in async_io_test Signed-off-by: SpadeA-Tang <[email protected]>

close tikv#15490 avoid duplicated Instant:now Signed-off-by: SpadeA-Tang <[email protected]>

close tikv#15458 Resolver owns a hash map to tracking locks and unlock events, and so for calculating resolved ts. However, it does not shrink map even after all lock are removed, this may result OOM if there are transactions that modify many rows across many regions. The total memory usage is proportional to the number of modified rows. Signed-off-by: Neil Shen <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

close tikv#15468 Return `RegionNotFound` while cannot find peer in the current store. Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

ref tikv#8235 Signed-off-by: Neil Shen <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…15504) close tikv#15503 fix panic of dynamic changing write-buffer-limit Signed-off-by: SpadeA-Tang <[email protected]>

close tikv#15487 Signed-off-by: qupeng <[email protected]>

ref tikv#15409 reuse failpoint tests in test_early_apply Signed-off-by: SpadeA-Tang <[email protected]>

…15456) close tikv#15457 there are three triggers will split the regions: 1. load split include sizekeys, load etc. In this cases, the new region should contains the data after split. 2. tidb split tables or partition table, such like `create table test.t1(id int,b int) shard_row_id_bits=4 partition by hash(id) partitions 2000`. In this cases , the new region shouldn't contains any data after split. Signed-off-by: bufferflies <[email protected]>

ref tikv#15461 limit the flush times during server stop Signed-off-by: SpadeA-Tang <[email protected]>

ref tikv#14864 * Fix resolved ts OOM caused by adding large txns locks to `ResolverStatus`. * Add initial scan backoff duration metrics. Signed-off-by: Neil Shen <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com> Co-authored-by: Connor <[email protected]>

…ocksdb compaction (tikv#17431) (tikv#17435) close tikv#17269 compaction-filter: consider mvcc.delete as redundant key to trigger Rocksdb compaction Signed-off-by: Shirly <[email protected]> Co-authored-by: Shirly <[email protected]>

close tikv#17471 Add a script to renew certificates and fix the flaky test `test_security_status_service_without_cn` . Signed-off-by: Neil Shen <[email protected]> Co-authored-by: Neil Shen <[email protected]>

…ethod (tikv#17357) (tikv#17481) close tikv#17368 * add one log to indicate the memory quota is freed when drop the `Drain` * free the truncated scanned event memory quota. * refactor `finish_scan_lock` method, to remove the else branch. * row size calculation should also consider old value * remove some outdate todo Signed-off-by: 3AceShowHand <[email protected]> Co-authored-by: 3AceShowHand <[email protected]> Co-authored-by: Ling Jin <[email protected]>

…sts (tikv#17500) (tikv#17517) close tikv#17394 lock_manager: Skip updating lock wait info for non-fair-locking requests This is a simpler and lower-risky fix of the OOM issue tikv#17394 for released branches, as an alternative solution to tikv#17451 . In this way, for acquire_pessimistic_lock requests without enabling fair locking, the behavior of update_wait_for will be a noop. So that if fair locking is globally disabled, the behavior will be equivalent to versions before 7.0. Signed-off-by: MyonKeminta <[email protected]> Co-authored-by: MyonKeminta <[email protected]>

close tikv#17356 Make the diskfull check mechanism compatible to the configuration `raft-engine.spill-dir`. Signed-off-by: lucasliang <[email protected]> Co-authored-by: glorv <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

close tikv#17272 TiKV no longer names bloom filter blocks with suffix like "FullBloom" or "Ribbon". Signed-off-by: Yang Zhang <[email protected]> Co-authored-by: Yang Zhang <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…tikv#17566) close tikv#17469 The commit fixes a panic in TiKV that occurs in a rare scenario that involves region splits and immediate removal of the new peer. When a region splits, the new peer on a follower can be created in two ways: (1) By receiving a Raft message from the new region (`fn maybe_create_peer`) (2) By applying the split operation locally (`fn on_ready_split_region`). Depending on timing, a new peer might first be created by a Raft message and then again when the split is applied. This is a known situation. When it happens, the second peer replaces the first, and the first peer is dicarded. However, the discarded peer may continue processing existing messages, leading to unexpected states. The panic can be reproduced with the following sequence of events: 1. The first peer is created by a Raft message and is waiting for a Raft snapshot. 2. The second peer (of the same region) is created by `on_ready_split_region` when the split operation is applied, replacing the first peer and closing its mailbox (as expected). 3. The second peer is immediately removed. This removes the region metadata. 4. The first peer continues processing the Raft snapshot message, expecting the metadata of the region to exist, causing the panic. Signed-off-by: Bisheng Huang <[email protected]> Co-authored-by: Bisheng Huang <[email protected]>

…#17458) (tikv#17565) close tikv#17304 Fix unexpected flow control after unsafe destroy range Flow controller detects pending compaction bytes jump before and after unsafe destroy range. If there is a jump, the controller enters a state that would ignore the high pending compaction bytes until it falls back to normal. Previously, the controller may not enter the state if the pending compaction bytes is lower than the threshold while long term average pending bytes is still high. Then it would trigger flow control mistakenly. Signed-off-by: Connor1996 <[email protected]> Co-authored-by: Connor1996 <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…#17326) (tikv#17567) close tikv#16229 Reduce the memory usage of peers' message channel Signed-off-by: lucasliang <[email protected]> Co-authored-by: lucasliang <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…he log task (tikv#17317) (tikv#17570) close tikv#17316 clean `pause-guard-gc-safepoint` when unregister the log task Signed-off-by: ti-chi-bot <[email protected]> Signed-off-by: Jianjun Liao <[email protected]> Co-authored-by: Jianjun Liao <[email protected]> Co-authored-by: Jianjun Liao <[email protected]>

…#17591) close tikv#17579 Fix inaccurate storage async write duration metric, which mistakenly included task wait time in the scheduler worker pool. This occurs because the metric is observed in a future running on the scheduler worker pool, leading to inflated values, especially under load. This can be misleading and cause confusion during troubleshooting. This commit corrects the metric by observing it in the async write callback. Signed-off-by: ti-chi-bot <[email protected]> Signed-off-by: lucasliang <[email protected]> Co-authored-by: Neil Shen <[email protected]> Co-authored-by: lucasliang <[email protected]>

close tikv#17224 Add a disk usage check when execute `download` and `apply` RPC from br. When the disk is not `Normal`, the request would be rejected. Signed-off-by: ti-chi-bot <[email protected]> Signed-off-by: hillium <[email protected]> Co-authored-by: ris <[email protected]> Co-authored-by: hillium <[email protected]>

…) (tikv#17598) close tikv#17589 Add some metrics for resource control priority resource limiter. Also adjust the build parameters of QuotaLimiter in resource control module to avoid triggering wait too frequently. Signed-off-by: ti-chi-bot <[email protected]> Signed-off-by: glorv <[email protected]> Co-authored-by: glorv <[email protected]>

Signed-off-by: ti-chi-bot <[email protected]>

Signed-off-by: Ti Chi Robot <[email protected]>

…kv#17656) close tikv#16601, close tikv#17620 cdc: filter events with the observed range before load old values Signed-off-by: qupeng <[email protected]>

close tikv#17808 Use rust-rocksdb tikv-7.5 for 7.5 release Signed-off-by: Yang Zhang <[email protected]>

close tikv#17689 Fixing yanked futures-util 0.3.15 Signed-off-by: ti-chi-bot <[email protected]> Signed-off-by: glorv <[email protected]> Co-authored-by: Yang Zhang <[email protected]> Co-authored-by: glorv <[email protected]>

close tikv#17852 expr: fix panic when using radians and degree Signed-off-by: gengliqi <[email protected]> Co-authored-by: gengliqi <[email protected]>

…17841) (tikv#17848) close tikv#17840 Skip handling remain raft messages after peer fsm is stopped. This can avoid potential panic if the raft message need to read raft log from raft engine. Signed-off-by: glorv <[email protected]> Co-authored-by: glorv <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…individual disk performance factors.(tikv#17801) (tikv#17901) close tikv#17884 This pr introduces an extra and individual inspector to detect whether there exists I/O hung issues on kvdb disk, if the kvdb is deployed with a separate mount path. Signed-off-by: lucasliang <[email protected]>

…v#17885) close tikv#17876, fix tikv#17876, close tikv#17877 cdc: skip loading old values for un-observed ranges Signed-off-by: qupeng <[email protected]> Co-authored-by: qupeng <[email protected]>

…ikv#17924) close tikv#17701 add write batch limit for raft command batch Signed-off-by: SpadeA-Tang <[email protected]> Signed-off-by: SpadeA-Tang <[email protected]> Co-authored-by: SpadeA-Tang <[email protected]> Co-authored-by: SpadeA-Tang <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…ikv#17765) (tikv#17921) close tikv#17383, close tikv#17760 To address the corner case where a read thread encounters a panic due to reading with a stale index from the `Memtable` in raft-engine, which has been updated by a background thread that has already purged the stale logs. Signed-off-by: lucasliang <[email protected]> Co-authored-by: lucasliang <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

ti-chi-bot · 2024-12-05T04:14:15Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

HACKED_OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

CLAassistant · 2024-12-05T04:50:38Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 5 committers have signed the CLA.

✅ ti-chi-bot
✅ CalvinNeo
❌ hicqu
❌ LykxSassinator
❌ v01dstar
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Signed-off-by: Calvin Neo <[email protected]>

ti-chi-bot · 2024-12-05T08:28:59Z

@CalvinNeo: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test	`567bff8`	link	true	`/test pull-unit-test`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Connor1996 and others added 30 commits August 24, 2023 04:09

scheduler: not panic in the case of unexepected dropped channel when …

3ae1fb4

…shutting dowm (tikv#15426) ref tikv#15202 not panic in the case of unexepected dropped channel when shutting dowm Signed-off-by: SpadeA-Tang <[email protected]>

raftstore: disable duplicated mvcc key compaction check by default (t…

8a44a2c

…ikv#15427) close tikv#15282 disable duplicated mvcc key check compaction by default Signed-off-by: SpadeA-Tang <[email protected]>

server: fix memory trace's leak metrics (tikv#15353)

2595965

close tikv#15357 Correct the raft_router/apply_router's alive and leak metrics. Signed-off-by: tonyxuqqi <[email protected]>

raftstore: fix unwrap panic of region_compact_redundant_rows_percent (t…

bea230d

…ikv#15440) close tikv#15438 fix unwrap panic of region_compact_redundant_rows_percent Signed-off-by: SpadeA-Tang <[email protected]>

raftstore: fix meta inconsistency issue (tikv#15423)

40b225f

close tikv#13311 Fix the possible meta inconsistency issue. Signed-off-by: cfzjywxk <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

config: support changed adjust max-background-compactions dynamically (…

f3b5bf5

…tikv#15425) close tikv#15424 Signed-off-by: glorv <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

raftstore-v2: enable failpoint for raftstore v2 in stale-peer (tikv#1…

e5efbe6

…5421) ref tikv#15409 Signed-off-by: bufferflies <[email protected]> Co-authored-by: Spade A <[email protected]>

raftstore-v2: support column family based write buffer manager (tikv#…

517522b

…15453) ref tikv#12842 support column family based write buffer manager Signed-off-by: SpadeA-Tang <[email protected]>

pd_client: add backoff for the reconnect retries (tikv#15429)

4b3e33e

ref tikv/pd#6556, close tikv#15428 pc_client: add store-level backoff for the reconnect retries Signed-off-by: nolouch <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

coprocessor: skip transient read request (tikv#15406)

0bb2706

close tikv#15405 Signed-off-by: bufferflies <[email protected]> Co-authored-by: Spade A <[email protected]>

raftstore-v2: init persisted_tablet_index on startup (tikv#15441)

fb9a40d

ref tikv#12842 - Initialize `persisted_apply_index` on startup. Signed-off-by: tabokie <[email protected]>

raftstore-v2: consider unmatch between region range and tablet range …

69b8ac5

…for mvcc scan (tikv#15455) ref tikv#14654 consider unmatch between region range and tablet range for mvcc scan

txn: add logs for assertion failure (tikv#12305)

1669a72

close tikv#12304 Add logs for assertion failure Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

config: make split config can update (tikv#15473)

b507aad

close tikv#15403 1. split config support to update dynamic. In past, the `optimize_for` function will set the config immutable. Signed-off-by: bufferflies <[email protected]>

raftstore-v2: reuse failpoint tests in async_io_test.rs (tikv#15476)

251df18

ref tikv#15409 reuse failpoint tests in async_io_test Signed-off-by: SpadeA-Tang <[email protected]>

storage: avoid duplicated Instant:now (tikv#15489)

437a68d

close tikv#15490 avoid duplicated Instant:now Signed-off-by: SpadeA-Tang <[email protected]>

raftstore: don't return is_witness while region not found (tikv#15475)

32c030d

close tikv#15468 Return `RegionNotFound` while cannot find peer in the current store. Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

server: track grpc threads memory throughput (tikv#15488)

fa3892b

ref tikv#8235 Signed-off-by: Neil Shen <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

raftstore-v2: fix panic of dynamic changing write-buffer-limit (tikv#…

a56fe6a

…15504) close tikv#15503 fix panic of dynamic changing write-buffer-limit Signed-off-by: SpadeA-Tang <[email protected]>

cdc: enhance deregister protocol (tikv#15485)

280b39c

close tikv#15487 Signed-off-by: qupeng <[email protected]>

raftstore-v2: reuse failpoint tests in test_early_apply.rs (tikv#15501)

1cd6dda

ref tikv#15409 reuse failpoint tests in test_early_apply Signed-off-by: SpadeA-Tang <[email protected]>

raftstore-v2: limit the flush times during server stop (tikv#15511)

02061be

ref tikv#15461 limit the flush times during server stop Signed-off-by: SpadeA-Tang <[email protected]>

ti-chi-bot and others added 25 commits August 29, 2024 23:58

test_util: renew tests certs (tikv#17472) (tikv#17474)

141029a

close tikv#17471 Add a script to renew certificates and fix the flaky test `test_security_status_service_without_cn` . Signed-off-by: Neil Shen <[email protected]> Co-authored-by: Neil Shen <[email protected]>

build: bump tikv pkg version (tikv#17653)

b4bddee

Signed-off-by: ti-chi-bot <[email protected]>

OWNERS: Auto Sync OWNERS files from community membership (tikv#17659)

5130f1a

Signed-off-by: Ti Chi Robot <[email protected]>

cdc: filter events with the observed range before load old values (ti…

e36cdcf

…kv#17656) close tikv#16601, close tikv#17620 cdc: filter events with the observed range before load old values Signed-off-by: qupeng <[email protected]>

RocksDB: Use rust-rocksdb 7.5 for TiKV 7.5 (tikv#17779)

e3951c7

close tikv#17808 Use rust-rocksdb tikv-7.5 for 7.5 release Signed-off-by: Yang Zhang <[email protected]>

expr: fix panic when using radians and degree (tikv#17853) (tikv#17857)

20f75d0

close tikv#17852 expr: fix panic when using radians and degree Signed-off-by: gengliqi <[email protected]> Co-authored-by: gengliqi <[email protected]>

cdc: skip loading old values for un-observed ranges (tikv#17878) (tik…

7e73958

…v#17885) close tikv#17876, fix tikv#17876, close tikv#17877 cdc: skip loading old values for un-observed ranges Signed-off-by: qupeng <[email protected]> Co-authored-by: qupeng <[email protected]>

Merge remote-tracking branch 'tikv/release-7.5' into merge-tikv-7.5

78e0590

ti-chi-bot bot added the size/XXL label Dec 5, 2024

resolve

567bff8

Signed-off-by: Calvin Neo <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge tikv 7.5 #400

Merge tikv 7.5 #400

CalvinNeo commented Dec 5, 2024

ti-chi-bot bot commented Dec 5, 2024

CLAassistant commented Dec 5, 2024 •

edited

Loading

ti-chi-bot bot commented Dec 5, 2024

Merge tikv 7.5 #400

Are you sure you want to change the base?

Merge tikv 7.5 #400

Conversation

CalvinNeo commented Dec 5, 2024

What is changed and how it works?

Related changes

Check List

Release note

ti-chi-bot bot commented Dec 5, 2024

CLAassistant commented Dec 5, 2024 • edited Loading

ti-chi-bot bot commented Dec 5, 2024

CLAassistant commented Dec 5, 2024 •

edited

Loading