logstore: use SingleDelete for raft log entries#169641
logstore: use SingleDelete for raft log entries#169641trunk-io[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
|
😎 Merged directly without going through the merge queue, as the queue was empty and the PR was up to date with the target branch - details. |
9ddac84 to
91ed565
Compare
c5f28af to
be53c0d
Compare
|
Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link) |
be53c0d to
77d9237
Compare
77d9237 to
c63dc0b
Compare
pav-kv
left a comment
There was a problem hiding this comment.
Quick pass on the first commit.
| } | ||
| defer iter.Close() | ||
|
|
||
| ok, err := iter.SeekEngineKeyGE(storage.EngineKey{Key: start}) |
There was a problem hiding this comment.
Alternatively (potentially faster), we could just learn the first index, and then generate all keys. Relying on the fact that the log is populated densely at <= lastIndex.
There was a problem hiding this comment.
Great! In the PR that unifies the raft log truncations, I am planning to have 2 main functions: First, is the function that we use when the raft log size is not known. Second, is the function that we use where the log size is known, and we can easily decide whether to use single delete or range deletes.
iskettaneh
left a comment
There was a problem hiding this comment.
@iskettaneh made 5 comments.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on pav-kv and RaduBerinde).
| } | ||
| defer iter.Close() | ||
|
|
||
| ok, err := iter.SeekEngineKeyGE(storage.EngineKey{Key: start}) |
There was a problem hiding this comment.
Great! In the PR that unifies the raft log truncations, I am planning to have 2 main functions: First, is the function that we use when the raft log size is not known. Second, is the function that we use where the log size is known, and we can easily decide whether to use single delete or range deletes.
c63dc0b to
bdecd4f
Compare
pav-kv
left a comment
There was a problem hiding this comment.
Mostly minor comments, one fix.
| const ( | ||
| // raftLogSingleDeleteDefault defers the choice to engine separation. | ||
| raftLogSingleDeleteDefault raftLogSingleDeleteMode = iota | ||
| // raftLogSingleDeleteEnabled forces SingleDelete on. This is only to be used |
There was a problem hiding this comment.
nit: cut the boilerplate/fillers. E.g.
// raftLogSingleDeleteEnabled forces SingleDelete on. Only for tests.
...
// raftLogSingleDeleteDisabled forces SingleDelete off. Only for tests.Introduce a SingleClearUnversioned method on the storage Writer interface, and use it for clearing raft log entries when UseRaftLogSingleDelete is enabled. This applies to all raft log deletion paths: truncation (Compact), log overwrites, and tail trimming in logAppend. Epic: none Release note: None Co-Authored-By: Claude Opus 4.6 <[email protected]>
bdecd4f to
7488722
Compare
iskettaneh
left a comment
There was a problem hiding this comment.
@iskettaneh made 10 comments and resolved 1 discussion.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on pav-kv and RaduBerinde).
7488722 to
5fcc501
Compare
AI Review: Potential Issue DetectedAn inline comment has been added to Summary: When If helpful: add |
Inline Note:
|
|
The AI review is pointing to a behaviour that we actually want. Even if SingleDelete is enabled, if we are truncating > PointDelThreshold, we use range clear instead. |
| _, _, err = storage.MVCCDelete(ctx, rw, key, | ||
| hlc.Timestamp{}, opts) |
There was a problem hiding this comment.
nit: this looks like a one-liner
| // clearRaftLogWithSingleDelete clears raft log entries using SingleDelete for | ||
| // each point key. Unlike the regular truncation path, this always uses point | ||
| // deletions and never falls back to a range tombstone, because range tombstones | ||
| // are incompatible with SingleDelete. |
There was a problem hiding this comment.
Substitute "for simplicity" instead of "because ranges tombstones are incompatible with SingleDelete".
Claude misunderstood. Any deletions are fine, we only require that two Sets are always interleaved by some deletion.
I instructed it not to do range deletions for simplicity. Maybe TODO to use the same heuristic as in logstore.Compact.
There was a problem hiding this comment.
Yes! It will be fixed in the next PR. I added the TODO for now.
| key := keys.RaftLogKeyFromPrefix(raftLogPrefix, i) | ||
| var err error | ||
| if UseRaftLogSingleDelete { | ||
| // These tail entries have exactly one Set each — they were |
There was a problem hiding this comment.
It's not necessarily that the entry has been written once, it could have been overwritten previously. But the same invariant holds: there is always a deletion between two puts.
// SingleDelete is safe since there is always a deletion between two puts.
| // When using SingleDelete for truncation, we must ensure each raft | ||
| // log key has exactly one Set. Overwriting via MVCCPut would create a | ||
| // second Set. Instead, cancel the old Set with a SingleDelete, then | ||
| // write the new entry with a blind put. SingleDelete is safe here | ||
| // because every prior write followed this same pattern, so each key | ||
| // has exactly one Set. |
There was a problem hiding this comment.
The exactly one Set bit is a bit inaccurate. How about squashing this comment this way:
// Overwriting an existing log entry. To make SingleDelete here and in
// other places safe, maintain the invariant that there is always a
// deletion between two puts.| // SingleClearUnversioned removes an unversioned item from the db using a | ||
| // Pebble SingleDelete. It has the same purpose as ClearUnversioned but uses | ||
| // SingleDelete which is more efficient when the caller can guarantee that | ||
| // the key has been Set exactly once since the last SingleDelete/Delete. |
There was a problem hiding this comment.
"last SingleDelete/Delete" -> "last deletion"
(any deletion should work, including range deletions)
This commit gates the decision of using SingleDelete using UseRaftLogSingleDelete() functions inside logstore package. This function first consults the variable: raftLogSingleDelete and unless it explicitly sets it to enabled/disabled (in some tests), it returns true if engines are separated. Note: these changes can be cleaned up once Compact and logAppend are members of logstore struct. Release note: None Co-Authored-By: Claude Opus 4.6 <[email protected]>
5fcc501 to
a122ca4
Compare
iskettaneh
left a comment
There was a problem hiding this comment.
@iskettaneh made 5 comments and resolved 1 discussion.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on pav-kv and RaduBerinde).
| // clearRaftLogWithSingleDelete clears raft log entries using SingleDelete for | ||
| // each point key. Unlike the regular truncation path, this always uses point | ||
| // deletions and never falls back to a range tombstone, because range tombstones | ||
| // are incompatible with SingleDelete. |
There was a problem hiding this comment.
Yes! It will be fixed in the next PR. I added the TODO for now.
| // When using SingleDelete for truncation, we must ensure each raft | ||
| // log key has exactly one Set. Overwriting via MVCCPut would create a | ||
| // second Set. Instead, cancel the old Set with a SingleDelete, then | ||
| // write the new entry with a blind put. SingleDelete is safe here | ||
| // because every prior write followed this same pattern, so each key | ||
| // has exactly one Set. |
| key := keys.RaftLogKeyFromPrefix(raftLogPrefix, i) | ||
| var err error | ||
| if UseRaftLogSingleDelete { | ||
| // These tail entries have exactly one Set each — they were |
| // SingleClearUnversioned removes an unversioned item from the db using a | ||
| // Pebble SingleDelete. It has the same purpose as ClearUnversioned but uses | ||
| // SingleDelete which is more efficient when the caller can guarantee that | ||
| // the key has been Set exactly once since the last SingleDelete/Delete. |
|
TFTR! /trunk merge |
🔴 Sysbench [SQL, 3node, oltp_read_write]
Reproducebenchdiff binaries: mkdir -p benchdiff/a122ca4/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/a122ca4f567d9db918429cd77c7209620d40b3b5/bin/pkg_sql_tests benchdiff/a122ca4/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/a122ca4/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/88ba9be/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/88ba9beefa2e1b37e72ffedb0af24edaabb68cb4/bin/pkg_sql_tests benchdiff/88ba9be/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/88ba9be/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: # NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=88ba9be --new=a122ca4 --memprofile ./pkg/sql/tests⚪ Sysbench [KV, 3node, oltp_read_only]
Reproducebenchdiff binaries: mkdir -p benchdiff/a122ca4/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/a122ca4f567d9db918429cd77c7209620d40b3b5/bin/pkg_sql_tests benchdiff/a122ca4/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/a122ca4/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/88ba9be/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/88ba9beefa2e1b37e72ffedb0af24edaabb68cb4/bin/pkg_sql_tests benchdiff/88ba9be/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/88ba9be/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: # NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_read_only$ --old=88ba9be --new=a122ca4 --memprofile ./pkg/sql/tests⚪ Sysbench [KV, 3node, oltp_write_only]
Reproducebenchdiff binaries: mkdir -p benchdiff/a122ca4/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/a122ca4f567d9db918429cd77c7209620d40b3b5/bin/pkg_sql_tests benchdiff/a122ca4/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/a122ca4/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/88ba9be/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/88ba9beefa2e1b37e72ffedb0af24edaabb68cb4/bin/pkg_sql_tests benchdiff/88ba9be/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/88ba9be/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: # NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_write_only$ --old=88ba9be --new=a122ca4 --memprofile ./pkg/sql/testsArtifactsdownload: mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/a122ca4f567d9db918429cd77c7209620d40b3b5/25498297850-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/88ba9beefa2e1b37e72ffedb0af24edaabb68cb4/25498297850-1/\* old/built with commit: a122ca4f567d9db918429cd77c7209620d40b3b5 |
Introduce a SingleClearUnversioned method on the storage Writer interface, and use it for clearing raft log entries when UseRaftLogSingleDelete is enabled. This applies to all raft log deletion paths: truncation (Compact), log
overwrites, and tail trimming in logAppend. Moreover, it gates the decision whether to use SingleDelete or regular point deletes on an env variable. The default use SingleDelete only if engines are separated.
The first commit is basically: 4839132 + a test
References: #8979
Release note: None
Co-Authored-By: Claude Opus 4.6 [email protected]