Skip to content

logstore: use SingleDelete for raft log entries#169641

Merged
trunk-io[bot] merged 2 commits intocockroachdb:masterfrom
iskettaneh:single_delete_3
May 7, 2026
Merged

logstore: use SingleDelete for raft log entries#169641
trunk-io[bot] merged 2 commits intocockroachdb:masterfrom
iskettaneh:single_delete_3

Conversation

@iskettaneh
Copy link
Copy Markdown
Contributor

@iskettaneh iskettaneh commented May 3, 2026

Introduce a SingleClearUnversioned method on the storage Writer interface, and use it for clearing raft log entries when UseRaftLogSingleDelete is enabled. This applies to all raft log deletion paths: truncation (Compact), log
overwrites, and tail trimming in logAppend. Moreover, it gates the decision whether to use SingleDelete or regular point deletes on an env variable. The default use SingleDelete only if engines are separated.

The first commit is basically: 4839132 + a test

References: #8979

Release note: None

Co-Authored-By: Claude Opus 4.6 [email protected]

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 3, 2026

😎 Merged directly without going through the merge queue, as the queue was empty and the PR was up to date with the target branch - details.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@iskettaneh iskettaneh force-pushed the single_delete_3 branch 2 times, most recently from 9ddac84 to 91ed565 Compare May 4, 2026 14:41
@iskettaneh iskettaneh force-pushed the single_delete_3 branch 2 times, most recently from c5f28af to be53c0d Compare May 4, 2026 15:57
@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 4, 2026

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

@iskettaneh iskettaneh requested a review from pav-kv May 4, 2026 18:42
@iskettaneh iskettaneh marked this pull request as ready for review May 4, 2026 18:42
@iskettaneh iskettaneh requested review from a team as code owners May 4, 2026 18:42
@iskettaneh iskettaneh requested a review from RaduBerinde May 4, 2026 18:42
Copy link
Copy Markdown
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick pass on the first commit.

Comment thread pkg/kv/kvserver/kvstorage/wag_truncator.go
}
defer iter.Close()

ok, err := iter.SeekEngineKeyGE(storage.EngineKey{Key: start})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively (potentially faster), we could just learn the first index, and then generate all keys. Relying on the fact that the log is populated densely at <= lastIndex.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! In the PR that unifies the raft log truncations, I am planning to have 2 main functions: First, is the function that we use when the raft log size is not known. Second, is the function that we use where the log size is known, and we can easily decide whether to use single delete or range deletes.

Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
Comment thread pkg/kv/kvserver/logstore/logstore_test.go
Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
Copy link
Copy Markdown
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iskettaneh made 5 comments.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on pav-kv and RaduBerinde).

Comment thread pkg/kv/kvserver/kvstorage/wag_truncator.go
}
defer iter.Close()

ok, err := iter.SeekEngineKeyGE(storage.EngineKey{Key: start})
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! In the PR that unifies the raft log truncations, I am planning to have 2 main functions: First, is the function that we use when the raft log size is not known. Second, is the function that we use where the log size is known, and we can easily decide whether to use single delete or range deletes.

Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
Comment thread pkg/kv/kvserver/logstore/logstore_test.go
@iskettaneh iskettaneh requested a review from pav-kv May 5, 2026 17:26
Copy link
Copy Markdown
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments, one fix.

Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
Comment thread pkg/kv/kvserver/logstore/logstore.go
Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
const (
// raftLogSingleDeleteDefault defers the choice to engine separation.
raftLogSingleDeleteDefault raftLogSingleDeleteMode = iota
// raftLogSingleDeleteEnabled forces SingleDelete on. This is only to be used
Copy link
Copy Markdown
Collaborator

@pav-kv pav-kv May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cut the boilerplate/fillers. E.g.

// raftLogSingleDeleteEnabled forces SingleDelete on. Only for tests.
...
// raftLogSingleDeleteDisabled forces SingleDelete off. Only for tests.

Comment thread pkg/kv/kvserver/logstore/logstore.go
Comment thread pkg/kv/kvserver/spanset/batch.go Outdated
Comment thread pkg/storage/pebble_batch.go
Comment thread pkg/kv/kvserver/logstore/logstore_test.go Outdated
Introduce a SingleClearUnversioned method on the storage Writer interface, and
use it for clearing raft log entries when UseRaftLogSingleDelete is enabled.
This applies to all raft log deletion paths: truncation (Compact), log
overwrites, and tail trimming in logAppend.

Epic: none
Release note: None

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@iskettaneh iskettaneh requested a review from pav-kv May 6, 2026 14:52
Copy link
Copy Markdown
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iskettaneh made 10 comments and resolved 1 discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on pav-kv and RaduBerinde).

Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
Comment thread pkg/kv/kvserver/logstore/logstore.go
Comment thread pkg/kv/kvserver/logstore/logstore.go
Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
Comment thread pkg/kv/kvserver/logstore/logstore.go
Comment thread pkg/kv/kvserver/spanset/batch.go Outdated
Comment thread pkg/storage/pebble_batch.go
Comment thread pkg/kv/kvserver/logstore/logstore_test.go Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

AI Review: Potential Issue Detected

An inline comment has been added to pkg/kv/kvserver/logstore/logstore.go at lines 600-603 identifying an inconsistency in the Compact() function.

Summary: When numTruncatedEntries >= raftLogTruncationClearRangeThreshold, Compact() unconditionally uses ClearRawRange (range tombstone) even when SingleDelete mode is active (enginesSeparated=true). The else branch was correctly updated to use SingleClearUnversioned, but the if branch was not. This is inconsistent with wag_truncator.go:317-318 which explicitly states "range tombstones are incompatible with SingleDelete" and avoids them. The metamorphic constant can randomize the threshold down to 1 in tests, making this path easily triggerable.

View full analysis


If helpful: add O-AI-Review-Real-Issue-Found label.
If not helpful: add O-AI-Review-Not-Helpful label.

@github-actions github-actions Bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label May 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Inline Note: pkg/kv/kvserver/logstore/logstore.go:600-607

(GitHub API prevented posting this as an inline comment)

Bug in Compact() function: The if branch (when numTruncatedEntries >= raftLogTruncationClearRangeThreshold) unconditionally uses ClearRawRange — a range tombstone — even when SingleDelete mode is active (enginesSeparated=true). The else branch (line 611-617) was correctly updated to check UseRaftLogSingleDelete(enginesSeparated) and use SingleClearUnversioned, but the if branch was missed.

As documented in wag_truncator.go:317-318: "range tombstones are incompatible with SingleDelete."

Fix: Check UseRaftLogSingleDelete(enginesSeparated) before line 600 and, when true, iterate through keys using SingleClearUnversioned (like clearRaftLogWithSingleDelete in wag_truncator.go:319-348) instead of issuing ClearRawRange. The metamorphic constant can randomize the threshold down to 1 in tests, making this path easily triggerable.

@iskettaneh
Copy link
Copy Markdown
Contributor Author

The AI review is pointing to a behaviour that we actually want. Even if SingleDelete is enabled, if we are truncating > PointDelThreshold, we use range clear instead.
The WAGTruncator behaviour will match that in a future PR.

Comment thread pkg/kv/kvserver/spanset/batch.go Outdated
Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
Comment on lines +477 to +478
_, _, err = storage.MVCCDelete(ctx, rw, key,
hlc.Timestamp{}, opts)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this looks like a one-liner

// clearRaftLogWithSingleDelete clears raft log entries using SingleDelete for
// each point key. Unlike the regular truncation path, this always uses point
// deletions and never falls back to a range tombstone, because range tombstones
// are incompatible with SingleDelete.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substitute "for simplicity" instead of "because ranges tombstones are incompatible with SingleDelete".

Claude misunderstood. Any deletions are fine, we only require that two Sets are always interleaved by some deletion.

I instructed it not to do range deletions for simplicity. Maybe TODO to use the same heuristic as in logstore.Compact.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! It will be fixed in the next PR. I added the TODO for now.

Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
key := keys.RaftLogKeyFromPrefix(raftLogPrefix, i)
var err error
if UseRaftLogSingleDelete {
// These tail entries have exactly one Set each — they were
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessarily that the entry has been written once, it could have been overwritten previously. But the same invariant holds: there is always a deletion between two puts.

			// SingleDelete is safe since there is always a deletion between two puts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
Comment on lines +446 to +451
// When using SingleDelete for truncation, we must ensure each raft
// log key has exactly one Set. Overwriting via MVCCPut would create a
// second Set. Instead, cancel the old Set with a SingleDelete, then
// write the new entry with a blind put. SingleDelete is safe here
// because every prior write followed this same pattern, so each key
// has exactly one Set.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exactly one Set bit is a bit inaccurate. How about squashing this comment this way:

			// Overwriting an existing log entry. To make SingleDelete here and in
			// other places safe, maintain the invariant that there is always a
			// deletion between two puts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread pkg/storage/engine.go Outdated
// SingleClearUnversioned removes an unversioned item from the db using a
// Pebble SingleDelete. It has the same purpose as ClearUnversioned but uses
// SingleDelete which is more efficient when the caller can guarantee that
// the key has been Set exactly once since the last SingleDelete/Delete.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"last SingleDelete/Delete" -> "last deletion"

(any deletion should work, including range deletions)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

This commit gates the decision of using SingleDelete using
UseRaftLogSingleDelete() functions inside logstore package.
This function first consults the variable:
raftLogSingleDelete and unless it explicitly
sets it to enabled/disabled (in some tests), it returns true
if engines are separated.

Note: these changes can be cleaned up once Compact and
logAppend are members of logstore struct.

Release note: None

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copy link
Copy Markdown
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iskettaneh made 5 comments and resolved 1 discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on pav-kv and RaduBerinde).

// clearRaftLogWithSingleDelete clears raft log entries using SingleDelete for
// each point key. Unlike the regular truncation path, this always uses point
// deletions and never falls back to a range tombstone, because range tombstones
// are incompatible with SingleDelete.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! It will be fixed in the next PR. I added the TODO for now.

Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
Comment on lines +446 to +451
// When using SingleDelete for truncation, we must ensure each raft
// log key has exactly one Set. Overwriting via MVCCPut would create a
// second Set. Instead, cancel the old Set with a SingleDelete, then
// write the new entry with a blind put. SingleDelete is safe here
// because every prior write followed this same pattern, so each key
// has exactly one Set.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread pkg/kv/kvserver/logstore/logstore.go Outdated
key := keys.RaftLogKeyFromPrefix(raftLogPrefix, i)
var err error
if UseRaftLogSingleDelete {
// These tail entries have exactly one Set each — they were
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread pkg/kv/kvserver/spanset/batch.go Outdated
Comment thread pkg/storage/engine.go Outdated
// SingleClearUnversioned removes an unversioned item from the db using a
// Pebble SingleDelete. It has the same purpose as ClearUnversioned but uses
// SingleDelete which is more efficient when the caller can guarantee that
// the key has been Set exactly once since the last SingleDelete/Delete.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@iskettaneh
Copy link
Copy Markdown
Contributor Author

TFTR!

/trunk merge

@trunk-io trunk-io Bot merged commit 9022a9f into cockroachdb:master May 7, 2026
29 checks passed
@cockroach-teamcity
Copy link
Copy Markdown
Member

🔴 Sysbench [SQL, 3node, oltp_read_write]
Metric Old Commit New Commit Delta Note
🔴 sec/op 10.48m ±3% 10.62m ±1% +1.37% p=0.000 n=15
allocs/op 8.157k ±0% 8.170k ±1% ~ p=0.783 n=15
Reproduce

benchdiff binaries:

mkdir -p benchdiff/a122ca4/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/a122ca4f567d9db918429cd77c7209620d40b3b5/bin/pkg_sql_tests benchdiff/a122ca4/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/a122ca4/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/88ba9be/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/88ba9beefa2e1b37e72ffedb0af24edaabb68cb4/bin/pkg_sql_tests benchdiff/88ba9be/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/88ba9be/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=88ba9be --new=a122ca4 --memprofile ./pkg/sql/tests
⚪ Sysbench [KV, 3node, oltp_read_only]
Metric Old Commit New Commit Delta Note
sec/op 3.279m ±2% 3.274m ±2% ~ p=0.935 n=15
allocs/op 2.102k ±0% 2.102k ±0% ~ p=0.972 n=15
Reproduce

benchdiff binaries:

mkdir -p benchdiff/a122ca4/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/a122ca4f567d9db918429cd77c7209620d40b3b5/bin/pkg_sql_tests benchdiff/a122ca4/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/a122ca4/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/88ba9be/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/88ba9beefa2e1b37e72ffedb0af24edaabb68cb4/bin/pkg_sql_tests benchdiff/88ba9be/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/88ba9be/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_read_only$ --old=88ba9be --new=a122ca4 --memprofile ./pkg/sql/tests
⚪ Sysbench [KV, 3node, oltp_write_only]
Metric Old Commit New Commit Delta Note
sec/op 3.023m ±5% 3.029m ±4% ~ p=0.902 n=15
allocs/op 4.207k ±0% 4.212k ±0% ~ p=0.290 n=15
Reproduce

benchdiff binaries:

mkdir -p benchdiff/a122ca4/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/a122ca4f567d9db918429cd77c7209620d40b3b5/bin/pkg_sql_tests benchdiff/a122ca4/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/a122ca4/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/88ba9be/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/88ba9beefa2e1b37e72ffedb0af24edaabb68cb4/bin/pkg_sql_tests benchdiff/88ba9be/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/88ba9be/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_write_only$ --old=88ba9be --new=a122ca4 --memprofile ./pkg/sql/tests
Artifacts

download:

mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/a122ca4f567d9db918429cd77c7209620d40b3b5/25498297850-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/88ba9beefa2e1b37e72ffedb0af24edaabb68cb4/25498297850-1/\* old/

built with commit: a122ca4f567d9db918429cd77c7209620d40b3b5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. target-release-26.3.0 X-perf-check Microbenchmarks CI: Added to a PR if a performance regression is detected and should be checked

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants