Skip to content

CASSANDRA-20519: Journal improvements #4058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 228 commits into from

Conversation

ifesdjeen
Copy link
Contributor

  • Improve Journal threading model (switch to a single maintenance executer in place of multiple executors)
  • Improve signalling during segment allocation; avoid using wait queues for both segment allocation and flushes
  • Get rid of callback queues and record pointers, and instead use Allocation for signalling flush completion
  • Fix an issue with segment availability for read during switching
  • Wire up failure callbacks for journal

bdeggleston and others added 30 commits March 31, 2025 17:09
patch by Blake Eggleston; reviewed by Benedict Elliott Smith, David Capwell for CASSANDRA-17103
Patch by Blake Eggleston; Reviewed by David Capwell & Benedict Elliott Smith for CASSANDRA-18004
Patch by Blake Eggleston; Reviewed by David Capwell and Benedict Elliott Smith for Cassandra-18192
…n merging to mainline

patch by David Capwell; reviewed by Caleb Rackliffe for CASSANDRA-18309
…ce RandomSource

patch by David Capwell; reviewed by Blake Eggleston for CASSANDRA-18213
patch by Jacek Lewandowski; reviewed by David Capwell and Caleb Rackliffe for CASSANDRA-18302
…ss when TransactionStatement is prepared

patch by David Capwell; reviewed by Ariel Weisberg, Caleb Rackliffe for CASSANDRA-18337
…ID_RSP but got ACCORD_SIMPLE_RSP

patch by David Capwell; reviewed by Caleb Rackliffe for CASSANDRA-18375
…ition with PreApply where reads and writes are interleaved, causing one of the coordinators to see the writes from the other

patch by David Capwell; reviewed by Ariel Weisberg for CASSANDRA-18422
…o a Stage and run directly in the messageing handler

patch by David Capwell; reviewed by Ariel Weisberg, Benedict Elliott Smith for CASSANDRA-18364
…rt tests to add custom logic

patch by David Capwell; reviewed by Caleb Rackliffe for CASSANDRA-18485
in a durable log before processing by CommandStores

patch by Aleksey Yeschenko; reviewed by David Capwell for
CASSANDRA-18344
…tion history

patch by Benedict; reviewed by Blake Eggleston for CASSANDRA-18523
- removing unnecessary calls to ServerTestUtils.daemonInitialization() in a handful of tests
- minor cleanup in Verb and BTreeSet
patch by David Capwell; reviewed by Ariel Weisberg for CASSANDRA-18519
patch by Aleksey Yeschenko; reviewed by Benedic Elliott Smith for
CASSANDRA-18561
patch by Aleksey Yeschenko; reviewed by Blake Eggleston for
CASSANDRA-18563
Patch by Blake Eggleston and Benedict Elliott Smith; Reviewed by David
Capwell for CASSANDRA-17101

CEP-15: Accord TCM integration

Patch by Blake Eggleston; Reviewed by David Capwell for CASSANDRA-18444
…ctions that are known to be applied across the cluster)

patch by Benedict Elliott Smith; reviewed by Ariel Weisberg, Aleksey Yeschenko, and David Capwell for CASSANDRA-18883

Co-authored-by: Benedict Elliott Smith <[email protected]>
Co-authored-by: Ariel Weisberg <[email protected]>
Co-authored-by: Aleksey Yeschenko <[email protected]>
Co-authored-by: David Capwell <[email protected]>
…ing right away (apache#3575)

patch by David Capwell; reviewed by Blake Eggleston for CASSANDRA-18764
apache/cassandra-accord#56

Patch by Ariel Weisberg; Reviewed by David Capwell for CASSANDRA-18779
…ure not to loose the partial deps (apache#3590)

patch by David Capwell; reviewed by Aleksey Yeschenko for CASSANDRA-18783
Accord compaction purgers see random slices of Accord state during compaction (based on randomly selected compaction inputs).

For at least the `durability` column in the `commands` table the tombstone being created when truncating was deleting the latest value since we can get enough information to truncate without actuall yhaving the latest `durability` value.

To fix we can wait to emit a tombstone until we are erasing the entire command row when truncating or truncating with outcome and meanwhile we can drop the extra columns that are no longer needed instead of using a tombstone. We don't need to emit cell tombstones we can drop them from the purger when processing each row.

patch by Ariel Weisberg; reviewed by David Capwell for CASSANDRA-18795
…s nullable but C* serializer doesn't expect null
…s in TxnWrite, as they can simply be pulled from PartialTxn when needed in Write#apply()

- Avoid serializing full TxnData instances to Accord state tables

patch by Caleb Rackliffe; reviewed by David Capwell, Benedict Elliot Smith, and Ariel Weisberg for CASSANDRA-18355
patch by Aleksey Yeschenko; reviewed by Ariel Weisberg for CASSANDRA-18573
dcapwell and others added 25 commits April 2, 2025 14:37
…n validation logic missed the argument to String.format causing confusing errors

patch by David Capwell; reviewed by Blake Eggleston for CASSANDRA-20286
Use initializeTopologyUnsafe to for-load the last topology and create shards rather than replaying all topologies.

Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20294
Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20297
patch by David Capwell; reviewed by Benedict Elliott Smith for CASSANDRA-20302
…ionTest#casOnAccordSimulationTest

patch by David Capwell; reviewed by Ariel Weisberg for CASSANDRA-20322
…is preaccepted

Also fix:
 - Topology slicing must declare whether we share/slice node ownership (to assist above)
 - CFK.visit removes transitive dependencies too eagerly across epoch change
 - apply cleanup to builder consistently, and construct the same value we would produce by purge (so that replay is idempotent)
 - Invoke ExecuteTxn.LocalExecute callbacks on originating CommandStore
 - misc other minor issues

patch by Benedict; reviewed by Alex Petrov for CASSANDRA-20325
Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20316
Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20358
Improve:
 - Introduce pre/accept fast execution flags
 - Introduce searchable Deps serialization
 - Flatten AccordRoutingKey(s) into single type, using sentinel bits
 - Introduce new fast byte-comparable serialization methods for Token and TableId to support above
Fix:
 - Fix journal re-serialization logic
 - Enable RandomPartitioner for Accord by supporting fixed-width serialization for RouteIndex

patch by Benedict; reviewed by Alex Petrov for CASSANDRA-20349
 - Only use persisted RedundantBefore for compaction
 - RouteIndex should index only touches, not Route
 - Flush RangesForEpoch updates to journal immediately, so we do not rely on the command we are processing succeeding
 - DurableBefore updates must wait for the epochs to be known locally
 - Shard.mustWitnessEpoch to support guaranteeing to witness relevant non-topology schema changes
 - We must propagate RedundantBefore RX shard bounds along with epoch syncs
 - Prevent a truncated transaction FetchData infinite loop
 - GC_BEFORE status being overwritten by bootstrappedAt, permitting old transaction state to be resurrected
 - Avoid CFK.maxUniqueHlc read race on bootstrap
 - TopologyManager.awaitEpoch could wait for wrong epoch
 - Journal fsync thread could miss notifications
Also improve:
 - CommandStores uses SearchableRangeList for finding matching stores
 - Refactor RedundantBefore to use a sorted array of TxnId/RedundantStatus pairs (to better fix GC_BEFORE issue)
 - Accord debug keyspace operates on keyspace/table, and sorts correctly by token

patch by Benedict; reviewed by Alex Petrov for CASSANDRA-20361
 - Bad ArrayBuffers recycling logic
 - RX must ensure dependencies TRANSITIVE_VISIBLE
 - Permit constructing "antiRange" that spans multiple prefixes
 - Not computing range CommandSummary IsDep correctly
 - Truncated commands that aren't shard durable could not repopulate CFK on replay, permitting recovery of another command to make an incorrect decision
 - NPE on async persist of RX (i.e. supplying no callback)
 - NPE in Builder.shouldCleanup when durability is null

patch by Benedict; reviewed by Alex Petrov for CASSANDRA-20370
…ding topology and other durability requirements.

Also improve:
 - Introduce DurabilityService
 - Retire SyncPoint, replace Barrier with Write and RX
 - MessageType -> enum, restore GetMaxConflict
 - Standardise backoff logic with WaitStrategy
 - improve TimeoutStrategy/RetryStrategy specification strings
 - Forbid KX, remove directKeyDeps
 - Introduce UniqueTimeService, permitting hlc reservations for sync points avoid delay when min TxnId is sufficiently in the past
 - Remove ListStore custom purge logic
Also fix:
 - RejectBefore should reject on both epoch and hlc
 - Do not record sync success for removed nodes
 - Support GlobalDurability detecting no command store to run on
 - Incorrect ballot constructor
 - Serializing 15-bit ballot flags incorrectly
 - TopologyManager.hasEpoch deadlock
 - Computing withOpenEpochs incorrectly, sometimes stopping one epoch short
 - PartitionKey serializer should not depend on schema information that can be erased

patch by Benedict; reviewed by Alex Petrov for CASSANDRA-20395
Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20393.
patch by Benedict Elliott Smith; reviewed by David Capwell for CASSANDRA-20420
…a NPE

patch by David Capwell; reviewed by Benedict Elliott Smith for CASSANDRA-20417
Patch by Alex Petrov; reviewed by Benedict Elliott Smith CASSANDRA-20347.
Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20424.
… required topology changes

patch by David Capwell; reviewed by Benedict Elliott Smith for CASSANDRA-20426
 - Decouple command serialization from TableMetadata version; introduce ColumnMetadata ids; gracefully handle missing TableId
 - DataInputPlus.readLeastSignificantBytes must truncate high bits
 - Fix RandomPartitioner accord serialization
 - Fast path stable commits must not override recovery propose/commit decisions regarding visibility of a transaction
 - RejectBefore must mergeMax, not merge, to ensure we maintain epoch and hlc increasing independently
 - Bad commitInvalidate decision
 - consistent filtering for touches and stillTouches
 - ensure TRUNCATE_BEFORE implies SHARD_APPLIED
 - TopologyManager.unsyncedOnly off-by-one error
 - DurabilityQueue should not retry SyncPointErased
 - handle rare case of no deps but none needed
 - not updating CFK synchronously on recovery, which can lead to erroneous recovery decisions for other transactions
 - Don't return partial read response when one commandStore rejects the commit
 - Filter touches/stillTouches consistently
 - WaitingState computeLowEpoch must use hasTouched to handle historic key with no route
Improve:
 - Use format parameters to defer building Invariants.requireArgument string
 - streamline RedundantStatus/RedundantBefore
Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20114
…e changes without impacting C* messaging

patch by David Capwell; reviewed by Aleksey Yeschenko, Alex Petrov for CASSANDRA-20403
 - Accord Journal purging was disabled
 - remove unique_id from schema keyspace
 - avoid String.format in Compactor hot path
 - avoid string concatenation on hot path; improve segment compactor partition build efficiency
 - Partial compaction should update records in place to ensure truncation of discontiguous compactions do not lead to an incorrect field version being used
 - StoreParticipants.touches behaviour for RX was erroneously modified; should touch all non-redundant ranges including those no longer owned
 - SetShardDurable should correctly set DurableBefore Majority/Universal based on the Durability parameter
 - fix erroneous prunedBefore invariant
 - Journal compaction should not rewrite fields shadowed by a newer record
 - Don't save updates to ERASED commands
 - Simplify CommandChange.getFlags
 - fix handling of Durability for Invalidated
 - Don't use ApplyAt for GC_BEFORE with partial input, as might be a saveStatus >= ApplyAtKnown but with executeAt < ApplyAtKnown

patch by Benedict; reviewed by Alex Petrov for CASSANDRA-20441
  * Fix short accord simulation test (seed 0x6bea128ae851724b), ConcurrentModificationException
  * Increase wait time during closing to avoid Unterminated threads
  * Increase timeouts, improve test stability
  * More descriptive output from CQL test
  * Shorten max CMS delay
  * Improve future handling in config service

Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20440
  * Improve Journal threading model (switch to a single maintenance executer in place of multiple executors)
  * Improve signalling during segment allocation; avoid using wait queues for both segment allocation and flushes
  * Get rid of callback queues and record pointers, and instead use Allocation for signalling flush completion
  * Fix an issue with segment availability for read during switching
  * Wire up failure callbacks for journal
@ifesdjeen ifesdjeen changed the title Journal improvements CASSANDRA-20519: Journal improvements Apr 7, 2025
@iamaleksey iamaleksey self-requested a review April 11, 2025 13:00
@maedhroz maedhroz deleted the branch apache:cep-15-accord May 13, 2025 01:43
@maedhroz maedhroz closed this May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants