-
Notifications
You must be signed in to change notification settings - Fork 3.7k
CASSANDRA-20519: Journal improvements #4058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
ifesdjeen
commented
Apr 7, 2025
- Improve Journal threading model (switch to a single maintenance executer in place of multiple executors)
- Improve signalling during segment allocation; avoid using wait queues for both segment allocation and flushes
- Get rid of callback queues and record pointers, and instead use Allocation for signalling flush completion
- Fix an issue with segment availability for read during switching
- Wire up failure callbacks for journal
patch by Blake Eggleston; reviewed by Benedict Elliott Smith, David Capwell for CASSANDRA-17103
Patch by Blake Eggleston; Reviewed by David Capwell & Benedict Elliott Smith for CASSANDRA-18004
Patch by Blake Eggleston; Reviewed by David Capwell and Benedict Elliott Smith for Cassandra-18192
…n merging to mainline patch by David Capwell; reviewed by Caleb Rackliffe for CASSANDRA-18309
…ce RandomSource patch by David Capwell; reviewed by Blake Eggleston for CASSANDRA-18213
patch by Jacek Lewandowski; reviewed by David Capwell and Caleb Rackliffe for CASSANDRA-18302
…cep-15-accord on cep-21-tcm
…ss when TransactionStatement is prepared patch by David Capwell; reviewed by Ariel Weisberg, Caleb Rackliffe for CASSANDRA-18337
…ID_RSP but got ACCORD_SIMPLE_RSP patch by David Capwell; reviewed by Caleb Rackliffe for CASSANDRA-18375
…ition with PreApply where reads and writes are interleaved, causing one of the coordinators to see the writes from the other patch by David Capwell; reviewed by Ariel Weisberg for CASSANDRA-18422
…o a Stage and run directly in the messageing handler patch by David Capwell; reviewed by Ariel Weisberg, Benedict Elliott Smith for CASSANDRA-18364
…rt tests to add custom logic patch by David Capwell; reviewed by Caleb Rackliffe for CASSANDRA-18485
in a durable log before processing by CommandStores patch by Aleksey Yeschenko; reviewed by David Capwell for CASSANDRA-18344
…tion history patch by Benedict; reviewed by Blake Eggleston for CASSANDRA-18523
- removing unnecessary calls to ServerTestUtils.daemonInitialization() in a handful of tests - minor cleanup in Verb and BTreeSet
patch by David Capwell; reviewed by Ariel Weisberg for CASSANDRA-18519
patch by Aleksey Yeschenko; reviewed by Benedic Elliott Smith for CASSANDRA-18561
patch by Aleksey Yeschenko; reviewed by Blake Eggleston for CASSANDRA-18563
Patch by Blake Eggleston and Benedict Elliott Smith; Reviewed by David Capwell for CASSANDRA-17101 CEP-15: Accord TCM integration Patch by Blake Eggleston; Reviewed by David Capwell for CASSANDRA-18444
…ctions that are known to be applied across the cluster) patch by Benedict Elliott Smith; reviewed by Ariel Weisberg, Aleksey Yeschenko, and David Capwell for CASSANDRA-18883 Co-authored-by: Benedict Elliott Smith <[email protected]> Co-authored-by: Ariel Weisberg <[email protected]> Co-authored-by: Aleksey Yeschenko <[email protected]> Co-authored-by: David Capwell <[email protected]>
…ing right away (apache#3575) patch by David Capwell; reviewed by Blake Eggleston for CASSANDRA-18764
apache/cassandra-accord#56 Patch by Ariel Weisberg; Reviewed by David Capwell for CASSANDRA-18779
…ure not to loose the partial deps (apache#3590) patch by David Capwell; reviewed by Aleksey Yeschenko for CASSANDRA-18783
Accord compaction purgers see random slices of Accord state during compaction (based on randomly selected compaction inputs). For at least the `durability` column in the `commands` table the tombstone being created when truncating was deleting the latest value since we can get enough information to truncate without actuall yhaving the latest `durability` value. To fix we can wait to emit a tombstone until we are erasing the entire command row when truncating or truncating with outcome and meanwhile we can drop the extra columns that are no longer needed instead of using a tombstone. We don't need to emit cell tombstones we can drop them from the purger when processing each row. patch by Ariel Weisberg; reviewed by David Capwell for CASSANDRA-18795
…s nullable but C* serializer doesn't expect null
…s in TxnWrite, as they can simply be pulled from PartialTxn when needed in Write#apply() - Avoid serializing full TxnData instances to Accord state tables patch by Caleb Rackliffe; reviewed by David Capwell, Benedict Elliot Smith, and Ariel Weisberg for CASSANDRA-18355
patch by Aleksey Yeschenko; reviewed by Ariel Weisberg for CASSANDRA-18573
…, SeedDefiner, RunStartDefiner, and Config
…n validation logic missed the argument to String.format causing confusing errors patch by David Capwell; reviewed by Blake Eggleston for CASSANDRA-20286
Use initializeTopologyUnsafe to for-load the last topology and create shards rather than replaying all topologies. Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20294
Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20297
patch by David Capwell; reviewed by Benedict Elliott Smith for CASSANDRA-20302
…ionTest#casOnAccordSimulationTest patch by David Capwell; reviewed by Ariel Weisberg for CASSANDRA-20322
…is preaccepted Also fix: - Topology slicing must declare whether we share/slice node ownership (to assist above) - CFK.visit removes transitive dependencies too eagerly across epoch change - apply cleanup to builder consistently, and construct the same value we would produce by purge (so that replay is idempotent) - Invoke ExecuteTxn.LocalExecute callbacks on originating CommandStore - misc other minor issues patch by Benedict; reviewed by Alex Petrov for CASSANDRA-20325
Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20316
Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20358
Improve: - Introduce pre/accept fast execution flags - Introduce searchable Deps serialization - Flatten AccordRoutingKey(s) into single type, using sentinel bits - Introduce new fast byte-comparable serialization methods for Token and TableId to support above Fix: - Fix journal re-serialization logic - Enable RandomPartitioner for Accord by supporting fixed-width serialization for RouteIndex patch by Benedict; reviewed by Alex Petrov for CASSANDRA-20349
- Only use persisted RedundantBefore for compaction - RouteIndex should index only touches, not Route - Flush RangesForEpoch updates to journal immediately, so we do not rely on the command we are processing succeeding - DurableBefore updates must wait for the epochs to be known locally - Shard.mustWitnessEpoch to support guaranteeing to witness relevant non-topology schema changes - We must propagate RedundantBefore RX shard bounds along with epoch syncs - Prevent a truncated transaction FetchData infinite loop - GC_BEFORE status being overwritten by bootstrappedAt, permitting old transaction state to be resurrected - Avoid CFK.maxUniqueHlc read race on bootstrap - TopologyManager.awaitEpoch could wait for wrong epoch - Journal fsync thread could miss notifications Also improve: - CommandStores uses SearchableRangeList for finding matching stores - Refactor RedundantBefore to use a sorted array of TxnId/RedundantStatus pairs (to better fix GC_BEFORE issue) - Accord debug keyspace operates on keyspace/table, and sorts correctly by token patch by Benedict; reviewed by Alex Petrov for CASSANDRA-20361
- Bad ArrayBuffers recycling logic - RX must ensure dependencies TRANSITIVE_VISIBLE - Permit constructing "antiRange" that spans multiple prefixes - Not computing range CommandSummary IsDep correctly - Truncated commands that aren't shard durable could not repopulate CFK on replay, permitting recovery of another command to make an incorrect decision - NPE on async persist of RX (i.e. supplying no callback) - NPE in Builder.shouldCleanup when durability is null patch by Benedict; reviewed by Alex Petrov for CASSANDRA-20370
…ding topology and other durability requirements. Also improve: - Introduce DurabilityService - Retire SyncPoint, replace Barrier with Write and RX - MessageType -> enum, restore GetMaxConflict - Standardise backoff logic with WaitStrategy - improve TimeoutStrategy/RetryStrategy specification strings - Forbid KX, remove directKeyDeps - Introduce UniqueTimeService, permitting hlc reservations for sync points avoid delay when min TxnId is sufficiently in the past - Remove ListStore custom purge logic Also fix: - RejectBefore should reject on both epoch and hlc - Do not record sync success for removed nodes - Support GlobalDurability detecting no command store to run on - Incorrect ballot constructor - Serializing 15-bit ballot flags incorrectly - TopologyManager.hasEpoch deadlock - Computing withOpenEpochs incorrectly, sometimes stopping one epoch short - PartitionKey serializer should not depend on schema information that can be erased patch by Benedict; reviewed by Alex Petrov for CASSANDRA-20395
Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20393.
patch by Benedict Elliott Smith; reviewed by David Capwell for CASSANDRA-20420
…a NPE patch by David Capwell; reviewed by Benedict Elliott Smith for CASSANDRA-20417
Patch by Alex Petrov; reviewed by Benedict Elliott Smith CASSANDRA-20347.
Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20424.
… required topology changes patch by David Capwell; reviewed by Benedict Elliott Smith for CASSANDRA-20426
- Decouple command serialization from TableMetadata version; introduce ColumnMetadata ids; gracefully handle missing TableId - DataInputPlus.readLeastSignificantBytes must truncate high bits - Fix RandomPartitioner accord serialization - Fast path stable commits must not override recovery propose/commit decisions regarding visibility of a transaction - RejectBefore must mergeMax, not merge, to ensure we maintain epoch and hlc increasing independently - Bad commitInvalidate decision - consistent filtering for touches and stillTouches - ensure TRUNCATE_BEFORE implies SHARD_APPLIED - TopologyManager.unsyncedOnly off-by-one error - DurabilityQueue should not retry SyncPointErased - handle rare case of no deps but none needed - not updating CFK synchronously on recovery, which can lead to erroneous recovery decisions for other transactions - Don't return partial read response when one commandStore rejects the commit - Filter touches/stillTouches consistently - WaitingState computeLowEpoch must use hasTouched to handle historic key with no route Improve: - Use format parameters to defer building Invariants.requireArgument string - streamline RedundantStatus/RedundantBefore
Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20114
…e changes without impacting C* messaging patch by David Capwell; reviewed by Aleksey Yeschenko, Alex Petrov for CASSANDRA-20403
- Accord Journal purging was disabled - remove unique_id from schema keyspace - avoid String.format in Compactor hot path - avoid string concatenation on hot path; improve segment compactor partition build efficiency - Partial compaction should update records in place to ensure truncation of discontiguous compactions do not lead to an incorrect field version being used - StoreParticipants.touches behaviour for RX was erroneously modified; should touch all non-redundant ranges including those no longer owned - SetShardDurable should correctly set DurableBefore Majority/Universal based on the Durability parameter - fix erroneous prunedBefore invariant - Journal compaction should not rewrite fields shadowed by a newer record - Don't save updates to ERASED commands - Simplify CommandChange.getFlags - fix handling of Durability for Invalidated - Don't use ApplyAt for GC_BEFORE with partial input, as might be a saveStatus >= ApplyAtKnown but with executeAt < ApplyAtKnown patch by Benedict; reviewed by Alex Petrov for CASSANDRA-20441
* Fix short accord simulation test (seed 0x6bea128ae851724b), ConcurrentModificationException * Increase wait time during closing to avoid Unterminated threads * Increase timeouts, improve test stability * More descriptive output from CQL test * Shorten max CMS delay * Improve future handling in config service Patch by Alex Petrov; reviewed by Benedict Elliott Smith for CASSANDRA-20440
* Improve Journal threading model (switch to a single maintenance executer in place of multiple executors) * Improve signalling during segment allocation; avoid using wait queues for both segment allocation and flushes * Get rid of callback queues and record pointers, and instead use Allocation for signalling flush completion * Fix an issue with segment availability for read during switching * Wire up failure callbacks for journal
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.