Skip to content

Commit c669c8f

Browse files
Parallel ledger close (#4543)
Resolves #4317 Concludes #4128 The implementation of this proposal requires massive changes to the stellar-core codebase, and touches almost every subsystem. There are some paradigm shifts in how the program executes, that I will discuss below for posterity. The same ideas are reflected in code comments as well, as it’ll be important for code maintenance and extensibility ## Database access Currently, only Postgres DB backend is supported, as it required minimal changes to how DB queries are structured (Postgres provides a fairly nice concurrency model). SQLite concurrency support is a lot more rudimentary, with only a single writer allowed, and the whole database is locked during writing. This necessitates further changes in core (such as splitting the database into two). Given that most network infrastructure is on Postgres right now, SQLite support can be added later. ### Reduced responsibilities of SQL SQL tables have been trimmed as much as possible to avoid conflicts, essentially we only store persistent state such as the latest LCL and SCP history, as well as legacy OFFER table. ## Asynchronous externalize flow There are three important subsystems in core that are in charge of tracking consensus, externalizing and applying ledgers, and advancing the state machine to catchup or synced state: - Herder: receives SCP messages, forwards them to SCP, decides if a ledger is externalized, triggers voting for the next ledger - LedgerManager: implements closing of a ledger, sets catchup vs synced state, advances and persists last closed ledger. - CatchupManager: Keep track of any externalized ledgers that are not LCL+1. That is, keep track of future externalizing ledgers, attempt applying them to keep core in sync, and trigger catchup if needed. Prior to this change, the externalize flow had two different flows: - If core received LCL+1, it would immediately apply it. Which means the flow externalize → closeLedger → set “synced” state happened in one synchronous function. After application, core triggers the next ledger, usually asynchronously, as it needs to wait to meet the 5s ledger requirement. - If core received ledger LCL+2..LCL+N it would asynchronously buffer it, and continue buffering new ledgers. If core can’t close the gap and apply everything sequentially, it would go into catchup flow. With the new changes, the triggering ledger close flow moved to CatchupManager completely. Essentially, CatchupManager::processLedger became a centralized place to decide whether to apply a ledger, or trigger catchup. Because ledger close happens in the background, the transition between externalize and “closeLedger→set synced” becomes asynchronous. ## Concurrent ledger close List of core items that moved to the background followed by explanation why it is safe to do so: ### Emitting meta Ledger application is the only process that touches the meta pipe, no conflicts with other subsystems ### Writing checkpoint files Only the background thread writes in-progress checkpoint files. Main thread deals exclusively with “complete” checkpoints, which after completion must not be touched by any subsystem except publishing. ### Updating ledger state The rest of the system operates strictly on read-only BucketList snapshots, and is unaffected by changing state. Note: there are some calls to LedgerTxn in the codebase still, but those only appear on startup during setup (when node is not operational) or in offline commands. ### Incrementing current LCL Because ledger close moved to the background, guarantees about ledger state and its staleness are now different. Previously, ledger state queried by subsystems outside of apply was always up-to-date. With this change, it is possible the snapshot used by main thread may become slightly stale (if background just closed a new ledger, but main thread hasn't refreshed its snapshot yet). There are different use cases of main thread's ledger state, which must be treated with caution and evaluated individually: - When it is safe: in cases, where LCL is used more like a heuristic or an approximation. Program correctness does not depend on the exact state of LCL. Example: post-externalize cleanup of transaction queue. We load LCL’s close time to purge invalid transactions from the queue. This is safe because if LCL has been updated while we call this, the queue is still in a consistent state. In fact, anything in the transaction queue is essentially an approximation, so a slightly stale snapshot should be safe to use. - When it is not safe: when LCL is needed in places where the latest ledger state is critical, like voting in SCP, validating blocks, etc. To avoid any unnecessary headaches, we introduce a new invariant: “applying” is a new state in the state machine, which does not allow voting and triggering next ledgers. Core must first complete applying to be able to vote on the “latest state”. In the meantime, if ledgers arrive while applying, we treat them like “future ledgers” and apply the same procedures in herder that we do today (don’t perform validation checks, don’t vote on them, and buffer them in a separate queue). The state machine remains on the main thread _only_, which ensures SCP can safely execute as long as the state transitions are correct (for example, executing a block production function can safely grab the LCL at the beginning of the function without worrying that it might change in the background). ### Reflecting state change in the bucketlist Close ledger is the only place in the code that updates the BucketList. Other subsystems may only read it. Example is garbage collection, which queries the latest BucketList state to decide which buckets to delete. These are protected with a mutex (the same LCL mutex used in LM, as bucketlist is conceptually a part of LCL as well).
2 parents d89edf1 + e417314 commit c669c8f

File tree

108 files changed

+2039
-1371
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

108 files changed

+2039
-1371
lines changed

src/bucket/BucketListBase.cpp

-2
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,6 @@ template <typename BucketT>
5757
void
5858
BucketLevel<BucketT>::setNext(FutureBucket<BucketT> const& fb)
5959
{
60-
releaseAssert(threadIsMain());
6160
mNextCurr = fb;
6261
}
6362

@@ -79,7 +78,6 @@ template <typename BucketT>
7978
void
8079
BucketLevel<BucketT>::setCurr(std::shared_ptr<BucketT> b)
8180
{
82-
releaseAssert(threadIsMain());
8381
mNextCurr.clear();
8482
mCurr = b;
8583
}

src/bucket/BucketListSnapshotBase.cpp

-2
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,6 @@ BucketListSnapshot<BucketT>::BucketListSnapshot(
1919
BucketListBase<BucketT> const& bl, LedgerHeader header)
2020
: mHeader(std::move(header))
2121
{
22-
releaseAssert(threadIsMain());
23-
2422
for (uint32_t i = 0; i < BucketListBase<BucketT>::kNumLevels; ++i)
2523
{
2624
auto const& level = bl.getLevel(i);

src/bucket/BucketManager.cpp

+38-39
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
#include "ledger/LedgerManager.h"
1919
#include "ledger/LedgerTxn.h"
2020
#include "ledger/LedgerTypeUtils.h"
21+
#include "ledger/NetworkConfig.h"
2122
#include "main/Application.h"
2223
#include "main/Config.h"
2324
#include "util/Fs.h"
@@ -62,6 +63,7 @@ void
6263
BucketManager::initialize()
6364
{
6465
ZoneScoped;
66+
releaseAssert(threadIsMain());
6567
std::string d = mConfig.BUCKET_DIR_PATH;
6668

6769
if (!fs::exists(d))
@@ -729,7 +731,7 @@ BucketManager::getBucketListReferencedBuckets() const
729731
}
730732

731733
std::set<Hash>
732-
BucketManager::getAllReferencedBuckets() const
734+
BucketManager::getAllReferencedBuckets(HistoryArchiveState const& has) const
733735
{
734736
ZoneScoped;
735737
auto referenced = getBucketListReferencedBuckets();
@@ -740,8 +742,7 @@ BucketManager::getAllReferencedBuckets() const
740742

741743
// retain any bucket referenced by the last closed ledger as recorded in the
742744
// database (as merges complete, the bucket list drifts from that state)
743-
auto lclHas = mApp.getLedgerManager().getLastClosedLedgerHAS();
744-
auto lclBuckets = lclHas.allBuckets();
745+
auto lclBuckets = has.allBuckets();
745746
for (auto const& h : lclBuckets)
746747
{
747748
auto rit = referenced.emplace(hexToBin256(h));
@@ -752,39 +753,38 @@ BucketManager::getAllReferencedBuckets() const
752753
}
753754

754755
// retain buckets that are referenced by a state in the publish queue.
755-
auto pub = mApp.getHistoryManager().getBucketsReferencedByPublishQueue();
756+
for (auto const& h :
757+
HistoryManager::getBucketsReferencedByPublishQueue(mApp.getConfig()))
756758
{
757-
for (auto const& h : pub)
759+
auto rhash = hexToBin256(h);
760+
auto rit = referenced.emplace(rhash);
761+
if (rit.second)
758762
{
759-
auto rhash = hexToBin256(h);
760-
auto rit = referenced.emplace(rhash);
761-
if (rit.second)
762-
{
763-
CLOG_TRACE(Bucket, "{} referenced by publish queue", h);
764-
765-
// Project referenced bucket `rhash` -- which might be a merge
766-
// input captured before a merge finished -- through our weak
767-
// map of merge input/output relationships, to find any outputs
768-
// we'll want to retain in order to resynthesize the merge in
769-
// the future, rather than re-run it.
770-
mFinishedMerges.getOutputsUsingInput(rhash, referenced);
771-
}
763+
CLOG_TRACE(Bucket, "{} referenced by publish queue", h);
764+
765+
// Project referenced bucket `rhash` -- which might be a merge
766+
// input captured before a merge finished -- through our weak
767+
// map of merge input/output relationships, to find any outputs
768+
// we'll want to retain in order to resynthesize the merge in
769+
// the future, rather than re-run it.
770+
mFinishedMerges.getOutputsUsingInput(rhash, referenced);
772771
}
773772
}
774773
return referenced;
775774
}
776775

777776
void
778-
BucketManager::cleanupStaleFiles()
777+
BucketManager::cleanupStaleFiles(HistoryArchiveState const& has)
779778
{
780779
ZoneScoped;
780+
releaseAssert(threadIsMain());
781781
if (mConfig.DISABLE_BUCKET_GC)
782782
{
783783
return;
784784
}
785785

786786
std::lock_guard<std::recursive_mutex> lock(mBucketMutex);
787-
auto referenced = getAllReferencedBuckets();
787+
auto referenced = getAllReferencedBuckets(has);
788788
std::transform(std::begin(mSharedLiveBuckets), std::end(mSharedLiveBuckets),
789789
std::inserter(referenced, std::end(referenced)),
790790
[](std::pair<Hash, std::shared_ptr<LiveBucket>> const& p) {
@@ -818,11 +818,11 @@ BucketManager::cleanupStaleFiles()
818818
}
819819

820820
void
821-
BucketManager::forgetUnreferencedBuckets()
821+
BucketManager::forgetUnreferencedBuckets(HistoryArchiveState const& has)
822822
{
823823
ZoneScoped;
824824
std::lock_guard<std::recursive_mutex> lock(mBucketMutex);
825-
auto referenced = getAllReferencedBuckets();
825+
auto referenced = getAllReferencedBuckets(has);
826826
auto blReferenced = getBucketListReferencedBuckets();
827827

828828
auto bucketMapLoop = [&](auto& bucketMap, auto& futureMap) {
@@ -867,7 +867,7 @@ BucketManager::forgetUnreferencedBuckets()
867867
Bucket,
868868
"BucketManager::forgetUnreferencedBuckets dropping {}",
869869
filename);
870-
if (!filename.empty() && !mApp.getConfig().DISABLE_BUCKET_GC)
870+
if (!filename.empty() && !mConfig.DISABLE_BUCKET_GC)
871871
{
872872
CLOG_TRACE(Bucket, "removing bucket file: {}", filename);
873873
std::filesystem::remove(filename);
@@ -1049,15 +1049,15 @@ BucketManager::maybeSetIndex(std::shared_ptr<BucketBase> b,
10491049

10501050
void
10511051
BucketManager::startBackgroundEvictionScan(uint32_t ledgerSeq,
1052-
uint32_t ledgerVers)
1052+
uint32_t ledgerVers,
1053+
SorobanNetworkConfig const& cfg)
10531054
{
10541055
releaseAssert(mSnapshotManager);
10551056
releaseAssert(!mEvictionFuture.valid());
10561057
releaseAssert(mEvictionStatistics);
10571058

10581059
auto searchableBL =
10591060
mSnapshotManager->copySearchableLiveBucketListSnapshot();
1060-
auto const& cfg = mApp.getLedgerManager().getSorobanNetworkConfigForApply();
10611061
auto const& sas = cfg.stateArchivalSettings();
10621062

10631063
using task_t = std::packaged_task<EvictionResultCandidates()>;
@@ -1078,31 +1078,27 @@ BucketManager::startBackgroundEvictionScan(uint32_t ledgerSeq,
10781078
}
10791079

10801080
EvictedStateVectors
1081-
BucketManager::resolveBackgroundEvictionScan(AbstractLedgerTxn& ltx,
1082-
uint32_t ledgerSeq,
1083-
LedgerKeySet const& modifiedKeys,
1084-
uint32_t ledgerVers)
1081+
BucketManager::resolveBackgroundEvictionScan(
1082+
AbstractLedgerTxn& ltx, uint32_t ledgerSeq,
1083+
LedgerKeySet const& modifiedKeys, uint32_t ledgerVers,
1084+
SorobanNetworkConfig const& networkConfig)
10851085
{
10861086
ZoneScoped;
1087-
releaseAssert(threadIsMain());
10881087
releaseAssert(mEvictionStatistics);
10891088

10901089
if (!mEvictionFuture.valid())
10911090
{
1092-
startBackgroundEvictionScan(ledgerSeq, ledgerVers);
1091+
startBackgroundEvictionScan(ledgerSeq, ledgerVers, networkConfig);
10931092
}
10941093

10951094
auto evictionCandidates = mEvictionFuture.get();
10961095

1097-
auto const& networkConfig =
1098-
mApp.getLedgerManager().getSorobanNetworkConfigForApply();
1099-
11001096
// If eviction related settings changed during the ledger, we have to
11011097
// restart the scan
11021098
if (!evictionCandidates.isValid(ledgerSeq,
11031099
networkConfig.stateArchivalSettings()))
11041100
{
1105-
startBackgroundEvictionScan(ledgerSeq, ledgerVers);
1101+
startBackgroundEvictionScan(ledgerSeq, ledgerVers, networkConfig);
11061102
evictionCandidates = mEvictionFuture.get();
11071103
}
11081104

@@ -1229,6 +1225,7 @@ BucketManager::assumeState(HistoryArchiveState const& has,
12291225
uint32_t maxProtocolVersion, bool restartMerges)
12301226
{
12311227
ZoneScoped;
1228+
releaseAssert(threadIsMain());
12321229
releaseAssertOrThrow(mConfig.MODE_ENABLES_BUCKETLIST);
12331230

12341231
// TODO: Assume archival bucket state
@@ -1277,7 +1274,7 @@ BucketManager::assumeState(HistoryArchiveState const& has,
12771274
mLiveBucketList->restartMerges(mApp, maxProtocolVersion,
12781275
has.currentLedger);
12791276
}
1280-
cleanupStaleFiles();
1277+
cleanupStaleFiles(has);
12811278
}
12821279

12831280
void
@@ -1378,7 +1375,7 @@ std::shared_ptr<LiveBucket>
13781375
BucketManager::mergeBuckets(HistoryArchiveState const& has)
13791376
{
13801377
ZoneScoped;
1381-
1378+
releaseAssert(threadIsMain());
13821379
std::map<LedgerKey, LedgerEntry> ledgerMap = loadCompleteLedgerState(has);
13831380
BucketMetadata meta;
13841381
MergeCounters mc;
@@ -1568,9 +1565,11 @@ BucketManager::visitLedgerEntries(
15681565
}
15691566

15701567
std::shared_ptr<BasicWork>
1571-
BucketManager::scheduleVerifyReferencedBucketsWork()
1568+
BucketManager::scheduleVerifyReferencedBucketsWork(
1569+
HistoryArchiveState const& has)
15721570
{
1573-
std::set<Hash> hashes = getAllReferencedBuckets();
1571+
releaseAssert(threadIsMain());
1572+
std::set<Hash> hashes = getAllReferencedBuckets(has);
15741573
std::vector<std::shared_ptr<BasicWork>> seq;
15751574
for (auto const& h : hashes)
15761575
{

src/bucket/BucketManager.h

+16-6
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ class BucketSnapshotManager;
3636
class SearchableLiveBucketListSnapshot;
3737
struct BucketEntryCounters;
3838
enum class LedgerEntryTypeAndDurability : uint32_t;
39+
class SorobanNetworkConfig;
3940

4041
struct HistoryArchiveState;
4142

@@ -70,6 +71,11 @@ class BucketManager : NonMovableOrCopyable
7071

7172
static std::string const kLockFilename;
7273

74+
// NB: ideally, BucketManager should have no access to mApp, as it's too
75+
// dangerous in the context of parallel application. BucketManager is quite
76+
// bloated, with lots of legacy code, so to ensure safety, annotate all
77+
// functions using mApp with `releaseAssert(threadIsMain())` and avoid
78+
// accessing mApp in the background.
7379
Application& mApp;
7480
std::unique_ptr<LiveBucketList> mLiveBucketList;
7581
std::unique_ptr<HotArchiveBucketList> mHotArchiveBucketList;
@@ -124,7 +130,7 @@ class BucketManager : NonMovableOrCopyable
124130

125131
std::atomic<bool> mIsShutdown{false};
126132

127-
void cleanupStaleFiles();
133+
void cleanupStaleFiles(HistoryArchiveState const& has);
128134
void deleteTmpDirAndUnlockBucketDir();
129135
void deleteEntireBucketDir();
130136

@@ -260,7 +266,7 @@ class BucketManager : NonMovableOrCopyable
260266
// not immediately cause the buckets to delete themselves, if someone else
261267
// is using them via a shared_ptr<>, but the BucketManager will no longer
262268
// independently keep them alive.
263-
void forgetUnreferencedBuckets();
269+
void forgetUnreferencedBuckets(HistoryArchiveState const& has);
264270

265271
// Feed a new batch of entries to the bucket list. This interface expects to
266272
// be given separate init (created) and live (updated) entry vectors. The
@@ -290,7 +296,8 @@ class BucketManager : NonMovableOrCopyable
290296
// Scans BucketList for non-live entries to evict starting at the entry
291297
// pointed to by EvictionIterator. Evicts until `maxEntriesToEvict` entries
292298
// have been evicted or maxEvictionScanSize bytes have been scanned.
293-
void startBackgroundEvictionScan(uint32_t ledgerSeq, uint32_t ledgerVers);
299+
void startBackgroundEvictionScan(uint32_t ledgerSeq, uint32_t ledgerVers,
300+
SorobanNetworkConfig const& cfg);
294301

295302
// Returns a pair of vectors representing entries evicted this ledger, where
296303
// the first vector constains all deleted keys (TTL and temporary), and
@@ -300,7 +307,8 @@ class BucketManager : NonMovableOrCopyable
300307
EvictedStateVectors
301308
resolveBackgroundEvictionScan(AbstractLedgerTxn& ltx, uint32_t ledgerSeq,
302309
LedgerKeySet const& modifiedKeys,
303-
uint32_t ledgerVers);
310+
uint32_t ledgerVers,
311+
SorobanNetworkConfig const& networkConfig);
304312

305313
medida::Meter& getBloomMissMeter() const;
306314
medida::Meter& getBloomLookupMeter() const;
@@ -325,7 +333,8 @@ class BucketManager : NonMovableOrCopyable
325333

326334
// Return the set of buckets referenced by the BucketList, LCL HAS,
327335
// and publish queue.
328-
std::set<Hash> getAllReferencedBuckets() const;
336+
std::set<Hash>
337+
getAllReferencedBuckets(HistoryArchiveState const& has) const;
329338

330339
// Check for missing bucket files that would prevent `assumeState` from
331340
// succeeding
@@ -382,7 +391,8 @@ class BucketManager : NonMovableOrCopyable
382391

383392
// Schedule a Work class that verifies the hashes of all referenced buckets
384393
// on background threads.
385-
std::shared_ptr<BasicWork> scheduleVerifyReferencedBucketsWork();
394+
std::shared_ptr<BasicWork>
395+
scheduleVerifyReferencedBucketsWork(HistoryArchiveState const& has);
386396

387397
Config const& getConfig() const;
388398

src/bucket/BucketSnapshotManager.cpp

-3
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,6 @@ BucketSnapshotManager::recordBulkLoadMetrics(std::string const& label,
9898
{
9999
// For now, only keep metrics for the main thread. We can decide on what
100100
// metrics make sense when more background services are added later.
101-
releaseAssert(threadIsMain());
102101

103102
if (numEntries != 0)
104103
{
@@ -153,8 +152,6 @@ BucketSnapshotManager::updateCurrentSnapshot(
153152
SnapshotPtrT<LiveBucket>&& liveSnapshot,
154153
SnapshotPtrT<HotArchiveBucket>&& hotArchiveSnapshot)
155154
{
156-
releaseAssert(threadIsMain());
157-
158155
auto updateSnapshot = [numHistoricalSnapshots = mNumHistoricalSnapshots](
159156
auto& currentSnapshot, auto& historicalSnapshots,
160157
auto&& newSnapshot) {

src/bucket/PublishQueueBuckets.cpp

-55
This file was deleted.

0 commit comments

Comments
 (0)