-
Notifications
You must be signed in to change notification settings - Fork 15
feat: improve ledger metrics gathering + updated ledger size management #410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
5 files reviewed, no comments
Edit PR Review Bot Settings | Greptile
…te-using-compaction-filter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
20 files reviewed, 7 comments
Edit PR Review Bot Settings | Greptile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested some improvements, but overall looks good.
Also would suggest a quick merge with my initial branch, I see some pieces in ledger are missing. It should be fairly easy to resolve to your changes in most of the cases
// If we restarted with an existing ledger we need to make sure that the | ||
// ledger size is not far above the max size before we can | ||
// start using the watermark strategy. | ||
// NOTE: that watermarks are set during the first tick |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this comment says Watermarks are set during first tick. Let's move this out the tick.
The code would be
tokio::spawn(async move {
let mut interval = interval(size_check_interval);
let mut watermarks = prepare_watermarksI(necessary_args);
loop {
tokio::select! {
_ = cancellation_token.cancelled() => {
return;
}
_ = interval.tick() => {
let ledger_size = Self::tick(
&ledger,
&finality_provider,
&mut watermarks,
&resize_percentage,
max_ledger_size,
&mut existing_ledger_state,
).await;
if let Some(ledger_size) = ledger_size {
metrics::set_ledger_size(ledger_size);
}
}
}
}
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we could get rid of ugly Option in
fn tick(..., watermarks: &mut Option<Watermarks>)
just fn tick(..., watermarks: &mut Watermarks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Option cannot be removed because watermarks need to be obtained during the first tick. Here's why:
-
Lazy Initialization: Watermarks are initialized differently depending on whether we have an existing ledger state or are starting fresh. This decision can only be made during the first tick when we check
watermarks.is_none()
. -
Existing Ledger State Handling: When restarting with an existing ledger, we need to:
- Check if the current ledger size exceeds max size
- Potentially truncate before setting up watermarks
- Initialize watermarks with the adjusted size and proper slot boundaries
This complex initialization logic requires the watermarks to start as None and be populated during the first tick.
-
Fresh Ledger Handling: For new ledgers, watermarks are initialized with
Watermarks::new(resize_percentage, max_ledger_size, None)
during the first call toget_or_insert_with()
.
The Option pattern here is not "ugly" but necessary for the correct initialization sequence. Removing it would require moving complex initialization logic elsewhere or passing uninitialized watermarks, which would be less safe and more error-prone.
* master: feat: committor service (#366)
Amp-Thread: https://ampcode.com/threads/T-ac49f6d4-4ea8-465a-a7c1-0f5a72ccde2c Co-authored-by: Amp <[email protected]>
Amp-Thread: https://ampcode.com/threads/T-ac49f6d4-4ea8-465a-a7c1-0f5a72ccde2c Co-authored-by: Amp <[email protected]>
Amp-Thread: https://ampcode.com/threads/T-ac49f6d4-4ea8-465a-a7c1-0f5a72ccde2c Co-authored-by: Amp <[email protected]>
Amp-Thread: https://ampcode.com/threads/T-ac49f6d4-4ea8-465a-a7c1-0f5a72ccde2c Co-authored-by: Amp <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks over-engineered, but well, not much can be done about it rn, if it solves the resource overconsumption issue, we can merge this PR.
Summary
Ledger metrics no longer rely on counting entries in ledger columns.
Ledger size management was redone to not rely on column counting either.
Supersedes: #413 (changes from there were cherry picked here)
Details
Metrics
Previously metrics relied on counting entries in ledger columns which is very slow for larger
ledgers.
We also had tried to improve perf by caching counts, but that proved not sufficient.
Instead now we update prometheus counters directly when an entry is added. We no longer use any
methods that count entries in ledger columns in order to update metrics.
The only piece that still relies on these methods is the
ledger_stats
command. However thisonly runs during inspection/diagnostics and thus performance is not as much of a concern. For
that reason I also removed all counter caching in order to not incurr any extra overhead while
the validator is running.
Renaming Metrics
Since ledger metrics where previously tracked as gauges and now are counters, the metric names
are changed as follows:
Ledger Size Management
We now use watermarks to keep track of the ledger size at particular slots and thus predict
more accurately how many slots to truncate to bring the ledger size below max size.
The default strategy truncates to 75% max size whenever we go above max size.
The code ended up more complex than the original truncator also due to handling lots of
edgecases related to restarts with lower max size and the finality slot.
Unit tests for each of those ensure that they are all handled correctly.
The original truncator and tests were removed.
NOTE: we track the accounts mod id at the watermark boundaries which we can use to truncate
these columns correctly as well, however we do not do that yet.
Truncation via Compaction
The updated truncation implementation was cherry picked from the
fix/ledger/delete-using-compaction-filter
branch.
The main difference is that we don't create tombstones anymore since we use a filter in order
to delete rows during manual (triggered by the ledger size manager) or automatic compaction.
After cherry picking I added a guard to avoid deleting account mod ids since they don't have a
key starting with a slot. Otherwise we'd potentially delete account mod ids that are actually
still needed.
Greptile Summary
Major overhaul of ledger metrics and size management system, replacing slow column entry counting with direct Prometheus counter updates and introducing a watermark-based ledger size management system.
ledger_blocktimes_gauge
→ledger_blocktimes_count
) for more efficient trackingLedgerColumn
to eliminate memory overhead