Skip to content

feat: improve ledger metrics gathering + updated ledger size management #410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 40 commits into
base: dev
Choose a base branch
from

Conversation

thlorenz
Copy link
Collaborator

@thlorenz thlorenz commented Jun 24, 2025

Summary

Ledger metrics no longer rely on counting entries in ledger columns.

Ledger size management was redone to not rely on column counting either.

Supersedes: #413 (changes from there were cherry picked here)

Details

Metrics

Previously metrics relied on counting entries in ledger columns which is very slow for larger
ledgers.
We also had tried to improve perf by caching counts, but that proved not sufficient.

Instead now we update prometheus counters directly when an entry is added. We no longer use any
methods that count entries in ledger columns in order to update metrics.

The only piece that still relies on these methods is the ledger_stats command. However this
only runs during inspection/diagnostics and thus performance is not as much of a concern. For
that reason I also removed all counter caching in order to not incurr any extra overhead while
the validator is running.

Renaming Metrics

Since ledger metrics where previously tracked as gauges and now are counters, the metric names
are changed as follows:

  • ledger_blocktimes_gauge -> ledger_blocktimes_count
  • ledger_blockhashes_gauge -> ledger_blockhashes_count
  • ledger_slot_signatures_gauge -> ledger_slot_signatures_count
  • ledger_address_signatures_gauge -> ledger_address_signatures_count
  • ledger_transaction_status_gauge -> ledger_transaction_status_count
  • ledger_transaction_successful_status_gauge -> ledger_transaction_successful_status_count
  • ledger_transaction_failed_status_gauge -> ledger_transaction_failed_status_count
  • ledger_transactions_gauge -> ledger_transactions_count
  • ledger_transaction_memos_gauge -> ledger_transaction_memos_count
  • ledger_perf_samples_gauge -> ledger_perf_samples_count
  • ledger_account_mod_data_gauge -> ledger_account_mod_data_count

Ledger Size Management

We now use watermarks to keep track of the ledger size at particular slots and thus predict
more accurately how many slots to truncate to bring the ledger size below max size.

The default strategy truncates to 75% max size whenever we go above max size.
The code ended up more complex than the original truncator also due to handling lots of
edgecases related to restarts with lower max size and the finality slot.

Unit tests for each of those ensure that they are all handled correctly.

The original truncator and tests were removed.

NOTE: we track the accounts mod id at the watermark boundaries which we can use to truncate
these columns correctly as well, however we do not do that yet.

Truncation via Compaction

The updated truncation implementation was cherry picked from the
fix/ledger/delete-using-compaction-filter
branch
.

The main difference is that we don't create tombstones anymore since we use a filter in order
to delete rows during manual (triggered by the ledger size manager) or automatic compaction.
After cherry picking I added a guard to avoid deleting account mod ids since they don't have a
key starting with a slot. Otherwise we'd potentially delete account mod ids that are actually
still needed.

Greptile Summary

Major overhaul of ledger metrics and size management system, replacing slow column entry counting with direct Prometheus counter updates and introducing a watermark-based ledger size management system.

  • Replaced gauge-based metrics with counters (e.g. ledger_blocktimes_gaugeledger_blocktimes_count) for more efficient tracking
  • Implemented watermark-based ledger size tracking to accurately predict truncation points and target 75% max size
  • Removed entry counter caching from LedgerColumn to eliminate memory overhead
  • Added RocksDB compaction filter system to efficiently purge old slots without tombstones
  • Added safeguards in compaction to protect account mod IDs from accidental deletion

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

5 files reviewed, no comments
Edit PR Review Bot Settings | Greptile

@thlorenz thlorenz marked this pull request as draft June 24, 2025 04:55
@thlorenz thlorenz changed the title feat: improve ledger metrics gathering feat: improve ledger metrics gathering + updated ledger size management Jul 2, 2025
@thlorenz thlorenz marked this pull request as ready for review July 2, 2025 03:17
* master:
  release: v0.1.7 (#409)
  Feat: simpler gasless implementation (#406)
  perf: run ensure accounts concurrently
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 files reviewed, 7 comments
Edit PR Review Bot Settings | Greptile

@thlorenz thlorenz requested review from bmuddha and taco-paco July 2, 2025 03:39
thlorenz added 2 commits July 2, 2025 11:43
* master:
  chore: issues were displayed incorrectly in README (#438)
  Fixed graceful shutdown (#414)
Copy link
Contributor

@taco-paco taco-paco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested some improvements, but overall looks good.

Also would suggest a quick merge with my initial branch, I see some pieces in ledger are missing. It should be fairly easy to resolve to your changes in most of the cases

// If we restarted with an existing ledger we need to make sure that the
// ledger size is not far above the max size before we can
// start using the watermark strategy.
// NOTE: that watermarks are set during the first tick
Copy link
Contributor

@taco-paco taco-paco Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this comment says Watermarks are set during first tick. Let's move this out the tick.
The code would be

tokio::spawn(async move {
                    let mut interval = interval(size_check_interval);
                    let mut watermarks = prepare_watermarksI(necessary_args);
                    loop {
                        tokio::select! {
                            _ = cancellation_token.cancelled() => {
                                return;
                            }
                            _ = interval.tick() => {
                                    let ledger_size = Self::tick(
                                        &ledger,
                                        &finality_provider,
                                        &mut watermarks,
                                        &resize_percentage,
                                        max_ledger_size,
                                        &mut existing_ledger_state,
                                    ).await;
                                    if let Some(ledger_size) = ledger_size {
                                        metrics::set_ledger_size(ledger_size);
                                    }
                                }
                        }
                    }
                })

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we could get rid of ugly Option in
fn tick(..., watermarks: &mut Option<Watermarks>)
just fn tick(..., watermarks: &mut Watermarks)

Copy link
Collaborator Author

@thlorenz thlorenz Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Option cannot be removed because watermarks need to be obtained during the first tick. Here's why:

  1. Lazy Initialization: Watermarks are initialized differently depending on whether we have an existing ledger state or are starting fresh. This decision can only be made during the first tick when we check watermarks.is_none().

  2. Existing Ledger State Handling: When restarting with an existing ledger, we need to:

    • Check if the current ledger size exceeds max size
    • Potentially truncate before setting up watermarks
    • Initialize watermarks with the adjusted size and proper slot boundaries

    This complex initialization logic requires the watermarks to start as None and be populated during the first tick.

  3. Fresh Ledger Handling: For new ledgers, watermarks are initialized with Watermarks::new(resize_percentage, max_ledger_size, None) during the first call to get_or_insert_with().

The Option pattern here is not "ugly" but necessary for the correct initialization sequence. Removing it would require moving complex initialization logic elsewhere or passing uninitialized watermarks, which would be less safe and more error-prone.

Copy link
Contributor

@bmuddha bmuddha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks over-engineered, but well, not much can be done about it rn, if it solves the resource overconsumption issue, we can merge this PR.

* dev:
  Cleanup and refactor accountsdb index (#419)
  chore: create dev branch
  fix: don't block the accounts update loop with subscription (#420)
  chore: increase compute budget for table mania create lookup table (#450)
@thlorenz thlorenz changed the base branch from master to dev July 15, 2025 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants