Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TBS: Replace badger with pebble #15235

Merged
merged 190 commits into from
Jan 29, 2025
Merged

TBS: Replace badger with pebble #15235

merged 190 commits into from
Jan 29, 2025

Conversation

carsonip
Copy link
Member

@carsonip carsonip commented Jan 14, 2025

Motivation/summary

This PR replaces badger with pebble as the database for tail-based sampling.

// The database of choice is Pebble, which does not have TTL handling built-in,
// and we implement our own TTL handling on top of the database:
// - TTL is divided up into N parts, where N is partitionsPerTTL.
// - A database holds N + 1 + 1 partitions.
// - Every TTL/N we will discard the oldest partition, so we keep a rolling window of N+1 partitions.
// - Writes will go to the most recent partition, and we'll read across N+1 partitions

Benchmarks

TLDR: +3000% in indexed events/s, +76% intakev2 event rate, while on -75% memory usage and -44% disk usage

See comment for details.

Major design changes

  • As pebble does not support TTL, this PR introduces a partitioning method to enforce TTL. TTL is handled by assigning a rotating partitioning key (not timestamp based) as database key prefix. There's also a background thread to run a "TTL GC Loop": for every TTL, the prefix of keys are rotated, and the expired prefix will be deleted and compacted. This explicit TTL handling enables precise deletion and the lifetime of an entry in the database is strictly bounded to 2*TTL. In fact, there's a knob partitionsPerTTL to adjust the available prefixes to trade between storage overhead and read amplification. e.g. partitionsPerTTL=1 keeps 2*TTL entries with 2 partition reads per key read, while partitionsPerTTL=2 keeps 1.5*TTL entries with 3 partition reads per key read.
  • Not using timestamp based prefix enables TTL adjustments without data loss on EA hot reload / apm-server restart, as prefixes are fixed but TTL-truncated timestamps aren't.
  • Sampling decision and events are stored separately in different pebble databases, as they have vastly different characteristics, which means the optimal pebble option would be different. Compression is enabled for events but disabled for decisions.

image

Other implied changes

  • As TTL is enforced by deleting entries and compacting actively in the TTL GC Loop, there is no TTL validation on stored entries at read time, i.e. it is possible to run apm-server TBS, stop it, wait for 1 year, then restart apm-server, and old entries will be respected. In contrast, badger checks for expiry at read time. As trace IDs are supposed to be unique, the downside of not doing read-time TTL validation should be minimal.
  • Pebble does not use a vlog like badger. In addition to the event and decision DB separation, this also requires changes in monitoring metrics structure. In the draft PR, the sum of table size, WAL size across the 2 DBs are summed and reported in sampling.tail.storage.lsm_size, while sampling.tail.storage.vlog_size is always 0. The change is not decided yet.

TODO:

Useful but not necessary, out of scope of this PR:

Checklist

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

Enable TBS, try various sampling policies, send events, keep it running for over 2 * TTL, ensure that disk usage is bounded, and memory usage is expected.

Related issues

Fixes #15246

Copy link
Contributor

mergify bot commented Jan 14, 2025

This pull request does not have a backport label. Could you fix it @carsonip? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.17 is the label to automatically backport to the 7.17 branch.
  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • backport-8.x is the label to automatically backport to the 8.x branch.

Copy link
Contributor

mergify bot commented Jan 14, 2025

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 14, 2025
axw
axw previously approved these changes Jan 29, 2025
1pkg
1pkg previously approved these changes Jan 29, 2025
Copy link
Member

@1pkg 1pkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

One request from me, could we please add a description to all struct that implement RW interface. There are couple of them and it takes time to understand how they interact, it'll save time in the future if we have some light description to explain their purpose.

@carsonip carsonip dismissed stale reviews from 1pkg and axw via e0857a5 January 29, 2025 00:16
@carsonip carsonip requested review from simitt, axw and 1pkg January 29, 2025 11:24
Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work!

@carsonip carsonip merged commit 0ca58b8 into main Jan 29, 2025
16 checks passed
@carsonip carsonip deleted the tbs-pebble-rebase branch January 29, 2025 14:58
mergify bot pushed a commit that referenced this pull request Jan 29, 2025
This PR replaces badger with pebble as the database for tail-based sampling. Significant performance gains.

The database of choice is Pebble, which does not have TTL handling built-in,
and we implement our own TTL handling on top of the database:
- TTL is divided up into N parts, where N is partitionsPerTTL.
- A database holds N + 1 + 1 partitions.
- Every TTL/N we will discard the oldest partition, so we keep a rolling window of N+1 partitions.
- Writes will go to the most recent partition, and we'll read across N+1 partitions

(cherry picked from commit 0ca58b8)

# Conflicts:
#	go.mod
#	go.sum
#	internal/beater/monitoringtest/opentelemetry.go
#	x-pack/apm-server/main.go
#	x-pack/apm-server/main_test.go
#	x-pack/apm-server/sampling/processor.go
#	x-pack/apm-server/sampling/processor_bench_test.go
#	x-pack/apm-server/sampling/processor_test.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TBS: Explore replacing badger with pebble
5 participants