-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TBS: Replace badger with pebble #15235
Conversation
This pull request does not have a backport label. Could you fix it @carsonip? 🙏
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
One request from me, could we please add a description to all struct that implement RW
interface. There are couple of them and it takes time to understand how they interact, it'll save time in the future if we have some light description to explain their purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work!
This PR replaces badger with pebble as the database for tail-based sampling. Significant performance gains. The database of choice is Pebble, which does not have TTL handling built-in, and we implement our own TTL handling on top of the database: - TTL is divided up into N parts, where N is partitionsPerTTL. - A database holds N + 1 + 1 partitions. - Every TTL/N we will discard the oldest partition, so we keep a rolling window of N+1 partitions. - Writes will go to the most recent partition, and we'll read across N+1 partitions (cherry picked from commit 0ca58b8) # Conflicts: # go.mod # go.sum # internal/beater/monitoringtest/opentelemetry.go # x-pack/apm-server/main.go # x-pack/apm-server/main_test.go # x-pack/apm-server/sampling/processor.go # x-pack/apm-server/sampling/processor_bench_test.go # x-pack/apm-server/sampling/processor_test.go
Motivation/summary
This PR replaces badger with pebble as the database for tail-based sampling.
Benchmarks
TLDR: +3000% in indexed events/s, +76% intakev2 event rate, while on -75% memory usage and -44% disk usage
See comment for details.
Major design changes
2*TTL
. In fact, there's a knobpartitionsPerTTL
to adjust the available prefixes to trade between storage overhead and read amplification. e.g. partitionsPerTTL=1 keeps2*TTL
entries with 2 partition reads per key read, while partitionsPerTTL=2 keeps1.5*TTL
entries with 3 partition reads per key read.Other implied changes
sampling.tail.storage.lsm_size
, whilesampling.tail.storage.vlog_size
is always 0. The change is not decided yet.TODO:
sampling.tail.storage_limit
and storage limit handling #14933Check for memory leak by running it for a long time, as pebble does manual memory managementevent loss on upgrade:see Migration path from badger to pebble #15423Useful but not necessary, out of scope of this PR:
testing/infra/terraform/modules/standalone_apm_server
for reproducible benchmark with moxy and without using ESS: terraform: support TBS in standalone_apm_server #15337Checklist
For functional changes, consider:
How to test these changes
Enable TBS, try various sampling policies, send events, keep it running for over 2 * TTL, ensure that disk usage is bounded, and memory usage is expected.
Related issues
Fixes #15246