Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit default TBS storage size limit sampling.tail.storage_limit and storage limit handling #14933

Open
Tracked by #14931
carsonip opened this issue Dec 12, 2024 · 6 comments

Comments

@carsonip
Copy link
Member

carsonip commented Dec 12, 2024

sampling.tail.storage_limit has a default of 3GB which is usually insufficient for high throughput use cases. A storage limit too low will cause TBS to be bypassed and other downstream effects. Reconsider the default in 9.0 release.

@carsonip
Copy link
Member Author

I have 2 ideas for the new default.

  • to allow infinite disk usage
  • to detect available disk space (on startup / periodically) and set the limit dynamically

The argument for either of them is that e.g. if you are running a mysql server, you expect it to use storage space but you don't expect yourself updating a "storage limit" every time it gets hit.

The argument for a detection is to avoid causing a node to be unhealthy in any environments.

@carsonip carsonip changed the title Revisit default TBS storage size limit sampling.tail.storage_size Revisit default TBS storage size limit sampling.tail.storage_limit Jan 13, 2025
@simitt
Copy link
Contributor

simitt commented Jan 15, 2025

@carsonip the solution to detect available disk space would be more elegant. We would need to ensure that it works on all supported OS systems + docker and have a sensible fallback if disk size cannot be detected.
Do you have capacity to look into this?

@carsonip
Copy link
Member Author

Do you have capacity to look into this?

Yes, it is in scope of this task. I'll write down some possible solutions soon.

@carsonip carsonip changed the title Revisit default TBS storage size limit sampling.tail.storage_limit Revisit default TBS storage size limit sampling.tail.storage_limit and storage limit handling Jan 15, 2025
@carsonip
Copy link
Member Author

carsonip commented Jan 22, 2025

This task highly depends on outcome of #15235

Assuming #15235 goes well, the pebble storage usage profile would be very different from badger. It should be much more predictable, and with that we will be able to come up with either

  1. keep storage limit handling, but set a much more reasonable and usable default, and documentation around how to set a good storage limit.
  2. (on top of point (1) above) default to disk space detection but fallback to a constant default. There may be complexity in a cross-platform solution.

@simitt what do you think of the other alternatives, are they no-go?
3. keep storage limit handling but default to unlimited (and maybe always set it to a number on ESS)
4. removing storage limit handling altogether. This will simplify #15235.

@simitt
Copy link
Contributor

simitt commented Jan 23, 2025

  1. keep storage limit handling but default to unlimited (and maybe always set it to a number on ESS)
  2. removing storage limit handling altogether. This will simplify TBS: Replace badger with pebble #15235.

These options do not sound like customer friendly options to me.
Option 1 and 2 both sound reasonable. If option 2 is risky or very complex, I wouldn't see a big problem with starting with option 1 and then potentially switching to option 2 later.

@carsonip
Copy link
Member Author

Option 1 is the baseline. It should be trivial.
I'm also optimistic about option 2 as we move to pebble in #15246 , because there seems to be a cross-platform disk usage stats (available, used, total) in pebble https://github.com/cockroachdb/pebble/blob/master/vfs/disk_usage_linux.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants