-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TBS: set default sampling.tail.storage_limit
to 0 but limit disk usage to 90%
#15467
Conversation
This pull request does not have a backport label. Could you fix it @carsonip? 🙏
|
|
|
|
|
||
usage, err := vfs.Default.GetDiskUsage(sm.storageDir) | ||
if err != nil { | ||
sm.rateLimitedLogger.With(logp.Error(err)).Warn("failed to get disk usage") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the fallback in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation it effectively becomes unlimited. It is by design because of how they are 2 separate checkers and RW. But I think it is possible to overwrite dbStorageLimit
to a fallback value if we get an error, so that storage_limit checker takes over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added fallback handling. A bit complex, but should cover all cases. All of it is written with the assumption that if GetDiskUsage ever returns an error, it will always return an error.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
usage, err := sm.getDiskUsage() | ||
if err != nil { | ||
sm.logger.With(logp.Error(err)).Warn("failed to get disk usage") | ||
sm.getDiskUsageFailed.Store(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: While I can't think of what could lead to transient failures, I am not sure about always failing on error bit - something to think about in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives me a headache as well. What should existing disk usage threshold checks perform when getDiskUsage has transient failures, should it use a stale number, or should it become unlimited? With this assumption of always failing, it simplifies the implementation.
sampling.tail.storage_limit
to 0 but limit disk usage to 90%
…age to 90% (#15467) This is a breaking change to the default storage_limit to enable a more user-friendly TBS disk usage handling. This new default will automatically scale with a larger disk. Change sampling.tail.storage_limit default to 0. While 0 means unlimited local tail-sampling database size, it now enforces a max 90% disk usage on the disk where the data directory is located. Any tail sampling writes after this threshold will be rejected, similar to what happens when tail-sampling database size exceeds a non-0 storage limit. Setting sampling.tail.storage_limit to non-0 maintains the existing behavior which limits the tail-sampling database size to sampling.tail.storage_limit and does not have the new disk usage threshold check. (cherry picked from commit d019277) # Conflicts: # changelogs/9.0.asciidoc
…isk usage to 90% (backport #15467) (#15501) * TBS: set default `sampling.tail.storage_limit` to 0 but limit disk usage to 90% (#15467) This is a breaking change to the default storage_limit to enable a more user-friendly TBS disk usage handling. This new default will automatically scale with a larger disk. Change sampling.tail.storage_limit default to 0. While 0 means unlimited local tail-sampling database size, it now enforces a max 90% disk usage on the disk where the data directory is located. Any tail sampling writes after this threshold will be rejected, similar to what happens when tail-sampling database size exceeds a non-0 storage limit. Setting sampling.tail.storage_limit to non-0 maintains the existing behavior which limits the tail-sampling database size to sampling.tail.storage_limit and does not have the new disk usage threshold check. (cherry picked from commit d019277) # Conflicts: # changelogs/9.0.asciidoc * Fix changelog --------- Co-authored-by: Carson Ip <[email protected]> Co-authored-by: Carson Ip <[email protected]>
Fix a missing colon in logs (typo from #15235 ), and remove "storage" in "configured storage limit reached" message to make way for #15467 to avoid confusion (cherry picked from commit 28068bd) Co-authored-by: Carson Ip <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
sampling.tail.storage_limit is 0 by default in 9.0. See elastic/apm-server#15467 . As UI validation requires unit (e.g. GB), set apm integration default storage limit to 0GB which carries the same meaning.
sampling.tail.storage_limit is 0 by default in 9.0. See elastic/apm-server#15467 . As UI validation requires unit (e.g. GB), set apm integration default storage limit to 0GB which carries the same meaning.
Motivation/summary
This is a breaking change to the default storage_limit to enable a more user-friendly TBS disk usage handling. This new default will automatically scale with a larger disk.
Change
sampling.tail.storage_limit
default to0
.While
0
means unlimited local tail-sampling database size,it now enforces a max 90% disk usage on the disk where the data directory is located.
Any tail sampling writes after this threshold will be rejected,
similar to what happens when tail-sampling database size exceeds a non-0 storage limit.
Setting
sampling.tail.storage_limit
to non-0 maintains the existing behaviorwhich limits the tail-sampling database size to
sampling.tail.storage_limit
and does not have the new disk usage threshold check.
Checklist
For functional changes, consider:
How to test these changes
Create a tmpfs with various sizes, check for logs as disk threshold is hit.
Related issues
Part of #15450
EA-managed apm-server needs elastic/integrations#12543 to change the default.