TBS: set default `sampling.tail.storage_limit` to 0 but limit disk usage to 90% #15467

carsonip · 2025-01-30T14:54:13Z

Motivation/summary

This is a breaking change to the default storage_limit to enable a more user-friendly TBS disk usage handling. This new default will automatically scale with a larger disk.

Change sampling.tail.storage_limit default to 0.
While 0 means unlimited local tail-sampling database size,
it now enforces a max 90% disk usage on the disk where the data directory is located.
Any tail sampling writes after this threshold will be rejected,
similar to what happens when tail-sampling database size exceeds a non-0 storage limit.
Setting sampling.tail.storage_limit to non-0 maintains the existing behavior
which limits the tail-sampling database size to sampling.tail.storage_limit
and does not have the new disk usage threshold check.

Checklist

Update CHANGELOG.asciidoc
Documentation has been updated

For functional changes, consider:

Is it observable through the addition of either logging or metrics?
Is its use being published in telemetry to enable product improvement?
Have system tests been added to avoid regression?

How to test these changes

Create a tmpfs with various sizes, check for logs as disk threshold is hit.

Related issues

Part of #15450
EA-managed apm-server needs elastic/integrations#12543 to change the default.

mergify · 2025-01-30T14:54:50Z

This pull request does not have a backport label. Could you fix it @carsonip? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-7.17 is the label to automatically backport to the 7.17 branch.
backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
backport-8.x is the label to automatically backport to the 8.x branch.

mergify · 2025-01-30T14:54:50Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

mergify · 2025-01-30T16:04:19Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

mergify · 2025-01-30T16:21:51Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

simitt · 2025-01-30T16:27:35Z

x-pack/apm-server/sampling/eventstorage/storage_manager.go

+
+	usage, err := vfs.Default.GetDiskUsage(sm.storageDir)
+	if err != nil {
+		sm.rateLimitedLogger.With(logp.Error(err)).Warn("failed to get disk usage")


what is the fallback in this case?

In the current implementation it effectively becomes unlimited. It is by design because of how they are 2 separate checkers and RW. But I think it is possible to overwrite dbStorageLimit to a fallback value if we get an error, so that storage_limit checker takes over.

Added fallback handling. A bit complex, but should cover all cases. All of it is written with the assumption that if GetDiskUsage ever returns an error, it will always return an error.

mergify · 2025-01-30T16:38:17Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

Fix a missing colon in logs (typo from #15235 ), and remove "storage" in "configured storage limit reached" message to make way for #15467 to avoid confusion

Fix a missing colon in logs (typo from #15235 ), and remove "storage" in "configured storage limit reached" message to make way for #15467 to avoid confusion (cherry picked from commit 28068bd)

Fix a missing colon in logs (typo from #15235 ), and remove "storage" in "configured storage limit reached" message to make way for #15467 to avoid confusion (cherry picked from commit 28068bd) # Conflicts: # x-pack/apm-server/sampling/eventstorage/rw.go

lahsivjar

LGTM!

lahsivjar · 2025-01-31T11:43:54Z

x-pack/apm-server/sampling/eventstorage/storage_manager.go

+	usage, err := sm.getDiskUsage()
+	if err != nil {
+		sm.logger.With(logp.Error(err)).Warn("failed to get disk usage")
+		sm.getDiskUsageFailed.Store(true)


nit: While I can't think of what could lead to transient failures, I am not sure about always failing on error bit - something to think about in a future PR.

This gives me a headache as well. What should existing disk usage threshold checks perform when getDiskUsage has transient failures, should it use a stale number, or should it become unlimited? With this assumption of always failing, it simplifies the implementation.

…age to 90% (#15467) This is a breaking change to the default storage_limit to enable a more user-friendly TBS disk usage handling. This new default will automatically scale with a larger disk. Change sampling.tail.storage_limit default to 0. While 0 means unlimited local tail-sampling database size, it now enforces a max 90% disk usage on the disk where the data directory is located. Any tail sampling writes after this threshold will be rejected, similar to what happens when tail-sampling database size exceeds a non-0 storage limit. Setting sampling.tail.storage_limit to non-0 maintains the existing behavior which limits the tail-sampling database size to sampling.tail.storage_limit and does not have the new disk usage threshold check. (cherry picked from commit d019277) # Conflicts: # changelogs/9.0.asciidoc

…isk usage to 90% (backport #15467) (#15501) * TBS: set default `sampling.tail.storage_limit` to 0 but limit disk usage to 90% (#15467) This is a breaking change to the default storage_limit to enable a more user-friendly TBS disk usage handling. This new default will automatically scale with a larger disk. Change sampling.tail.storage_limit default to 0. While 0 means unlimited local tail-sampling database size, it now enforces a max 90% disk usage on the disk where the data directory is located. Any tail sampling writes after this threshold will be rejected, similar to what happens when tail-sampling database size exceeds a non-0 storage limit. Setting sampling.tail.storage_limit to non-0 maintains the existing behavior which limits the tail-sampling database size to sampling.tail.storage_limit and does not have the new disk usage threshold check. (cherry picked from commit d019277) # Conflicts: # changelogs/9.0.asciidoc * Fix changelog --------- Co-authored-by: Carson Ip <[email protected]> Co-authored-by: Carson Ip <[email protected]>

Fix a missing colon in logs (typo from #15235 ), and remove "storage" in "configured storage limit reached" message to make way for #15467 to avoid confusion (cherry picked from commit 28068bd) Co-authored-by: Carson Ip <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

sampling.tail.storage_limit is 0 by default in 9.0. See elastic/apm-server#15467 . As UI validation requires unit (e.g. GB), set apm integration default storage limit to 0GB which carries the same meaning.

PoC on disk threshold

0e138ff

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 30, 2025

carsonip added 6 commits January 30, 2025 14:55

Remove default storage limit

34406b7

Cache results

a614778

Fix default storage limit test

27d2e21

Use rate limited logger

a047c91

Update comment

6bc1078

Better error message

b409d03

carsonip added backport-9.0 Automated backport to the 9.0 branch and removed backport-8.x Automated backport to the 8.x branch with mergify labels Jan 30, 2025

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 30, 2025

carsonip added 3 commits January 30, 2025 16:04

Refactor

28e440f

Refactor

e51d329

Rename

e550b11

carsonip changed the title ~~[WIP] TBS: disk threshold~~ TBS: stop writing after disk_threshold and remove default storage_limit Jan 30, 2025

carsonip marked this pull request as ready for review January 30, 2025 16:19

carsonip requested a review from a team as a code owner January 30, 2025 16:19

Fix compile error

954eca1

carsonip removed the backport-8.x Automated backport to the 8.x branch with mergify label Jan 30, 2025

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 30, 2025

simitt reviewed Jan 30, 2025

View reviewed changes

Add comment

38489a9

carsonip removed the backport-8.x Automated backport to the 8.x branch with mergify label Jan 30, 2025

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 30, 2025

Add fallback to GetDiskUsage error

cf9e2da

carsonip added a commit that referenced this pull request Jan 31, 2025

TBS: fix log formatting (#15496)

28068bd

Fix a missing colon in logs (typo from #15235 ), and remove "storage" in "configured storage limit reached" message to make way for #15467 to avoid confusion

This was referenced Jan 31, 2025

[8.x] TBS: fix log formatting (backport #15496) #15498

Closed

[9.0] TBS: fix log formatting (backport #15496) #15499

Merged

simitt previously approved these changes Jan 31, 2025

View reviewed changes

lahsivjar previously approved these changes Jan 31, 2025

View reviewed changes

Set used to 0 as well

7bcfc60

carsonip dismissed stale reviews from lahsivjar and simitt via 7bcfc60 January 31, 2025 11:46

carsonip requested review from simitt and lahsivjar January 31, 2025 11:53

lahsivjar previously approved these changes Jan 31, 2025

View reviewed changes

carsonip and others added 2 commits January 31, 2025 12:23

Merge branch 'main' into disk-threshold

4677241

Add changelog

235d1ed

carsonip dismissed lahsivjar’s stale review via 235d1ed January 31, 2025 12:37

carsonip requested a review from lahsivjar January 31, 2025 12:39

carsonip changed the title ~~TBS: stop writing after disk_threshold_ratio and remove default storage_limit~~ TBS: set default sampling.tail.storage_limit to 0 but limit disk usage to 90% Jan 31, 2025

carsonip enabled auto-merge (squash) January 31, 2025 12:54

lahsivjar approved these changes Jan 31, 2025

View reviewed changes

simitt approved these changes Jan 31, 2025

View reviewed changes

carsonip merged commit d019277 into elastic:main Jan 31, 2025
16 checks passed

mergify bot mentioned this pull request Jan 31, 2025

[9.0] TBS: set default sampling.tail.storage_limit to 0 but limit disk usage to 90% (backport #15467) #15501

Merged

2 tasks

carsonip mentioned this pull request Feb 3, 2025

TBS: change default disk usage threshold to 0.8 #15524

Merged

2 tasks

mergify bot mentioned this pull request Feb 3, 2025

[9.0] TBS: change default disk usage threshold to 0.8 (backport #15524) #15525

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TBS: set default `sampling.tail.storage_limit` to 0 but limit disk usage to 90% #15467

TBS: set default `sampling.tail.storage_limit` to 0 but limit disk usage to 90% #15467

carsonip commented Jan 30, 2025 •

edited

Loading

mergify bot commented Jan 30, 2025

mergify bot commented Jan 30, 2025

mergify bot commented Jan 30, 2025

mergify bot commented Jan 30, 2025

simitt Jan 30, 2025

carsonip Jan 30, 2025

carsonip Jan 30, 2025

mergify bot commented Jan 30, 2025

lahsivjar left a comment

lahsivjar Jan 31, 2025

carsonip Jan 31, 2025

TBS: set default sampling.tail.storage_limit to 0 but limit disk usage to 90% #15467

TBS: set default sampling.tail.storage_limit to 0 but limit disk usage to 90% #15467

Conversation

carsonip commented Jan 30, 2025 • edited Loading

Motivation/summary

Checklist

How to test these changes

Related issues

mergify bot commented Jan 30, 2025

mergify bot commented Jan 30, 2025

mergify bot commented Jan 30, 2025

mergify bot commented Jan 30, 2025

simitt Jan 30, 2025

Choose a reason for hiding this comment

carsonip Jan 30, 2025

Choose a reason for hiding this comment

carsonip Jan 30, 2025

Choose a reason for hiding this comment

mergify bot commented Jan 30, 2025

lahsivjar left a comment

Choose a reason for hiding this comment

lahsivjar Jan 31, 2025

Choose a reason for hiding this comment

carsonip Jan 31, 2025

Choose a reason for hiding this comment

TBS: set default `sampling.tail.storage_limit` to 0 but limit disk usage to 90% #15467

TBS: set default `sampling.tail.storage_limit` to 0 but limit disk usage to 90% #15467

carsonip commented Jan 30, 2025 •

edited

Loading