Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SMP] Increase the DogStatsD default stringInterner size #19882

Closed
wants to merge 3 commits into from

Conversation

blt
Copy link
Contributor

@blt blt commented Oct 2, 2023

What does this PR do?

This commit increases the default DogStatsD string interner size from 4096 keys to 20_480 keys. Per #19852 we believe that this will benefit Agent memory consumption, allocation pressure while nominally increasing DogStatsD throughput. Details in this comment.

Additional Notes

We will know if this change is specific to a single scenario or more generally applicable after the Regression Detector runs. I will comment with an analysis on this PR once that happens.

Possible Drawbacks / Trade-offs

Given the allocation pressure already present in this portion of the codebase I am unaware of any. The string interner does not pre-allocate its storage -- although #19179 suggests that would be an interesting follow-up -- so smaller target deploys will not see higher baseline memory consumption.

Motivation

See this page for details

REF SMPTNG-12

@blt blt added changelog/no-changelog [deprecated] qa/skip-qa - use other qa/ labels [DEPRECATED] Please use qa/done or qa/no-code-change to skip creating a QA card labels Oct 2, 2023
@blt blt added this to the 7.50.0 milestone Oct 2, 2023
@blt blt requested a review from a team October 2, 2023 16:52
@blt blt requested a review from a team as a code owner October 2, 2023 16:52
@pr-commenter
Copy link

pr-commenter bot commented Oct 2, 2023

Bloop Bleep... Dogbot Here

Regression Detector Results

Run ID: 277ece38-0649-4a73-93cc-47bdf7a276c4
Baseline: 9610a26
Comparison: eeb5415
Total datadog-agent CPUs: 7

Explanation

A regression test is an integrated performance test for datadog-agent in a repeatable rig, with varying configuration for datadog-agent. What follows is a statistical summary of a brief datadog-agent run for each configuration across SHAs given above. The goal of these tests are to determine quickly if datadog-agent performance is changed and to what degree by a pull request.

Because a target's optimization goal performance in each experiment will vary somewhat each time it is run, we can only estimate mean differences in optimization goal relative to the baseline target. We express these differences as a percentage change relative to the baseline target, denoted "Δ mean %". These estimates are made to a precision that balances accuracy and cost control. We represent this precision as a 90.00% confidence interval denoted "Δ mean % CI": there is a 90.00% chance that the true value of "Δ mean %" is in that interval.

We decide whether a change in performance is a "regression" -- a change worth investigating further -- if both of the following two criteria are true:

  1. The estimated |Δ mean %| ≥ 5.00%. This criterion intends to answer the question "Does the estimated change in mean optimization goal performance have a meaningful impact on your customers?". We assume that when |Δ mean %| < 5.00%, the impact on your customers is not meaningful. We also assume that a performance change in optimization goal is worth investigating whether it is an increase or decrease, so long as the magnitude of the change is sufficiently large.

  2. Zero is not in the 90.00% confidence interval "Δ mean % CI" about "Δ mean %". This statement is equivalent to saying that there is at least a 90.00% chance that the mean difference in optimization goal is not zero. This criterion intends to answer the question, "Is there a statistically significant difference in mean optimization goal performance?". It also means there is no more than a 10.00% chance this criterion reports a statistically significant difference when the true difference in mean optimization goal is zero -- a "false positive". We assume you are willing to accept a 10.00% chance of inaccurately detecting a change in performance when no true difference exists.

The table below, if present, lists those experiments that have experienced a statistically significant change in mean optimization goal performance between baseline and comparison SHAs with 90.00% confidence OR have been detected as newly erratic. Negative values of "Δ mean %" mean that baseline is faster, whereas positive values of "Δ mean %" mean that comparison is faster. Results that do not exhibit more than a ±5.00% change in their mean optimization goal are discarded. An experiment is erratic if its coefficient of variation is greater than 0.1. The abbreviated table will be omitted if no interesting change is observed.

No interesting changes in experiment optimization goals with confidence ≥ 90.00% and |Δ mean %| ≥ 5.00%.

Fine details of change detection per experiment.
experiment goal Δ mean % Δ mean % CI confidence
otel_to_otel_logs ingress throughput +1.30 [-0.29, +2.88] 82.15%
file_tree egress throughput +0.77 [-1.09, +2.63] 50.49%
process_agent_standard_check_with_stats egress throughput +0.13 [-1.87, +2.13] 8.52%
uds_dogstatsd_to_api ingress throughput +0.03 [-0.13, +0.19] 26.95%
trace_agent_json ingress throughput +0.02 [-0.11, +0.14] 16.48%
uds_dogstatsd_to_api_nodist_1KiB ingress throughput +0.01 [-0.46, +0.48] 3.84%
dogstatsd_string_interner_8MiB_50k ingress throughput +0.01 [-0.03, +0.04] 29.45%
dogstatsd_string_interner_8MiB_100k ingress throughput +0.01 [-0.03, +0.05] 25.25%
dogstatsd_string_interner_8MiB_100 ingress throughput +0.01 [-0.12, +0.13] 6.93%
dogstatsd_string_interner_8MiB_10k ingress throughput +0.00 [-0.00, +0.01] 55.24%
dogstatsd_string_interner_64MiB_100 ingress throughput +0.00 [-0.14, +0.14] 0.08%
dogstatsd_string_interner_64MiB_1k ingress throughput -0.00 [-0.13, +0.13] 0.60%
dogstatsd_string_interner_128MiB_100 ingress throughput -0.00 [-0.14, +0.14] 0.59%
tcp_dd_logs_filter_exclude ingress throughput -0.00 [-0.06, +0.06] 3.60%
dogstatsd_string_interner_128MiB_1k ingress throughput -0.00 [-0.14, +0.14] 2.25%
idle egress throughput -0.01 [-2.44, +2.42] 0.44%
dogstatsd_string_interner_8MiB_1k ingress throughput -0.01 [-0.11, +0.09] 16.12%
trace_agent_msgpack ingress throughput -0.01 [-0.14, +0.11] 15.03%
uds_dogstatsd_to_api_nodist_nomulti_1KiB ingress throughput -0.02 [-0.51, +0.48] 4.19%
process_agent_standard_check egress throughput -0.28 [-3.79, +3.24] 10.29%
file_to_blackhole egress throughput -0.39 [-0.83, +0.05] 85.26%
process_agent_real_time_mode egress throughput -0.39 [-2.90, +2.11] 20.38%
tcp_syslog_to_blackhole ingress throughput -0.53 [-0.66, -0.40] 100.00%

@blt
Copy link
Contributor Author

blt commented Oct 2, 2023

Observations from the Regression Detector:

  • change is throughput neutral
  • presence of distribution metrics drowns out all other effects
  • even in _nodist_100MiB the CPU percentage drop is notable, -50% utilization

We now have profiles for this run. Results, main vs this PR:

Experiment CPU time Lifetime Allocations Heap Live Size Mutex Time
uds_dogstatsd_to_api_nodist_100MiB -27 sec ~ -52 M -63Mb ~ -3 sec
uds_dogstatsd_to_api_nodist_200MiB -25 sec ~ -65 M -85Mb ~ -3 sec
uds_dogstatsd_to_api_nodist_nomulti_100MiB -32 sec ~ -53 M +21Mb ~ -1 sec
uds_dogstatsd_to_api_nodist_nomulti_200MiB -25 sec ~ -64 M -85Mb +12 sec

Consider _nodist_nomulti_100MiB. Added live heap appears to be present in both this and the _200MiB variant in sendEvents. This is consistent with the increase in throughput in these experiment, below the Regression Detector threshold in relative terms but worth ~5MiB/sec in absolute terms.

In general I consider this change a desirable one. This obsoletes #19852.

@sgnn7 sgnn7 requested a review from a team October 2, 2023 19:30
@blt
Copy link
Contributor Author

blt commented Oct 3, 2023

Some noted concerns, questions:

  • @scottopell : Will this increase memory consumption in low-throughput scenarios?
  • @GeorgeHahn : Does this work because it is not populating the string intern cache?

Apologies if I muddled the paraphrases. To the first, I think we could introduce a new variant experiment in this PR that does, say, 100KiB/sec throughput, all other settings being the same. It's certainly something we can probe. To the second, we can collect the interner's self-telemetry, I just need to wire that in.

@blt blt force-pushed the string_interner_default_sz branch from 88b6446 to 69fcb49 Compare October 3, 2023 23:30
@scottopell
Copy link
Contributor

I'm trying to establish an intuition for what factors play into this, I don't think its as simple as "high bytes per second vs low bytes per second".

The string interner is really about the number of unique strings that are processed (ie, components of a dogstatsd message).
So in a high-churn environment where new metric names and tags are constantly seen, this is nearly useless. We'll just continue to flush the cache and we get no real benefit.

But "every metric is unique" is not the case for the Agent. We see many tags and metric names repeated, which brings us back to the idea of a metric context.

In a scenario where we have a high number of metric contexts, we correspondingly have a high number of unique strings, we want to re-use existing allocations for all of them, so increasing dogstatsd_string_interner_size is useful.

In a scenario where we have a low number of metric contexts, if the dogstatsd_string_interner_size is too big, then we end up wasting space holding onto strings we may have only seen once and will never see again.

I'm leaning towards leaving this default alone and maybe updating our docs for high throughput dogstatsd to mention this as a good thing to tune.

Longer/medium term it would be great to improve the tunables here as its rather hard to say what a good size would be for a given workload. Or make it automatic so nobody ever has to tune this (though I have no idea how we'd do that).

Sorry for the big wall of text, but I lean towards a docs change vs a default change.

@blt blt force-pushed the string_interner_default_sz branch from 69fcb49 to c9359b2 Compare October 6, 2023 13:51
@blt
Copy link
Contributor Author

blt commented Oct 6, 2023

Fundamentally, I do not believe that our customers should need to be domain expert in the internals of our Agent to get acceptable performance. I believe that documenting toggles is a kinder way of pushing demand for expertise on customers, but it's still pushing that demand. If a DogStatsD heavy customer is not using distribution metrics, this is the bottleneck and memory allocation center they'll hit even for what I would argue are a moderate number of contexts, see below.

I'm trying to establish an intuition for what factors play into this, I don't think its as simple as "high bytes per second vs low bytes per second".

It's not, no. It's also not as simple as contexts per second. If we consider the distribution problem, Agent performance is clearly contingent on throughput, distribution of input through time -- smooth, spikey etc -- and also contingent on factors that impact the internals of a component (contexts, here). That pattern is repeated here. I strongly believe that optimizing Agent to remove allocations / second is the major heuristic available to us today, these results are driven entirely by following that heuristic. Allocating and then freeing memory is not cheap.

The string interner is really about the number of unique strings that are processed (ie, components of a dogstatsd message). So in a high-churn environment where new metric names and tags are constantly seen, this is nearly useless. We'll just continue to flush the cache and we get no real benefit.

But "every metric is unique" is not the case for the Agent. We see many tags and metric names repeated, which brings us back to the idea of a metric context.

These experiments are not exercising unique metric names. The maximum number of contexts is 10k. The minimum is 1,000. I do agree that we lack -- as a project -- the notion of what a reasonable number of contexts for a single agent to handle is as of this writing to frame 'reasonable' here.

In a scenario where we have a high number of metric contexts, we correspondingly have a high number of unique strings, we want to re-use existing allocations for all of them, so increasing dogstatsd_string_interner_size is useful.

In a scenario where we have a low number of metric contexts, if the dogstatsd_string_interner_size is too big, then we end up wasting space holding onto strings we may have only seen once and will never see again.

This is not correct. If we have a low number of input contexts through the lifetime of the Agent the interner will size up proportionally to that size. We are not pre-allocating storage. The worse behavior here is what happens when we overwhelm the cache. As of this PR we continue to dump the whole storage, reallocate it and then restart the cache. It is true that we do not clear the cache without dumping the whole thing.

I have added experiments to this PR to experiment with low-throughput, same context situations. I've re-triggered it with a rebase from main so we can speak from more up to date data. If you would like to see other parameters changed do let me know.

I'm leaning towards leaving this default alone and maybe updating our docs for high throughput dogstatsd to mention this as a good thing to tune.

Longer/medium term it would be great to improve the tunables here as its rather hard to say what a good size would be for a given workload. Or make it automatic so nobody ever has to tune this (though I have no idea how we'd do that).

That is this work. I believe a non-trivial improvement is possible here in the near term. As mentioned, I believe we might also profitably explore re-introduction of #19179 as a follow-up but more importantly we really should not be invaliding the whole cache in one shot when it's full. We could explore lumping more work into this PR, if our follow-up discussion from new data still leaves concerns.

Sorry for the big wall of text, but I lean towards a docs change vs a default change.

@scottopell
Copy link
Contributor

scottopell commented Oct 6, 2023

Fundamentally, I do not believe that our customers should need to be domain expert in the internals of our Agent to get acceptable performance. I believe that documenting toggles is a kinder way of pushing demand for expertise on customers, but it's still pushing that demand.

I agree, this is a great goal. I'll pick one nit which is that we're talking about getting acceptable performance under an abnormally high workload.

This is not correct. If we have a low number of input contexts through the lifetime of the Agent the interner will size up proportionally to that size. We are not pre-allocating storage. The worse behavior here is what happens when we overwhelm the cache. As of this PR we continue to dump the whole storage, reallocate it and then restart the cache. It is true that we do not clear the cache without dumping the whole thing.

I understand that we don't pre-allocate, when I say "we end up wasting space and holding onto strings that we may have only seen once and will never see again", I'm referring to a scenario where there is some application and/or codepath that is hit early in the agent's lifecycle which emits some large strings as part of a dogstatsd payload.
Those metrics never get sent again, and therefore the space they occupy in the string interner cache is wasted, and because they're in the interner, they never get GC'd.

This risk is always present with low-context workloads, but the risk is higher with a larger cache size, which is why I don't support this as a new default value.

That is this work. I believe a non-trivial improvement is possible here in the near term. As mentioned, I believe we might also profitably explore re-introduction of #19179 as a follow-up but more importantly we really should not be invaliding the whole cache in one shot when it's full. We could explore lumping more work into this PR, if our follow-up discussion from new data still leaves concerns.

I think this was poor phrasing on my part, when I say:

Longer/medium term it would be great to improve the tunables here

I'm referring to what "tunable"/"variable" is available to be changed. Right now we only have a variable that says "size of string cache", but as this discussion highlights, its hard to reason about the impact of changing this.

If we want to improve the string interner for high-context/throughput workloads, we should be making changes to the way the string interner works, not just bumping up the default.

@blt
Copy link
Contributor Author

blt commented Oct 6, 2023

It doesn't look like the pr-commenter ran. 0b810c87-3bd9-4b9a-b4e1-aad012cd9639 is the most recent job-id. CI results: here.

Results: https://app.datadoghq.com/notebook/6635384/datadog-agent-string-interner-optimization?range=575458&tpl_var_experiment%5B0%5D=uds_dogstatsd_to_api_nodist_nomulti_100mib&tpl_var_experiment%5B1%5D=uds_dogstatsd_to_api_nodist_nomulti_1kib&tpl_var_experiment%5B2%5D=uds_dogstatsd_to_api_nodist_nomulti_200mib&tpl_var_run_id=0b810c87-3bd9-4b9a-b4e1-aad012cd9639&view=view-mode&start=1696607177068&live=false

Fundamentally, I do not believe that our customers should need to be domain expert in the internals of our Agent to get acceptable performance. I believe that documenting toggles is a kinder way of pushing demand for expertise on customers, but it's still pushing that demand.

I agree, this is a great goal. I'll pick one nit which is that we're talking about getting acceptable performance under an abnormally high workload.

For context, the Workload Checks experiment app_server targets 32MiB/s and demonstrates a similar bottleneck, once the distribution issue is removed. We could run an exact copy of that experiment to reconfirm if you'd like but I'd argue the setup here is less abnormal than it might seem. That said, I agree with the concern of not wanting to pessimize low-throughput use cases and see that as important.

This is not correct. If we have a low number of input contexts through the lifetime of the Agent the interner will size up proportionally to that size. We are not pre-allocating storage. The worse behavior here is what happens when we overwhelm the cache. As of this PR we continue to dump the whole storage, reallocate it and then restart the cache. It is true that we do not clear the cache without dumping the whole thing.

I understand that we don't pre-allocate, when I say "we end up wasting space and holding onto strings that we may have only seen once and will never see again", I'm referring to a scenario where there is some application and/or codepath that is hit early in the agent's lifecycle which emits some large strings as part of a dogstatsd payload. Those metrics never get sent again, and therefore the space they occupy in the string interner cache is wasted, and because they're in the interner, they never get GC'd.

Understood, but that problem is not addressed with the project's existing default either.

Consider also that the memory consumption under discussion is small, on the order of 2.6MiB in the high-end throughput experiments. The existing default does indicate an allocation led throughput bottleneck across a range of throughputs. My argument for raising the default is that we end up in a no-worse state, set ourselves up for further improvements and address an existing bottleneck for very little effort.

I certainly think that eviction of the cache needs to be revisited. I also think that the cache is far too small by default.

This risk is always present with low-context workloads, but the risk is higher with a larger cache size, which is why I don't support this as a new default value.

I think we should be clear about what constitutes low/high contexts, else we might be thinking about numbers that are way, way off. Context data from the experiments. (Note this indicates the context filter in lading is buggy, or potentially surprising if throughput is low. Something to investigate.) Which is to say, if you'd toss some numbers out I'd consider it a favor and I'll get the experiments up and running for them.

Some thoughts on the risk being higher, noting the experimental numbers above:

  • In the 1KiB case we see no difference in the bytes held in the interner between baseline and comparison.
  • In 100MiB and 200MiB cases we see most replicates thrash around 0.6MiB with two replicates holding steady at 2.62MiB.
  • Even with the higher default we spend a lot of time resetting the cache.

It seems to me that:

  • The amount of memory under discussion worst-case is small relative to the total amount consumed by Agent at runtime,
  • the behavior of low-throuhgput users is not pessimized compared to baseline,
  • the proposed default improves the behavior of higher-throughput users and
  • the reset counter for 100MiB/200MiB cases suggests that resets happen multiple times per second as-is.

This suggests to me that the proposed risk in cases where it does happen in practice:

  • exists in main today,
  • is a rounding error amount of memory and
  • will not happen in practice for high-throughput clients which demonstrably reset the cache frequently.

Further, correctly setting the tunable flag requires:

  • a detailed understanding of Agent internals, one's workload,
  • potentially access to a profiled view of the live Agent.

That is this work. I believe a non-trivial improvement is possible here in the near term. As mentioned, I believe we might also profitably explore re-introduction of #19179 as a follow-up but more importantly we really should not be invaliding the whole cache in one shot when it's full. We could explore lumping more work into this PR, if our follow-up discussion from new data still leaves concerns.

I think this was poor phrasing on my part, when I say:

Longer/medium term it would be great to improve the tunables here
I'm referring to what "tunable"/"variable" is available to be changed. Right now we only have a variable that says "size of string cache", but as this discussion highlights, its hard to reason about the impact of changing this.

If we want to improve the string interner for high-context/throughput workloads, we should be making changes to the way the string interner works, not just bumping up the default.

To be clear, I would like to follow this work up with:

  • a PR to pre-allocate the storage for the cache (potentially undesirable for reasons discussed above)
  • a PR to introduce FIFO eviction to the cache to avoid flushing the whole thing, since resetting is such a frequent operation.

This work is a good, cheap win -- is my argument -- and we can follow-up with improvements on the low-end of things too.

@scottopell
Copy link
Contributor

#19990 related

blt added a commit that referenced this pull request Oct 10, 2023
This commit extracts experiments from #19882 and #19990, expanding the
throughput parameter across the range of concern. Note that at the lower end of
the throughput range the full contexts will not be explored, meaning that
throughput per second is the dominate metric here. Contexts per second we will
need to derive from captured target telemetry.

Intended to back ongoing experiments with the stringInterner.

REF SMPTNG-12

Signed-off-by: Brian L. Troutwine <[email protected]>
blt added a commit that referenced this pull request Oct 10, 2023
This commit extracts experiments from #19882 and #19990, expanding the
throughput parameter across the range of concern. Note that at the lower end of
the throughput range the full contexts will not be explored, meaning that
throughput per second is the dominate metric here. Contexts per second we will
need to derive from captured target telemetry.

Intended to back ongoing experiments with the stringInterner.

REF SMPTNG-12

Signed-off-by: Brian L. Troutwine <[email protected]>
blt added a commit that referenced this pull request Oct 11, 2023
This commit extracts experiments from #19882 and #19990, expanding the
throughput parameter across the range of concern. Note that at the lower end of
the throughput range the full contexts will not be explored, meaning that
throughput per second is the dominate metric here. Contexts per second we will
need to derive from captured target telemetry.

Intended to back ongoing experiments with the stringInterner.

REF SMPTNG-12

Signed-off-by: Brian L. Troutwine <[email protected]>
@blt blt force-pushed the string_interner_default_sz branch from c9359b2 to 55a4897 Compare October 11, 2023 02:01
@blt
Copy link
Contributor Author

blt commented Oct 11, 2023

See #19990 (comment)

@blt blt force-pushed the string_interner_default_sz branch from 55a4897 to 965a4f4 Compare October 12, 2023 15:30
blt added a commit that referenced this pull request Oct 18, 2023
This commit replaces the _nodist variants of the `uds_dogstatsd_to_api`
experiment with a set of parameterized string interner experiments, varying both
total contexts and throughput along two scales. The goal here is to understand
how, at 8MiB/sec throughput, the variation of contexts impacts the
stringInterner hotspot and along throughput to achieve coverage on a small range
of contexts.

REF SMPTNG-12
REF #19852
REF #19882
REF #19990

Signed-off-by: Brian L. Troutwine <[email protected]>
blt added a commit that referenced this pull request Oct 18, 2023
* Add experiments focused on dogstatsd's string interner

This commit replaces the _nodist variants of the `uds_dogstatsd_to_api`
experiment with a set of parameterized string interner experiments, varying both
total contexts and throughput along two scales. The goal here is to understand
how, at 8MiB/sec throughput, the variation of contexts impacts the
stringInterner hotspot and along throughput to achieve coverage on a small range
of contexts.

REF SMPTNG-12
REF #19852
REF #19882
REF #19990

Signed-off-by: Brian L. Troutwine <[email protected]>

* correct target metrics path for Agent

Signed-off-by: Brian L. Troutwine <[email protected]>

---------

Signed-off-by: Brian L. Troutwine <[email protected]>
@blt blt force-pushed the string_interner_default_sz branch from 965a4f4 to 5672d60 Compare October 18, 2023 23:36
@blt blt force-pushed the string_interner_default_sz branch from 5672d60 to 76a992f Compare October 23, 2023 17:39
blt added 3 commits October 30, 2023 16:07
This commit increases the default DogStatsD string interner size from 4096 keys
to 20_480 keys. Per #19852 we believe that this will benefit Agent memory
consumption, allocation pressure while nominally increasing DogStatsD
throughput. Details [in this
comment](#19852 (comment)). We
will know if this change is specific to a single scenario or more generally
applicable after the Regression Detector runs. I will comment with an analysis
on this PR once that happens.

REF SMPTNG-12

Signed-off-by: Brian L. Troutwine <[email protected]>
Signed-off-by: Brian L. Troutwine <[email protected]>
@blt blt force-pushed the string_interner_default_sz branch from 76a992f to eeb5415 Compare October 30, 2023 23:07
@kacper-murzyn kacper-murzyn modified the milestones: 7.50.0, 7.51.0 Nov 11, 2023
@blt
Copy link
Contributor Author

blt commented Nov 28, 2023

Supplanted by #20943

@blt blt closed this Nov 28, 2023
@dd-devflow dd-devflow bot deleted the string_interner_default_sz branch May 1, 2024 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
7.50.0-drop changelog/no-changelog [deprecated] qa/skip-qa - use other qa/ labels [DEPRECATED] Please use qa/done or qa/no-code-change to skip creating a QA card
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants