-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMLII-2019] Max samples per context for Histogram, Distribution and Timing metrics (Experimental Feature) #863
base: master
Are you sure you want to change the base?
Conversation
* add buffered_metrics object type * update metric_types to include histogram, distribution, timing * Run tests on any branch
…py into add-extended-aggregation
def should_sample(self, rate): | ||
"""Determine if a sample should be kept based on the specified rate.""" | ||
with self.random_lock: | ||
return self.random.random() < rate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Code Vulnerability
do not use random (...read more)
Make sure to use values that are actually random. The random
module in Python should generally not be used and replaced with the secrets
module, as noted in the official Python documentation.
Learn More
…are just buffered
… for Histogram, Distribution and Timing metrics
datadog/dogstatsd/base.py
Outdated
@@ -268,6 +274,10 @@ def __init__( | |||
depending on the connection type. | |||
:type max_buffer_len: integer | |||
|
|||
:param max_metric_samples: Maximum number of metric samples for buffered | |||
metrics (Histogram, Distribution, Timing) | |||
:type max_metric_samples: integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This param seems to be an older version of the per_context one, I don't think it's actually defined
sampled_metrics = self.aggregator.flush_aggregated_sampled_metrics() | ||
if not self._enabled: | ||
return | ||
for m in sampled_metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: While this is definitely fully functional, I would be a bit more comfortable if _report was modified to allow the option to not internally sample. That way we only have one location to worry about changes to "are we disabled", "how do we handle constant tags", "how do we handle telemetry", etc.
self.metric_type = metric_type | ||
self.max_metric_samples = max_metric_samples | ||
self.specified_rate = specified_rate | ||
self.data = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely makes sense to pre-allocate data in the case of non-zero max_metric_samples
self.total_metric_samples += 1 | ||
|
||
def maybe_keep_sample(self, value): | ||
if self.max_metric_samples > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All or most of this function will need lock protections, a number of internal variables like stored metrics and the data array are potentially being updated by multiple threads.
def maybe_keep_sample(self, value): | ||
if self.max_metric_samples > 0: | ||
if self.stored_metric_samples >= self.max_metric_samples: | ||
i = random.randint(0, self.total_metric_samples - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an off-by-one - if max_metric_samples is five, then the sixth sample should have a chance to not be persisted to the data array. It looks like we're persisting it 100% of the time here.
if self.specified_rate != 1.0: | ||
rate = self.specified_rate | ||
else: | ||
rate = self.stored_metric_samples / total_metric_samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current structure it is highly likely that we can always use this value, pending confirmation that we can safely increment all skip_samples behind a lock without too heavy a performance hit.
metrics = [] | ||
"""Flush the metrics and reset the stored values.""" | ||
with self.lock: | ||
copiedValues = self.values.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look to me like the copy and the clear are necessary - the subsequent assignment of values to an empty dict should be generally equivalent to the clear for our use case, and without a clear call there's no need to shallow copy the dict.
for _, metric in copiedValues.items(): | ||
metrics.append(metric.flush()) | ||
|
||
self.nb_context += len(copiedValues) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nb_context doesn't seem to be used
# Create a new metric if it doesn't exist | ||
self.values[context_key] = self.max_sample_metric_type(name, tags, rate, max_samples_per_context) | ||
metric = self.values[context_key] | ||
if keeping_sample: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're likely racing the flush to persist this correctly. Operating flow:
Thread in sample pulls a metric from the values dict and gets pre-empted after dropping the metric_context's lock.
Separate thread enters flush, pulls the metric from values and removes further reference to it, then flushes the metric.
Original thread wakes back up and continues operating on the metric object that it had pulled, unaware that the metric had already been flushed and any additional samples added will be ignored.
|
||
if sample_rate != 1 and random() > sample_rate: | ||
return | ||
if sample_rate != 1 and random() > sample_rate: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Code Vulnerability
do not use random (...read more)
Make sure to use values that are actually random. The random
module in Python should generally not be used and replaced with the secrets
module, as noted in the official Python documentation.
Requirements for Contributing to this repository
What does this PR do?
This experimental feature allows the user to limit the number of samples per context for histogram, distribution, and timing metrics.
This can be enabled with the statsd_max_samples_per_context flag. When enabled up to n samples will be kept in per context for Histogram, Distribution and Timing metrics when Aggregation is enabled. The default value is 0 which means no limit.
This is already implemented for the go client. Go Client Docs
Description of the Change
Verification Process
For local testing, follow steps here to set up local testing for the python client
Replace
testapp/main.py
withAdditional Notes
Release Notes
Review checklist (to be filled by reviewers)
changelog/
label attached. If applicable it should have thebackward-incompatible
label attached.do-not-merge/
label attached.kind/
andseverity/
labels attached at least.