Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMLII-2019] Max samples per context for Histogram, Distribution and Timing metrics (Experimental Feature) #863

Open
wants to merge 77 commits into
base: master
Choose a base branch
from

Conversation

andrewqian2001datadog
Copy link
Contributor

@andrewqian2001datadog andrewqian2001datadog commented Oct 28, 2024

Requirements for Contributing to this repository

  • Fill out the template below. Any pull request that does not include enough information to be reviewed in a timely manner may be closed at the maintainers' discretion.
  • The pull request must only fix one issue, or add one feature, at the time.
  • The pull request must update the test suite to demonstrate the changed functionality.
  • After you create the pull request, all status checks must be pass before a maintainer reviews your contribution. For more details, please see CONTRIBUTING.

What does this PR do?

This experimental feature allows the user to limit the number of samples per context for histogram, distribution, and timing metrics.

This can be enabled with the statsd_max_samples_per_context flag. When enabled up to n samples will be kept in per context for Histogram, Distribution and Timing metrics when Aggregation is enabled. The default value is 0 which means no limit.

This is already implemented for the go client. Go Client Docs

Description of the Change

Verification Process

For local testing, follow steps here to set up local testing for the python client

Replace testapp/main.py with

from datadog import initialize, statsd
import time


options = {
    "statsd_host": "127.0.0.1",
    "statsd_port": 8125,
    "statsd_disable_buffering" : True,
    "statsd_disable_aggregation" : False,
    "statsd_max_samples_per_context" : 1,
    "statsd_aggregation_flush_interval" : 15
}

initialize(**options)

x = 0
name = "andrew_q_maxsample2"
sleepTime = 3
while(1):
  print("-------------------------------------------------")
  print("running :)", x)
  statsd.histogram(name, 1)
  time.sleep(sleepTime)
  x += 1

Additional Notes

Release Notes

Review checklist (to be filled by reviewers)

  • Feature or bug fix MUST have appropriate tests (unit, integration, etc...)
  • PR title must be written as a CHANGELOG entry (see why)
  • Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
  • PR must have one changelog/ label attached. If applicable it should have the backward-incompatible label attached.
  • PR should not have do-not-merge/ label attached.
  • If Applicable, issue must have kind/ and severity/ labels attached at least.

* add buffered_metrics object type

* update metric_types to include histogram, distribution, timing

* Run tests on any branch
@andrewqian2001datadog andrewqian2001datadog self-assigned this Oct 28, 2024
def should_sample(self, rate):
"""Determine if a sample should be kept based on the specified rate."""
with self.random_lock:
return self.random.random() < rate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Code Vulnerability

do not use random (...read more)

Make sure to use values that are actually random. The random module in Python should generally not be used and replaced with the secrets module, as noted in the official Python documentation.

Learn More

View in Datadog  Leave us feedback  Documentation

@github-actions github-actions bot added the stale Stale - Bot reminder label Nov 29, 2024
@andrewqian2001datadog andrewqian2001datadog removed the stale Stale - Bot reminder label Dec 4, 2024
@DataDog DataDog deleted a comment from github-actions bot Dec 10, 2024
@andrewqian2001datadog andrewqian2001datadog changed the title Max samples per context for Histogram, Distribution and Timing metrics Max samples per context for Histogram, Distribution and Timing metrics (Experimental Feature) Jan 14, 2025
@andrewqian2001datadog andrewqian2001datadog changed the title Max samples per context for Histogram, Distribution and Timing metrics (Experimental Feature) [AMLII-2019] Max samples per context for Histogram, Distribution and Timing metrics (Experimental Feature) Jan 14, 2025
@@ -268,6 +274,10 @@ def __init__(
depending on the connection type.
:type max_buffer_len: integer

:param max_metric_samples: Maximum number of metric samples for buffered
metrics (Histogram, Distribution, Timing)
:type max_metric_samples: integer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This param seems to be an older version of the per_context one, I don't think it's actually defined

sampled_metrics = self.aggregator.flush_aggregated_sampled_metrics()
if not self._enabled:
return
for m in sampled_metrics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: While this is definitely fully functional, I would be a bit more comfortable if _report was modified to allow the option to not internally sample. That way we only have one location to worry about changes to "are we disabled", "how do we handle constant tags", "how do we handle telemetry", etc.

self.metric_type = metric_type
self.max_metric_samples = max_metric_samples
self.specified_rate = specified_rate
self.data = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely makes sense to pre-allocate data in the case of non-zero max_metric_samples

self.total_metric_samples += 1

def maybe_keep_sample(self, value):
if self.max_metric_samples > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All or most of this function will need lock protections, a number of internal variables like stored metrics and the data array are potentially being updated by multiple threads.

def maybe_keep_sample(self, value):
if self.max_metric_samples > 0:
if self.stored_metric_samples >= self.max_metric_samples:
i = random.randint(0, self.total_metric_samples - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an off-by-one - if max_metric_samples is five, then the sixth sample should have a chance to not be persisted to the data array. It looks like we're persisting it 100% of the time here.

if self.specified_rate != 1.0:
rate = self.specified_rate
else:
rate = self.stored_metric_samples / total_metric_samples
Copy link
Contributor

@ddrthall ddrthall Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current structure it is highly likely that we can always use this value, pending confirmation that we can safely increment all skip_samples behind a lock without too heavy a performance hit.

metrics = []
"""Flush the metrics and reset the stored values."""
with self.lock:
copiedValues = self.values.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look to me like the copy and the clear are necessary - the subsequent assignment of values to an empty dict should be generally equivalent to the clear for our use case, and without a clear call there's no need to shallow copy the dict.

for _, metric in copiedValues.items():
metrics.append(metric.flush())

self.nb_context += len(copiedValues)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nb_context doesn't seem to be used

# Create a new metric if it doesn't exist
self.values[context_key] = self.max_sample_metric_type(name, tags, rate, max_samples_per_context)
metric = self.values[context_key]
if keeping_sample:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're likely racing the flush to persist this correctly. Operating flow:
Thread in sample pulls a metric from the values dict and gets pre-empted after dropping the metric_context's lock.
Separate thread enters flush, pulls the metric from values and removes further reference to it, then flushes the metric.
Original thread wakes back up and continues operating on the metric object that it had pulled, unaware that the metric had already been flushed and any additional samples added will be ignored.


if sample_rate != 1 and random() > sample_rate:
return
if sample_rate != 1 and random() > sample_rate:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Code Vulnerability

do not use random (...read more)

Make sure to use values that are actually random. The random module in Python should generally not be used and replaced with the secrets module, as noted in the official Python documentation.

Learn More

View in Datadog  Leave us feedback  Documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog/Added Added features results into a minor version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants