[AMLII-2019] Max samples per context for Histogram, Distribution and Timing metrics (Experimental Feature) #863

andrewqian2001datadog · 2024-10-28T17:27:58Z

Requirements for Contributing to this repository

Fill out the template below. Any pull request that does not include enough information to be reviewed in a timely manner may be closed at the maintainers' discretion.
The pull request must only fix one issue, or add one feature, at the time.
The pull request must update the test suite to demonstrate the changed functionality.
After you create the pull request, all status checks must be pass before a maintainer reviews your contribution. For more details, please see CONTRIBUTING.

What does this PR do?

This experimental feature allows the user to limit the number of samples per context for histogram, distribution, and timing metrics.

This can be enabled with the statsd_max_samples_per_context flag. When enabled up to n samples will be kept in per context for Histogram, Distribution and Timing metrics when Aggregation is enabled. The default value is 0 which means no limit.

This is already implemented for the go client. Go Client Docs

Description of the Change

Verification Process

For local testing, follow steps here to set up local testing for the python client

Replace testapp/main.py with

from datadog import initialize, statsd
import time


options = {
    "statsd_host": "127.0.0.1",
    "statsd_port": 8125,
    "statsd_disable_buffering" : True,
    "statsd_disable_aggregation" : False,
    "statsd_max_samples_per_context" : 1,
    "statsd_aggregation_flush_interval" : 15
}

initialize(**options)

x = 0
name = "andrew_q_maxsample2"
sleepTime = 3
while(1):
  print("-------------------------------------------------")
  print("running :)", x)
  statsd.histogram(name, 1)
  time.sleep(sleepTime)
  x += 1

Additional Notes

Release Notes

Review checklist (to be filled by reviewers)

Feature or bug fix MUST have appropriate tests (unit, integration, etc...)
PR title must be written as a CHANGELOG entry (see why)
Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
PR must have one changelog/ label attached. If applicable it should have the backward-incompatible label attached.
PR should not have do-not-merge/ label attached.
If Applicable, issue must have kind/ and severity/ labels attached at least.

* add buffered_metrics object type * update metric_types to include histogram, distribution, timing * Run tests on any branch

…py into add-extended-aggregation

datadog-datadog-prod-us1 · 2024-10-29T20:24:41Z

datadog/dogstatsd/buffered_metrics_context.py

+    def should_sample(self, rate):
+        """Determine if a sample should be kept based on the specified rate."""
+        with self.random_lock:
+            return self.random.random() < rate


🔴 Code Vulnerability

do not use random (...read more)

Make sure to use values that are actually random. The random module in Python should generally not be used and replaced with the secrets module, as noted in the official Python documentation.

Learn More

CWE-330

Python random module documentation

Python secrets module documentation

datadog/dogstatsd/buffered_metrics_context.py

…are just buffered

datadog/dogstatsd/buffered_metrics_context.py

… for Histogram, Distribution and Timing metrics

…is set

ddrthall · 2025-01-15T17:54:56Z

datadog/dogstatsd/base.py

@@ -268,6 +274,10 @@ def __init__(
        depending on the connection type.
        :type max_buffer_len: integer

+        :param max_metric_samples: Maximum number of metric samples for buffered
+        metrics (Histogram, Distribution, Timing)
+        :type max_metric_samples: integer


This param seems to be an older version of the per_context one, I don't think it's actually defined

ddrthall · 2025-01-15T17:55:26Z

datadog/dogstatsd/base.py

+        sampled_metrics = self.aggregator.flush_aggregated_sampled_metrics()
+        if not self._enabled:
+            return
+        for m in sampled_metrics:


Nit: While this is definitely fully functional, I would be a bit more comfortable if _report was modified to allow the option to not internally sample. That way we only have one location to worry about changes to "are we disabled", "how do we handle constant tags", "how do we handle telemetry", etc.

ddrthall · 2025-01-15T17:55:52Z

datadog/dogstatsd/max_sample_metric.py

+        self.metric_type = metric_type
+        self.max_metric_samples = max_metric_samples
+        self.specified_rate = specified_rate
+        self.data = []


Likely makes sense to pre-allocate data in the case of non-zero max_metric_samples

ddrthall · 2025-01-15T17:56:14Z

datadog/dogstatsd/max_sample_metric.py

+        self.total_metric_samples += 1
+
+    def maybe_keep_sample(self, value):
+        if self.max_metric_samples > 0:


All or most of this function will need lock protections, a number of internal variables like stored metrics and the data array are potentially being updated by multiple threads.

ddrthall · 2025-01-15T17:57:17Z

datadog/dogstatsd/max_sample_metric.py

+    def maybe_keep_sample(self, value):
+        if self.max_metric_samples > 0:
+            if self.stored_metric_samples >= self.max_metric_samples:
+                i = random.randint(0, self.total_metric_samples - 1)


This looks like an off-by-one - if max_metric_samples is five, then the sixth sample should have a chance to not be persisted to the data array. It looks like we're persisting it 100% of the time here.

ddrthall · 2025-01-15T17:58:54Z

datadog/dogstatsd/max_sample_metric.py

+        if self.specified_rate != 1.0:
+            rate = self.specified_rate
+        else:
+            rate = self.stored_metric_samples / total_metric_samples


With the current structure it is highly likely that we can always use this value, pending confirmation that we can safely increment all skip_samples behind a lock without too heavy a performance hit.

ddrthall · 2025-01-15T17:59:39Z

datadog/dogstatsd/max_sample_metric_context.py

+        metrics = []
+        """Flush the metrics and reset the stored values."""
+        with self.lock:
+            copiedValues = self.values.copy()


It doesn't look to me like the copy and the clear are necessary - the subsequent assignment of values to an empty dict should be generally equivalent to the clear for our use case, and without a clear call there's no need to shallow copy the dict.

ddrthall · 2025-01-15T17:59:47Z

datadog/dogstatsd/max_sample_metric_context.py

+        for _, metric in copiedValues.items():
+            metrics.append(metric.flush())
+
+        self.nb_context += len(copiedValues)


nb_context doesn't seem to be used

ddrthall · 2025-01-15T18:00:53Z

datadog/dogstatsd/max_sample_metric_context.py

+                # Create a new metric if it doesn't exist
+                self.values[context_key] = self.max_sample_metric_type(name, tags, rate, max_samples_per_context)
+            metric = self.values[context_key]
+        if keeping_sample:


It looks like we're likely racing the flush to persist this correctly. Operating flow:
Thread in sample pulls a metric from the values dict and gets pre-empted after dropping the metric_context's lock.
Separate thread enters flush, pulls the metric from values and removes further reference to it, then flushes the metric.
Original thread wakes back up and continues operating on the metric object that it had pulled, unaware that the metric had already been flushed and any additional samples added will be ignored.

datadog-datadog-prod-us1 · 2025-01-16T17:24:42Z

datadog/dogstatsd/base.py


-        if sample_rate != 1 and random() > sample_rate:
-            return
+            if sample_rate != 1 and random() > sample_rate:


🔴 Code Vulnerability

do not use random (...read more)

Make sure to use values that are actually random. The random module in Python should generally not be used and replaced with the secrets module, as noted in the official Python documentation.

Learn More

CWE-330

Python random module documentation

Python secrets module documentation

andrewqian2001datadog added 4 commits September 9, 2024 10:21

WIP

6de4f9b

add buffered_metrics object type (#853)

c171911

* add buffered_metrics object type * update metric_types to include histogram, distribution, timing * Run tests on any branch

Merge branch 'add-extended-aggregation' of github.com:DataDog/datadog…

572da5c

…py into add-extended-aggregation

Merge branch 'master' into add-extended-aggregation

79590b0

andrewqian2001datadog self-assigned this Oct 28, 2024

andrewqian2001datadog added 2 commits October 28, 2024 13:28

revert test config change

c112d5b

add buffered_metric_context WIP

890c657

datadog-datadog-prod-us1 bot reviewed Oct 29, 2024

View reviewed changes

github-actions bot added the stale Stale - Bot reminder label Nov 29, 2024

Merge branch 'master' into add-extended-aggregation

e591bbb

andrewqian2001datadog removed the stale Stale - Bot reminder label Dec 4, 2024

andrewqian2001datadog added 2 commits December 9, 2024 14:03

change naming to sample

4c2b238

update tests

a84af9d

DataDog deleted a comment from github-actions bot Dec 10, 2024

fix buffered_metric_context and aggregator, update tests

583e287

datadog-datadog-prod-us1 bot reviewed Dec 10, 2024

View reviewed changes

datadog/dogstatsd/buffered_metrics_context.py Outdated Show resolved Hide resolved

andrewqian2001datadog added 8 commits December 10, 2024 14:42

use snake case

b01ed1d

histograms, distribution and timing metrics are not aggregated, they …

bb863c5

…are just buffered

remove max_metric_per_context, not in scope?

ca4981d

lint

1091569

fix lint

606b271

Merge branch 'master' into add-extended-aggregation

a7f2a56

fix syntax

a5d4b15

replace secrets with random

7cf00b1

datadog-datadog-prod-us1 bot reviewed Dec 12, 2024

View reviewed changes

datadog/dogstatsd/buffered_metrics_context.py Outdated Show resolved Hide resolved

andrewqian2001datadog added 5 commits December 12, 2024 15:06

lint

92481b9

update tests

6aa0243

lint

58af7c0

lint

f5a4cd1

update test

3725fca

andrewqian2001datadog added 8 commits January 13, 2025 15:06

lint

08d150a

lint

dfd1a29

use statsd_max_samples_per_context to set the max samples per context…

5c86590

… for Histogram, Distribution and Timing metrics

lint

0276bfa

set max_samples_per_context through Aggregator constructor

e50a85f

add comments

f6d963d

lint

4dbaba5

remove print

b2f714b

andrewqian2001datadog requested a review from ddrthall January 14, 2025 19:45

andrewqian2001datadog changed the title ~~Max samples per context for Histogram, Distribution and Timing metrics~~ Max samples per context for Histogram, Distribution and Timing metrics (Experimental Feature) Jan 14, 2025

andrewqian2001datadog changed the title ~~Max samples per context for Histogram, Distribution and Timing metrics (Experimental Feature)~~ [AMLII-2019] Max samples per context for Histogram, Distribution and Timing metrics (Experimental Feature) Jan 14, 2025

update base.py to not use new feature unless max_samples_per_context …

b46b539

…is set

ddrthall reviewed Jan 15, 2025

View reviewed changes

andrewqian2001datadog added 5 commits January 15, 2025 13:18

remove comment

b80da4c

add flag for _report function to enable/disable sampling

c27bf8d

fix one off error

52fd25f

remove unused code

ec9a487

use specified rate

363d940

datadog-datadog-prod-us1 bot reviewed Jan 16, 2025

View reviewed changes

andrewqian2001datadog added 3 commits January 16, 2025 15:47

prelocate data in array

f274b9e

change code to assume that max_metric_samples is always set

4807ab2

remove unused function

7a9f982

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMLII-2019] Max samples per context for Histogram, Distribution and Timing metrics (Experimental Feature) #863

[AMLII-2019] Max samples per context for Histogram, Distribution and Timing metrics (Experimental Feature) #863

andrewqian2001datadog commented Oct 28, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot Oct 29, 2024

ddrthall Jan 15, 2025

ddrthall Jan 15, 2025

ddrthall Jan 15, 2025

ddrthall Jan 15, 2025

ddrthall Jan 15, 2025

ddrthall Jan 15, 2025 •

edited

Loading

ddrthall Jan 15, 2025

ddrthall Jan 15, 2025

ddrthall Jan 15, 2025

datadog-datadog-prod-us1 bot Jan 16, 2025

[AMLII-2019] Max samples per context for Histogram, Distribution and Timing metrics (Experimental Feature) #863

Are you sure you want to change the base?

[AMLII-2019] Max samples per context for Histogram, Distribution and Timing metrics (Experimental Feature) #863

Conversation

andrewqian2001datadog commented Oct 28, 2024 • edited Loading

Requirements for Contributing to this repository

What does this PR do?

Description of the Change

Verification Process

Additional Notes

Release Notes

Review checklist (to be filled by reviewers)

datadog-datadog-prod-us1 bot Oct 29, 2024

Choose a reason for hiding this comment

🔴 Code Vulnerability

Learn More

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ddrthall Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datadog-datadog-prod-us1 bot Jan 16, 2025

Choose a reason for hiding this comment

🔴 Code Vulnerability

Learn More

andrewqian2001datadog commented Oct 28, 2024 •

edited

Loading

ddrthall Jan 15, 2025 •

edited

Loading