Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sdk-metrics] Promote overflow attribute from experimental to stable #5909

Merged
merged 14 commits into from
Oct 25, 2024
Merged
25 changes: 9 additions & 16 deletions docs/metrics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,13 @@
* [Instruments](#instruments)
* [MeterProvider Management](#meterprovider-management)
* [Memory Management](#memory-management)
* [Example](#example)
* [Pre-Aggregation](#pre-aggregation)
* [Cardinality Limits](#cardinality-limits)
* [Memory Preallocation](#memory-preallocation)
* [Metrics Correlation](#metrics-correlation)
* [Metrics Enrichment](#metrics-enrichment)
* [Common issues that lead to missing metrics](#common-issues-that-lead-to-missing-metrics)

</details>

Expand Down Expand Up @@ -386,22 +388,13 @@ and the `MetricStreamConfiguration.CardinalityLimit` setting. Refer to this
[doc](../../docs/metrics/customizing-the-sdk/README.md#changing-the-cardinality-limit-for-a-metric)
for more information.

Given a metric, once the cardinality limit is reached, any new measurement which
cannot be independently aggregated because of the limit will be dropped or
aggregated using the [overflow
attribute](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#overflow-attribute)
(if enabled). When NOT using the overflow attribute feature a warning is written
to the [self-diagnostic log](../../src/OpenTelemetry/README.md#self-diagnostics)
the first time an overflow is detected for a given metric.

> [!NOTE]
> Overflow attribute was introduced in OpenTelemetry .NET
[1.6.0-rc.1](../../src/OpenTelemetry/CHANGELOG.md#160-rc1). It is currently an
experimental feature which can be turned on by setting the environment
variable `OTEL_DOTNET_EXPERIMENTAL_METRICS_EMIT_OVERFLOW_ATTRIBUTE=true`. Once
the [OpenTelemetry
Specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#overflow-attribute)
become stable, this feature will be turned on by default.
Given a metric, once the cardinality limit is reached, any new measurement
xiang17 marked this conversation as resolved.
Show resolved Hide resolved
cijothomas marked this conversation as resolved.
Show resolved Hide resolved
that could not be independently aggregated will be aggregated using the
[overflow attribute](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#overflow-attribute).
In versions prior to 1.10.0, the default behavior when cardinality limit was
reached was to drop the measurement. Users had the ability to opt-in to use
overflow attribute instead, but this behavior is the default and the only
allowed behavior starting with version 1.10.0.

When [Delta Aggregation
Temporality](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/data-model.md#temporality)
Expand Down
25 changes: 25 additions & 0 deletions src/OpenTelemetry/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,31 @@ Notes](../../RELEASENOTES.md).

## Unreleased

* Promoted overflow attribute from experimental to stable and removed the
`OTEL_DOTNET_EXPERIMENTAL_METRICS_EMIT_OVERFLOW_ATTRIBUTE` environment variable.

**Previous Behavior:**
By default, when the cardinality limit was reached, measurements were dropped,
and an internal log was emitted the first time this occurred. Users could
opt-in to experimental overflow attribute feature with
`OTEL_DOTNET_EXPERIMENTAL_METRICS_EMIT_OVERFLOW_ATTRIBUTE=true`.
With this setting, the SDK would use an overflow attribute
(`otel.metric.overflow = true`) to aggregate measurements instead of dropping
measurements. No internal log was emitted in this case.

**New Behavior:**
The SDK now always uses the overflow attribute (`otel.metric.overflow = true`)
to aggregate measurements when the cardinality limit is reached. The previous
approach of dropping measurements has been removed. No internal logs are
emitted when the limit is hit.
Comment on lines +22 to +25
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feel very misleading to me. The measurements are still dropped. The overflow attribute is not a fix for that, it is just a detection mechanism to alert users when they have an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your definition of "dropped" vs. not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the right question. I'm not the audience. The question is, what is user's definition of dropped and expectation when something is recorded? I promise you it is not "my value and dimensions get folded into a flag and thrown away" 🤣

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the specification perspective, "dropped" means the value is ignored, it has no effect to the final metrics. Since overflow attribute keeps the accurate total sum for Counter (which means the user will always get the correct total sum, whether overflow happened or not), and correct percentile for Histogram, it shouldn't be considered "dropped".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. My perspective is... I'm some user and I look at my backend and I see all my metrics, I'm happy. Then a day later I look and everything is gone. Or suddenly everything is empty? Certainly not what I saw before, right? So naturally I'm going to think data is missing and must be getting dropped somewhere. But I read the SDK CHANGELOG and it said it no longer drops anything! Now I'm upset and declare the SDK bugged and open an issue to vent my anger 🔥

I'm not saying the spec is wrong. But I don't think the users will understand the nuance 🤔

Given a metric, once the cardinality limit is reached, any new measurement that could not be independently aggregated will be aggregated using the overflow attribute.

IMO this is going to be like lawyer jargon for users 😄 I think we should break it down in more layman/accessible terms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then a day later I look and everything is gone. Or suddenly everything is empty? Certainly not what I saw before, right? So naturally I'm going to think data is missing and must be getting dropped somewhere.

I'm very confused. Would you explain more? E.g. could you give a concrete example about "everything is empty" or "everything is gone"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that the wording may not easily be understood by all end-users. The https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#cardinality-limits doc is very good, and perhaps we can make some additions to it to explain how the sdk behaves when limit is hit, and how to interpret overflow attribute correctly.

One important think I'd like called out is, if user has a query like sum of all requests, where route=foo, and an overflow exists - then that query is no longer trustable, as there is no way to tell if a route=foo measurement was folded into overflow. The only thing trustable in the event an overflow exists is the total metrics (i.e the one which do not filter based on any dimensions).

If none volunteers to make this change in the doc, I can cover it. (I am implementing similar thing for OTel Rust right now, so I can hopefully steal some wordings! Ideally this should be covered in otel docs website, so every language can benefit)
https://github.com/utpilla/MetricOverflowAttribute?tab=readme-ov-file can be a good starting point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened an issue #5939 to track this.


The default cardinality limit remains 2000 per metric. To set the cardinality
limit for an individual metric, use the [changing cardinality limit for a
Metric](../../docs/metrics/customizing-the-sdk/README.md#changing-the-cardinality-limit-for-a-metric).

There is NO ability to revert to old behavior.
([#5909](https://github.com/open-telemetry/opentelemetry-dotnet/pull/5909))
cijothomas marked this conversation as resolved.
Show resolved Hide resolved

## 1.10.0-beta.1

Released 2024-Sep-30
Expand Down
47 changes: 10 additions & 37 deletions src/OpenTelemetry/Metrics/AggregatorStore.cs
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,11 @@ internal sealed class AggregatorStore
internal readonly bool OutputDelta;
internal readonly bool OutputDeltaWithUnusedMetricPointReclaimEnabled;
internal readonly int NumberOfMetricPoints;
internal readonly bool EmitOverflowAttribute;
internal readonly ConcurrentDictionary<Tags, LookupData>? TagsToMetricPointIndexDictionaryDelta;
internal readonly Func<ExemplarReservoir?>? ExemplarReservoirFactory;
internal long DroppedMeasurements = 0;

private const ExemplarFilterType DefaultExemplarFilter = ExemplarFilterType.AlwaysOff;
private static readonly string MetricPointCapHitFixMessage = "Consider opting in for the experimental SDK feature to emit all the throttled metrics under the overflow attribute by setting env variable OTEL_DOTNET_EXPERIMENTAL_METRICS_EMIT_OVERFLOW_ATTRIBUTE = true. You could also modify instrumentation to reduce the number of unique key/value pair combinations. Or use Views to drop unwanted tags. Or use MeterProviderBuilder.SetMaxMetricPointsPerMetricStream to set higher limit.";
private static readonly Comparison<KeyValuePair<string, object?>> DimensionComparisonDelegate = (x, y) => x.Key.CompareTo(y.Key);

private readonly Lock lockZeroTags = new();
Expand All @@ -42,7 +40,6 @@ internal sealed class AggregatorStore
new();

private readonly string name;
private readonly string metricPointCapHitMessage;
private readonly MetricPoint[] metricPoints;
private readonly int[] currentMetricPointBatch;
private readonly AggregationType aggType;
Expand All @@ -56,7 +53,6 @@ internal sealed class AggregatorStore

private int metricPointIndex = 0;
private int batchSize = 0;
private int metricCapHitMessageLogged;
private bool zeroTagMetricPointInitialized;
private bool overflowTagMetricPointInitialized;

Expand All @@ -65,7 +61,6 @@ internal AggregatorStore(
AggregationType aggType,
AggregationTemporality temporality,
int cardinalityLimit,
bool emitOverflowAttribute,
bool shouldReclaimUnusedMetricPoints,
ExemplarFilterType? exemplarFilter = null,
Func<ExemplarReservoir?>? exemplarReservoirFactory = null)
Expand All @@ -77,7 +72,6 @@ internal AggregatorStore(
// Previously, these were included within the original cardinalityLimit, but now they are explicitly added to enhance clarity.
this.NumberOfMetricPoints = cardinalityLimit + 2;

this.metricPointCapHitMessage = $"Maximum MetricPoints limit reached for this Metric stream. Configured limit: {cardinalityLimit}";
this.metricPoints = new MetricPoint[this.NumberOfMetricPoints];
this.currentMetricPointBatch = new int[this.NumberOfMetricPoints];
this.aggType = aggType;
Expand Down Expand Up @@ -105,8 +99,6 @@ internal AggregatorStore(
this.tagsKeysInterestingCount = hs.Count;
}

this.EmitOverflowAttribute = emitOverflowAttribute;

this.exemplarFilter = exemplarFilter ?? DefaultExemplarFilter;
Debug.Assert(
this.exemplarFilter == ExemplarFilterType.AlwaysOff
Expand Down Expand Up @@ -245,17 +237,14 @@ internal void SnapshotDeltaWithMetricPointReclaim()
this.batchSize++;
}

if (this.EmitOverflowAttribute)
// TakeSnapshot for the MetricPoint for overflow
ref var metricPointForOverflow = ref this.metricPoints[1];
if (metricPointForOverflow.MetricPointStatus != MetricPointStatus.NoCollectPending)
{
// TakeSnapshot for the MetricPoint for overflow
ref var metricPointForOverflow = ref this.metricPoints[1];
if (metricPointForOverflow.MetricPointStatus != MetricPointStatus.NoCollectPending)
{
this.TakeMetricPointSnapshot(ref metricPointForOverflow, outputDelta: true);
this.TakeMetricPointSnapshot(ref metricPointForOverflow, outputDelta: true);

this.currentMetricPointBatch[this.batchSize] = 1;
this.batchSize++;
}
this.currentMetricPointBatch[this.batchSize] = 1;
this.batchSize++;
}

// Index 0 and 1 are reserved for no tags and overflow
Expand Down Expand Up @@ -994,16 +983,8 @@ private void UpdateLongMetricPoint(int metricPointIndex, long value, ReadOnlySpa
if (metricPointIndex < 0)
{
Interlocked.Increment(ref this.DroppedMeasurements);

if (this.EmitOverflowAttribute)
{
this.InitializeOverflowTagPointIfNotInitialized();
this.metricPoints[1].Update(value);
}
else if (Interlocked.CompareExchange(ref this.metricCapHitMessageLogged, 1, 0) == 0)
{
OpenTelemetrySdkEventSource.Log.MeasurementDropped(this.name, this.metricPointCapHitMessage, MetricPointCapHitFixMessage);
}
this.InitializeOverflowTagPointIfNotInitialized();
this.metricPoints[1].Update(value);

return;
}
Expand Down Expand Up @@ -1049,16 +1030,8 @@ private void UpdateDoubleMetricPoint(int metricPointIndex, double value, ReadOnl
if (metricPointIndex < 0)
{
Interlocked.Increment(ref this.DroppedMeasurements);

if (this.EmitOverflowAttribute)
{
this.InitializeOverflowTagPointIfNotInitialized();
this.metricPoints[1].Update(value);
}
else if (Interlocked.CompareExchange(ref this.metricCapHitMessageLogged, 1, 0) == 0)
{
OpenTelemetrySdkEventSource.Log.MeasurementDropped(this.name, this.metricPointCapHitMessage, MetricPointCapHitFixMessage);
xiang17 marked this conversation as resolved.
Show resolved Hide resolved
}
this.InitializeOverflowTagPointIfNotInitialized();
this.metricPoints[1].Update(value);

return;
}
Expand Down
10 changes: 1 addition & 9 deletions src/OpenTelemetry/Metrics/MeterProviderSdk.cs
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ namespace OpenTelemetry.Metrics;

internal sealed class MeterProviderSdk : MeterProvider
{
internal const string EmitOverFlowAttributeConfigKey = "OTEL_DOTNET_EXPERIMENTAL_METRICS_EMIT_OVERFLOW_ATTRIBUTE";
internal const string ReclaimUnusedMetricPointsConfigKey = "OTEL_DOTNET_EXPERIMENTAL_METRICS_RECLAIM_UNUSED_METRIC_POINTS";
internal const string ExemplarFilterConfigKey = "OTEL_METRICS_EXEMPLAR_FILTER";
internal const string ExemplarFilterHistogramsConfigKey = "OTEL_DOTNET_EXPERIMENTAL_METRICS_EXEMPLAR_FILTER_HISTOGRAMS";
Expand All @@ -22,7 +21,6 @@ internal sealed class MeterProviderSdk : MeterProvider
internal readonly IDisposable? OwnedServiceProvider;
internal int ShutdownCount;
internal bool Disposed;
internal bool EmitOverflowAttribute;
internal bool ReclaimUnusedMetricPoints;
internal ExemplarFilterType? ExemplarFilter;
internal ExemplarFilterType? ExemplarFilterForHistograms;
Expand Down Expand Up @@ -75,7 +73,7 @@ internal MeterProviderSdk(
this.viewConfigs = state.ViewConfigs;

OpenTelemetrySdkEventSource.Log.MeterProviderSdkEvent(
$"MeterProvider configuration: {{MetricLimit={state.MetricLimit}, CardinalityLimit={state.CardinalityLimit}, EmitOverflowAttribute={this.EmitOverflowAttribute}, ReclaimUnusedMetricPoints={this.ReclaimUnusedMetricPoints}, ExemplarFilter={this.ExemplarFilter}, ExemplarFilterForHistograms={this.ExemplarFilterForHistograms}}}.");
$"MeterProvider configuration: {{MetricLimit={state.MetricLimit}, CardinalityLimit={state.CardinalityLimit}, ReclaimUnusedMetricPoints={this.ReclaimUnusedMetricPoints}, ExemplarFilter={this.ExemplarFilter}, ExemplarFilterForHistograms={this.ExemplarFilterForHistograms}}}.");

foreach (var reader in state.Readers)
{
Expand All @@ -86,7 +84,6 @@ internal MeterProviderSdk(
reader.ApplyParentProviderSettings(
state.MetricLimit,
state.CardinalityLimit,
this.EmitOverflowAttribute,
this.ReclaimUnusedMetricPoints,
this.ExemplarFilter,
this.ExemplarFilterForHistograms);
Expand Down Expand Up @@ -486,11 +483,6 @@ protected override void Dispose(bool disposing)

private void ApplySpecificationConfigurationKeys(IConfiguration configuration)
{
if (configuration.TryGetBoolValue(OpenTelemetrySdkEventSource.Log, EmitOverFlowAttributeConfigKey, out this.EmitOverflowAttribute))
{
OpenTelemetrySdkEventSource.Log.MeterProviderSdkEvent("Overflow attribute feature enabled via configuration.");
}

if (configuration.TryGetBoolValue(OpenTelemetrySdkEventSource.Log, ReclaimUnusedMetricPointsConfigKey, out this.ReclaimUnusedMetricPoints))
{
OpenTelemetrySdkEventSource.Log.MeterProviderSdkEvent("Reclaim unused metric point feature enabled via configuration.");
Expand Down
2 changes: 0 additions & 2 deletions src/OpenTelemetry/Metrics/Metric.cs
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,6 @@ internal Metric(
MetricStreamIdentity instrumentIdentity,
AggregationTemporality temporality,
int cardinalityLimit,
bool emitOverflowAttribute,
bool shouldReclaimUnusedMetricPoints,
ExemplarFilterType? exemplarFilter = null,
Func<ExemplarReservoir?>? exemplarReservoirFactory = null)
Expand Down Expand Up @@ -193,7 +192,6 @@ internal Metric(
aggType,
temporality,
cardinalityLimit,
emitOverflowAttribute,
shouldReclaimUnusedMetricPoints,
exemplarFilter,
exemplarReservoirFactory);
Expand Down
5 changes: 0 additions & 5 deletions src/OpenTelemetry/Metrics/Reader/MetricReaderExt.cs
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ public abstract partial class MetricReader
private Metric?[]? metrics;
private Metric[]? metricsCurrentBatch;
private int metricIndex = -1;
private bool emitOverflowAttribute;
private bool reclaimUnusedMetricPoints;
private ExemplarFilterType? exemplarFilter;
private ExemplarFilterType? exemplarFilterForHistograms;
Expand Down Expand Up @@ -82,7 +81,6 @@ internal virtual List<Metric> AddMetricWithNoViews(Instrument instrument)
metricStreamIdentity,
this.GetAggregationTemporality(metricStreamIdentity.InstrumentType),
this.cardinalityLimit,
this.emitOverflowAttribute,
this.reclaimUnusedMetricPoints,
exemplarFilter);
}
Expand Down Expand Up @@ -164,7 +162,6 @@ internal virtual List<Metric> AddMetricWithViews(Instrument instrument, List<Met
metricStreamIdentity,
this.GetAggregationTemporality(metricStreamIdentity.InstrumentType),
metricStreamConfig?.CardinalityLimit ?? this.cardinalityLimit,
this.emitOverflowAttribute,
this.reclaimUnusedMetricPoints,
exemplarFilter,
metricStreamConfig?.ExemplarReservoirFactory);
Expand All @@ -184,7 +181,6 @@ internal virtual List<Metric> AddMetricWithViews(Instrument instrument, List<Met
internal void ApplyParentProviderSettings(
int metricLimit,
int cardinalityLimit,
bool emitOverflowAttribute,
bool reclaimUnusedMetricPoints,
ExemplarFilterType? exemplarFilter,
ExemplarFilterType? exemplarFilterForHistograms)
Expand All @@ -193,7 +189,6 @@ internal void ApplyParentProviderSettings(
this.metrics = new Metric[metricLimit];
this.metricsCurrentBatch = new Metric[metricLimit];
this.cardinalityLimit = cardinalityLimit;
this.emitOverflowAttribute = emitOverflowAttribute;
this.reclaimUnusedMetricPoints = reclaimUnusedMetricPoints;
this.exemplarFilter = exemplarFilter;
this.exemplarFilterForHistograms = exemplarFilterForHistograms;
Expand Down
29 changes: 4 additions & 25 deletions test/OpenTelemetry.Tests/Metrics/AggregatorTestsBase.cs
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,14 @@ public abstract class AggregatorTestsBase
private static readonly ExplicitBucketHistogramConfiguration HistogramConfiguration = new() { Boundaries = Metric.DefaultHistogramBounds };
private static readonly MetricStreamIdentity MetricStreamIdentity = new(Instrument, HistogramConfiguration);

private readonly bool emitOverflowAttribute;
private readonly bool shouldReclaimUnusedMetricPoints;
private readonly AggregatorStore aggregatorStore;

protected AggregatorTestsBase(bool emitOverflowAttribute, bool shouldReclaimUnusedMetricPoints)
protected AggregatorTestsBase(bool shouldReclaimUnusedMetricPoints)
{
this.emitOverflowAttribute = emitOverflowAttribute;
this.shouldReclaimUnusedMetricPoints = shouldReclaimUnusedMetricPoints;

this.aggregatorStore = new(MetricStreamIdentity, AggregationType.HistogramWithBuckets, AggregationTemporality.Cumulative, 1024, emitOverflowAttribute, this.shouldReclaimUnusedMetricPoints);
this.aggregatorStore = new(MetricStreamIdentity, AggregationType.HistogramWithBuckets, AggregationTemporality.Cumulative, 1024, this.shouldReclaimUnusedMetricPoints);
}

[Fact]
Expand Down Expand Up @@ -253,7 +251,6 @@ public void HistogramBucketsDefaultUpdatesForSecondsTest(string meterName, strin
AggregationType.Histogram,
AggregationTemporality.Cumulative,
cardinalityLimit: 1024,
this.emitOverflowAttribute,
this.shouldReclaimUnusedMetricPoints);

KnownHistogramBuckets actualHistogramBounds = KnownHistogramBuckets.Default;
Expand Down Expand Up @@ -330,7 +327,6 @@ internal void ExponentialHistogramTests(AggregationType aggregationType, Aggrega
aggregationType,
aggregationTemporality,
cardinalityLimit: 1024,
this.emitOverflowAttribute,
this.shouldReclaimUnusedMetricPoints,
exemplarsEnabled ? ExemplarFilterType.AlwaysOn : null);

Expand Down Expand Up @@ -440,7 +436,6 @@ internal void ExponentialMaxScaleConfigWorks(int? maxScale)
AggregationType.Base2ExponentialHistogram,
AggregationTemporality.Cumulative,
cardinalityLimit: 1024,
this.emitOverflowAttribute,
this.shouldReclaimUnusedMetricPoints);

aggregatorStore.Update(10, Array.Empty<KeyValuePair<string, object?>>());
Expand Down Expand Up @@ -525,31 +520,15 @@ public ThreadArguments(MetricPoint histogramPoint, ManualResetEvent mreToEnsureA
public class AggregatorTests : AggregatorTestsBase
{
public AggregatorTests()
: base(emitOverflowAttribute: false, shouldReclaimUnusedMetricPoints: false)
{
}
}

public class AggregatorTestsWithOverflowAttribute : AggregatorTestsBase
{
public AggregatorTestsWithOverflowAttribute()
: base(emitOverflowAttribute: true, shouldReclaimUnusedMetricPoints: false)
: base(shouldReclaimUnusedMetricPoints: false)
{
}
}

public class AggregatorTestsWithReclaimAttribute : AggregatorTestsBase
{
public AggregatorTestsWithReclaimAttribute()
: base(emitOverflowAttribute: false, shouldReclaimUnusedMetricPoints: true)
{
}
}

public class AggregatorTestsWithBothReclaimAndOverflowAttributes : AggregatorTestsBase
{
public AggregatorTestsWithBothReclaimAndOverflowAttributes()
: base(emitOverflowAttribute: true, shouldReclaimUnusedMetricPoints: true)
: base(shouldReclaimUnusedMetricPoints: true)
{
}
}
Loading
Loading