[sdk-metrics] Promote overflow attribute from experimental to stable #5909

xiang17 · 2024-10-18T00:07:03Z

Fixes #5641.
Design discussion issue #

Changes

Remove the OTEL_DOTNET_EXPERIMENTAL_METRICS_EMIT_OVERFLOW_ATTRIBUTE variable and make the feature enabled by default.

Merge requirement checklist

CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)
Unit tests added/updated
Appropriate CHANGELOG.md files updated for non-trivial changes
Changes in public API reviewed (if applicable)

codecov · 2024-10-18T00:13:36Z

Codecov Report

Attention: Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 86.27%. Comparing base (6250307) to head (e4e21ca).
Report is 360 commits behind head on main.

Files with missing lines	Patch %	Lines
src/OpenTelemetry/Metrics/AggregatorStore.cs	77.77%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5909      +/-   ##
==========================================
+ Coverage   83.38%   86.27%   +2.88%     
==========================================
  Files         297      260      -37     
  Lines       12531    11437    -1094     
==========================================
- Hits        10449     9867     -582     
+ Misses       2082     1570     -512

Flag	Coverage Δ
unittests	`?`
unittests-Project-Experimental	`86.23% <80.00%> (?)`
unittests-Project-Stable	`86.21% <80.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/OpenTelemetry/Metrics/MeterProviderSdk.cs	`93.38% <100.00%> (+5.54%)`	⬆️
src/OpenTelemetry/Metrics/Metric.cs	`97.27% <ø> (+0.75%)`	⬆️
...rc/OpenTelemetry/Metrics/Reader/MetricReaderExt.cs	`92.03% <ø> (ø)`
src/OpenTelemetry/Metrics/AggregatorStore.cs	`86.15% <77.77%> (+5.76%)`	⬆️

... and 233 files with indirect coverage changes

…ected for a given metric

docs/metrics/README.md

src/OpenTelemetry/CHANGELOG.md

src/OpenTelemetry/Metrics/AggregatorStore.cs

cijothomas

Changes look good overall.
I am requesting couple of changes

Remove internal log for cardinality hit.
Rephrase changelog.

(I left detailed comment about each of above in PR).

Also, we could use some refactoring on the docs, but this can be followed up separately. If you need help, I can help with the doc part.

docs/metrics/README.md

…flow attribute and this behavior won't be able to be turned off.

src/OpenTelemetry/CHANGELOG.md

Co-authored-by: Reiley Yang <[email protected]>

src/OpenTelemetry/CHANGELOG.md

reyang

LGTM with a minor suggestion.

cijothomas · 2024-10-23T01:56:11Z

docs/metrics/README.md

-  the [OpenTelemetry
-  Specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#overflow-attribute)
-  become stable, this feature will be turned on by default.
+Starting with version 1.10.0, given a metric, once the cardinality limit is


i think its best to mention the pre 1.10.0 behavior too here.
Versions prior to 1.10.0 had a different behavior when cardinality limit was hit. The measurements was either dropped (default) and an internal log was emitted once or aggregated using overflow attribute(opt-in basis)

(we have some places in this repo where were use such style in document)

I feel it's better if the README reflects the current scenario. Adding prior behavior may increase the length of the document.

I think the whole thing is a bit confusing 🤣

Here is my stab at improving it:

Given a metric, once the cardinality limit is reached, any new measurement which cannot be independently aggregated because of the limit will be dropped. Starting with version 1.10.0, when a measurement is dropped, the the overflow attribute is updated automatically.

I think what is important is the first statement "Given a metric, once the cardinality limit is reached, any new measurement which cannot be independently aggregated because of the limit will be dropped." That is a hard stop thing. The way it is currently written, it seems this is new to 1.10.0 😄

I feel this is tricky. One is the Previous behavior where there is default and experimental opt-in overflow attribute. The other one is the New behavior with the always on overflow attribute. It will also be too long if we explain everything in README.

I've changed the README to include minimal info to mention things has changed a few times, and put more details in the CHANGELOG in case anyone is really interested and want to better understand.

I think what is important is the first statement "Given a metric, once the cardinality limit is reached, any new measurement which cannot be independently aggregated because of the limit will be dropped." That is a hard stop thing. The way it is currently written, it seems this is new to 1.10.0 😄

That's a good point. I've updated it to make that first statement more prominent and followed with how our SDK handles and how approaches changed.

One small thing, the spec seems to not see it as a "drop" when it's done with the overflow attribute, but rather a "synthetic aggregation". So it's still "aggregated", but not "independently aggregated".

An overflow attribute set is defined, containing a single attribute otel.metric.overflow having (boolean) value true, which is used to report a synthetic aggregation of the Measurements that could not be independently aggregated because of the limit.

cijothomas

I have some suggestions on changelog and readme/doc part, rest looks good.
They can be a follow up PR too.

src/OpenTelemetry/Metrics/AggregatorStore.cs

src/OpenTelemetry/CHANGELOG.md

docs/metrics/README.md

Kielek

Nice to see this stabilized.
I like the current CHANGELOG entry.

CodeBlanch · 2024-10-25T16:45:34Z

src/OpenTelemetry/CHANGELOG.md

+  The SDK now always uses the overflow attribute (`otel.metric.overflow = true`)
+  to aggregate measurements when the cardinality limit is reached. The previous
+  approach of dropping measurements has been removed. No internal logs are
+  emitted when the limit is hit.


This feel very misleading to me. The measurements are still dropped. The overflow attribute is not a fix for that, it is just a detection mechanism to alert users when they have an issue.

What's your definition of "dropped" vs. not?

That's not the right question. I'm not the audience. The question is, what is user's definition of dropped and expectation when something is recorded? I promise you it is not "my value and dimensions get folded into a flag and thrown away" 🤣

From the specification perspective, "dropped" means the value is ignored, it has no effect to the final metrics. Since overflow attribute keeps the accurate total sum for Counter (which means the user will always get the correct total sum, whether overflow happened or not), and correct percentile for Histogram, it shouldn't be considered "dropped".

I see your point. My perspective is... I'm some user and I look at my backend and I see all my metrics, I'm happy. Then a day later I look and everything is gone. Or suddenly everything is empty? Certainly not what I saw before, right? So naturally I'm going to think data is missing and must be getting dropped somewhere. But I read the SDK CHANGELOG and it said it no longer drops anything! Now I'm upset and declare the SDK bugged and open an issue to vent my anger 🔥

I'm not saying the spec is wrong. But I don't think the users will understand the nuance 🤔

Given a metric, once the cardinality limit is reached, any new measurement that could not be independently aggregated will be aggregated using the overflow attribute.

IMO this is going to be like lawyer jargon for users 😄 I think we should break it down in more layman/accessible terms.

Then a day later I look and everything is gone. Or suddenly everything is empty? Certainly not what I saw before, right? So naturally I'm going to think data is missing and must be getting dropped somewhere.

I'm very confused. Would you explain more? E.g. could you give a concrete example about "everything is empty" or "everything is gone"?

Agree that the wording may not easily be understood by all end-users. The https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#cardinality-limits doc is very good, and perhaps we can make some additions to it to explain how the sdk behaves when limit is hit, and how to interpret overflow attribute correctly.

One important think I'd like called out is, if user has a query like sum of all requests, where route=foo, and an overflow exists - then that query is no longer trustable, as there is no way to tell if a route=foo measurement was folded into overflow. The only thing trustable in the event an overflow exists is the total metrics (i.e the one which do not filter based on any dimensions).

If none volunteers to make this change in the doc, I can cover it. (I am implementing similar thing for OTel Rust right now, so I can hopefully steal some wordings! Ideally this should be covered in otel docs website, so every language can benefit)
https://github.com/utpilla/MetricOverflowAttribute?tab=readme-ov-file can be a good starting point.

Opened an issue #5939 to track this.

Promote overflow attribute from experimental to stable

7070393

github-actions bot added documentation Documentation related pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package labels Oct 18, 2024

xiang17 added 2 commits October 17, 2024 18:06

update wording in markdowns

a349acf

update readme regarding the warning the first time an overflow is det…

737c3c9

…ected for a given metric

xiang17 marked this pull request as ready for review October 18, 2024 01:33

xiang17 requested a review from a team as a code owner October 18, 2024 01:33

cijothomas reviewed Oct 18, 2024

View reviewed changes

docs/metrics/README.md Outdated Show resolved Hide resolved

cijothomas reviewed Oct 18, 2024

View reviewed changes

docs/metrics/README.md Outdated Show resolved Hide resolved

cijothomas reviewed Oct 18, 2024

View reviewed changes

src/OpenTelemetry/CHANGELOG.md Show resolved Hide resolved

cijothomas reviewed Oct 18, 2024

View reviewed changes

src/OpenTelemetry/Metrics/AggregatorStore.cs Outdated Show resolved Hide resolved

cijothomas reviewed Oct 18, 2024

View reviewed changes

src/OpenTelemetry/Metrics/AggregatorStore.cs Outdated Show resolved Hide resolved

cijothomas requested changes Oct 18, 2024

View reviewed changes

rajkumar-rangaraj reviewed Oct 21, 2024

View reviewed changes

docs/metrics/README.md Outdated Show resolved Hide resolved

xiang17 added 3 commits October 22, 2024 17:04

Remove MetricPointCaptHitMessage

9e4fe9b

Update comments related to overflow attribute being optional

bc41d1a

Update README and CHANGELOG

127025f

reyang reviewed Oct 23, 2024

View reviewed changes

docs/metrics/README.md Outdated Show resolved Hide resolved

The measurement won't be dropped. It will be aggregated with the over…

f3d1257

…flow attribute and this behavior won't be able to be turned off.

reyang reviewed Oct 23, 2024

View reviewed changes

src/OpenTelemetry/CHANGELOG.md Outdated Show resolved Hide resolved

Update src/OpenTelemetry/CHANGELOG.md

0823e2d

Co-authored-by: Reiley Yang <[email protected]>

reyang reviewed Oct 23, 2024

View reviewed changes

src/OpenTelemetry/CHANGELOG.md Outdated Show resolved Hide resolved

xiang17 added 2 commits October 22, 2024 18:49

reword

aeceb18

Merge branch 'main' into xiang17/OverflowAttribute

ba83250

reyang reviewed Oct 23, 2024

View reviewed changes

src/OpenTelemetry/CHANGELOG.md Outdated Show resolved Hide resolved

typo

b72090e

reyang approved these changes Oct 23, 2024

View reviewed changes

cijothomas reviewed Oct 23, 2024

View reviewed changes

cijothomas approved these changes Oct 23, 2024

View reviewed changes

rajkumar-rangaraj approved these changes Oct 23, 2024

View reviewed changes

CodeBlanch reviewed Oct 23, 2024

View reviewed changes

src/OpenTelemetry/Metrics/AggregatorStore.cs Show resolved Hide resolved

improve README and CHANGELOG

6c67e3f

cijothomas reviewed Oct 24, 2024

View reviewed changes

src/OpenTelemetry/CHANGELOG.md Outdated Show resolved Hide resolved

cijothomas reviewed Oct 24, 2024

View reviewed changes

docs/metrics/README.md Show resolved Hide resolved

cijothomas reviewed Oct 24, 2024

View reviewed changes

docs/metrics/README.md Show resolved Hide resolved

update wording

a118842

cijothomas approved these changes Oct 25, 2024

View reviewed changes

Kielek approved these changes Oct 25, 2024

View reviewed changes

CodeBlanch changed the title ~~Promote overflow attribute from experimental to stable~~ [sdk-metrics] Promote overflow attribute from experimental to stable Oct 25, 2024

CodeBlanch reviewed Oct 25, 2024

View reviewed changes

Merge branch 'main' into xiang17/OverflowAttribute

e4e21ca

CodeBlanch merged commit 9f41ead into open-telemetry:main Oct 25, 2024
39 checks passed

xiang17 deleted the xiang17/OverflowAttribute branch October 29, 2024 23:40

xiang17 mentioned this pull request Oct 30, 2024

[doc] Improve wording regarding to overflow attribute once cardinality limit is reached for easier understanding for end-users #5939

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sdk-metrics] Promote overflow attribute from experimental to stable #5909

[sdk-metrics] Promote overflow attribute from experimental to stable #5909

xiang17 commented Oct 18, 2024 •

edited

Loading

codecov bot commented Oct 18, 2024 •

edited

Loading

cijothomas left a comment

reyang left a comment

cijothomas Oct 23, 2024

rajkumar-rangaraj Oct 23, 2024

CodeBlanch Oct 23, 2024

xiang17 Oct 24, 2024 •

edited

Loading

xiang17 Oct 24, 2024 •

edited

Loading

cijothomas left a comment

Kielek left a comment

CodeBlanch Oct 25, 2024

reyang Oct 25, 2024

CodeBlanch Oct 25, 2024

reyang Oct 25, 2024

CodeBlanch Oct 25, 2024

reyang Oct 25, 2024

cijothomas Oct 26, 2024

xiang17 Oct 30, 2024

[sdk-metrics] Promote overflow attribute from experimental to stable #5909

[sdk-metrics] Promote overflow attribute from experimental to stable #5909

Conversation

xiang17 commented Oct 18, 2024 • edited Loading

Changes

Merge requirement checklist

codecov bot commented Oct 18, 2024 • edited Loading

Codecov Report

cijothomas left a comment

Choose a reason for hiding this comment

reyang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang17 Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

xiang17 Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

cijothomas left a comment

Choose a reason for hiding this comment

Kielek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang17 commented Oct 18, 2024 •

edited

Loading

codecov bot commented Oct 18, 2024 •

edited

Loading

xiang17 Oct 24, 2024 •

edited

Loading

xiang17 Oct 24, 2024 •

edited

Loading