-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
telemetry: periodic reporting #2721
base: master
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Large diffs are not rendered by default.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -261,6 +261,21 @@ message Telemetry { | |
// Optional. AccessLogging configures the access logging behavior for all | ||
// selected workloads. | ||
repeated AccessLogging access_logging = 4; | ||
|
||
// Reporting interval allows configuration of the time between the access log | ||
// reports for the individual streams. Duration set to 0 disables the perodic | ||
// reporting, and only the ends of the streams are reported. | ||
message ReportingInterval { | ||
// Optional. Reporting interval for TCP streams. The default value is 5s. | ||
google.protobuf.Duration tcp = 1; | ||
|
||
// Optional. Reporting interval for HTTP streams. Not enabled by default. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. both http 1 and 2? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure why this would be here or why it matters what protocol is used. You want 30 seconds - using whatever protocol is available. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any HTTP, Envoy treats them the same. Specialization by protocol is for backwards compatibility only - Istio does periodic reporting for TCP only right now. 30s is too long - users consider 10s extra latency unacceptable and without any telemetry during that time, we can't debug. |
||
google.protobuf.Duration http = 2; | ||
} | ||
|
||
// Configuration for the interval reporting for the access logging and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does it mean to "report" and access log at a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's what Istio does today? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As far as I know access logs are only printed once per connection...? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Only the demo stdout log. The production stackdriver log reports periodically. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we note, in the proto, this? Stdout is not seen as "demo" as well, its pretty commonly used in production (https://12factor.net/logs) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does any vendor rely on stdout for production? It's demo because you need another system to collect logs and we don't include it in Istio. The point of this PR is to generalize what we do for stackdriver as a good production practive to every other logging sink including Otel and stdout. Is that what you want me to say? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, we do 🙂 I want the PR to explain what the API does, and give guidance to a user on how they can configure it, which is the standard for API documentation. I am still not sure I even understand. "including Otel and stdout" - I thought we were saying this does not apply to stdout? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Google doesn't depend on stdout logging, not sure why you claim so. It shouldn't since stdout will hurt performance. The setting is global - it applies to all Envoy telemetry sinks, including stdout and otel. SD implementation is an odd ball but it's Google's problem and will be fixed later. The API is about enabling periodic reporting for long living streams, and it's opt-in. What is not clear about it? Are you questioning the value of the periodic reporting or the shape of the API? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Depends - many apps report logs to stdout and this gets collected and pushed to stackdriver. I think we need to be compatible ( and adopt ) OTel, including integration with their collectors if available - and not expose APIs that would not be implementable or needed. Maybe in mesh config for the other integrations - but even for that I doubt we should expose them. All APIs require testing, support and are hard to deprecated. |
||
// metrics. | ||
ReportingInterval reporting_interval = 5; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be in top level or inside each struct? You may want different settings for each. Stdout logs we set flush interval to 1s for example, but that would be aggressive for a write to a remote server. Does this overwrite There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's not supported. Most people have only one sink enabled (telemetry costs $$$) and we should cater to them and not complicate things for imaginary use cases. This setting has nothing to do with IO flush. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you might be misunderstanding periodic reporting. It's reporting per data stream, not per sink stream There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I am misunderstanding the surely our users will 🙂 . Can we clarify the comments a bit so it's understandable for someone that doesn't have prior context? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also if it's not related to IO flush that seems like something to be called out in the comments. I guess that means it I use Telemetry to turn on stdout logging and set this field it has no impact? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mentioned that the IO flush rate is completely independent. Istio reports telemetry per stream, not per sink. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reading the .proto file I still have no clue what this does
But you are saying its NOT related to the IO flush rate, which is, by my understanding, the interval at which access logs are written to stdout? In Envoy, is this config global or something? What happens if I have 2 metrics backends configured by 2 Telemetry with 2 different configs? Or 2 telemetry, 1 configuring metrics and 1 configuring access logs. TBH I have read this proto quite a few times and I don't think I am any closer to understanding what it does, and how or why I should set it. Can we write the comments from the perspective of a user to help guide them? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not a new feature - this existed in Istio since 1.0 for TCP. IO flush rate has nothing to do with telemetry production - it's a rate of delivering telemetry not production. Users never complained about this being confusing :/
It is a global setting per listener. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have a list of telemetry providers - and how many can implement this feature ? From what I've seen, if we use OTel this will not be configurable per app - but the per-node or cluster collector will handle most of this. For metric push - I see OTel has it so I don't mind adding it, but for the others I would not. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think all of them would be supported, but they need to be modified to properly report mid-stream telemetry. OTel would need a special field to indicate whether it's end of stream or mid-stream.
Metrics are somewhat irrelevant actually, it's an artifact of Istio quirky stackdriver/prometheus. |
||
} | ||
|
||
// Tracing configures tracing behavior for workloads within a mesh. | ||
|
@@ -401,6 +416,7 @@ message Metrics { | |
// Optional. Reporting interval allows configuration of the time between calls out to for metrics reporting. | ||
// This currently only supports TCP metrics but we may use this for long duration HTTP streams in the future. | ||
// The default duration is `5s`. | ||
// DEPRECATED. Please use Telemetry.reporting_interval.tcp field instead. | ||
google.protobuf.Duration reporting_interval = 3; | ||
} | ||
|
||
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be part of AccessLogging? IIUC, this config is only applicable for access logs. If access logs is not enabled, this config would not be useful, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, worth to explain why user would want to customize this for what type of workloads. And why they would consider disable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to have a discussion about what we plan to to about OTel and all the standards and integrations they are driving.
https://opentelemetry.io/docs/reference/specification/sdk-environment-variables/ has the set of options OTel supports - OTEL_METRIC_EXPORT_INTERVAL is the equivalent setting that an app using otel natively would use.
In otel it applies to all push-based metric exporters - logs and tracers don't seem to have this ( but have batch and other settings ).
IMO we should consider the API surface of OTel and not try to have more complicated settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@linsun AccessLogging is a list, and the setting is global in Envoy. Hence I pulled it to the top. Users shouldn't generally touch this unless they want to disable it (because they want to limit telemetry). It cannot be made Otel specific, since the log production is driven by Envoy and the push is driven by OTel.
@costinm That env seems unrelated - it's about metric push, not production again. See the discussion about IO flush - they are not related at all.
The closest semantic convention I could find is modeling stream events in Otel https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/semantic_conventions/events.md. We'd define an
event.name
for stream open/mid/close, and report the generic attributes associated with the stream.