Skip to content
This repository has been archived by the owner on Jul 12, 2023. It is now read-only.

OpenCensus metric export fail sometimes #474

Closed
yegle opened this issue Sep 3, 2020 · 10 comments
Closed

OpenCensus metric export fail sometimes #474

yegle opened this issue Sep 3, 2020 · 10 comments
Assignees
Labels
kind/bug Something is malfunctioning.

Comments

@yegle
Copy link
Contributor

yegle commented Sep 3, 2020

TL;DR

Example log:

jsonPayload: {
  build_id: "6a80e574-0d10-4adc-b5da-ef210c03f4f8"   
  build_tag: "v0.6.0-19-gd4b5ec21081d"   
  caller: "observability/stackdriver.go:55"   
  error: "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: timeSeries[0-16]"   
  logger: "stackdriver"   
  message: "failed to export metric"   
  resource: {
  }
  stacktrace: "github.com/google/exposure-notifications-server/pkg/observability.NewStackdriver.func1
	github.com/google/exposure-notifications-server@v0.6.2-0.20200901223640-ce4572602269/pkg/observability/stackdriver.go:55
contrib.go.opencensus.io/exporter/stackdriver.Options.handleError
	contrib.go.opencensus.io/exporter/[email protected]/stackdriver.go:474
contrib.go.opencensus.io/exporter/stackdriver.(*statsExporter).handleMetricsUpload
	contrib.go.opencensus.io/exporter/[email protected]/metrics.go:68
contrib.go.opencensus.io/exporter/stackdriver.newStatsExporter.func2
	contrib.go.opencensus.io/exporter/[email protected]/stats.go:126
google.golang.org/api/support/bundler.(*Bundler).handle
	google.golang.org/[email protected]/support/bundler/bundler.go:324"   
  timestamp: "2020-09-03T22:53:43.266468382Z"   
 }

This log is from apiserver service, but it also happen a lot on e2e-runner service.

@yegle yegle added the kind/bug Something is malfunctioning. label Sep 3, 2020
@yegle
Copy link
Contributor Author

yegle commented Sep 3, 2020

/kind bug

@sethvargo
Copy link
Member

/assign @icco

@icco
Copy link
Contributor

icco commented Sep 4, 2020

Yup we've been looking into this. Current theory is it's a big in open census. We have verified that we aren't losing any data, as uploads are retired. Why data is getting sent more than once a minute, we have no idea.

@yegle
Copy link
Contributor Author

yegle commented Sep 12, 2020

https://github.com/census-instrumentation/opencensus-go/blob/v0.22.4/stats/view/worker.go#L117

Looks like we did not set reporting period and are in fact reporting every 10 seconds.

yegle added a commit to yegle/exposure-notifications-server that referenced this issue Sep 12, 2020
@yegle
Copy link
Contributor Author

yegle commented Sep 12, 2020

Hmm #474 (comment) is incorrect. I was looking at an old commit and we do set reporting period to 2min right now.

yegle added a commit to yegle/exposure-notifications-server that referenced this issue Sep 15, 2020
variable.

This allows me to fine-tuning these values and see if adjusting the
value can eliminate
google/exposure-notifications-verification-server#474

The value used should be backward compatible. The value for
BundleDelayThreshold/BundleCountThreshold is from
https://github.com/census-ecosystem/opencensus-go-exporter-stackdriver/blob/db101e30979316cca594a74b3181a9a3b6094086/trace.go#L72-L81
google-oss-robot pushed a commit to google/exposure-notifications-server that referenced this issue Sep 16, 2020
* Making stackdriver exporter options configurable in environment
variable.

This allows me to fine-tuning these values and see if adjusting the
value can eliminate
google/exposure-notifications-verification-server#474

The value used should be backward compatible. The value for
BundleDelayThreshold/BundleCountThreshold is from
https://github.com/census-ecosystem/opencensus-go-exporter-stackdriver/blob/db101e30979316cca594a74b3181a9a3b6094086/trace.go#L72-L81

* fixup! Making stackdriver exporter options configurable in environment variable.

* Add document explaining the additional knobs.

* Change BundleCountThreshold type to unit.

* fixup! Change BundleCountThreshold type to unit.
@icco
Copy link
Contributor

icco commented Sep 16, 2020

/assign @yegle as he's going deep on this :D

@yegle
Copy link
Contributor Author

yegle commented Sep 22, 2020

I haven't made much progress by just read the code. I'm adding some more debug logging and that should help me understand the bundler behavior better.

@yegle
Copy link
Contributor Author

yegle commented Sep 22, 2020

The previous attempt to add debug logging won't work as the OpenCensus Stackdriver exporter uses gRPC instead of HTTP.

Working on binary logging support to gain more insight.

@yegle
Copy link
Contributor Author

yegle commented Sep 29, 2020

After tweaking the source code and adding a lot more logging statement, I think I finally figure out why...

TLDR: we misused the stackdriver exporter and double-registered it to export metrics, causing the same data to be exported multiple times, and got throttled by Cloud Monitoring API.

The correct way to start the exporter, per the example, is to call the StartMetricsExporter method: https://github.com/census-ecosystem/opencensus-go-exporter-stackdriver/blob/0fc2674ae49bc91ffa3b9880ad66f56155ff5616/examples/stats/main.go#L53

In our code, we called StartMetricsExporter and view.RegisterExporter: https://github.com/google/exposure-notifications-server/blob/f0596149c8380dfc38abfbc42629e764885a7be6/pkg/observability/stackdriver.go#L96-L113

Why did we call the latter one? Probably because 1) the OpenCensus Agent exporter requires it (see https://github.com/google/exposure-notifications-server/blob/f0596149c8380dfc38abfbc42629e764885a7be6/pkg/observability/opencensus.go#L56), and 2) implementing view.Exporter is supposed to be the way to implement a customized exporter (https://opencensus.io/exporters/custom-exporter/go/metrics/) and the stackdriver exporter implemented it, and caused confusion.

The stackdriver exporter project is also considering removing the ExportView method census-ecosystem/opencensus-go-exporter-stackdriver#193

@icco
Copy link
Contributor

icco commented Sep 30, 2020

This is fixed!

@icco icco closed this as completed Sep 30, 2020
@google google deleted a comment Oct 6, 2020
@google google deleted a comment Oct 6, 2020
@google google locked as resolved and limited conversation to collaborators Oct 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Something is malfunctioning.
Projects
None yet
Development

No branches or pull requests

3 participants