-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation
Description
What is the S2S-Proxy?
- The S2S proxy is a new Temporal component that handles connecting two instances of Temporal together. With a pair of them, two Temporal clusters can communicate as if they were on the same network, Temporal Version (!), and namespace and search-attribute configuration.
architecture-beta
group clientCluster(cloud)[Client Cluster]
service clientTemporal(server)[Temporal OSS frontend] in clientCluster
service clientS2S(server)[S2S Proxy] in clientCluster
service clientPort(internet)[Client multiplexed port] in clientCluster
group temporalCloud(cloud)[Temporal Cloud]
service serviceTemporal(server)[Temporal Cloud frontend] in temporalCloud
service serviceS2S(server)[S2S Proxy] in temporalCloud
service servicePort(internet)[Temporal Cloud migration port] in temporalCloud
%% Client connections
clientTemporal:T <--> B:clientS2S
clientS2S:R <--> L:clientPort
%% Temporal Cloud connections
serviceTemporal:T <--> B:serviceS2S
serviceS2S:L <--> R:servicePort
%% Client<->Cloud connections
servicePort:L <--> R:clientPort
What is Prometheus and how does it work?
- Prometheus is a monitoring framework that hosts a reporting server. The reporting server will respond with statistics when a scraper pulls metrics from it. These statistics are immediate information, meaning one poll of the reporting server reports the current value of every metric available at the time requested.
- This is different from something like AWS CloudWatch, which emits time-series logs to a central location, and each log instance has a timestamp, scope, etc attached to it. With Prometheus, all the metrics are returned together, once.
- After the scraper scrapes the reporting node, the counter values are persisted into a time-series database where you can retrieve them later using a tool like Grafana.
- From your code, emitting a metric sets a value in memory that your reporting server will read. Each “Metric” object is really just an atomic counter, think about using them in terms of “AtomicLong.Increment()”.
- In Prometheus, metrics are organized in “Registry” objects, and each Registry is attached to a reporting server. When the reporting server is queried, each Registry will iterate its attached metrics, gather values, and return.
- Prometheus likes to use free-running counters to represent rates. If you want to measure the number of times something happened, use a Counter or CounterVec and increment it per instance. The query endpoint has built-in rate measuring that will convert that into counts-per-second.
- To give you better flexibility in reporting metrics, Prometheus supports “vector” metrics, which are parameterized along one or more labels. You can use this to report the same metric like
grpc_server_handled_totalfor each of multiple sources, for example by method.
Why raw Prometheus over Tally/OpenTelemetry?
- Many parts of Temporal OSS are still using Tally to report metrics, which runs using the AWS CloudWatch-style paradigm. It has an adapter that converts it back to Prometheus, but given we’re reading our metrics using Prometheus only, there’s no real reason to keep doing this!
- OpenTelemetry is a tracing framework that added support for metrics on the side. This makes Otel kind of difficult to emit metrics from, since you have to do all the unnecessary work to set up tracing contexts everywhere. In Prometheus, you can just emit to a global (or non-global!) counter and you’re done!
- If for some reason you start managing multiple registries and passing them around, you’ve now done nearly the same work required to run OpenTelemetry. Consider using the DefaultRegistry unless you really, really need completely separate reporters.
Prometheus setup in s2s-proxy
- Starting from zero in the s2s-proxy, we added these two libraries. The first is the Prometheus client, the second contains an interceptor for gRPC that sets up many convenient metrics automagically.
- We opted to follow Temporal-OSS’s example and created one big file with the Prometheus gauge definitions
- A downside of this is that we can’t see the metrics definitions used in a particular file
- But, the upside is that it condenses the global-ness into a single place and makes it easy to find the list of all metrics emitted by the service.
- Note the
initfunction at the bottom which registers all the counters to the default. Counters really are just atomic values, and the registry is where the logic required to scan them is located.
- We already had a midweight local integration test that ran a server, so we branched that test out to test the Prometheus endpoint on the local port. This test shows how we did that
- It’s also possible to run against a test registry, but you’d need to stub out the Default registry, which we decided we didn’t need to do
- If you drop a breakpoint or a print statement in a test like this, you can take a look at the raw output you get from Prometheus! It looks like this: Example Prometheus output
-
Way down at the bottom, we get to see our metric:
# HELP temporal_s2s_proxy_health_check_success s2s-proxy service is healthy # TYPE temporal_s2s_proxy_health_check_success gaugetemporal_s2s_proxy_health_check_success 1
-
Add metrics everywhere!
- Now that we can add a new metric in about 4 lines, we should make use of our super-power and emit a metric for anything that feels interesting.
- Keep in mind that the overhead for updating a metrics is exclusively what’s needed to build the strings for its tags (frequently, this overhead can be made static!), and a single call to an atomic int. If you’re willing to touch any kind of atomic operation, incrementing a metric is ok!
- On the scrape+query side, the boundaries for “too many metrics” are at around 10k entries for a single metric, or 100k generally. Metric labels should therefore not have super-high-cardinality things in them like request UUIDs or unbounded sequences.
Appendix: Why use metrics to solve problems?
- Graphed metrics are one of the only ways for us to observe the behavior of many servers at once. They’re also extremely cheap to create, process, and store relative to text logs, traces, and other forms of observation. This means you can take them into high-throughput locations without sampling, and you can observe the behavior of hundreds of hosts at once.
- With some basic statistics, you can use metrics to identify:
- The slowest request your services are handling, and how many slow requests there were
- Which of your servers are running too hot in memory, CPU, or network
- How many times something is happening, and where it happened in your fleet
- Especially errors, and important decisions!
- If you’re trying to figure out how to get started with metrics, start by adding a Counter next to every
log.Infoin your code. Name the counter after what that log is describing, and set the metric labels to the parameters you would have passed to your log. This does the same thing as a log-scanning tool like AWS Cloudwatch or Xray, but in a fraction of the cost and time.
Appendix: Writing great dashboards
https://link.excalidraw.com/readonly/7FnmAzvf6JxlGSsshExI
- Your dashboard should put the most important graph in the first location someone looks. We read English left-to-right top-to-bottom, so the graph in the top section on the left should be your single MOST important graph.
- Graphs exist to answer questions. Each graph on your dashboard should answer a question that someone reading the dashboard would have. If the question a graph answers is complicated, add documentation to the graph as a caption or subtitle explaining it, “This graph shows…”
- Use your graph title, legend, and axis label to communicate! A graph titled “gRPC call rate per instance” with a bunch of IPs in the legend is far more readable than that same graph with no title, and a legend full of
{cluster="xyz", container="abc", grpc_code=...} - All dashboards should have sections for critical service health, system resources, and application state. Some dashboards may also want to add things like business-critical metrics, drilldown/investigation dashboards, and topic-specific dashboards like client quota trackers.
- Your dashboard is a living document! If you investigate something and you think it might happen again, make a metric and put it on the dashboard. If something hasn’t happened for years and you’re not sure it exists anymore, take it OFF the dashboard!
- Every language runtime has a metrics source you can poll to ask questions about the current application. Know what your language’s source is, and make sure it appears in your system resources. (For Go, this is the
runtime/metricspackage. For Java, it’s JMX. )
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation