Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(meshmetric): meshmetric profiles MADR #9269

Merged
merged 21 commits into from
Mar 4, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
cfc5f3f
docs(meshmetric): tmp
slonka Jan 29, 2024
1f5330c
docs(meshmetric): add combining metrics
slonka Feb 15, 2024
4a218a2
docs(meshmetric): add paragraph about the need for dynamic metrics in…
slonka Feb 15, 2024
a08b8cd
docs(meshmetric): suggest 6 predefined profiles
slonka Feb 15, 2024
731ab7f
docs(meshmetric): add schema change and implementation note
slonka Feb 15, 2024
f043765
docs(meshmetric): move section
slonka Feb 15, 2024
021fccb
docs(meshmetric): polish up madr
slonka Feb 15, 2024
6e934e9
docs(meshmetric): try ai to generate profiles
slonka Feb 16, 2024
9180fa7
Update docs/madr/decisions/038-meshmetric-profiles.md
slonka Feb 19, 2024
752a835
docs(meshmetric): add feature based profiles
slonka Feb 23, 2024
0c0014e
docs(meshmetric): rename minimal to golden
slonka Feb 28, 2024
56a56a9
docs(meshmetric): move assets to gist and link to it in links
slonka Feb 28, 2024
d1a1804
docs(meshmetric): post pr review
slonka Feb 28, 2024
1a86665
docs(meshmetric): can not dynamically configure stats_filter in Envoy
slonka Feb 28, 2024
3bc625e
docs(meshmetric): adjust to last change
slonka Feb 28, 2024
5373986
docs(meshmetric): answer one more question
slonka Feb 28, 2024
3542191
Update docs/madr/decisions/038-meshmetric-profiles.md
slonka Feb 28, 2024
b75da04
docs(meshmetric): add schema change
slonka Feb 29, 2024
44348b6
docs(meshmetric): make include / exclude a list
slonka Feb 29, 2024
938f6b1
Apply suggestions from code review
slonka Feb 29, 2024
6ae7470
docs(meshmetric): post review changes
slonka Mar 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
228 changes: 228 additions & 0 deletions docs/madr/decisions/038-meshmetric-profiles.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
# MeshMetric profiles (predefined subsets of all metrics)
slonka marked this conversation as resolved.
Show resolved Hide resolved

* Status: accepted

Technical Story: https://github.com/kumahq/kuma/issues/8845

## Context and Problem Statement

There is a lot of default Envoy metrics, even with `usedOnly` enabled.
Users might be overwhelmed by the number of metrics and not know what it's important what is less important.
Hosted providers usually charge based on the number of metrics ingested.
When using trial versions of hosted metrics users run out of free credits pretty fast.

To make it more digestible for the users and cheaper we should introduce profiles that only contain subsets of all Envoy metrics.

## Decision Drivers

* no loss of quality while minimizing number of metrics the most (ideally we want to reduce the number of metrics while still providing access to the ones that are the most valuable)

## Considered Options

### Use AI to generate Profiles

Useless, after a couple of metrics just prints "envoy_cluster_manager_warming_clusters_total" indefinitely.

### Base profiles on expert knowledge, external dashboards and our grafana dashboards

#### Where to get data from?

We built the dashboards to show what is important to look at, we could extract the list of metrics from these dashboards like this:

```bash
cat app/kumactl/data/install/k8s/metrics/grafana/kuma-dataplane.json | jq 'try .panels[] | try .targets[] | try .expr' | grep -E -o '\benvoy_[a-zA-Z0-9_]+\b' | sort | uniq
cat app/kumactl/data/install/k8s/metrics/grafana/kuma-gateway.json | jq 'try .panels[] | try .targets[] | try .expr' | grep -E -o '\benvoy_[a-zA-Z0-9_]+\b' | sort | uniq
cat app/kumactl/data/install/k8s/metrics/grafana/kuma-service-to-service.json | jq 'try .panels[] | try .targets[] | try .expr' | grep -E -o '\benvoy_[a-zA-Z0-9_]+\b' | sort | uniq
cat app/kumactl/data/install/k8s/metrics/grafana/kuma-mesh.json | jq 'try .panels[] | try .targets[] | try .expr' | grep -E -o '\benvoy_[a-zA-Z0-9_]+\b' | sort | uniq
cat app/kumactl/data/install/k8s/metrics/grafana/kuma-service.json | jq 'try .panels[] | try .targets[] | try .expr' | grep -E -o '\benvoy_[a-zA-Z0-9_]+\b' | sort | uniq
```

and put this in a profile.

We can do similar thing for other dashboards, like official Envoy Datadog dashboard:

```bash
cat docs/madr/decisions/assets/038/envoy-datadog-dashboard.json | jq 'try .widgets[] | try .definition | try .widgets[] | try .definition | try .requests[] | try .queries[] | .query' | grep -E -o '\benvoy[\._a-zA-Z0-9]+{' | tr -d '{'
```

Or for Consul Grafana dashboards:

```bash
cat docs/madr/decisions/assets/038/consul-grafana.json | jq 'try .panels[] | try .targets[] | try .expr' | grep -E -o '\benvoy_[a-zA-Z0-9_]+\b' | sort | uniq
```

All of these metrics combined result in 100 metrics.

By default, Envoy starts with 378 metrics (unfortunately it's not a complete list):

```bash
docker run --rm -it -p 9901:9901 -p 10000:10000 envoyproxy/envoy:v1.29.1
curl -s localhost:10000 > /dev/null
curl -s localhost:9901/stats/prometheus | grep -E -o '\benvoy_[a-zA-Z0-9_]+\b' | sort | uniq | wc -l
# 378
```

And datadog lists 990 metrics in total https://github.com/DataDog/integrations-core/blob/master/envoy/metadata.csv and 329 non-legacy ones

```bash
curl -s https://raw.githubusercontent.com/DataDog/integrations-core/master/envoy/metadata.csv | grep -v Legacy | wc -l -- INSERT --
slonka marked this conversation as resolved.
Show resolved Hide resolved
```

We can build automation on top of this, so when we update dashboards or Envoy changes metrics it emits we know about this and can adjust accordingly
slonka marked this conversation as resolved.
Show resolved Hide resolved
(just like now official Envoy dashboards - `envoy_cluster_upstream_rq_time_99percentile` have metrics that no longer exist).

As you can see there is no easy way to track everything (Envoy does not by default print all possible metrics)
so I strongly suggest adding a feature to dynamically (by regex for example) add/remove metrics to/from existing profiles.

#### Profiles suggested:

Suggestions to the merge / split and naming are welcomed.

##### All

Nothing is removed, everything included in Envoy, people can manually remove stuff.
slonka marked this conversation as resolved.
Show resolved Hide resolved

##### Extensive
slonka marked this conversation as resolved.
Show resolved Hide resolved

- All available dashboards + [Charly's demo regexes](https://github.com/lahabana/demo-scene/blob/a48ec6e0079601d340f79613549e1b2a4ea715a1/mesh-localityaware/k8s/otel-collectors.yaml#L174)
lahabana marked this conversation as resolved.
Show resolved Hide resolved
- `envoy_cluster_upstream_cx_.*`
- `envoy_cluster_upstream_rq_.*`
- `envoy_cluster_circuit_breakers_.*`
- `envoy_http_downstream_.*`
- `envoy_listener_downstream_.*`
- `envoy_listener_http_.*`

##### Comprehensive

- All available dashboards

##### Default

- Our dashboards

##### Minimal
slonka marked this conversation as resolved.
Show resolved Hide resolved

Only golden 4 (by regex / or exact):
- Latency
- `.*_rq_time_.*` which is:
- envoy_cluster_internal_upstream_rq_time_bucket
- envoy_cluster_internal_upstream_rq_time_count
- envoy_cluster_internal_upstream_rq_time_sum
- envoy_cluster_external_upstream_rq_time_bucket
- envoy_cluster_external_upstream_rq_time_count
- envoy_cluster_external_upstream_rq_time_sum
- envoy_cluster_upstream_rq_time_bucket
- envoy_cluster_upstream_rq_time_count
- envoy_cluster_upstream_rq_time_sum
- envoy_http_downstream_rq_time_bucket
- envoy_http_downstream_rq_time_count
- envoy_http_downstream_rq_time_sum
- or just `envoy_cluster_upstream_rq_time` and `envoy_http_downstream_rq_time`
- `.*cx_length_ms.*`
- envoy_cluster_upstream_cx_length_ms_bucket
- envoy_cluster_upstream_cx_length_ms_count
- envoy_cluster_upstream_cx_length_ms_sum
- envoy_http_downstream_cx_length_ms_bucket
- envoy_http_downstream_cx_length_ms_count
- envoy_http_downstream_cx_length_ms_sum
- envoy_listener_admin_downstream_cx_length_ms_bucket
- envoy_listener_admin_downstream_cx_length_ms_count
- envoy_listener_admin_downstream_cx_length_ms_sum
- envoy_listener_downstream_cx_length_ms_bucket
- envoy_listener_downstream_cx_length_ms_count
- envoy_listener_downstream_cx_length_ms_sum
- or just `envoy_cluster_upstream_cx_length_ms`
- Traffic
- `.*cx_count.*` (connections total)
- `.*cx_active.*` (active connection)
- `.*_rq` (upstream/downstream requests broken by specific codes e.g. 200/201)
- `.*bytes*` (bytes sent/received/not send)
- Errors (just went over combined stats from all dashboards and picked the ones that had anything to do with errors)
- we get 5xx from `*_rq`
- `.*timeout.*`
- `.*health_check.*`
- `.*lb_healthy_panic.*`
- `.*cx_destroy.*`
- `envoy_cluster_membership_degraded`
- `envoy_cluster_membership_healthy`
- `envoy_cluster_ssl_connection_error`
- `.*error.*`
- `.*fail.*` (has also envoy_cluster_ssl_fail envoy_cluster_update_failure)
- `.*reset.*`
- `.*outlier_detection_ejections.*`
- `envoy_cluster_upstream_cx_pool_overflow_count`
- `protocol_error`
- `envoy_cluster_upstream_rq_cancelled`
- `envoy_cluster_upstream_rq_max_duration_reached`
- `envoy_cluster_upstream_rq_pending_failure_eject`
- `.*overflow.*`
- `.*no_cluster.*`
- `.*no_route.*`
- `.*reject.*`
- `envoy_listener_no_filter_chain_match`
- `.*denied.*` (envoy_rbac_denied, envoy_rbac_shadow_denied)
- `envoy_server_days_until_first_cert_expiring`
- Saturation
- `.*memory.*` (allocated, heap, physical)
slonka marked this conversation as resolved.
Show resolved Hide resolved

##### Nothing

- Just an empty profile and people manually can add things to this.
slonka marked this conversation as resolved.
Show resolved Hide resolved

#### Schema

I suggest changing the current schema of

```yaml
sidecar:
usedOnly: true # true or false
profile: minimal # one of minimal, default, full
slonka marked this conversation as resolved.
Show resolved Hide resolved
regex: http2_act.* # only profile or regex can be defined
```

to

```yaml
sidecar:
usedOnly: true # true or false
profile:
name: default # one of `nothing`, `minimal`, `default`, `comprehensive`, `all`
exclude: regex2.* # first exclude
slonka marked this conversation as resolved.
Show resolved Hide resolved
include: regex1.* # then include (include takes over)
```
slonka marked this conversation as resolved.
Show resolved Hide resolved

#### Implementation

Just like we [mutate responses for metrics hijacker](https://github.com/kumahq/kuma/blob/d6c9ce64ac5e7ba1f5dbb9fb410e7d9410b67815/app/kuma-dp/pkg/dataplane/metrics/server.go#L348)
we can add a filter mutator to reduce the number of metrics (same thing for [OTEL](https://github.com/kumahq/kuma/blob/d6c9ce64ac5e7ba1f5dbb9fb410e7d9410b67815/app/kuma-dp/pkg/dataplane/metrics/metrics_producer.go#L106)).
slonka marked this conversation as resolved.
Show resolved Hide resolved
slonka marked this conversation as resolved.
Show resolved Hide resolved

#### Validation

After all profiles are compiled from regexes make sure that they include the ones on the lower levels (all includes default, default includes minimal etc.)
Make sure that with `default` profile (or the profile chosen for dashboards) all dashboards are populated.

Can we somehow track if users are happy with the defined profiles?
slonka marked this conversation as resolved.
Show resolved Hide resolved

#### Additional work

We could adjust our dashboards to reflect which graphs are going to be populated in which profile.

## Decision Outcome

Chosen option: "Base profiles on expert knowledge, external dashboards and our grafana dashboards", because it provides a good mix of inputs (our recommended metrics, others recommended metrics, expert knowledge).

### Positive Consequences

* should cover all typical scenarios
* non-typical scenarios can be handled by include/exclude
* allows us to build some automation to make it more resilient to changes

### Negative Consequences

* some processing power needed to filter metrics

## Links

* https://github.com/lahabana/demo-scene/blob/a48ec6e0079601d340f79613549e1b2a4ea715a1/mesh-localityaware/k8s/otel-collectors.yaml#L174
* https://docs.datadoghq.com/integrations/envoy
* https://docs.datadoghq.com/integrations/istio
Loading