Skip to content
This repository has been archived by the owner on Sep 3, 2024. It is now read-only.

Latest commit

 

History

History
211 lines (186 loc) · 14.5 KB

metric-expressions-for-k8s.md

File metadata and controls

211 lines (186 loc) · 14.5 KB

Metric Expressions for K8s

This document collects frequently asked metric expressions for typical K8s use-cases. Listed metric expressions can be used for charting or alerting using custom metric events.

Important note on alerting

If your main use case is to alert on common Kubernetes issues, please do not set up Metric events using any of the metric expressions below. Instead, use our product's built-in alerts, which we introduced with version 1.254. You find them in Settings > Anomaly detection > Kubernetes. You can find more information about this feature in our documentation and in our community.

Table of contents

Prerequisites

This document assumes basic familiarity with metric selectors. If you don't yet feel comfortable with simple metric selectors, consider reading through Query by Example first.

You can try the examples listed here through the Web UI via Data Explorer, or you can send selectors to the /v2/metrics/query API for evaluation.

Important note

Depending on the size of your environment, the following examples might put extreme load on your Dynatrace environment. Please use with care and try using filters whenever possible to narrow the scope of these queries.

Node utilization

Let's start of with some simple metric expressions to measure relative nodes utilization in terms of usage, requests and limits over time.

CPU

For CPU usage, we can use the builtin:host.cpu.usage metric in combination with an entity-selector filtering for only hosts that are part of any Kubernetes cluster.

builtin:host.cpu.usage
:filter(and(in("dt.entity.host",entitySelector("type(host),softwaretechnologies(~"KUBERNETES~")"))))
:splitBy("dt.entity.host")

For requests we can divide the requested CPU by the total number of cores available on each node.

builtin:kubernetes.node.requests_cpu:avg:splitBy("dt.entity.kubernetes_cluster"):sum
/ builtin:kubernetes.node.cpu_allocatable:avg:splitBy("dt.entity.kubernetes_cluster"):sum
* 100

The same can be done for limits as follows:

builtin:kubernetes.node.limits_cpu:avg:splitBy("dt.entity.kubernetes_cluster"):sum
/ builtin:kubernetes.node.cpu_allocatable:avg:splitBy("dt.entity.kubernetes_cluster"):sum
* 100

Let's reduce the scope of these queries to only nodes of a given cluster. This is especially important if you want to set up an alert in the scope of a single cluster. The scope is narrowed down to a single cluster by attaching a filter for the cluster entity. For usage this results in the following metric expression. Note: Replace KUBERNETES_CLUSTER-44D2F1E49BE901AF with the entity-ID of your Kubernetes cluster. The easiest way to get the entity-ID of your cluster, is to navigate to the cluster within the Dynatrace Web UI - you'll see the ID in the URL in your browser's address bar.

Usage

builtin:host.cpu.usage:avg
:filter(in("dt.entity.host", entitySelector("type(HOST),toRelationships.IS_KUBERNETES_CLUSTER_OF_HOST(type(KUBERNETES_CLUSTER),entityId(KUBERNETES_CLUSTER-44D2F1E49BE901AF))")))

We can further expand this query and only count nodes of a cluster with a CPU usage above a certain threshold. In other words: "How many of my cluster's nodes have CPU usage over 80%". Note: Replace "KUBERNETES_CLUSTER-44D2F1E49BE901AF" with the entity-ID of your Kubernetes cluster and 80 with your desired CPU usage threshold.

builtin:host.cpu.usage:avg
:filter(in("dt.entity.host", entitySelector("type(HOST),toRelationships.IS_KUBERNETES_CLUSTER_OF_HOST(type(KUBERNETES_CLUSTER),entityId(KUBERNETES_CLUSTER-44D2F1E49BE901AF))")))
:filter(series(avg,gt,80)):splitby():count

We can modify this query to also answer the reverse question: "How many of my cluster's nodes have CPU usage below 80%". Just replace gt with lt.

builtin:host.cpu.usage:avg
:filter(in("dt.entity.host", entitySelector("type(HOST),toRelationships.IS_KUBERNETES_CLUSTER_OF_HOST(type(KUBERNETES_CLUSTER),entityId(KUBERNETES_CLUSTER-44D2F1E49BE901AF))")))
:filter(series(avg,lt,80)):splitby():count

By replacing the usage part of this query, we can solve the same use cases for requests and limits. For example, for requests the last metric expression above would result in the following query:

(
  builtin:cloud.kubernetes.node.cpuRequested:avg
  / builtin:cloud.kubernetes.node.cores:avg
  * 100
)
:filter(in("dt.entity.kubernetes_node", entitySelector("type(KUBERNETES_NODE),toRelationships.IS_KUBERNETES_CLUSTER_OF_NODE(type(KUBERNETES_CLUSTER),entityId(KUBERNETES_CLUSTER-44D2F1E49BE901AF))")))
:filter(series(avg,lt,80)):splitby():count

Memory

The metric expressions used for CPU can easily be adapted to Memory by just replacing the CPU related metrics with the proper memory related metric:

  • builtin:cloud.kubernetes.node.cores -> builtin:cloud.kubernetes.node.memory
  • builtin:cloud.kubernetes.node.cpuAvailable -> builtin:cloud.kubernetes.node.memoryAvailable
  • builtin:cloud.kubernetes.node.cpuRequested -> builtin:cloud.kubernetes.node.memoryRequested
  • builtin:cloud.kubernetes.node.cpuLimit -> builtin:cloud.kubernetes.node.memoryLimit

Node conditions

For node conditions, it's important to understand, that a single node can have multiple conditions at the same time, such as, DiskPressure and MemoryPressure. Consequently, the used metric offers a dimension for each condition (node_condition), that can either be true or false (condition_status) at any given point in time. Hence, we can chart which nodes have any not-ready conditions. Of course, we can also use this metric expression for setting up metric events for alerting. Again, you could reduce the scope of this query to a single cluster by adding a proper filter.

builtin:cloud.kubernetes.node.conditions
:filter(and(ne(node_condition,Ready),eq(condition_status,True)))
:splitBy("dt.entity.kubernetes_cluster","dt.entity.kubernetes_node","node_condition")
:count

Workload health

Let's use metric expressions to monitor the health of our workloads. We strongly recommend, to scope such expressions to a single workload, by adding a filter. In the following examples, all expressions will be filtered to a sample workload by appending the following filter: :filter(and(in("dt.entity.cloud_application",entitySelector("type(cloud_application),entityId(~"CLOUD_APPLICATION-A26E32FC302257AB~")"))))

To adapt this to your scenario, simply replace "CLOUD_APPLICATION-A26E32FC302257AB" with the workload ID of your workload. Again, you can find the ID in the URL in your browser while viewing the workload in the Dynatrace Web UI.

Number of pods not running

Using the following query, you can find how many pods are not running compared to the number of desired pods of this workload.

( 
  builtin:kubernetes.workload.pods_desired:avg:splitBy("dt.entity.cloud_application"):sum
  - builtin:kubernetes.pods:avg:filter(eq("pod_phase","Running")):splitBy("dt.entity.cloud_application"):sum
)
:filter(and(in("dt.entity.cloud_application",entitySelector("type(cloud_application),entityId(~"CLOUD_APPLICATION-A26E32FC302257AB~")"))))

Number of pods beeing not ready

Using the following query, you can find how many pods are not ready compared to the number of desired pods of this workload.

( 
  builtin:kubernetes.workload.pods_desired:avg:splitBy("dt.entity.cloud_application"):sum
  - builtin:kubernetes.pods:avg:filter(eq("current_pod_condition","Ready")):splitBy("dt.entity.cloud_application"):sum
)
:filter(and(in("dt.entity.cloud_application",entitySelector("type(cloud_application),entityId(~"CLOUD_APPLICATION-A26E32FC302257AB~")"))))

Number of containers not running

Using the following query, you can find how many containers are not running compared to the number of desired containers of this workload.

( 
  builtin:kubernetes.workload.containers_desired:avg:splitBy("dt.entity.cloud_application"):sum
  - builtin:kubernetes.containers:avg:filter(eq("container_state","running")):splitBy("dt.entity.cloud_application"):sum
)
:filter(and(in("dt.entity.cloud_application",entitySelector("type(cloud_application),entityId(~"CLOUD_APPLICATION-A26E32FC302257AB~")"))))

Workloads resource utilization and optimization

Let's use some metric expressions to understand if we have proper requests and limits in place for our workloads. We start by looking into the relation between usage and requests. Usually, one tries to have usage just below requests. So which workloads are requesting far more CPU or Memory, than they actually use - this is usually referred to as slack. The following examples will only walk you through CPU related topics. They can easily by adapted to memory, by replacing the CPU metrics with their memory counterpart metrics.

Slack

(
  builtin:kubernetes.workload.requests_cpu:avg:splitBy("dt.entity.cloud_application"):sum
  - builtin:containers.cpu.usageMilliCores:avg:parents:parents:splitBy("dt.entity.cloud_application"):sum
)
:splitBy("dt.entity.cloud_application"):avg
:filter(and(in("dt.entity.cloud_application",entitySelector("type(cloud_application),entityId(~"CLOUD_APPLICATION-A26E32FC302257AB~")"))))

*Note: We are using :parents, to split and filter by entities that are higher in the Kubernetes entity hierarchy (container -> pod -> workload -> namespace -> cluster). You can also increase the scope of this query to multiple workloads by expanding the scope of the filter or just removing it. Just be aware, that in very large environments this can put a lot of stress on your Dynatrace environment.

Usage above requests

We can see the difference between a workload's CPU usage and CPU requests using the following query. This means that for workloads with a result higher than zero, one should consider increasing its requests to increase the stability of the workload and the stability of the Kubernetes cluster it runs on.

(
  builtin:containers.cpu.usageMilliCores:avg:parents:parents:splitBy("dt.entity.cloud_application"):sum
  - builtin:kubernetes.workload.requests_cpu:avg:splitBy("dt.entity.cloud_application"):sum
)
:splitBy("dt.entity.cloud_application"):avg
:filter(and(in("dt.entity.cloud_application",entitySelector("type(cloud_application),entityId(~"CLOUD_APPLICATION-A26E32FC302257AB~")"))))

Usage in terms of limits

Users afraid of throttling, often want to alert on the percentage of limits being used. However, we suggest to alert on throttling relative to usage. The reason for this is, that the relative usage in terms of limits is not the only deciding factor for when Kubernetes starts to throttle containers - we'll cover that in the next example. Note, also the usage of the 'setUnit(Percent)' operation in this example. This provides us with a nicer formating of values in charts and alerts.

(
  builtin:containers.cpu.usageMilliCores:avg:parents:parents:splitBy("dt.entity.cloud_application"):sum
  / builtin:kubernetes.workload.limits_cpu:avg:splitBy("dt.entity.cloud_application"):sum
  * 100
)
:splitBy("dt.entity.cloud_application"):avg
:setUnit(Percent)
:filter(and(in("dt.entity.cloud_application",entitySelector("type(cloud_application),entityId(~"CLOUD_APPLICATION-A26E32FC302257AB~")"))))

High throttling

As mentioned, it makes more sense to alert on the outcome rather than only on one of multiple potential triggers for throttling. In other words, track throttling relative to usage instead of usage in terms of limits.

(
    builtin:containers.cpu.throttledMilliCores:avg:parents:parents:splitBy("dt.entity.cloud_application_instance","dt.entity.cloud_application"):sum
    / builtin:containers.cpu.usageMilliCores:avg:parents:parents:splitBy("dt.entity.cloud_application_instance","dt.entity.cloud_application"):sum
    * 100
)
:splitBy("dt.entity.cloud_application"):avg
:filter(and(in("dt.entity.cloud_application",entitySelector("type(cloud_application),entityId(~"CLOUD_APPLICATION-A26E32FC302257AB~")"))))

Memory

Most of the metric expressions shown above can be adapted for memory related use cases by simply replacing the CPU metrics with the proper memory metrics. However, there is one important difference: While high CPU usage leads to throttling, high memory usage leads to out-of-memory kills, and thus container restarts. Consequently, it makes sense to keep track of container restarts.

High container restart rate

builtin:kubernetes.container.restarts
:splitBy("dt.entity.cloud_application"):sum
:filter(and(in("dt.entity.cloud_application",entitySelector("type(cloud_application),entityId(~"CLOUD_APPLICATION-A26E32FC302257AB~")"))))

Note: Be aware, that the restart metric for containers is only available if we observed at least one restart. Consequently, for pods with no container restarts, there won't be any data.

Further Reading

Please refer to the documentation on metric expressions on dyntrace.com for additional technical details such as: precedence of operators, semantics for combinations of point/series results and literals, null handling, time alignment, many others.