rpk: Improve generated Grafana dashboard with public_metrics #6382

r-vasquez · 2022-09-13T14:51:23Z

Description

Follow up for: #6165 (comment) to improve the generated grafana dashboard with public_metrics:

1. Refactor the dashboard to include:

This is an example of an ops dashboard that we've produced in CS which covers some of the things we'd be encouraging people to monitor. Unfortunately, it seems that some of these things aren't available in public_metrics, so we may need a separate issue to add those things.
List as follows

Nodes Up
Uptime
No. Partitions
No. Topics
Leadership transfer rate (not present)
Under replicated partitions
Leaderless partitions
CPU Utilisation (not present) -> Check additional info
Allocated Memory
Leadership balance
Currently active connections (not present)
Cluster info (build numbers, versions, etc) (not present)
Produce latency
Consumer latency
Storage bytes written (not present)
Storage bytes read (not present)
Network bytes received
Network bytes sent
Under-replicated partitions (by topic) (not present) -> Check additional info
Leaderless partitions (list)
Under replicated partitions by cluster (not present) -> Can be derived from redpanda_kafka_under_replicated_replicas
Number of groups for which a node is a leader
Partition leadership per broker

2. Replace Memory and Storage Section:

Replace the panels in the storage section with two new panels that display the ratio of disk available:
Disk Usage per Broker (the percentage of disk currently in use): 1 - (redpanda_storage_disk_free_bytes / redpanda_storage_disk_total_bytes)

Replace the memory section with: Memory Usage per Broker (the percentage of memory currently in use): redpanda_memory_allocated_memory / (redpanda_memory_free_memory + redpanda_memory_allocated_memory)

3. Aggregate the rest proxy and schema registry errors by redpanda_status

The queries should change like this: sum(...) by ($aggr_criteria, redpanda_status). Note the new label we aggregate by.

Additional Info:

#6165 (comment)

This will require rpk to handle /public_metrics differently from how we handle /metrics to keep backompat of the old dashboard.

The text was updated successfully, but these errors were encountered:

r-vasquez · 2023-04-11T19:42:37Z

Fixed by: #9662

r-vasquez added kind/enhance New feature or request area/rpk labels Sep 13, 2022

r-vasquez closed this as not planned Won't fix, can't repro, duplicate, stale Sep 13, 2022

r-vasquez reopened this Sep 13, 2022

r-vasquez mentioned this issue Sep 13, 2022

rpk: grafana-generate - support public metrics #6165

Merged

1 task

tmgstevens mentioned this issue Sep 22, 2022

Wire enhanced Grafana Dashboards into RPK #6502

Closed

r-vasquez closed this as completed Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpk: Improve generated Grafana dashboard with public_metrics #6382

rpk: Improve generated Grafana dashboard with public_metrics #6382

r-vasquez commented Sep 13, 2022 •

edited

Loading

r-vasquez commented Apr 11, 2023

rpk: Improve generated Grafana dashboard with public_metrics #6382

rpk: Improve generated Grafana dashboard with public_metrics #6382

Comments

r-vasquez commented Sep 13, 2022 • edited Loading

Description

1. Refactor the dashboard to include:

2. Replace Memory and Storage Section:

3. Aggregate the rest proxy and schema registry errors by redpanda_status

Additional Info:

r-vasquez commented Apr 11, 2023

r-vasquez commented Sep 13, 2022 •

edited

Loading