You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Follow up for: #6165 (comment) to improve the generated grafana dashboard with public_metrics:
1. Refactor the dashboard to include:
This is an example of an ops dashboard that we've produced in CS which covers some of the things we'd be encouraging people to monitor. Unfortunately, it seems that some of these things aren't available in public_metrics, so we may need a separate issue to add those things.
List as follows
Nodes Up
Uptime
No. Partitions
No. Topics
Leadership transfer rate (not present)
Under replicated partitions
Leaderless partitions
CPU Utilisation (not present) -> Check additional info
Allocated Memory
Leadership balance
Currently active connections (not present)
Cluster info (build numbers, versions, etc) (not present)
Produce latency
Consumer latency
Storage bytes written (not present)
Storage bytes read (not present)
Network bytes received
Network bytes sent
Under-replicated partitions (by topic) (not present) -> Check additional info
Leaderless partitions (list)
Under replicated partitions by cluster (not present) -> Can be derived from redpanda_kafka_under_replicated_replicas
Number of groups for which a node is a leader
Partition leadership per broker
2. Replace Memory and Storage Section:
Replace the panels in the storage section with two new panels that display the ratio of disk available:
Disk Usage per Broker (the percentage of disk currently in use): 1 - (redpanda_storage_disk_free_bytes / redpanda_storage_disk_total_bytes)
Replace the memory section with: Memory Usage per Broker (the percentage of memory currently in use): redpanda_memory_allocated_memory / (redpanda_memory_free_memory + redpanda_memory_allocated_memory)
3. Aggregate the rest proxy and schema registry errors by redpanda_status
The queries should change like this: sum(...) by ($aggr_criteria, redpanda_status). Note the new label we aggregate by.
Description
Follow up for: #6165 (comment) to improve the generated grafana dashboard with public_metrics:
1. Refactor the dashboard to include:
redpanda_kafka_under_replicated_replicas
2. Replace Memory and Storage Section:
Replace the panels in the storage section with two new panels that display the ratio of disk available:
Disk Usage per Broker (the percentage of disk currently in use): 1 - (
redpanda_storage_disk_free_bytes
/redpanda_storage_disk_total_bytes
)Replace the memory section with: Memory Usage per Broker (the percentage of memory currently in use):
redpanda_memory_allocated_memory
/ (redpanda_memory_free_memory
+redpanda_memory_allocated_memory
)3. Aggregate the rest proxy and schema registry errors by redpanda_status
The queries should change like this: sum(...) by ($aggr_criteria, redpanda_status). Note the new label we aggregate by.
Additional Info:
#6165 (comment)
This will require rpk to handle
/public_metrics
differently from how we handle/metrics
to keep backompat of the old dashboard.The text was updated successfully, but these errors were encountered: