-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpk: grafana-generate - support public metrics #6165
Conversation
389d455
to
580ef79
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to someone with better go knowledge (mine's non-existent) for the code review.
Functionally it looks nice. There's a couple of improvements we could make:
- Remove the labels that never have a value from the legend. For all the panels in "Redpanda Summary" we can remove the
redpanda_request
label from the legent as it never has a value.
Same for the panels in the "Internal RPC Latency" section. The panel in the "Throughput" section should only keep the "redpanda_request" label in the legend. - Can we change the Y axis unit for panels that represent time (latency) to seconds?
@BenPope, @travisdowns could you take a look too? I generated some load between 14:20 and 14:25 UTC so the panels should display something around that time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The intervals for rates should prefer
[$__rate_interval]
over a fixed period such as[2m]
, but you may need to set a min step of1m
. - The units should always be set; e.g., it looks like the latency is off by some orders of magnitude, and should be set to seconds.
- I'd be tempted to also aggregate schema registry and HTTP Proxy errors by
redpanda_status
- it's in the legend, after all. - In general I'd prefer
rate
overirate
, at least for simple queries - I don't want to get into subqueries here. - Memory stats have shard in the legend, but not in aggregation criteria (probably worth adding a new option that is
cluster,instance,shard
in the dropdown. - The unit for scheduler stats is
seconds per second
- it should be percent (0..1), notops/s
. - The storage section has rate queries, and I don't know what we're trying to measure. Probably worth combining them into a ratio of disk used %.
- Internal latency of schema registry / http proxy isn't internal, it's the request latency.
Nice to see support for public metrics. I had some feedback but Ben got all of it in his comments above (particularly important to fix the units for latency before it goes out). Minor concern about how we detect the endpoint: we look for the string |
The new metrics all start with |
2e1243a
to
b6e96bf
Compare
New Grafana example can be found here |
@@ -65,7 +65,7 @@ vectorized_vectorized_internal_rpc_dispatch_handler_latency_bucket{le="20.000000 | |||
vectorized_memory_allocated_memory_bytes{shard="0",type="bytes"} 40837120 | |||
vectorized_memory_allocated_memory_bytes{shard="1",type="bytes"} 36986880 | |||
` | |||
expected := `{"title":"Redpanda","templating":{"list":[{"name":"node","datasource":"prometheus","label":"Node","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(instance)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"node_shard","datasource":"prometheus","label":"Shard","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(shard)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"aggr_criteria","datasource":"prometheus","label":"Aggregate by","type":"custom","refresh":1,"options":[{"text":"Cluster","value":"","selected":false},{"text":"Instance","value":"instance,","selected":false},{"text":"Instance, Shard","value":"instance,shard,","selected":false}],"includeAll":false,"allFormat":"","allValue":"","multi":false,"multiFormat":"","query":"Cluster : cluster,Instance : instance,Instance\\,Shard : instance\\,shard","current":{"text":"Cluster","value":""},"hide":0,"sort":1}]},"panels":[{"type":"text","id":1,"title":"","editable":true,"gridPos":{"h":2,"w":24,"x":0,"y":0},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Redpanda Summary</h1>","mode":"html"},{"type":"singlestat","id":2,"title":"Nodes Up","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":0,"y":2},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count by (app) (vectorized_application_uptime)","intervalFactor":1,"step":40,"legendFormat":"Nodes Up"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"singlestat","id":3,"title":"Partitions","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":2,"y":8},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count(count by (topic,partition) (vectorized_storage_log_partition_size{namespace=\"kafka\"}))","legendFormat":"Partition count"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"text","id":5,"title":"","editable":true,"gridPos":{"h":2,"w":12,"x":12,"y":14},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Throughput</h1>","mode":"html"},{"type":"row","collapsed":true,"id":7,"title":"memory","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":20},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":6,"title":"Rate - Allocated memory size in bytes","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":20},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_memory_allocated_memory_bytes{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"Bps"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false}]},{"type":"row","collapsed":true,"id":9,"title":"vectorized_internal_rpc","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":21},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":8,"title":"Amount of memory consumed for requests processing","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(vectorized_vectorized_internal_rpc_consumed_mem{instance=~\"$node\",shard=~\"$node_shard\"}) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"short"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":true},{"type":"graph","id":10,"title":"Rate - Number of requests with corrupted headers","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":8,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_vectorized_internal_rpc_corrupted_headers{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"ops"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false},{"type":"graph","id":11,"title":"Latency of service handler dispatch (p95)","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":16,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"A","expr":"histogram_quantile(0.95, sum(rate(vectorized_vectorized_internal_rpc_dispatch_handler_latency_bucket{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by (le, $aggr_criteria))","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"µs"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null as zero","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"individual","msResolution":true},"aliasColors":{},"steppedLine":true}]}],"editable":true,"timezone":"utc","refresh":"10s","time":{"from":"now-1h","to":"now"},"timepicker":{"refresh_intervals":["5s","10s","30s","1m","5m","15m","30m","1h","2h","1d"],"time_options":["5m","15m","1h","6h","12h","24h","2d","7d","30d"]},"annotations":{"list":null},"links":null,"schemaVersion":12}` | |||
expected := `{"title":"Redpanda","templating":{"list":[{"name":"node","datasource":"prometheus","label":"Node","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(instance)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"node_shard","datasource":"prometheus","label":"Shard","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(shard)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"aggr_criteria","datasource":"prometheus","label":"Aggregate by","type":"custom","refresh":1,"options":[{"text":"Cluster","value":"","selected":false},{"text":"Instance","value":"instance,","selected":false},{"text":"Instance, Shard","value":"instance,shard,","selected":false}],"includeAll":false,"allFormat":"","allValue":"","multi":false,"multiFormat":"","query":"Cluster : cluster,Instance : instance,Instance\\,Shard : instance\\,shard","current":{"text":"Cluster","value":""},"hide":0,"sort":1}]},"panels":[{"type":"text","id":1,"title":"","editable":true,"gridPos":{"h":2,"w":24,"x":0,"y":0},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Redpanda Summary</h1>","mode":"html"},{"type":"singlestat","id":2,"title":"Nodes Up","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":0,"y":2},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count by (app) (vectorized_application_uptime)","intervalFactor":1,"step":40,"legendFormat":"Nodes Up"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"singlestat","id":3,"title":"Partitions","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":2,"y":8},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count(count by (topic,partition) (vectorized_storage_log_partition_size{namespace=\"kafka\"}))","legendFormat":"Partition count"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"text","id":5,"title":"","editable":true,"gridPos":{"h":2,"w":12,"x":12,"y":14},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Throughput</h1>","mode":"html"},{"type":"row","collapsed":true,"id":7,"title":"memory","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":20},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":6,"interval":"1m","title":"Rate - Allocated memory size in bytes","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":20},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_memory_allocated_memory_bytes{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"Bps"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false}]},{"type":"row","collapsed":true,"id":9,"title":"vectorized_internal_rpc","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":21},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":8,"title":"Amount of memory consumed for requests processing","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(vectorized_vectorized_internal_rpc_consumed_mem{instance=~\"$node\",shard=~\"$node_shard\"}) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"short"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":true},{"type":"graph","id":10,"interval":"1m","title":"Rate - Number of requests with corrupted headers","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":8,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_vectorized_internal_rpc_corrupted_headers{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"ops"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false},{"type":"graph","id":11,"interval":"1m","title":"Latency of service handler dispatch (p95)","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":16,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"A","expr":"histogram_quantile(0.95, sum(rate(vectorized_vectorized_internal_rpc_dispatch_handler_latency_bucket{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by (le, $aggr_criteria))","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"µs"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null as zero","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"individual","msResolution":true},"aliasColors":{},"steppedLine":true}]}],"editable":true,"timezone":"utc","refresh":"10s","time":{"from":"now-1h","to":"now"},"timepicker":{"refresh_intervals":["5s","10s","30s","1m","5m","15m","30m","1h","2h","1d"],"time_options":["5m","15m","1h","6h","12h","24h","2d","7d","30d"]},"annotations":{"list":null},"links":null,"schemaVersion":12}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single line change: it's changing because now I'm adding the default interval rate to be [1m]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In hindsight I think it would have been better to create a golden file instead of putting all this in a JSON single-liner. it would make reviewing these changes easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. I think a few of Ben's comments still need addressing:
- Rename "Internal latency of request for schema_registry" to "Schema Registry Request Latency" and change unit to seconds
- Rename "Internal latency of rest_proxy" to "REST Proxy Request Latency" and change unit to seconds
- Replace the panels in the storage section with two new panels that display the ratio of disk available:
- Disk Usage per Broker (the percentage of disk currently in use):
1 - (redpanda_storage_disk_free_bytes / redpanda_storage_disk_total_bytes)
- Disk Usage per Broker (the percentage of disk currently in use):
- Aggregate the rest proxy and schema registry errors by
redpanda_status
. The queries should change like this:
sum(...) by ($aggr_criteria, redpanda_status)
. Note the new label we aggregate by.
Also, I would replace the panels in the memory section with the following:
- Memory Usage per Broker (the percentage of memory currently in use):
redpanda_memory_allocated_memory / (redpanda_memory_free_memory + redpanda_memory_allocated_memory)
@VladLazar We take the name of the panel from the HELP of each metric, in the case of
To avoid hardcoding this, can we change the response from
Is this just for the 2 panels or can we add the aggregation criteria for every other panel?
Is it helpful to leave the ones that we have and add the 2 new panels?
Same as above. I'm asking all this because we will have a lot of custom hardcoded logic in rpk for |
If that's the case then let's leave it as is for now and we can change the description of the metrics in redpanda.
Just for those two panels. The others don't have this label attached to them.
I would just replace them. They'd display the same information, but in a different way. I think what a user actually |
panel = newCounterPanel(family) | ||
// hack around redpanda_storage_* metrics: these should be gauge | ||
// panels but the metrics type come as COUNTER | ||
if family.GetType() == dto.MetricType_COUNTER && !strings.Contains(name, "redpanda_storage") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be !strings.HasPrefix?
Also, is the counter vs. gauge thing a redpanda issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this looks like a bug. I think all of the repdanda_storage metrics in public_metrics should be gauges. Filed: #6316
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two non-blocking questions
@tmgstevens may have thoughts on the default panels we show on /public_metrics i think CS found other panels to yield higher information/better signal. |
This is an example of an ops dashboard that we've produced in CS which covers some of the things we'd be encouraging people to monitor. Unfortunately it seems that some of these things aren't available in public_metrics, so we may need a separate issue to add those things.
|
@r-vasquez ^^ see tristan's comments above. |
@emaxerrno would it be a good idea to decouple the fix (this PR) and the improvements suggested by @tmgstevens? |
Yep, not available.
You may find these useful:
I thought we had this, but apparently not.
I thought we had this, but apparently not.
Yep, not available.
Can be derived from |
I would, however, suggest going through the list I mentioned and making sure that we've got as many of those things on the generated dashboard. For example - IIRC Under-replicated partitions isn't anywhere near the top of the page, but should be as it's really important to monitor. |
We need to separate further dashboard improvements into a separate PR -- the scope of this PR was originally to migrate from our overly excessive The best long term fix would be to remove the dashboard generation from pure-Go, rpk-only code and separate it to something that can be maintained and extended by all teams. If we are ok with the current dashboards in this PR, we should merge this PR. I think the comments above indicate that the current dashboards are good, pending any disagreements, I think we should merge this by EOD. We should have two followup issues, one tracking how to separate these dashboards from pure-Go code so that the dashboards can be maintained more broadly, and one to further improve the dashboards per @tmgstevens's suggestions above. Lmk if there are any disagreements here, otherwise the plan is to merge EOD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm besides unresolved comments.
@@ -65,7 +65,7 @@ vectorized_vectorized_internal_rpc_dispatch_handler_latency_bucket{le="20.000000 | |||
vectorized_memory_allocated_memory_bytes{shard="0",type="bytes"} 40837120 | |||
vectorized_memory_allocated_memory_bytes{shard="1",type="bytes"} 36986880 | |||
` | |||
expected := `{"title":"Redpanda","templating":{"list":[{"name":"node","datasource":"prometheus","label":"Node","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(instance)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"node_shard","datasource":"prometheus","label":"Shard","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(shard)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"aggr_criteria","datasource":"prometheus","label":"Aggregate by","type":"custom","refresh":1,"options":[{"text":"Cluster","value":"","selected":false},{"text":"Instance","value":"instance,","selected":false},{"text":"Instance, Shard","value":"instance,shard,","selected":false}],"includeAll":false,"allFormat":"","allValue":"","multi":false,"multiFormat":"","query":"Cluster : cluster,Instance : instance,Instance\\,Shard : instance\\,shard","current":{"text":"Cluster","value":""},"hide":0,"sort":1}]},"panels":[{"type":"text","id":1,"title":"","editable":true,"gridPos":{"h":2,"w":24,"x":0,"y":0},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Redpanda Summary</h1>","mode":"html"},{"type":"singlestat","id":2,"title":"Nodes Up","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":0,"y":2},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count by (app) (vectorized_application_uptime)","intervalFactor":1,"step":40,"legendFormat":"Nodes Up"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"singlestat","id":3,"title":"Partitions","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":2,"y":8},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count(count by (topic,partition) (vectorized_storage_log_partition_size{namespace=\"kafka\"}))","legendFormat":"Partition count"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"text","id":5,"title":"","editable":true,"gridPos":{"h":2,"w":12,"x":12,"y":14},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Throughput</h1>","mode":"html"},{"type":"row","collapsed":true,"id":7,"title":"memory","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":20},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":6,"title":"Rate - Allocated memory size in bytes","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":20},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_memory_allocated_memory_bytes{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"Bps"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false}]},{"type":"row","collapsed":true,"id":9,"title":"vectorized_internal_rpc","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":21},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":8,"title":"Amount of memory consumed for requests processing","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(vectorized_vectorized_internal_rpc_consumed_mem{instance=~\"$node\",shard=~\"$node_shard\"}) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"short"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":true},{"type":"graph","id":10,"title":"Rate - Number of requests with corrupted headers","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":8,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_vectorized_internal_rpc_corrupted_headers{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"ops"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false},{"type":"graph","id":11,"title":"Latency of service handler dispatch (p95)","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":16,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"A","expr":"histogram_quantile(0.95, sum(rate(vectorized_vectorized_internal_rpc_dispatch_handler_latency_bucket{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by (le, $aggr_criteria))","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"µs"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null as zero","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"individual","msResolution":true},"aliasColors":{},"steppedLine":true}]}],"editable":true,"timezone":"utc","refresh":"10s","time":{"from":"now-1h","to":"now"},"timepicker":{"refresh_intervals":["5s","10s","30s","1m","5m","15m","30m","1h","2h","1d"],"time_options":["5m","15m","1h","6h","12h","24h","2d","7d","30d"]},"annotations":{"list":null},"links":null,"schemaVersion":12}` | |||
expected := `{"title":"Redpanda","templating":{"list":[{"name":"node","datasource":"prometheus","label":"Node","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(instance)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"node_shard","datasource":"prometheus","label":"Shard","type":"query","refresh":1,"options":[],"includeAll":true,"allFormat":"","allValue":".*","multi":true,"multiFormat":"","query":"label_values(shard)","current":{"text":"","value":null},"hide":0,"sort":1},{"name":"aggr_criteria","datasource":"prometheus","label":"Aggregate by","type":"custom","refresh":1,"options":[{"text":"Cluster","value":"","selected":false},{"text":"Instance","value":"instance,","selected":false},{"text":"Instance, Shard","value":"instance,shard,","selected":false}],"includeAll":false,"allFormat":"","allValue":"","multi":false,"multiFormat":"","query":"Cluster : cluster,Instance : instance,Instance\\,Shard : instance\\,shard","current":{"text":"Cluster","value":""},"hide":0,"sort":1}]},"panels":[{"type":"text","id":1,"title":"","editable":true,"gridPos":{"h":2,"w":24,"x":0,"y":0},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Redpanda Summary</h1>","mode":"html"},{"type":"singlestat","id":2,"title":"Nodes Up","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":0,"y":2},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count by (app) (vectorized_application_uptime)","intervalFactor":1,"step":40,"legendFormat":"Nodes Up"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"singlestat","id":3,"title":"Partitions","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":2,"x":2,"y":8},"transparent":true,"span":1,"error":false,"targets":[{"refId":"","expr":"count(count by (topic,partition) (vectorized_storage_log_partition_size{namespace=\"kafka\"}))","legendFormat":"Partition count"}],"format":"none","prefix":"","postfix":"","maxDataPoints":100,"valueMaps":[{"value":"null","op":"=","text":"N/A"}],"mappingTypes":[{"name":"value to text","value":1},{"name":"range to text","value":2}],"rangeMaps":[{"from":"null","to":"null","text":"N/A"}],"mappingType":1,"nullPointMode":"connected","valueName":"current","valueFontSize":"200%","prefixFontSize":"50%","postfixFontSize":"50%","colorBackground":false,"colorValue":true,"colors":["#299c46","rgba(237, 129, 40, 0.89)","#d44a3a"],"thresholds":"","sparkline":{"show":false,"full":false,"ymin":null,"ymax":null,"lineColor":"rgb(31, 120, 193)","fillColor":"rgba(31, 118, 189, 0.18)"},"gauge":{"show":false,"minValue":0,"maxValue":100,"thresholdMarkers":true,"thresholdLabels":false},"links":[],"interval":null,"timeFrom":null,"timeShift":null,"nullText":null,"cacheTimeout":null,"tableColumn":""},{"type":"text","id":5,"title":"","editable":true,"gridPos":{"h":2,"w":12,"x":12,"y":14},"transparent":true,"links":null,"span":1,"error":false,"content":"<h1 style=\"color:#87CEEB; border-bottom: 3px solid #87CEEB;\">Throughput</h1>","mode":"html"},{"type":"row","collapsed":true,"id":7,"title":"memory","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":20},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":6,"interval":"1m","title":"Rate - Allocated memory size in bytes","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":20},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_memory_allocated_memory_bytes{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"Bps"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false}]},{"type":"row","collapsed":true,"id":9,"title":"vectorized_internal_rpc","editable":true,"gridPos":{"h":6,"w":24,"x":0,"y":21},"transparent":false,"links":null,"span":0,"error":false,"panels":[{"type":"graph","id":8,"title":"Amount of memory consumed for requests processing","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":0,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(vectorized_vectorized_internal_rpc_consumed_mem{instance=~\"$node\",shard=~\"$node_shard\"}) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"short"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":true},{"type":"graph","id":10,"interval":"1m","title":"Rate - Number of requests with corrupted headers","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":8,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"","expr":"sum(irate(vectorized_vectorized_internal_rpc_corrupted_headers{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by ($aggr_criteria)","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"ops"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"cumulative","msResolution":true},"aliasColors":{},"steppedLine":false},{"type":"graph","id":11,"interval":"1m","title":"Latency of service handler dispatch (p95)","datasource":"prometheus","editable":true,"gridPos":{"h":6,"w":8,"x":16,"y":21},"transparent":false,"links":null,"renderer":"flot","span":4,"error":false,"targets":[{"refId":"A","expr":"histogram_quantile(0.95, sum(rate(vectorized_vectorized_internal_rpc_dispatch_handler_latency_bucket{instance=~\"$node\",shard=~\"$node_shard\"}[2m])) by (le, $aggr_criteria))","intervalFactor":2,"step":10,"legendFormat":"node: {{instance}}, shard: {{shard}}","format":"time_series"}],"xaxis":{"format":"","logBase":0,"show":true,"mode":"time"},"yaxes":[{"label":null,"show":true,"logBase":1,"min":0,"format":"µs"},{"label":null,"show":true,"logBase":1,"min":0,"format":"short"}],"legend":{"show":true,"max":false,"min":false,"values":false,"avg":false,"current":false,"total":false},"fill":1,"linewidth":2,"nullPointMode":"null as zero","thresholds":null,"lines":true,"bars":false,"tooltip":{"shared":true,"value_type":"individual","msResolution":true},"aliasColors":{},"steppedLine":true}]}],"editable":true,"timezone":"utc","refresh":"10s","time":{"from":"now-1h","to":"now"},"timepicker":{"refresh_intervals":["5s","10s","30s","1m","5m","15m","30m","1h","2h","1d"],"time_options":["5m","15m","1h","6h","12h","24h","2d","7d","30d"]},"annotations":{"list":null},"links":null,"schemaVersion":12}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In hindsight I think it would have been better to create a golden file instead of putting all this in a JSON single-liner. it would make reviewing these changes easier.
I'm cool with that. I haven't managed to actually run the code anywhere, if someone can generate a dashboard for me then I'll happily test it, but generally, bar the comments above, LGTM |
b6e96bf
to
f61cc7c
Compare
rpk generate grafana-dashboard have a summary section that has some metrics that don't exist in the new /public_metrics endpoint.
f61cc7c
to
0e65d05
Compare
Latest Grafana dashboard example: here
The improvements to the generated dashboard will be solved by #6382 so we can split the fix to make public_metric work and the improvements of the dashboard. 😄 |
/backport v22.1.x : my mistake. |
Branch name "v22.1.x" not found. |
/backport v22.2.x |
Branch name "v22.2.x" not found. |
Cover letter
rpk generate grafana-dashboard
have a summary section that has some metrics that don't exist in the new/public_metrics
endpoint.This PR creates a new summary section with the new metrics available in the
/public_metrics
endpoint.Fixes #5646
Grafana dashboard example: here
Backport Required
UX changes
Old:

Now:

We are keeping backward compatibility for
/metrics
endpoint and added new panels / modified some aggregation criteria for the new endpoint only:Node Up and Partitions panels
Now it queries
redpanda_cluster_brokers
andredpanda_cluster_partitions
.Latency of Kafka Request
We are splitting produce and consume requests in each percentile. We are querying
redpanda_kafka_request_latency_seconds_bucket
now.Throughput
Is now calculed by
sum(rate(redpanda_kafka_request_bytes_total[2m])) by (redpanda_request)
.Changes that affect all panels
We are removing the shard label, filtering, and aggregation criteria in the
/public_metrics
endpoint because the new metrics don't have a shard label.Release notes
Improvements
rpk generate grafana-dashboard
now supports/public_metrics
endpoint.