feat(#144): additional configuration for cht-user-management #146

kennsippell · 2025-03-12T10:21:37Z

cc @ernestoteo since he was working on UMT grafana monitoring

mrjones-plip · 2025-03-13T19:11:18Z

exporters/cht-user-management/scrape.yml

@@ -0,0 +1,7 @@
+scrape_configs:
+  - job_name: cht-user-management-metrics
+    metrics_path: /metrics


The default path is /metrics so we can drop this for simplicity:

Suggested change

metrics_path: /metrics

per other comment, still seeing metrics_path: /metrics in there - so assume local changes weren't pushed

exporters/cht-user-management/scrape.yml

mrjones-plip · 2025-03-13T19:23:28Z

@kennsippell - yay! Love the simplicity of the request here - just enough, not too much. Thanks for opening the PR!

That said there's a disconnect in that it assumes metrics are on /metrics and they're actually on two different URLs of /bullmq-metrics AND /fastify-metrics. If it's not too late (I know, UMT code as already shipped), I suggest we actually make two changes to the upstream UMT repo to compliment this watchdog PR:

concat both outputs to one URL instead having two separate ones
make that URL be /metrics

The reason is that the two URLs introduce a lot of complexity in how you monitor them. Further, Prometheus assumes that you're going to use /metrics and it's a minor hassle to keep track of which custom URLs apps use when then deviate from the norm.

(sorry - forgot to include this summary on my request for changes)

kennsippell · 2025-03-14T08:41:56Z

concat both outputs to one URL instead having two separate ones

I didn't realise this was the best practice for prometheus, but you're right. I'll make that change.

mrjones-plip

Hey again! I wanted to actually test the dashboard you created in the PR. Once installed, the dashboard works great - nice work!

That said, I had some issues getting the compose file to work with the new dashboard json file and made a suggestion on how to simplify things.

grafana/provisioning/dashboards/UMT/umt_server.json

mrjones-plip

Hello again! I did some more testing to ensure two things:

system scales with N instances
all panels are getting data

system scales with N instances

For # 2 - works great! I stood up a second UMT instance on port 3501 and put it in the scrape.yml like this:

scrape_configs:
  - job_name: cht-user-management-metrics
    metrics_path: /fastify-metrics
    static_configs:
      - targets:
         - 172.17.0.1:3500
         - 172.17.0.1:3501

And I was able to see the two discretely in watchdog:

However, this made me realize that we're going to be editing the scrape.yml by definition to add which ever instance we want to monitor. This means if there's a change to it upstream (like metrics_path: /fastify-metrics -> metrics_path: /metrics) there will be a conflict in this file and you won't be able to do a git pull origin or similar call.

I suggest we move scrape.yml file to be something like ./exporters/cht-user-management/scrape-dist.yml and then we instruct people to copy it.

all panels are getting data

There's a number of panels that are not getting data. The top half is missing data for:

Transferred
Transfer Rate
Number of Open connections

On the bottom half:

Response Latency (90th Percentile), Top 5 request Duration and Top 5 request count all only ever showed /fastify-metrics . In a worst case this could mean that all the stats that the /fastify-metrics URL is gathering ONLY show data for the /fastify-metrics endpoint? In a best case it's just these 3 panels. We should double check that all these stats monitor the entire holistic UMT app and not just the one Fastify endpiont
Top Error Duration has no data - this may be correct because there's no errors?
Top Error Count has no data - this may be correct because there's no errors?
Response Size has no data

kennsippell · 2025-03-17T11:06:18Z

I suggest we move scrape.yml file to be something like ./exporters/cht-user-management/scrape-dist.yml and then we instruct people to copy it

I had been proposing here scrape-custom.yml to match the naming and setup procedure from the postgres exporter.

Further, Prometheus assumes that you're going to use /metrics and it's a minor hassle to keep track of which custom URLs apps use

Ready here medic/cht-user-management#278

all panels are getting data

There appears to be a bug in fastify-metrics around default argument values. Workaround is that we need to explicitly set defaults. I also think we were setting summary: false which I don't think is necessary. Good catch thanks!

mrjones-plip · 2025-03-17T16:59:01Z

I had been proposing here scrape-custom.yml to match the naming and setup procedure from the postgres exporter.

Ah - sure enough - we do indeed tell folks to edit the revision controlled files. Ok - let's fix this another time then. OK to leave as is for now!

mrjones-plip

I think you have local changes not yet pushed! Spot checking, I see nothing has changed.

mrjones-plip · 2025-03-17T17:00:31Z

exporters/cht-user-management/scrape.yml

@@ -0,0 +1,7 @@
+scrape_configs:
+  - job_name: cht-user-management-metrics
+    metrics_path: /metrics


per other comment, still seeing metrics_path: /metrics in there - so assume local changes weren't pushed

mrjones-plip

looking good! Just one quick file name change to match things up.

exporters/cht-user-management/scrape.yml

mrjones-plip

I think this is good to go! Thanks for iterating with me.

We should merge this and the other UMT PR!

medic-ci · 2025-03-19T08:58:03Z

🎉 This PR is included in version 1.18.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

kennsippell added 2 commits March 12, 2025 12:13

Scrape config and a first dashboard

519903b

Renaming

8992fc2

kennsippell requested a review from mrjones-plip March 12, 2025 10:21

kennsippell mentioned this pull request Mar 12, 2025

Monitoring for cht-user-management instances #144

Open

kennsippell changed the title ~~Additional Configuration - cht-user-management~~ feat(#144): additional configuration for cht-user-management Mar 12, 2025

mrjones-plip requested changes Mar 13, 2025

View reviewed changes

mrjones-plip requested changes Mar 15, 2025

View reviewed changes

grafana/provisioning/dashboards/UMT/umt_server.json Outdated Show resolved Hide resolved

mrjones-plip requested changes Mar 16, 2025

View reviewed changes

kennsippell requested a review from mrjones-plip March 17, 2025 11:06

mrjones-plip mentioned this pull request Mar 17, 2025

Single endpoint for all prometheus metrics medic/cht-user-management#278

Merged

mrjones-plip requested changes Mar 17, 2025

View reviewed changes

Missing <3

c3c82b9

mrjones-plip requested changes Mar 17, 2025

View reviewed changes

exporters/cht-user-management/scrape.yml Show resolved Hide resolved

mrjones-plip approved these changes Mar 18, 2025

View reviewed changes

kennsippell merged commit 8cf360b into main Mar 19, 2025
5 checks passed

kennsippell deleted the 144-umt branch March 19, 2025 08:57

medic-ci added the released label Mar 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(#144): additional configuration for cht-user-management #146

feat(#144): additional configuration for cht-user-management #146

kennsippell commented Mar 12, 2025 •

edited

Loading

mrjones-plip Mar 13, 2025

mrjones-plip Mar 17, 2025

mrjones-plip commented Mar 13, 2025 •

edited

Loading

kennsippell commented Mar 14, 2025

mrjones-plip left a comment •

edited

Loading

mrjones-plip left a comment

kennsippell commented Mar 17, 2025

mrjones-plip commented Mar 17, 2025

mrjones-plip left a comment

mrjones-plip Mar 17, 2025

mrjones-plip left a comment

mrjones-plip left a comment

medic-ci commented Mar 19, 2025

feat(#144): additional configuration for cht-user-management #146

feat(#144): additional configuration for cht-user-management #146

Conversation

kennsippell commented Mar 12, 2025 • edited Loading

mrjones-plip Mar 13, 2025

Choose a reason for hiding this comment

mrjones-plip Mar 17, 2025

Choose a reason for hiding this comment

mrjones-plip commented Mar 13, 2025 • edited Loading

kennsippell commented Mar 14, 2025

mrjones-plip left a comment • edited Loading

Choose a reason for hiding this comment

mrjones-plip left a comment

Choose a reason for hiding this comment

system scales with N instances

all panels are getting data

kennsippell commented Mar 17, 2025

mrjones-plip commented Mar 17, 2025

mrjones-plip left a comment

Choose a reason for hiding this comment

mrjones-plip Mar 17, 2025

Choose a reason for hiding this comment

mrjones-plip left a comment

Choose a reason for hiding this comment

mrjones-plip left a comment

Choose a reason for hiding this comment

medic-ci commented Mar 19, 2025

kennsippell commented Mar 12, 2025 •

edited

Loading

mrjones-plip commented Mar 13, 2025 •

edited

Loading

mrjones-plip left a comment •

edited

Loading