Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing service definition for metrics/prometheus endpoint #121

Closed
samos123 opened this issue Feb 11, 2023 · 8 comments
Closed

Missing service definition for metrics/prometheus endpoint #121

samos123 opened this issue Feb 11, 2023 · 8 comments

Comments

@samos123
Copy link
Contributor

Prometheus by default is exposed on port 2112 when env.PROMETHEUS_MONITORING_ENABLED: true. However the weaviate service definition has a hardcoded targetport of 8080. So either the weaviate service should also expose the prometheus service or we should create a new service definition for prometheus.

@samos123
Copy link
Contributor Author

actually it makes sense the way it is, because each weaviate replica will have it's own metric endpoitn accessible under weaviate-$X:2112/metrics . I will close the bug

@jfcmartins
Copy link

Hey @samos123 👋 I came across the same issue and I don't understand when you say that makes sense the way it is. The targetPort is hardcoded to 8080, so the metrics will not be available for Prometheus to scrape in the Weaviate Service.

What solves this issue is to change the Service ports spec to this:

spec:
  ports:
    - name: http
      protocol: TCP
      port: 80
      targetPort: 8080
    - name: metrics
      protocol: TCP
      port: 2112
      targetPort: 2112

Am I missing something?

@samos123
Copy link
Contributor Author

samos123 commented Dec 7, 2023

Each Weaviate instance (when replicas is more than 1) needs to have it's own accessible metrics endpoint. The Weaviate replicas will each expose the metrics on 2112. So you should access the metrics for your first replica using curl http://weaviate-0:2112/metrics

Give it a try. It should work.

@jfcmartins
Copy link

Each Weaviate instance (when replicas is more than 1) needs to have it's own accessible metrics endpoint. The Weaviate replicas will each expose the metrics on 2112. So you should access the metrics for your first replica using curl http://weaviate-0:2112/metrics

Give it a try. It should work.

The issue lies not in the availability of the metrics endpoint within the Pods but rather in the capability of the corresponding Service to expose the metrics generated by these Pods. To address this, the Service needs to be configured to expose two ports, as demonstrated in the example I provided in the initial comment.

@samos123
Copy link
Contributor Author

samos123 commented Dec 12, 2023

My view is that the metrics endpoint should be removed from the Service definition. That should remove the confusion that you and I both shared around thinking that we should use the Service endpoint to query metrics. Note you should not use the service endpoint, you should use the metric endpoint of each individual pod.

I'm clearly not doing a good job in explaining here either. Maybe a question would help: If you have 3 replicas and you query metrics using the Service endpoint. How would I know whether the metrics are from replica 1, replica 2 or replica 3?

@jfcmartins
Copy link

If you have 3 replicas and you query metrics using the Service endpoint. How would I know whether the metrics are from replica 1, replica 2 or replica 3?

To distinguish metrics from different replicas when querying through the Service endpoint, we would use Prometheus labels.

Example:

weaviate_requests_total{instance="weaviate-0", method="GET"}  100
weaviate_requests_total{instance="weaviate-1", method="GET"}  150
weaviate_requests_total{instance="weaviate-2", method="GET"}  120

In this example, the instance label is used to specify the unique identifier for each Weaviate replica. These metrics are typically collected at the Service level rather than directly from individual pods in the Prometheus Stack that is currently in use.

@samos123
Copy link
Contributor Author

I guess that does make sense as well if you have 3 replicas. What if you have 1000 replicas?

Now you pull metrics from the service endpoint every 10 seconds. So every 10 seconds only 1 of the 1000 replicas will have it's metrics pulled into prometheus. That would be problematic. What you would want instead is that every 10 seconds, all 1000 replicas have their metrics endpoint pulled. If you use the service endpoint it would do round robin load balancing and you will probably miss out on metrics if you have 1000 replicas.

@jfcmartins
Copy link

I guess that does make sense as well if you have 3 replicas. What if you have 1000 replicas?

Now you pull metrics from the service endpoint every 10 seconds. So every 10 seconds only 1 of the 1000 replicas will have it's metrics pulled into prometheus. That would be problematic. What you would want instead is that every 10 seconds, all 1000 replicas have their metrics endpoint pulled. If you use the service endpoint it would do round-robin load balancing and you will probably miss out on metrics if you have 1000 replicas.

You're correct. In a scenario with 1000 replicas, relying on the service endpoint alone for metric collection might lead to missed data due to round-robin load balancing. To ensure that metrics are collected from all replicas within the 10-second window, a more scalable approach would be needed. One solution in that case is setup the Prometheus server to scrapes metrics from each replica as you mentioned.

Anyway, I've added support for this feature in the PR #197.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants