Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support experimental querier component #267

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

kavirajk
Copy link

This shouldn't impact any of the existing uses cases of helm chart.
You can enable querier (Deployment) by have overrides values.experimental.querier.enabled flag.
You can find example overrides here.

Querier takes mainly three dependencies

  1. Addrs to talk to core weaviate (for schema info v1/schema)
  2. Object storage (e.g: minio, s3 or gcs address)
  3. Vectorizer address (to vectorize query string for example)

Changes

  1. Querier deployment
  2. Querier service that exposes it's gPRC and http (currently only metrics server) servers.

set `experimental.querier.enabled=true`

Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
@kavirajk kavirajk marked this pull request as ready for review October 15, 2024 16:01
kind: Deployment
metadata:
name: querier
labels:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that this is the first of many new "Weaviate components", have you though on adding a label to each of these components that will allow performing actions on all of them at once? Let's say, if we will have the querier, the batcher, the schema_handler (all invented :-D), if you want to monitor or perform some action on them at once (they are independent deployments) you can leverage labels by assigning a common label to the three deployments, for example "weaviate: serverless" and print all those pods with:

kubectl get pods -l weaviate=serverless

Of course ,this is something you can add later, it's just a suggestion, not anything blocking for the review.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very good idea. using labels to group these new components. I'm happy to add later as we add more components. Just don't want to add anything that's is not absolute necessary for now. Just to keep things simple (plus not a big fan of the name serverless tbh :P and things can change in the future ).

app: querier
app.kubernetes.io/name: querier
spec:
replicas: {{ .Values.replicas }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will always the number of replicas for the querier be associated to the number of total replicas? I mean, couldn't it happen that there will be more queriers than any other subcomponent? If that's the case, it might be interesting allowing overriding with a :
{{.Values.experimental.querier.replicas}}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. That's copy-paste error. We should have separate replicas for querier. Will add it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

name: {{ $secret }}
key: {{ $key }}
{{- end }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unable to match this end to any other opening statement, is that correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        {{- if or $.Values.experimental.querier.env $.Values.experimental.querier.envSecrets }}
          {{- range $key, $value := $.Values.env }}
          - name: {{ $key }}
            value: {{ $value | quote }}
          {{- end }}
          {{- range $key, $secret := $.Values.experimental.querier.envSecrets }}
          - name: {{ $key }}
            valueFrom:
              secretKeyRef:
                name: {{ $secret }}
                key: {{ $key }}
          {{- end }}
          {{- end }}

two end for two range and one end for outer most if. I copy pasted this from existing statefulset btw :)

{{- end }}
resources:
{{ toYaml .Values.resources | indent 10 }}
env:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antas-marcin isn't there any way to isolate such a long env section? maybe moving it to _helpers? we do that for Weaviate's sts, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copy pasted this from Weaviate sts. Don't see any helpers there as well.

But this should be actually .Values.experimental.querier.resources instead. I will update it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfrancoa yeah, good idea, since now we will have 2 weaviate deployments we can think of unifying those settings. I will add this task to our board @jfrancoa

replicas: {{ .Values.replicas }}
updateStrategy:
{{ toYaml .Values.updateStrategy | indent 4}}
serviceName: {{ .Values.experimental.querier.service.name }}-headless
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked GPT:

In Kubernetes, the serviceName field is not typically used directly in a Deployment object. Instead, it is more commonly found in the context of StatefulSets or Headless Services and some other resources.

However, here are some scenarios where serviceName is relevant:

1. StatefulSet

In the context of a StatefulSet, the serviceName field is used to associate a Headless Service with the StatefulSet. This ensures that the StatefulSet pods are discoverable by a stable DNS name based on their ordinal index (e.g., pod-0, pod-1, etc.). Each pod gets a unique DNS entry such as <pod-name>.<serviceName>, which is useful for stateful applications like databases.

Was it added intentionally or was it just a copy-paste from the Weaviate's sts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right. It's copy pasted. I don't think we need this. Will remove.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

storage_size: 10Gi
pullPolicy: IfNotPresent
command: ["/bin/weaviate"]
args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we don't pass these args? I mean, could we provide some default values so that the querier would work out of the box? Mostly to improve the user experience and avoid requesting the user who deploys to know about the weaviate's architecture internals. I know this is a first PR to test the new querier, but for example, the contextionary endpoint should be added conditionally if contextionary is configured only. Otherwise it will fail when it won't be enabled as a module.
Also, the monitoring seems a bit redundant, don't we have an env var when we want monitoring to be working (PROMETHEUS_BLABLA_ENABLED)? we should enable this only if that env var is enabled, imho

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we provide some default values so that the querier would work out of the box?

Absolutely we should. I thought I had some sane default values. But that doesn't work well with our helm chart values. I will update it 👍

I know this is a first PR to test the new querier, but for example, the contextionary endpoint should be added conditionally if contextionary is configured only. Otherwise it will fail when it won't be enabled as a module.

Currently querier has hardcoded dependency with contextionary unfortunately. We will remove it in the future and I will update the conditional loading of modules. Very good catch.

Also, the monitoring seems a bit redundant, don't we have an env var when we want monitoring to be working (PROMETHEUS_BLABLA_ENABLED)? we should enable this only if that env var is enabled, imho

Good point. Currently querier doesn't respect the ENV PROMETHEUS_MONITORING_ENABLED. I think we should (I can take it up in the follow up PR if that's ok?).

Having said that I have some opinions about configs :) I like to keep this cli-flags as well. IMHO cli-flags should take highest priority than ENVs (and every config should be able to pass via cli-flags in addition to ENVs). Not a big fan of ENV based configs tbh, in my experience it's hard to manage at scale (10s is fine, 100s is crazy). Prefer cli or file based config instead. Main rationale is I want my configs to be "local" and "explicit". ENVs are by design "global" and "implicit" (any other process that share same pod, shell can set different values).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely we should. I thought I had some sane default values. But that doesn't work well with our helm chart values. I will update it 👍

I think we still cannot provide sane default for some flags like schema_url, contextionary_url and minio_url. Because this is totally different when running in k8s, all those will be local service name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed some args to make it work with defaults.

This PR also changes port mapping 9091 (used by prometheus server in our setup) to 7071
weaviate/weaviate#6043

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@weaviate-git-bot
Copy link

To avoid any confusion in the future about your contribution to Weaviate, we work with a Contributor License Agreement. If you agree, you can simply add a comment to this PR that you agree with the CLA so that we can merge.

beep boop - the Weaviate bot 👋🤖

PS:
Are you already a member of the Weaviate Slack channel?

Signed-off-by: Kaviraj <[email protected]>
@kavirajk kavirajk requested a review from jfrancoa October 17, 2024 09:57
apiVersion: apps/v1
kind: Deployment
metadata:
name: querier
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefix all of our services with weaviate so the name that I suggest here would be then weaviate-querier and all of the labels also should have that prefix to easily find "our" services / deployments

metadata:
name: {{ .Values.experimental.querier.service.name }}
labels:
app.kubernetes.io/name: weaviate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we agree on weaviate-querier naming? to be used throughout all of those files?

{{- end }}
{{- end }}
- name: CLUSTER_JOIN
value: {{ .Values.service.name }}-headless.{{ .Release.Namespace }}.svc.{{ .Values.clusterDomain }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we use a separate headless service for weaviate-querier deployment? how about scaling this service? and the raft response from this service? should querier be aware of all Weaviate pods?

@@ -0,0 +1,456 @@
{{ if .Values.experimental.querier.enabled }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kavirajk can we add some unit tests? for you deployment?

querier:
enabled: false
replicas: 3
image: semitechnologies/weaviate-experimental:preview-chore-change-default-configs-on-querier-23d39eb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we follow the same pattern for defining the Weaviate image as we do it in the Weaviate STS:
Screenshot 2024-10-17 at 15 33 27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants