-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add kv-store connection check to readiness probe #135
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Legion2 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me and the changes look fine, but I wonder if @tjohnson31415 or @njhill could chime in to confirm the idea before merging especially since it doesn't seem like there are any tests related to readiness probe.
FYI @ckadner
Hmm, I'm not sure about this behavior being the default. As stated, ModelMesh is intended to handle downtime in etcd by continuing to serve requests for existing models (adding models and scaling would fail). If all pods reported as Unready when etcd goes down, then no requests would succeed. Even if the downtime extends over a longer period, I think it would be better to serve the requests it can until the system recovers. Where this change would help is in the situation described in the issue where one or two of the pods are disconnected from the KV store, but this situation is not easily distinguishable from a complete etcd downtime at the pod level. Merging this PR would prevent partial service in the case that etcd does go down. 🤔 That said, depending on the failure mode, perhaps reporting as |
Thanks for the feedback, I will add a feature flag for it. |
Signed-off-by: Leon Kiefer <[email protected]>
c1eda40
to
5aae970
Compare
How should I implement the feature flag, as command line argument or as environment variable? I had a quick look at the code and I don't understand how to add a new command line argument or if there is any util for using features flags from the environment variables. |
Motivation
When the modelmesh is not able to connect to the kv store to update its instance recording or sync with the other instances it can not reliably serve inference requests. For a short time a disconnect can be tolerated and the cached values can be used to serve requests. However after some time the data may be stale and the routing of requests may result in errors. For example with instance A and B, if A has connection issues while B leaves the mesh, A still have the outdated instance record and will still route inference requests to B, which fail. To prevent this, A should be marked unready if it can not connect to the kv store, to inform upstream proxies to not route traffic to A.
Modifications
Add existing
verifyKvStoreConnection
check toisReady
check.Result
isReady
will returnfalse
when the model mesh instance lost connection to the kv store. Allowing systems such as kubernetes to react to this condition.