-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K8s worker fails monitoring flow and sets it to crashed. #14954
Comments
Thanks for the issue @meggers! I'm trying to reproduce this issue, but I haven't been able to do so yet. Do you have any additional logs or tracebacks that you can share? |
Thank you @desertaxle ! The worker logs don't have anything more than what I added for additional context. If there is a setting that would expose more information such as a stack trace from the logs definitely happy to try and get more info! |
Hmmm, it looks like we're light on logging in the worker. I'll need to add some logging to get more info on the failure. While in there, I'll see if I can add some error handling and retries to the watch logic. I'll respond here once there's a new version to try! |
Thank you! We are eager to resolve this, let us know! |
Other users are reporting the same error. We might have to handle issues ourselves. See. |
I wonder if it is something like this: tomplus/kubernetes_asyncio#235 (comment) I am following up internally to see what we might have in the middle, but could be something like istio. Wonder if it is possible to configure the k8s aiohttp client with keepalives? Example in aio-libs/aiohttp#3904 (comment) |
@meggers That's a good callout, the sync version of the worker did have this, but it doesn't seem like the async version does. |
Makes sense. I was able to confirm we have a timeout of ~5 minutes in our mesh which aligns with the timing we see with this issue. I think a keepalive would be helpful. |
I'm not sure about all of the teams experiencing this but at least my team sees the following error immediately after the
Which looks like it might be related to this bit here:
When we run our job pods there are 3 containers inside a given pod, and it appears that on occasion this is checking on the Istio-Proxy instead of the job container... For my jobs, when we encounter Previous, when my team was running versions
While this error showed up in the worker logs and the UI, the flow was still marked as Running - though the UI would stop updating about Subflow/Task status. Then, when the job container completed, the flow would be marked as Successful/Failed as appropriate and the UI would fill in. I'm not sure if the Istio-Proxy container being referenced in the new error message is actually relevant to the issue, or if it's something else and then it's just further tripping up trying to write out an error message. |
Bug summary
We are seeing an issue with the latest k8s worker monitoring flow runs. After some amount of time (5-10 minutes?) we see the worker fail to monitor the flow run with the message provided in additional context. The worker then marks the flow as failed. The flow continues to run until it tries to update its task status at which point it fails because its task status is already in a terminal state.
Version info (
prefect version
output)Additional context
The text was updated successfully, but these errors were encountered: