-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Job not found" errors #37
Comments
|
Are there circumstances in which a job can be perceived to have taken longer than 30 minutes? Of course there's a chance that the job itself takes that long, say if it's doing something very hard, but I'm wondering if somehow jobs might be seen to have taken a long time if I stop a queue for DB maintenance or something. But of course the main point is still that the worker is just choking on this error and not doing anything else. |
Hi @tombh. I was not aware that the It would be super nice to have a "protocol" wiki page for all the commands! @mperham Thanks for the explanation! I wouldn't mind taking a stab at writing up the protocol docs if you could link me the Go code. @tombh, as a quick fix, I'll publish 0.7.2 tonight which handles that case (and I guess just logs a warning for when that happens). But if you don't think that you're jobs aren't taking over 30 mins, then it's a bit worrisome. |
@cjbottaro Faktory uses the RESP protocol. You should handle any |
|
Awesome, thank you! |
@tombh Can you try out the |
Sure! I've got a queue of 1.9 million jobs, I'll get your branch deployed and leave it running all day. |
Wow, that's a lot of jobs.
For posterity, it's because I was not aware that Faktory's protocol was RESP. The code assumes any error talking to the Faktory server is due to a networking issue. And since the the code uses Connection, it also assumes the connection will self heal, and thus it retries any failed communication with the Faktory server indefinitely (with capped exponential backoff). |
Ah I see that makes sense. Thanks so much for the quick fix. I've got it deployed now and its handled a few thousand jobs already without problem. I'd say there were about 4 of these "Job not found" errors in the last 24 hours, so by this time tomorrow we should know if your branch is a good fix. I forgot to answer this:
I haven't looked into this closely, but out of all these jobs running, all involving HTTP requests, it's as good as certain then that the 30 minute limits are being hit at some point. |
Ok so after 24 hours and a few 100,000s of jobs I haven't seen any problems :) |
I intermittently get errors like this that stop the worker from processing other jobs:
16:19:38.904 [warn] [faktory] fail failure: Job not found a82e98fb13746eca74755f5d -- retrying in 32.138s
I suspect it's because I have a large backlog of jobs and weird things happen when I deploy new versions of the worker app (I mean new versions of my code, not new versions of deps) and it has to reconnect to the maanger.
So how can a job not be found? I thought there was only one canonical reference to a job. And why should such an error completely stop the worker from doing anything else even though its concurrency is set to 30 or so?
I'm using the Faktory Docker image, the
faktory
binary there reports version 1.0.1. I'm using the latestfaktory_worker_ex
client, v0.7.0.The text was updated successfully, but these errors were encountered: