Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Job not found" errors #37

Open
tombh opened this issue Jun 25, 2019 · 12 comments
Open

"Job not found" errors #37

tombh opened this issue Jun 25, 2019 · 12 comments

Comments

@tombh
Copy link

tombh commented Jun 25, 2019

I intermittently get errors like this that stop the worker from processing other jobs:
16:19:38.904 [warn] [faktory] fail failure: Job not found a82e98fb13746eca74755f5d -- retrying in 32.138s

I suspect it's because I have a large backlog of jobs and weird things happen when I deploy new versions of the worker app (I mean new versions of my code, not new versions of deps) and it has to reconnect to the maanger.

So how can a job not be found? I thought there was only one canonical reference to a job. And why should such an error completely stop the worker from doing anything else even though its concurrency is set to 30 or so?

I'm using the Faktory Docker image, the faktory binary there reports version 1.0.1. I'm using the latest faktory_worker_ex client, v0.7.0.

@mperham
Copy link

mperham commented Jun 25, 2019

Job not found is returned if a worker calls FAIL <jid> for a job but Faktory does not have an existing reservation for that JID. That can happen if the job takes longer that the reservation timeout, which is 30 minutes by default, so the reservation expired and was garbage collected.

@tombh
Copy link
Author

tombh commented Jun 25, 2019

Are there circumstances in which a job can be perceived to have taken longer than 30 minutes? Of course there's a chance that the job itself takes that long, say if it's doing something very hard, but I'm wondering if somehow jobs might be seen to have taken a long time if I stop a queue for DB maintenance or something.

But of course the main point is still that the worker is just choking on this error and not doing anything else.

@cjbottaro
Copy link
Owner

Hi @tombh. I was not aware that the FAIL command could return "Job not found"; there is not much documentation on the Faktory protocol: https://github.com/contribsys/faktory/wiki/Worker-Lifecycle#report-result . So I thought there are only two outcomes: "OK" or a network error.

It would be super nice to have a "protocol" wiki page for all the commands!

@mperham Thanks for the explanation! I wouldn't mind taking a stab at writing up the protocol docs if you could link me the Go code.

@tombh, as a quick fix, I'll publish 0.7.2 tonight which handles that case (and I guess just logs a warning for when that happens). But if you don't think that you're jobs aren't taking over 30 mins, then it's a bit worrisome.

@mperham
Copy link

mperham commented Jun 25, 2019

@cjbottaro Faktory uses the RESP protocol. You should handle any -ERR response as a protocol error.

@mperham
Copy link

mperham commented Jun 25, 2019

@mperham
Copy link

mperham commented Jun 25, 2019

-ERR indicates a generic error, there can be more specific error codes which indicate specific conditions which the worker might want to respond to, one example is -NOTUNIQUE when pushing a new job which violates Faktory Pro's job uniqueness feature.

@cjbottaro
Copy link
Owner

Awesome, thank you!

This was referenced Jun 26, 2019
@cjbottaro
Copy link
Owner

@tombh Can you try out the fail-and-ack-errors branch? It's kind of hard to test this since Elixir's mocking and stubbing isn't as loose as some other languages. The current test suite passes on that branch though.

@tombh
Copy link
Author

tombh commented Jun 26, 2019

Sure! I've got a queue of 1.9 million jobs, I'll get your branch deployed and leave it running all day.

@cjbottaro
Copy link
Owner

cjbottaro commented Jun 26, 2019

Sure! I've got a queue of 1.9 million jobs, I'll get your branch deployed and leave it running all day.

Wow, that's a lot of jobs.

And why should such an error completely stop the worker from doing anything else even though its concurrency is set to 30 or so?

For posterity, it's because I was not aware that Faktory's protocol was RESP.

The code assumes any error talking to the Faktory server is due to a networking issue. And since the the code uses Connection, it also assumes the connection will self heal, and thus it retries any failed communication with the Faktory server indefinitely (with capped exponential backoff).

@tombh
Copy link
Author

tombh commented Jun 26, 2019

Ah I see that makes sense. Thanks so much for the quick fix.

I've got it deployed now and its handled a few thousand jobs already without problem. I'd say there were about 4 of these "Job not found" errors in the last 24 hours, so by this time tomorrow we should know if your branch is a good fix.

I forgot to answer this:

But if you don't think that you're jobs aren't taking over 30 mins, then it's a bit worrisome

I haven't looked into this closely, but out of all these jobs running, all involving HTTP requests, it's as good as certain then that the 30 minute limits are being hit at some point.

@tombh
Copy link
Author

tombh commented Jun 27, 2019

Ok so after 24 hours and a few 100,000s of jobs I haven't seen any problems :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants