This document describes some high level design concepts.
The following diagram shows (almost) all the processes involved in a Worker
.
+----------+
| |
+---------| runner | <--------+
| | | |
| +----------+ |
v |
+-------------+ +-------------+ +----------+ +--------------+
| | | | | | | |
| faktory |<--------| fetcher |<--------| runner | <--------| reporter |
| | | | | | | |
+-------------+ +-------------+ +----------+ +--------------+
^ |
| +----------+ |
| | | |
+---------| runner | <--------+
| |
+----------+
- The fetcher asks the Faktory server for jobs.
- Runners ask the fetcher for jobs to run. They emit a report on the success or failure of a job.
- The reporter takes these reports and sends an ACK or FAIL message back to the Faktory server.
The number of jobs that can be processed concurrently is equal to the number of runners, which is set by the concurrency
option.
A worker only makes 3 connections to the Faktory server, no matter what the concurrency is set to:
- Fetcher (for fetching jobs)
- Reporter (for acking or failing jobs)
- Heartbeat (for sending a required keepalive message every 15 seconds)
It's possible to scale fetchers and reporters using the fetcher_count
and reporter_count
options respectively, but I don't see why or how they would ever become bottlenecks unless your job latency is less than the time it takes to talk to the Faktory server. And if that's the case, you shouldn't be using async jobs.
Still, scaling out fetchers and reporters is implemented (for fun and learning).
Connections to the Faktory server use the Connection library which aids in error handling and reconnecting.
Network errors cause retries with exponential backoff for fetching, acking, and failing jobs.
There is no retry logic for pushing (enqueing) jobs.
No job should ever be lost due to Faktory's acking semantics, and if they are
it's Faktory's fault, not faktory_worker_ex
's... ;)
At worst, a job will be processed more than once due to a process crashing before issuing the ack.
Fetchers, runners, and reporters are all long running processes, but each job is executed in its own process.
It's like the Resque model, but efficient because of BEAM processes vs Unix processes.