-
Worker can be CRON, Delayed, Recurring or Or-Demand.
- CRON: Scheduled worker running at specific time of day.
- Delayed: Worker scheduled to run after a specific time.
- Recurring: Worker running at specific interval.
- Or-Demand: Worker running in background asynchronously.
-
Workers can be Computer Intensive and/or Memory Intensive.
- Compute Intensive: Workers needing more CPU cycles. These kind of workers generally involves CPU intensive operations like loops, arithmetic operations, logical operations (if-else) and input/output operations where RAM read/writes are involved.
- Memory Intesive: Workers that need considerably high footprint of memory in RAM to run to complete its execution.
- CPU intensive and memory intensive jobs need not be mutually exclusive.
-
Workers can have third-party calls involved. Third party calls basically refers to any external/internal services which are being called inside the worker. Please note that internal services do not include databases, elasticsearch, redis, AWS resources calls. For the sake of simplicity, we consider these services to be reliable enough to handle any scale.
- Example: ZomatoLocationUpdateWorker, RouterWorker, OrderTracker, VidyootWorker
- The workers need to lie in the above four patterns in the diagram. The 4 patterns are:
- Pattern 1: Background processing job concerned with only computation job on the data set supplied to it. This has no dependency on external services. It doesn't involve database/redis/elasticsearch/AWS resources read writes as well. Objective of this pattern is to split workers into multiple smaller units of jobs in case such situations arise.
- Pattern 2: Job concerned with basic operations and making http calls to external services asynchronously(*preferred). Async call is preferred here because it gives us additional flexibility of handling failure of external service by pausing queues and thus preventing starvation in sidekiq.
- Pattern 3: Job concerned with read/write operations from reliable sources of data with some basic operation on data set. Please ensure this pattern does not promote CPU/Memory intensive operations inside single job, if such cases arise we need to use multiple patterns to ensure don't run into any such issue.
- Pattern 4: Job concerned with bulk operations requiring high (constraints discussed later) CPU or memory footprints. This pattern suggests to spawn new pods to ensure job is executed outside the scope of current cluster in which original sidekiq pods are running. This is done to ensure,
- Pattern X: This is a variant of the Pattern 1, 2, 3, 4, wherein the Pattern 1, 2, 3, 4 is followed by scheduling another worker of any Pattern 1, 2, 3, 4. This becomes useful in case we want to split worker and make constraint bound workers.
By the use of memory profiler, we can get to know the allocated and retained memory for a worker run. Setting the parameters for worst case, we can get to know the memory footprint in the worst. For every worker, before its deployment in production we need to make sure that worker's allocated memory should not exceed 100 MB, and retained memory should not exceed 10 MB.
Why these numbers?
Memory issue arises because when a high memory footprint jobs run in bulk this leads to sidekiq pods scaling, and these pods don't get downscaled because subsequently less memory footprint jobs run and this does not lead to garbage collection. Our sidekiq pods run at 8 GB memory limit which with defined constraint allows 80 jobs in the worst case on a single pod.
The number of CPU cycles utilised by sidekiq worker. This is calculated on the basis of average response time x average throughput for a worker. This number cannot be constraint because throughput cannot be controlled and it depends on the scale at which our system is working. Also, the nature in which sidekiq works ensures high throughput and low latency jobs are appropriately handled.
So, we introduce here response time constraint for sidekiq worker.
- For high throughput jobs (tp >= 60 rpm), response time should not under no circumstances exceed 1s. This number ensures if we have set concurrency set to 40, all the 60 jobs will get executed in 1.5s if only single pod for sidekiq is operational.
- For low throughput jobs (tp < 60 rpm), response time should not exceed 5s.
- If the job takes more than 5s, we have to be sure that it is bulk in nature and it never runs during operational hours (11 AM to 3 PM).
Jobs which are of nature such that it is recursively enqueued and only the most recent job needs to be performed, we can mark such jobs are collapsible. In our current system we have added a middleware of StaleJobClient and StaleJobStopper which get executed before any job execution. To implement StaleJobStopper in the system, please refer to GoogleOrderStatusWorker, NumberMaskingWorker.
This strategy can be used in worker when we want to limit concurrent worker execution with same parameters. This ensures a lock is acquired on set of condition and no other worker with that same condition gets enqueued till the lock is released. RouterWorker works on this principle.
In this strategy, we need to consider pulling out the logic of making thirdparty API calls in other worker. This gives us more granular control over the API calls and handling the fallbacks.
Consider using webhook based architecture in case response is not required synchronously. This can be used in our interservice communication where we sync data and then update successful sync flags at source upon successful response. Example: ProcurementOrderStatusUpdateWorker. For external thirdparty services we would need to confirm if webhook architecture is possible at their end.
But before doing this we need to consider the amount of data which we will need to store in redis to perform this action. Bulk operations API calls in other worker should be avoided as it might lead to redis limits.
We can allow batching for bulk operations, only if batches perform within the constraints discussed above. We would need to reduce batch size and schedule another worker if constraints are not satisfied.
Sidekiq defaults to 25 retries with back-off between each retry. 25 retries means that the last retry would happen around three weeks after the first attempt (assuming all 24 prior retries failed).
- For non-idempotent worker having no API calls, retry count to 0.
- For idempotent worker having no API calls, we will be going ahead with 1 retry.
- For non-idempotent worker having API calls, retries to be handled by faraday only and fallback with count as 2 and uniqueness mechanism need to be implemented in faraday failures and worker uniqueness.
- For idempotent worker having API calls, retries to handled by faraday only with count as 2.
Please note idempotent worker means whose repeated execution would not lead to inconsistency in system's state.

