-
Notifications
You must be signed in to change notification settings - Fork 38
[Deferred] Improve default retry strategy #251
Description
Description
Currently, Rage::Deferred retries failed tasks up to 5 times using the formula rand(5 * 2**attempt) + 1. This prioritises the distribution of failed tasks, but the delays are too short — on average, a task exhausts all 5 retries in under 3 minutes:
| Attempt | Avg delay |
|---|---|
| 1 | ~6s |
| 2 | ~11s |
| 3 | ~21s |
| 4 | ~41s |
| 5 | ~81s |
| Total | ~2.7 minutes |
The problem is that most transient failures (service outages, deployment issues, configuration errors) take much longer than 3 minutes to resolve. By the time a developer discovers the issue, the task has already exhausted all its retries and won't be attempted again.
The goal of this issue is to increase the default retry count and use a formula that spaces retries further apart with each attempt, giving developers time to discover and fix the underlying issue.
Suggested formula
(attempt**4) + 10 + (rand(15) * attempt)
This uses polynomial growth (attempt⁴) instead of exponential (2ⁿ), combined with a small jitter component. The delays look like this:
| Attempt | Avg delay |
|---|---|
| 1 | ~18s |
| 2 | ~41s |
| 3 | ~2m |
| 4 | ~5m |
| 5 | ~11m |
| 10 | ~2.8h |
| 15 | ~14.1h |
| 20 | ~44.5h |
With 15 retries, the total time to exhaust all attempts is roughly 2 days. With 20 retries, it's roughly 8 days.
The default retry count should be increased to either 15 or 20 (see design considerations below).
Design considerations
- 15 vs 20 retries. 15 retries (~2 days total) may be enough for issues caught during the work week, but tight if something breaks on a Friday evening. 20 retries (~8 days total) is more forgiving and covers weekend scenarios, but means failed tasks occupy memory for longer. Worth discussing which default makes more sense.
Tips
- Look at the existing retry logic in the codebase to find where the current formula and retry count are defined.
- Check the Deferred docs to understand how
Rage::Deferredworks. - Check the architecture doc that shows how Rage's core components interact with each other and outlines the design principles.
- Feel free to ask any questions or request help in the comments below!