-
-
Notifications
You must be signed in to change notification settings - Fork 7
RFC 0014: Cancelable timers #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
ysbaddaden
wants to merge
13
commits into
main
Choose a base branch
from
general-timeouts
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 4 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
5fd88f3
RFC XXXX: General Timeouts [DRAFT]
ysbaddaden 65cec99
Update and rename 0000-general-timeout.md to 0014-general-timeouts.md
ysbaddaden 8209ceb
Update the draft with the API and such
ysbaddaden 4cf5119
Add comments to mutex example, move callouts as unresolved questions,…
ysbaddaden f365034
Rename + clarify the intent + reorganize the sections
ysbaddaden 18c48cb
Update text/0014-cancelable-timers.md
ysbaddaden ae5efb5
Update text/0014-cancelable-timers.md
ysbaddaden 0c5553e
Include suggestions from @straight-shoota
ysbaddaden 9d05392
Add PoC PR link
ysbaddaden b157742
Update text/0014-cancelable-timers.md
ysbaddaden 29eaeb0
Updated proposal (CancelationToken, sleep, ...)
ysbaddaden d925b76
Fix: cancellable instead of cancelable
ysbaddaden 1f4bc7e
Fix: #cancel_suspension? instead of #resolve_timer?
ysbaddaden File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,313 @@ | ||
| --- | ||
| Feature Name: general-timeouts | ||
| Start Date: 2025-08-02 | ||
| RFC PR: "https://github.com/crystal-lang/rfcs/pull/0014" | ||
| Issue: N/A | ||
| --- | ||
|
|
||
| # Summary | ||
|
|
||
| Introduce a general `#timeout` feature, ideally as simple as `#sleep` that would | ||
| return whether the timer expired, or was manually canceled. | ||
|
|
||
|
|
||
| # Motivation | ||
|
|
||
| Synchronization primitives, such as mutexes, condition variables, pools, or | ||
| event channels, could take advantage of a general timeout mechanism. | ||
|
|
||
| Crystal has a mechanism in the event loop to suspend the execution of a fiber | ||
| for a set amount of time (`#sleep`). It also has a couple mechanisms to add | ||
| timeouts: one associated to IO operations, and another tailored to `Channel` and | ||
| `select` to support the timeout branch of select actions. | ||
|
|
||
| > [!CAUTION] | ||
| > Verify if the select action timeout mechanism can resume *twice*. | ||
|
|
||
| Adding timeouts to all the synchronization primitives in the stdlib, and | ||
| possibly to custom ones in shards and applications, shouldn't be much harder | ||
| than calling `#sleep`, or need to hack into the private `Fiber#timeout_event`. | ||
|
|
||
|
|
||
| # Guide-level explanation | ||
ysbaddaden marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| The complexity of timeout over sleep is that it can be canceled, which leads to | ||
| synchronization issues. For example, multiple threads may try to cancel the | ||
| timeout at the same time, and yet another thread might try to process the | ||
| expired timer too. | ||
|
|
||
| We want the fiber to be resumed *once* and to report whether the timeout expired | ||
| or was canceled. We need synchronization and a mean to decide which thread | ||
| resolves the timeout and will enqueue the fiber. There can of course be only | ||
| one. | ||
|
|
||
| A straightforward solution is to use an atomic: the fiber sets the atomic before | ||
| suspending itself, then any attempt to enqueue the fiber must be the one that | ||
| succeeds to unset the atomic. This can be achieved with an `Atomic(Bool)` for | ||
| example (`#set(true)` then `#swap(false) == true`). | ||
|
|
||
| The atomic could be a property on `Fiber`, it's easily accessible to anything | ||
| that already knows about it, be it a waiting fiber in a mutex or a waiting timer | ||
| event in the event loop. The problem is that there is no telling if the timeout | ||
| changed in between. This is more commonly known as the ABA problem. | ||
|
|
||
| The common solution is to increment the atomic value on every attempt to change | ||
| the value using compare-and-set (CAS). If the atomic value didn't change, the | ||
| CAS succeeds, and the timeout is succesfully resolved. Other attempts will fail | ||
| because the value changed, so even if the timeout is set again, the value will | ||
| have been incremented and the CAS will always fail. Albeit, the counter will | ||
| eventually wrap, but after so many iterations that it's impossible to hit the | ||
| ABA issue in practice. For example, an UInt32 word with a 1 bit flag would need | ||
| 2,147,483,648 iterations! | ||
|
|
||
| The counterpart is that every waiter and timer event must know the current | ||
| atomic value (thereafter called `cancelation token`) to be able to resolve the | ||
| timeout. I believe this is an acceptable trade off. | ||
|
|
||
| > [!NOTE] | ||
| > Alternatively, we could allocate the atomic in the HEAP, but then every | ||
| > timeout would depend on the GC, and we'd still need to save the pointer for | ||
| > every waiter and timer. We can't store it on `Fiber` because we'd jump back | ||
| > into the ABA problem. | ||
|
|
||
| For our use case, we can rely on a few properties to choose the best atomic | ||
| operation: | ||
|
|
||
| 1. The fiber can only be suspended once at a time, and can thus only be waiting | ||
| on a single timeout (or a sleep) at a time. Last but not least, only the | ||
| current fiber can decide to create a timeout (a third party fiber can't). | ||
| Hence we can merely set the value when setting the timeout (no need for CAS). | ||
|
|
||
| 2. The flag is enough to prevent multiple threads to unset the timeout in | ||
| parallel, however we must increment the value when we *set the timeout again* | ||
| to prevent another thread from mistakenly unset the new timeout. Hence | ||
| setting the timeout must increment the value, but resolving can merely unset | ||
| the flag. | ||
|
|
||
| > [!TIP] | ||
| > To keep the flag and the increment in the same atomic value, we can use an | ||
| > UInt32 value, use bit 1 for the flag, so 0 and 1 denote the unset and set | ||
| > states repectively, and use bits 2..32 to increment the value, thus adding 2 | ||
| > to do the increment—with more flags we'd shift the increment by 1 bit. | ||
| > | ||
| > Setting the timeout must get the atomic value, set the first bit and increment | ||
| > by 2 (wrapping on overflow): `token = (atm.get | 1) &+ 2` while resolving the | ||
| > timeout shall unset the first bit: `token & ~1`. | ||
|
|
||
| ## Synchronization primitives | ||
|
|
||
| Let's take a mutex for the example. | ||
|
|
||
| On failure to acquire the lock, the current fiber will want to add itself into a | ||
| waiting list (private to the mutex). As explained in the previous section, we | ||
| must memorize the cancelation token to be able to cancel it, so it must set the | ||
| timeout, get the token, and then add the fiber and the token to the mutex | ||
| waiting list, then delegate the timeout to the event loop implementation. | ||
|
|
||
| When another fiber unlocks, the mutex must try to wake a waiting fiber. While | ||
| doing this it must resolve the timeout using its cancelation token. On success | ||
| it must enqueue the fiber, on failure it must skip the fiber because another | ||
| fiber or thread has already enqueued the fiber or will enqueue it, and the mutex | ||
| shall try to wake another waiting fiber. | ||
|
|
||
| The resumed fiber then will know if the timer expired or was canceled, and can | ||
| act accordingly: | ||
|
|
||
| - If the timeout expired, the fiber must remove itself from the waiting list and | ||
| return an error or raise an exception for example. | ||
|
|
||
| - If the timeout was canceled, then the fiber was manually resumed by an unlock | ||
| and shall try to acquire the lock again. On failure it would add itself back | ||
| into the waiting list and set a timeout for the remaining time (if any). | ||
|
|
||
| ## Fiber | ||
|
|
||
| The `Fiber` object shall hold the atomic value, and provide methods to create | ||
| the cancelation token, start waiting and to resolve the timeout. | ||
|
|
||
| ## Event loop | ||
|
|
||
| Each event loop shall provide a method to suspend the calling fiber for a | ||
| duration and needs to be given the cancelation token so it can resolve the | ||
| timeout when the timer expires. On success, it shall mark the event as expired | ||
| and enqueue the fiber. On failure, it must skip the fiber (it was canceled). | ||
|
|
||
| When the suspended fiber resumes, it must check the state of the timer. When | ||
| expired, the method can simply return, otherwise it must cancel the timer event. | ||
|
|
||
| > [!NOTE] | ||
| > We could cancel the timer event sooner, but that would require a method on | ||
| > Fiber to cancel the timeout, another method on every event loop to cancel the | ||
| > timer, and the event loops would have to memorize the timer event for every | ||
| > fiber in a timeout, for example using a hashmap. | ||
| > | ||
| > By delaying the timer cancelation to when the fiber is resumed, we can avoid | ||
| > all that, at the expense of keeping the timer event a bit longer than | ||
| > necessary in the timers' data structure. | ||
|
|
||
| # Reference-level explanation | ||
|
|
||
| ## Public API (`Fiber`) | ||
|
|
||
| We introduce an enum: `Fiber::TimeoutResult` with two values `CANCELED` and | ||
| `EXPIRED`. We could return a `Bool` instead, but then we'd be left to wonder | ||
| whether `true` means expired or canceled. | ||
|
|
||
| We introduce a `TimeoutToken` alias for the atomic value type. This abstracts | ||
| the underlying type as an 'opaque' type. We could introduce a wrapper struct | ||
| with only a `#value` method to make it fully opaque. | ||
|
|
||
| The public API can do with a couple methods: | ||
|
|
||
| - `Fiber.timeout(duration : Time::Span, & : Fiber::TimeoutToken) : Fiber::TimeoutResult` | ||
|
|
||
| Sets the flag and increments the value of the atomic. Yields the cancelation | ||
| token (aka the new atomic value) so the caller can record it, then delegates | ||
| to the event loop to suspend the calling fiber and eventually resume it when | ||
| the timeout expired if it hasn't been canceled already. | ||
|
|
||
| Returns `Fiber::TimeoutToken::CANCELED` if the timeout was canceled, and | ||
| `Fiber::TimeoutToken::EXPIRED` if the timeout expired. | ||
|
|
||
| All the details to add, trigger and remove the timer are fully delegated to | ||
| the event loop implementations. | ||
|
|
||
| - `Fiber#resolve_timeout?(token : Fiber::TimeoutToken) : Bool` | ||
|
|
||
| Tries to unset the flag of the timeout atomic value for `fiber`. It must fail | ||
| if the atomic value isn't `token` anymore (the flag has already been unset or | ||
| the value got incremented). Returns true if and only if the atomic value was | ||
| sucessfully updated, otherwise returns false. | ||
|
|
||
| On success, the caller must enqueue the fiber. On failure, the caller musn't. | ||
|
|
||
| Code calling `Fiber.timeout` must call `Fiber#resolve_timeout?` with the | ||
| cancelation token and to act accordingly to the result—enqueue the fiber iff | ||
| the timeout was canceled. | ||
|
|
||
| ## Internal API (`Crystal::EventLoop`) | ||
|
|
||
| We introduce one method: | ||
|
|
||
| - `Crystal::EventLoop#timeout(duration : Time::Span, token : Fiber::TimeoutToken) : Bool` | ||
|
|
||
| The event loop shall suspend the current fiber for `duration` and returns | ||
| `true` if the timeout expired, and `false` otherwise. | ||
|
|
||
| When processing the timer event, the event loop must resolve the timeout by | ||
| calling `fiber.resolve_timeout?(token)` and enqueue the fiber if an only if it | ||
| returned true. It must skip the fiber otherwise. | ||
|
|
||
| ## Example | ||
|
|
||
| Following is a potential implementation for the mutex example from the guide | ||
| section above. | ||
|
|
||
| ```crystal | ||
| class CancelableMutex | ||
| def lock : Nil | ||
| loop do | ||
| break if lock_impl? | ||
| enqueue_waiter(Fiber.current, nil) | ||
| end | ||
| end | ||
|
|
||
| def lock?(timeout : Time::Span) : Bool | ||
| limit = Time.monotonic + timeout | ||
|
|
||
| loop do | ||
| return true if lock_impl? | ||
|
|
||
| # 1. set the timeout | ||
| res = Fiber.timeout(limit - Time.monotonic) do |token| | ||
| # 2. save the cancelation token | ||
| enqueue_waiter(Fiber.current, token) | ||
|
|
||
| # 3. the fiber will now be suspended... | ||
| end | ||
| # 4. and, the fiber has resumed | ||
|
|
||
| return false if res.expired? | ||
| end | ||
| end | ||
|
|
||
| def unlock : Nil | ||
| unlock_impl | ||
|
|
||
| while waiter = dequeue_waiter? | ||
| fiber, token = waiter | ||
|
|
||
| if token.nil? || fiber.resolve_timeout?(token) | ||
| # the waiter had no timeout or we canceled the timeout | ||
| fiber.enqueue | ||
| break | ||
| end | ||
|
|
||
| # try the next waiting fiber | ||
| end | ||
| end | ||
| def | ||
| ``` | ||
|
|
||
|
|
||
| # Drawbacks | ||
|
|
||
| None that I can think of. | ||
|
|
||
| # Rationale and alternatives | ||
|
|
||
| The feature is designed to be low level yet simple and efficient, allowing | ||
| higher level abstractions to easily implement a timeout. Ideally this feature | ||
| might be usable to implement all the different timeouts: select action timeouts, | ||
| IO timeouts for some event loops (this needs to be investigated). | ||
|
|
||
| An alternative could be to introduce lightweight abstract channels. One such | ||
| channel could have a delayed sent that would be triggered after some duration. A | ||
| select action could merely wait on this. Yet, such an delayed channel might be | ||
| implementable on top of the timeout feature presented here. It actually feels | ||
| more like a potential evolution for `select`. | ||
|
|
||
| # Prior art | ||
|
|
||
| Most runtimes expose cancelable timers, directly (for example [timer_create(2)] | ||
| on POSIX) or indirectly through synchronization primitives (for example | ||
| [pthread_cond_timedwait(3)] on POSIX). | ||
|
|
||
| # Unresolved questions | ||
|
|
||
| 1. Instead of `Fiber#resolve_timeout?` we could have `TimeoutToken` be a struct | ||
| with the fiber reference and the cancelation token, and have a `#resolve?` | ||
| method to resolve the token. That would be more OOP and open more evolutions, | ||
| though it would make the token larger (pointer + u32 + padding) as well as | ||
| the duplicate the fiber reference that is likely to be already kept. | ||
|
|
||
| 2. We may want to use *absolute time instead of relative duration*, so every use | ||
| cases that need to retry wouldn't have to deal with caculating the remaining | ||
| time on each iteration, and would only need to calculate the absolute limit | ||
| (now + timeout). | ||
|
|
||
| The problem is that monotonic times and durations are represented using the | ||
| same type (`Time::Span`), and `#sleep` uses relative time. Using absolute | ||
| time would be confusing. Maybe we can support *both* and add an `absolute` | ||
| argument, that would default to `false`? | ||
ysbaddaden marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Only `Fiber.timeout()` would need the argument. The event loop API can be | ||
| fixed to use either absolute or relative times only, since implementations | ||
| may already use absolute times internally—the epoll and kqueue event loops | ||
| use absolute time for example. | ||
|
|
||
| # Future possibilities | ||
|
|
||
| As a low level feature, we can use it to add timeouts a bit anywhere they'd make | ||
| sense. For example implement alternative methods that could timeout in every | ||
| synchronization primitive (`Mutex`, `WaitGroup`, ...), including those in the | ||
| [sync shard]. | ||
|
|
||
| We might be able to use it in the IOCP event loop where we currently need to | ||
| sleep and yield in a specific timeout case, as well as other event loops for IO | ||
| timeouts (to be verified), as well as for the select action timeout that | ||
| currently relies on a custom solution. | ||
|
|
||
| [timer_create(2)]: https://www.man7.org/linux/man-pages/man2/timer_create.2.html | ||
| [pthread_cond_timedwait(3)]: https://www.man7.org/linux/man-pages/man3/pthread_cond_timedwait.3p.html | ||
| [sync shard]: https://github.com/ysbaddaden/sync | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.