-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Adds priority-inheritance futexes #131584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @cuviper (or someone else) some time within the next two weeks. Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (
|
This comment has been minimized.
This comment has been minimized.
After some investigation, I think it is not worth implementing pi futexes using
Giving pthread mutexes on FreeBSD implement priority inheritance, a possible solution is to switch to pthread backend on FreeBSD. |
cc @joboet |
This comment has been minimized.
This comment has been minimized.
The Miri subtree was changed cc @rust-lang/miri |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this in Miri! However, the implementation is unfortunately quite hard to follow -- this definitely needs more comments. You cannot assume that the reader of this code knows the futex API by heart.
// It's not uncommon for `addr` to be passed as another type than `*mut i32`, such as `*const AtomicI32`. | ||
let futex_val = this.read_scalar_atomic(&addr, AtomicReadOrd::Relaxed)?.to_i32()?; | ||
if val == futex_val { | ||
let futex_val = this.read_scalar_atomic(&addr, AtomicReadOrd::SeqCst)?.to_u32()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a huge comment above why we are doing a fence here, and now you just replaced the fence by something else. Why?
Please stick to the original implementation. It will be very hard to review this if you make deep fundamental changes like this. SeqCst writes and SeqCst fences are very much not equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SeqCst writes and SeqCst fences are very much not equivalent.
I have no idea. Could you please explain it or provide some materials?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid a full introduction to the C++ memory model is beyond the scope of this thread. Mara wrote a book about it, available at https://marabos.nl/atomics/, but I don't know if it goes into the fact that SeqCst fences + relaxed accesses are not equivalent to SeqCst accesses -- that is really advanced, and I don't know any place that thoroughly explains it.
Please keep the fence, and make all the reads/writes Relaxed
, like it was before. Carefully read the comment to ensure the fence is put in the right place, given that there are some new accesses being added here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about write_scalar_atomic
to &addr
? Do I need to put a fence after it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. The old logic was carefully figured out by a bunch of people together. Any adjustment of it will require someone to really dig into this, understand the logic, and make sure the adjustment makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sorry it's complicated. :/ I can try to help, but my time is very limited.
If there is some way to land this without doing the Miri change, that would also work. E.g. we could keep using futex::Mutex
instead of pi_futex::Mutex
when cfg(miri)
is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's correct now in my latest commit. Someone can report that if it's not, but for now I assume it is as it passed tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests passing doesn't mean it is correct, it just means it is good enough to not blow up. ;) Concurrency primitives are prone to subtle bugs.
If you want to land the Miri changes, we'll definitely need a test as requested here, in particular a version of concurrent_wait_wake
for PI futexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to land the Miri changes, we'll definitely need a test as requested #131584 (comment), in particular a version of concurrent_wait_wake for PI futexes.
I've added one in the latest commit. Plz have a look 😃
I don't think I'm the best person to review this... maybe: r? m-ou-se |
I modified the (internal) interface of pal |
When does this poisoning happen? Is that also something Miri should (eventually) emulate?
|
From
Linux automatically unlocks the futex and sets the rust/library/std/src/sync/mutex.rs Lines 546 to 554 in e200c7f
However, P.S. the simplest way to die a thread at arbitrary point might be to set up a signal handler and call
It's hard. BTW Linux also implements deadlock detection ( |
What does it mean for the "owner to die"? Does it mean the thread finishes, the process gets killed, or what?
Anyway this does not have to be in the first implementation, but if this lands we should open an issue to track filling the gaps.
For now, I will wait for a sign from t-libs that they are planning to accept a change like this before I spend the time of doing a more thorow review.
…On October 12, 2024 8:29:08 PM UTC, Rui He ***@***.***> wrote:
> When does this poisoning happen? Is that also something Miri should (eventually) emulate?
From `man futex(2)`:
> [If] the owner of the futex/RT-mutex dies unexpectedly, then the kernel cleans up the RT-mutex and hands it over to the next waiter. This in turn requires that the user-space value is updated accordingly. To indicate that this is required, the kernel sets the FUTEX_OWNER_DIED bit in the futex word along with the thread ID of the new owner. User space can detect this situation via the presence of the FUTEX_OWNER_DIED bit and is then responsible for cleaning up the stale state left over by the dead owner.
Linux automatically unlocks the futex and set the `FUTEX_OWNER_DIED` bit if the owner of futex dies. Sure we have `MutexGuard` that poisons `Mutex` when panicking and the thread dies:
https://github.com/rust-lang/rust/blob/e200c7f2e1a1ec7691a24539cf191f4c4d9d2a2c/library/std/src/sync/mutex.rs#L546-L554
However, `self.poison` is no-op when `panic = "abort"` (IDK why; this is another topic). And the thread can die between `poison.done()` and `inner.unlock()` (`poison.done()` stores the flag using `Relaxed`, so it's possible more operations are shuffled in between; IDK why it is implemented in this way; this is also another topic). So there are cases that `Mutex` is not properly poisoned and have to rely on the `FUTEX_OWNER_DIED` bit returned by Linux.
> Is that also something Miri should (eventually) emulate?
It's hard. BTW Linux also implements deadlock detection (`EDEADLK`). I have no idea how to implement them in miri.
--
Reply to this email directly or view it on GitHub:
#131584 (comment)
You are receiving this because you are on a team that was mentioned.
Message ID: ***@***.***>
|
The thread terminates (normally, killed, or whatever) with mutex held. |
FWIW I'm also working on a Futex requeue is also available on OpenBSD and Fuchsia. |
One thing that would definitely be good for the Miri side is extending |
If the kernel unlocked the mutex because the owner died, regular lock poisoning is not enough. It is safe to ignore lock poisoning. Instead you did have to consider the mutex permanently locked and make all future attempts at locking it either block forever or panic rather than return a poison error. If a mutex guard is forgotten, it should never be exposed in unlocked state again as unsafe code may depend on it staying locked permanently. Also rustc itself for example has a place where it leaks an RwLock reader to ensure nobody locks it with write permissions again as doing that could cause miscompilations or other bugs. |
Make sense 👍. I've updated my code. |
☔ The latest upstream changes (presumably #131727) made this pull request unmergeable. Please resolve the merge conflicts. |
This comment was marked as resolved.
This comment was marked as resolved.
This comment has been minimized.
This comment has been minimized.
Seems like you figured it out, but please add comments next to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had the time to look at the new PI code yet (also still waiting for a signal from t-libs that they want to pursue this), but here's some comment on the other part. As style comments they also apply to the PI code, if similar patterns occur there.
If it comes down to it, we don't have to block this PR on finessing the PI shims, we can always improve them later.
@@ -145,13 +175,21 @@ pub fn futex<'tcx>( | |||
// It's not uncommon for `addr` to be passed as another type than `*mut i32`, such as `*const AtomicI32`. | |||
let futex_val = this.read_scalar_atomic(&addr, AtomicReadOrd::Relaxed)?.to_i32()?; | |||
if val == futex_val { | |||
// Check that the top waiter (if exists) is waiting using FUTEX_WAIT_*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain why we do this. Is there a test covering this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the manpage:
EINVAL (FUTEX_WAKE, FUTEX_WAKE_OP, FUTEX_WAKE_BITSET,
FUTEX_REQUEUE, FUTEX_CMP_REQUEUE) The kernel detected an
inconsistency between the user-space state at uaddr and
the kernel state—that is, it detected a waiter which waits
in FUTEX_LOCK_PI or FUTEX_LOCK_PI2 on uaddr.
EINVAL (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI,
FUTEX_UNLOCK_PI) The kernel detected an inconsistency
between the user-space state at uaddr and the kernel
state. This indicates either state corruption or that the
kernel found a waiter on uaddr which is waiting via
FUTEX_WAIT or FUTEX_WAIT_BITSET.
It results in EINVAL if these two series of op are mix-used. So I add detection for such scenario.
if this.futex_waiter_count(addr_usize) != 0 { | ||
if this.futex_top_waiter_extra(addr_usize).is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use &&
. Also, like above: please explain why, and make sure we have a test.
Given that this takes up N waiters, why are we only testing the top waiter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the top waiter has no extra
, all waiters have no extra
. If it has, all waiters have. Because this is checked each time we add a waiter, so it is guaranteed that all waiters are the same wrt whether have extra
or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. Please document this invariant in the FutexWaiter
type.
// Ok(None) for EINVAL set, Ok(Some(None)) for no timeout (infinity), Ok(Some(Some(...))) for a timeout. | ||
// Forgive me, I don't want to create an enum for this return value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this becomes cleaner if you don't pass in dest
, and return an InterpResult<'tcx, Result<Option<(...)>, IoError>>
. Then the caller should do set_last_error
.
this.set_last_error(LibcError("EINVAL"))?; | ||
this.write_scalar(Scalar::from_target_isize(-1, this), dest)?; | ||
return interp_ok(()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can become one line via return this.set_last_error_and_return(...)
.
@@ -128,6 +128,8 @@ struct FutexWaiter { | |||
thread: ThreadId, | |||
/// The bitset used by FUTEX_*_BITSET, or u32::MAX for other operations. | |||
bitset: u32, | |||
/// Extra info stored for this waiter. | |||
extra: Option<u32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this field encodes whether the waiter is PI waiter? Just calling it extra
doesn't make that very clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is a PI waiter, this field stores Some(tid)
, where tid
is the result of gettid()
; if it is not, this field stores None
.
We need to store tid
here because we need to write the tid of the top waiter to futex when waking it.
Yes, the naming is not clear. I'll change it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to store tid here because we need to write the tid of the top waiter to futex when waking it.
Why can't the thread-that-is-woken-up do that itself in the wakeup callback?
Miri isn't an OS kernel so some things are a bit nicer. When a thread blocks it registers a callback that will be invoked on wakeup so it can do whatever it needs to at that moment, "atomically" as part of the wakeup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't the thread-that-is-woken-up do that itself in the wakeup callback?
It's possible; however it requires some tight coupling logic in this.futex_wait()
that checks whether we are a PI waiter, grabs the tid, and writes the futex. I'd prefer the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to keep this in mind for the review of the PI paths, to see if I can find an elegant alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the implementation of PI futexes in the kernel. It seems to be using a fair unlock protocol where, on unlock, ownership of the mutex is passed directly to the next waiting thread.
While fair unlocking has theoretical benefits, in practice it tends to be much slower than unfair locking which leaves the mutex in an unlocked state and just wakes up a waiter. The reason is that waking up a thread has a long latency but it's very common for a single thread to repeatedly lock and unlock the same mutex. If lock ownership is forcibly transferred to a waiting thread then this prevents any other thread from acquiring the mutex until that thread wakes up.
As such, I don't think PI futexes are suitable for use as the default mutex implementation for Rust. They should be provided in a crate for specialized use cases where PI is needed and the performance cost of fair unlocking is acceptable.
} | ||
|
||
pub fn locked() -> State { | ||
(unsafe { libc::gettid() }) as _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 issues with gettid
:
- It always performs a syscall, which we really don't want in the uncontended fast path.
- It's only available from glibc 2.30, which is newer than our minimum (2.17).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It always performs a syscall, which we really don't want in the uncontended fast path.
Do you mean to use a thread-local storage to cache the tid?
I'm afraid that it may cause bugs in some corner cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can store it in the Thread
structure.
We could additionally install a MADV_WIPEONFORK
page containing an atomic function pointer that resolves either to "get the cached result" or "do a syscall" functions... or just a flag.
Though that requires 4.14. An atfork
handler would work too but adding one might have other exciting consequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though that requires 4.14. An
atfork
handler would work too but adding one might have other exciting consequences.
We cannot use atfork
. e.g. clone
and clone3
do not call atfork
handlers. We cannot assume the programmers always use syscall wrappers from glibc.
We can store it in the
Thread
structure.
Same issue. The programmers can call raw syscalls and have Thread
not updated to reflect a new thread. I have no idea whether this is defined to be an UB in Rust. Giving calling raw syscalls is "unsafe", maybe we can assume the tid stored in Thread
is valid?
We could additionally install a
MADV_WIPEONFORK
page containing an atomic function pointer that resolves either to "get the cached result" or "do a syscall" functions... or just a flag.
Somewhat complicated. I can implement this, but I'm afraid it's not zero-cost (we need one memory page per thread to store tid.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot use atfork. e.g. clone and clone3 do not call atfork handlers. We cannot assume the programmers always use syscall wrappers from glibc.
libc authors have said that not using the wrappers and then calling libc functions after the clone is UB. Since the standard library on those targets relies on libc that's also Rust UB.
Somewhat complicated. I can implement this, but I'm afraid it's not zero-cost (we need one memory page per thread to store tid.)
I don't think so. A shared flag or a sort of generation marker for all threads should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread id is cached in thread-local storage in the latest commit. Would you plz review it?
Yes, there may be overhead. However, fair unlocking only happen when the futex is in contended state. pub unsafe fn unlock(&self) {
if self.futex.compare_exchange(pi::locked(), pi::unlocked(), Release, Relaxed).is_err() {
// We only wake up one thread. When that thread locks the mutex,
// the kernel will mark the mutex as contended automatically
// (futex != pi::locked() in this case),
// which makes sure that any other waiting threads will also be
// woken up eventually.
self.wake();
}
} So, it is not true for "waking up a thread has a long latency but it's very common for a single thread to repeatedly lock and unlock the same mutex". In this case, we do not enter the kernel space. |
The macos lock we use is fair, right? How does it deal with this?
|
Yes. And as the result it is relatively slower than other platforms. I think @joboet is more familiar with the macOS implementation. |
macOS uses unfair locking by default and the thread IDs are stored in a thread-local variable, so it avoids most of the cost of priority-inheritance. |
Is it mainlined? I find the current std still uses pthread on macOS. |
#122408 (which hasn't been merged yet) switches macOS to futexes. |
Right. If macOS uses a thread-local variable to store thread ID, we can do in the same way in Linux as well to avoid fetching tid every time. |
This uses FUTEX_LOCK_PI and FUTEX_UNLOCK_PI on Linux.
I dropped the miri implementation due to merge conflicts. It can be added back later. |
This comment has been minimized.
This comment has been minimized.
The contended case is specifically the one that I am concerned about. If you have 2 threads both contending on a single lock, an unfair mutex allows one thread to re-acquire the lock while the other thread is still in the process of waking up. A fair mutex keeps the mutex lock and transfers ownership to the other thread directly. The problem is that waking up a sleeping thread is slow and it may take a while until it is scheduled if the system is contended. It's very common for threads to repeatedly lock & unlock the same lock in a sequence (e.g. calling a method that acquires a lock in a loop). If that happens with a fair mutex then neither threads are making progress for the full duration of wakeup, and this delay is incurred on every unlock. This is the reason why today most mutex implementations are unfair. For example, quoting from https://webkit.org/blog/6161/locking-in-webkit/ (which is the post that inspired the creation of
PI futexes as currently implemented in the Linux kernel enforce the use of fair unlocking and thus suffer from this performance penalty. This is the reason why they are not used in glibc by default. As such, I think they are unsuitable as the default implementation of |
The job Click to see the possible cause of the failure (guessed by this bot)
|
Make sense. I hope Linux can provide an unfair PI futex in the future. |
Linked: #131514, #128231
This PR uses
FUTEX_LOCK_PI
andFUTEX_UNLOCK_PI
on Linux for priority-inheritance futexes to implementMutex
.Quoted from
man 2 futex
:I'm still working on an implementation of PI-futex on FreeBSD.