Skip to content

Adds priority-inheritance futexes #131584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Conversation

ruihe774
Copy link

Linked: #131514, #128231

This PR uses FUTEX_LOCK_PI and FUTEX_UNLOCK_PI on Linux for priority-inheritance futexes to implement Mutex.

Quoted from man 2 futex:

Priority inversion is the problem that occurs when a high-priority task is blocked waiting to acquire a lock held by a low-priority task, while tasks at an intermediate priority continuously preempt the low-priority task from the CPU. Consequently, the low-priority task makes no progress toward releasing the lock, and the high-priority task remains blocked.

Priority inheritance is a mechanism for dealing with the priority-inversion problem. With this mechanism, when a high- priority task becomes blocked by a lock held by a low-priority task, the priority of the low-priority task is temporarily raised to that of the high-priority task, so that it is not preempted by any intermediate level tasks, and can thus make progress toward releasing the lock. To be effective, priority inheritance must be transitive, meaning that if a high-priority task blocks on a lock held by a lower-priority task that is itself blocked by a lock held by another intermediate-priority task (and so on, for chains of arbitrary length), then both of those tasks (or more generally, all of the tasks in a lock chain) have their priorities raised to be the same as the high-priority task.

I'm still working on an implementation of PI-futex on FreeBSD.

@rustbot
Copy link
Collaborator

rustbot commented Oct 12, 2024

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @cuviper (or someone else) some time within the next two weeks.

Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (S-waiting-on-review and S-waiting-on-author) stays updated, invoking these commands when appropriate:

  • @rustbot author: the review is finished, PR author should check the comments and take action accordingly
  • @rustbot review: the author is ready for a review, this PR will be queued again in the reviewer's queue

@rustbot rustbot added O-unix Operating system: Unix-like S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 12, 2024
@rust-log-analyzer

This comment has been minimized.

@ruihe774
Copy link
Author

ruihe774 commented Oct 12, 2024

After some investigation, I think it is not worth implementing pi futexes using UMUTEX_PRIO_INHERIT on FreeBSD. Drawbacks are:

Giving pthread mutexes on FreeBSD implement priority inheritance, a possible solution is to switch to pthread backend on FreeBSD.

@slanterns
Copy link
Contributor

cc @joboet

@rust-log-analyzer

This comment has been minimized.

@ruihe774 ruihe774 marked this pull request as ready for review October 12, 2024 11:16
@rustbot
Copy link
Collaborator

rustbot commented Oct 12, 2024

The Miri subtree was changed

cc @rust-lang/miri

Copy link
Member

@RalfJung RalfJung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this in Miri! However, the implementation is unfortunately quite hard to follow -- this definitely needs more comments. You cannot assume that the reader of this code knows the futex API by heart.

// It's not uncommon for `addr` to be passed as another type than `*mut i32`, such as `*const AtomicI32`.
let futex_val = this.read_scalar_atomic(&addr, AtomicReadOrd::Relaxed)?.to_i32()?;
if val == futex_val {
let futex_val = this.read_scalar_atomic(&addr, AtomicReadOrd::SeqCst)?.to_u32()?;
Copy link
Member

@RalfJung RalfJung Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a huge comment above why we are doing a fence here, and now you just replaced the fence by something else. Why?

Please stick to the original implementation. It will be very hard to review this if you make deep fundamental changes like this. SeqCst writes and SeqCst fences are very much not equivalent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SeqCst writes and SeqCst fences are very much not equivalent.

I have no idea. Could you please explain it or provide some materials?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid a full introduction to the C++ memory model is beyond the scope of this thread. Mara wrote a book about it, available at https://marabos.nl/atomics/, but I don't know if it goes into the fact that SeqCst fences + relaxed accesses are not equivalent to SeqCst accesses -- that is really advanced, and I don't know any place that thoroughly explains it.

Please keep the fence, and make all the reads/writes Relaxed, like it was before. Carefully read the comment to ensure the fence is put in the right place, given that there are some new accesses being added here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about write_scalar_atomic to &addr? Do I need to put a fence after it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. The old logic was carefully figured out by a bunch of people together. Any adjustment of it will require someone to really dig into this, understand the logic, and make sure the adjustment makes sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry it's complicated. :/ I can try to help, but my time is very limited.

If there is some way to land this without doing the Miri change, that would also work. E.g. we could keep using futex::Mutex instead of pi_futex::Mutex when cfg(miri) is set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's correct now in my latest commit. Someone can report that if it's not, but for now I assume it is as it passed tests.

Copy link
Member

@RalfJung RalfJung Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests passing doesn't mean it is correct, it just means it is good enough to not blow up. ;) Concurrency primitives are prone to subtle bugs.

If you want to land the Miri changes, we'll definitely need a test as requested here, in particular a version of concurrent_wait_wake for PI futexes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to land the Miri changes, we'll definitely need a test as requested #131584 (comment), in particular a version of concurrent_wait_wake for PI futexes.

I've added one in the latest commit. Plz have a look 😃

@cuviper
Copy link
Member

cuviper commented Oct 12, 2024

I don't think I'm the best person to review this... maybe:

r? m-ou-se

@rustbot rustbot assigned m-ou-se and unassigned cuviper Oct 12, 2024
@ruihe774
Copy link
Author

ruihe774 commented Oct 12, 2024

I modified the (internal) interface of pal sys::sync::Mutex to return a MutexState. Linux poisons pi futexes by itself, so an return value is needed to propagate poison state to the outer wrapper. Most line changes in source files of unrelated platform are caused by this.

@RalfJung
Copy link
Member

RalfJung commented Oct 12, 2024 via email

@ruihe774
Copy link
Author

ruihe774 commented Oct 12, 2024

When does this poisoning happen? Is that also something Miri should (eventually) emulate?

From man futex(2):

[If] the owner of the futex/RT-mutex dies unexpectedly, then the kernel cleans up the RT-mutex and hands it over to the next waiter. This in turn requires that the user-space value is updated accordingly. To indicate that this is required, the kernel sets the FUTEX_OWNER_DIED bit in the futex word along with the thread ID of the new owner. User space can detect this situation via the presence of the FUTEX_OWNER_DIED bit and is then responsible for cleaning up the stale state left over by the dead owner.

Linux automatically unlocks the futex and sets the FUTEX_OWNER_DIED bit if the owner of futex dies. Sure we have MutexGuard that poisons Mutex when panicking.

impl<T: ?Sized> Drop for MutexGuard<'_, T> {
#[inline]
fn drop(&mut self) {
unsafe {
self.lock.poison.done(&self.poison);
self.lock.inner.unlock();
}
}
}

However, self.poison is no-op when panic = "abort" (IDK why; this is another topic); a thread can die when panic = "abort" as well. And the thread can die between poison.done() and inner.unlock() (poison.done() stores the flag using Relaxed, so it's possible more operations are shuffled in between; IDK why it is implemented in this way; this is also another topic). So there are cases that Mutex is not properly poisoned and have to rely on the FUTEX_OWNER_DIED bit returned by Linux.

P.S. the simplest way to die a thread at arbitrary point might be to set up a signal handler and call pthread_exit() in it.

Is that also something Miri should (eventually) emulate?

It's hard. BTW Linux also implements deadlock detection (EDEADLK). I have no idea how to implement them in miri.

@RalfJung
Copy link
Member

RalfJung commented Oct 12, 2024 via email

@ruihe774
Copy link
Author

What does it mean for the "owner to die"? Does it mean the thread finishes, the process gets killed, or what?

The thread terminates (normally, killed, or whatever) with mutex held.

@ruihe774
Copy link
Author

ruihe774 commented Oct 12, 2024

FWIW I'm also working on a Condvar implementation (ruihe774@53155b2) based on futex requeue to avoid thundering-herd formation.

Futex requeue is also available on OpenBSD and Fuchsia.

@RalfJung
Copy link
Member

One thing that would definitely be good for the Miri side is extending src/tools/miri/tests/pass-dep/concurrency/linux-futex.rs to invoke the new APIs. See in particular the concurrent_wait_wake test there which is related to the SeqCst fence.

@bjorn3
Copy link
Member

bjorn3 commented Oct 12, 2024

If the kernel unlocked the mutex because the owner died, regular lock poisoning is not enough. It is safe to ignore lock poisoning. Instead you did have to consider the mutex permanently locked and make all future attempts at locking it either block forever or panic rather than return a poison error. If a mutex guard is forgotten, it should never be exposed in unlocked state again as unsafe code may depend on it staying locked permanently. Also rustc itself for example has a place where it leaks an RwLock reader to ensure nobody locks it with write permissions again as doing that could cause miscompilations or other bugs.

@ruihe774
Copy link
Author

If the kernel unlocked the mutex because the owner died, regular lock poisoning is not enough. It is safe to ignore lock poisoning. Instead you did have to consider the mutex permanently locked and make all future attempts at locking it either block forever or panic rather than return a poison error. If a mutex guard is forgotten, it should never be exposed in unlocked state again as unsafe code may depend on it staying locked permanently. Also rustc itself for example has a place where it leaks an RwLock reader to ensure nobody locks it with write permissions again as doing that could cause miscompilations or other bugs.

Make sense 👍. I've updated my code.

@ruihe774
Copy link
Author

@m-ou-se @joboet I'm looking forward to hearing from you 😃

@bors
Copy link
Collaborator

bors commented Oct 15, 2024

☔ The latest upstream changes (presumably #131727) made this pull request unmergeable. Please resolve the merge conflicts.

@ruihe774

This comment was marked as resolved.

@rust-log-analyzer

This comment has been minimized.

@RalfJung
Copy link
Member

I wonder how I can have different miri test case expected outputs for different platform.

Seems like you figured it out, but please add comments next to the ignore/only explaining that there is another test covering this case, and where that test can be found.

Copy link
Member

@RalfJung RalfJung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had the time to look at the new PI code yet (also still waiting for a signal from t-libs that they want to pursue this), but here's some comment on the other part. As style comments they also apply to the PI code, if similar patterns occur there.

If it comes down to it, we don't have to block this PR on finessing the PI shims, we can always improve them later.

@@ -145,13 +175,21 @@ pub fn futex<'tcx>(
// It's not uncommon for `addr` to be passed as another type than `*mut i32`, such as `*const AtomicI32`.
let futex_val = this.read_scalar_atomic(&addr, AtomicReadOrd::Relaxed)?.to_i32()?;
if val == futex_val {
// Check that the top waiter (if exists) is waiting using FUTEX_WAIT_*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain why we do this. Is there a test covering this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the manpage:

       EINVAL (FUTEX_WAKE, FUTEX_WAKE_OP, FUTEX_WAKE_BITSET,
              FUTEX_REQUEUE, FUTEX_CMP_REQUEUE) The kernel detected an
              inconsistency between the user-space state at uaddr and
              the kernel state—that is, it detected a waiter which waits
              in FUTEX_LOCK_PI or FUTEX_LOCK_PI2 on uaddr.

       EINVAL (FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI,
              FUTEX_UNLOCK_PI) The kernel detected an inconsistency
              between the user-space state at uaddr and the kernel
              state.  This indicates either state corruption or that the
              kernel found a waiter on uaddr which is waiting via
              FUTEX_WAIT or FUTEX_WAIT_BITSET.

It results in EINVAL if these two series of op are mix-used. So I add detection for such scenario.

Comment on lines 236 to 237
if this.futex_waiter_count(addr_usize) != 0 {
if this.futex_top_waiter_extra(addr_usize).is_some() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use &&. Also, like above: please explain why, and make sure we have a test.

Given that this takes up N waiters, why are we only testing the top waiter?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the top waiter has no extra, all waiters have no extra. If it has, all waiters have. Because this is checked each time we add a waiter, so it is guaranteed that all waiters are the same wrt whether have extra or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. Please document this invariant in the FutexWaiter type.

Comment on lines 45 to 46
// Ok(None) for EINVAL set, Ok(Some(None)) for no timeout (infinity), Ok(Some(Some(...))) for a timeout.
// Forgive me, I don't want to create an enum for this return value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this becomes cleaner if you don't pass in dest, and return an InterpResult<'tcx, Result<Option<(...)>, IoError>>. Then the caller should do set_last_error.

Comment on lines 180 to 182
this.set_last_error(LibcError("EINVAL"))?;
this.write_scalar(Scalar::from_target_isize(-1, this), dest)?;
return interp_ok(());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can become one line via return this.set_last_error_and_return(...).

@@ -128,6 +128,8 @@ struct FutexWaiter {
thread: ThreadId,
/// The bitset used by FUTEX_*_BITSET, or u32::MAX for other operations.
bitset: u32,
/// Extra info stored for this waiter.
extra: Option<u32>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this field encodes whether the waiter is PI waiter? Just calling it extra doesn't make that very clear.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a PI waiter, this field stores Some(tid), where tid is the result of gettid(); if it is not, this field stores None.

We need to store tid here because we need to write the tid of the top waiter to futex when waking it.

Yes, the naming is not clear. I'll change it later.

Copy link
Member

@RalfJung RalfJung Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to store tid here because we need to write the tid of the top waiter to futex when waking it.

Why can't the thread-that-is-woken-up do that itself in the wakeup callback?

Miri isn't an OS kernel so some things are a bit nicer. When a thread blocks it registers a callback that will be invoked on wakeup so it can do whatever it needs to at that moment, "atomically" as part of the wakeup.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't the thread-that-is-woken-up do that itself in the wakeup callback?

It's possible; however it requires some tight coupling logic in this.futex_wait() that checks whether we are a PI waiter, grabs the tid, and writes the futex. I'd prefer the current implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to keep this in mind for the review of the PI paths, to see if I can find an elegant alternative.

Copy link
Member

@Amanieu Amanieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the implementation of PI futexes in the kernel. It seems to be using a fair unlock protocol where, on unlock, ownership of the mutex is passed directly to the next waiting thread.

While fair unlocking has theoretical benefits, in practice it tends to be much slower than unfair locking which leaves the mutex in an unlocked state and just wakes up a waiter. The reason is that waking up a thread has a long latency but it's very common for a single thread to repeatedly lock and unlock the same mutex. If lock ownership is forcibly transferred to a waiting thread then this prevents any other thread from acquiring the mutex until that thread wakes up.

As such, I don't think PI futexes are suitable for use as the default mutex implementation for Rust. They should be provided in a crate for specialized use cases where PI is needed and the performance cost of fair unlocking is acceptable.

}

pub fn locked() -> State {
(unsafe { libc::gettid() }) as _
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 issues with gettid:

  • It always performs a syscall, which we really don't want in the uncontended fast path.
  • It's only available from glibc 2.30, which is newer than our minimum (2.17).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It always performs a syscall, which we really don't want in the uncontended fast path.

Do you mean to use a thread-local storage to cache the tid?

I'm afraid that it may cause bugs in some corner cases.

Copy link
Member

@the8472 the8472 Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can store it in the Thread structure.

We could additionally install a MADV_WIPEONFORK page containing an atomic function pointer that resolves either to "get the cached result" or "do a syscall" functions... or just a flag.
Though that requires 4.14. An atfork handler would work too but adding one might have other exciting consequences.

Copy link
Author

@ruihe774 ruihe774 Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though that requires 4.14. An atfork handler would work too but adding one might have other exciting consequences.

We cannot use atfork. e.g. clone and clone3 do not call atfork handlers. We cannot assume the programmers always use syscall wrappers from glibc.

We can store it in the Thread structure.

Same issue. The programmers can call raw syscalls and have Thread not updated to reflect a new thread. I have no idea whether this is defined to be an UB in Rust. Giving calling raw syscalls is "unsafe", maybe we can assume the tid stored in Thread is valid?

We could additionally install a MADV_WIPEONFORK page containing an atomic function pointer that resolves either to "get the cached result" or "do a syscall" functions... or just a flag.

Somewhat complicated. I can implement this, but I'm afraid it's not zero-cost (we need one memory page per thread to store tid.)

Copy link
Member

@the8472 the8472 Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot use atfork. e.g. clone and clone3 do not call atfork handlers. We cannot assume the programmers always use syscall wrappers from glibc.

libc authors have said that not using the wrappers and then calling libc functions after the clone is UB. Since the standard library on those targets relies on libc that's also Rust UB.

Somewhat complicated. I can implement this, but I'm afraid it's not zero-cost (we need one memory page per thread to store tid.)

I don't think so. A shared flag or a sort of generation marker for all threads should be sufficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread id is cached in thread-local storage in the latest commit. Would you plz review it?

@ruihe774
Copy link
Author

I reviewed the implementation of PI futexes in the kernel. It seems to be using a fair unlock protocol where, on unlock, ownership of the mutex is passed directly to the next waiting thread.

While fair unlocking has theoretical benefits, in practice it tends to be much slower than unfair locking which leaves the mutex in an unlocked state and just wakes up a waiter. The reason is that waking up a thread has a long latency but it's very common for a single thread to repeatedly lock and unlock the same mutex. If lock ownership is forcibly transferred to a waiting thread then this prevents any other thread from acquiring the mutex until that thread wakes up.

As such, I don't think PI futexes are suitable for use as the default mutex implementation for Rust. They should be provided in a crate for specialized use cases where PI is needed and the performance cost of fair unlocking is acceptable.

Yes, there may be overhead. However, fair unlocking only happen when the futex is in contended state.

pub unsafe fn unlock(&self) {
    if self.futex.compare_exchange(pi::locked(), pi::unlocked(), Release, Relaxed).is_err() {
        // We only wake up one thread. When that thread locks the mutex,
        // the kernel will mark the mutex as contended automatically
        // (futex != pi::locked() in this case),
        // which makes sure that any other waiting threads will also be
        // woken up eventually.
        self.wake();
    }
}

So, it is not true for "waking up a thread has a long latency but it's very common for a single thread to repeatedly lock and unlock the same mutex". In this case, we do not enter the kernel space.

@RalfJung
Copy link
Member

RalfJung commented Nov 21, 2024 via email

@ruihe774
Copy link
Author

The macos lock we use is fair, right? How does it deal with this?

Yes. And as the result it is relatively slower than other platforms.

I think @joboet is more familiar with the macOS implementation.

@joboet
Copy link
Member

joboet commented Nov 21, 2024

macOS uses unfair locking by default and the thread IDs are stored in a thread-local variable, so it avoids most of the cost of priority-inheritance.

@ruihe774
Copy link
Author

macOS uses unfair locking by default and the thread IDs are stored in a thread-local variable, so it avoids most of the cost of priority-inheritance.

Is it mainlined? I find the current std still uses pthread on macOS.

@bjorn3
Copy link
Member

bjorn3 commented Nov 21, 2024

#122408 (which hasn't been merged yet) switches macOS to futexes.

@ruihe774
Copy link
Author

#122408 (which hasn't been merged yet) switches macOS to futexes.

Right. If macOS uses a thread-local variable to store thread ID, we can do in the same way in Linux as well to avoid fetching tid every time.

This uses FUTEX_LOCK_PI and FUTEX_UNLOCK_PI
on Linux.
@ruihe774
Copy link
Author

I dropped the miri implementation due to merge conflicts. It can be added back later.

@rust-log-analyzer

This comment has been minimized.

@Amanieu
Copy link
Member

Amanieu commented Nov 21, 2024

Yes, there may be overhead. However, fair unlocking only happen when the futex is in contended state.

So, it is not true for "waking up a thread has a long latency but it's very common for a single thread to repeatedly lock and unlock the same mutex". In this case, we do not enter the kernel space.

The contended case is specifically the one that I am concerned about. If you have 2 threads both contending on a single lock, an unfair mutex allows one thread to re-acquire the lock while the other thread is still in the process of waking up. A fair mutex keeps the mutex lock and transfers ownership to the other thread directly.

The problem is that waking up a sleeping thread is slow and it may take a while until it is scheduled if the system is contended. It's very common for threads to repeatedly lock & unlock the same lock in a sequence (e.g. calling a method that acquires a lock in a loop). If that happens with a fair mutex then neither threads are making progress for the full duration of wakeup, and this delay is incurred on every unlock.

This is the reason why today most mutex implementations are unfair. For example, quoting from https://webkit.org/blog/6161/locking-in-webkit/ (which is the post that inspired the creation of parking_lot):

However, allowing barging instead of enforcing FIFO allows for much higher throughput when a lock is heavily contended. Heavy contention in systems like WebKit that use very fine-grained locks implies that multiple threads are repeatedly locking and unlocking the same lock. In the worst case, a thread will make very little progress between two critical sections protected by the same lock. In a barging lock, if a thread unlocks a lock that had threads parked then it is still eligible to immediately reacquire it if it gets to the next critical section before the unparked thread gets scheduled. Barging permits threads engaged in microcontention to take turns acquiring the lock many times per turn. On the other hand, FIFO locks force contenders to form a convoy where they only get to hold the lock once per turn. This makes the program run much slower than with a barging lock because of the huge number of context switches – one per lock acquisition!

PI futexes as currently implemented in the Linux kernel enforce the use of fair unlocking and thus suffer from this performance penalty. This is the reason why they are not used in glibc by default. As such, I think they are unsuitable as the default implementation of Mutex in Rust.

@rust-log-analyzer
Copy link
Collaborator

The job x86_64-gnu-tools failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)
tests/pass/float_nan.rs ... ok
tests/pass/0weak_memory_consistency.rs ... ok

FAILED TEST: tests/pass/concurrency/sync.rs (revision `stack`)
command: MIRI_ENV_VAR_TEST="0" MIRI_TEMP="/tmp/miri-uitest-mvfqeD" RUST_BACKTRACE="1" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage1/bin/miri" "--error-format=json" "--sysroot=/checkout/obj/build/x86_64-unknown-linux-gnu/miri-sysroot" "-Dwarnings" "-Dunused" "-Ainternal_features" "-Zui-testing" "--out-dir" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage1-tools/miri_ui/tests/pass/concurrency" "tests/pass/concurrency/sync.rs" "--cfg=stack" "-Zmiri-disable-isolation" "-Zmiri-strict-provenance" "-Zmiri-preemption-rate=0" "--edition" "2021"
error: test got exit status: 1, but expected 0
 = note: compilation failed, but was expected to succeed

error: actual output differed from expected
error: actual output differed from expected
Execute `./miri test --bless` to update `tests/pass/concurrency/sync.stack.stderr` to the actual output
--- tests/pass/concurrency/sync.stack.stderr
+++ <stderr output>
+error: unsupported operation: Miri does not support `futex` syscall with op=6
+  --> RUSTLIB/std/src/sys/pal/PLATFORM/pi_futex.rs:LL:CC
+   |
+LL | /                 libc::syscall(
+LL | |                     libc::SYS_futex,
+LL | |                     ptr::from_ref(futex.deref()),
+LL | |                     libc::FUTEX_LOCK_PI | libc::FUTEX_PRIVATE_FLAG,
+LL | |                     // remaining args are unused
+LL | |                 )
+   | |_________________^ Miri does not support `futex` syscall with op=6
+   |
+   |
+   = help: this is likely not a bug in the program; it indicates that the program performed an operation that Miri does not support
+   = note: BACKTRACE on thread `unnamed-ID`:
+   = note: inside `std::sys::pal::PLATFORM::pi_futex::linux::futex_lock` at RUSTLIB/std/src/sys/pal/PLATFORM/pi_futex.rs:LL:CC
+   = note: inside `std::sys::sync::mutex::pi_futex::Mutex::lock_contended` at RUSTLIB/std/src/sys/sync/mutex/pi_futex.rs:LL:CC
+   = note: inside `std::sys::sync::mutex::pi_futex::Mutex::lock` at RUSTLIB/std/src/sys/sync/mutex/pi_futex.rs:LL:CC
+   = note: inside `std::sync::Mutex::<i32>::lock` at RUSTLIB/std/src/sync/mutex.rs:LL:CC
+  --> tests/pass/concurrency/sync.rs:LL:CC
+   |
+LL |             let mut data = data.lock().unwrap();
+   |                            ^^^^^^^^^^^
---

Location:
   /cargo/registry/src/index.crates.io-6f17d22bba15001f/ui_test-0.26.5/src/lib.rs:357

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
error: test failed, to rerun pass `--test ui`
Caused by:
  process didn't exit successfully: `/checkout/obj/build/x86_64-unknown-linux-gnu/stage1-tools/x86_64-unknown-linux-gnu/release/deps/ui-de50b20aa7a9761c --quiet` (exit status: 1)
  process didn't exit successfully: `/checkout/obj/build/x86_64-unknown-linux-gnu/stage1-tools/x86_64-unknown-linux-gnu/release/deps/ui-de50b20aa7a9761c --quiet` (exit status: 1)
Command has failed. Rerun with -v to see more details.
  local time: Thu Nov 21 22:28:06 UTC 2024
  network time: Thu, 21 Nov 2024 22:28:06 GMT
##[error]Process completed with exit code 1.
Post job cleanup.

@ruihe774
Copy link
Author

Yes, there may be overhead. However, fair unlocking only happen when the futex is in contended state.
So, it is not true for "waking up a thread has a long latency but it's very common for a single thread to repeatedly lock and unlock the same mutex". In this case, we do not enter the kernel space.

The contended case is specifically the one that I am concerned about. If you have 2 threads both contending on a single lock, an unfair mutex allows one thread to re-acquire the lock while the other thread is still in the process of waking up. A fair mutex keeps the mutex lock and transfers ownership to the other thread directly.

The problem is that waking up a sleeping thread is slow and it may take a while until it is scheduled if the system is contended. It's very common for threads to repeatedly lock & unlock the same lock in a sequence (e.g. calling a method that acquires a lock in a loop). If that happens with a fair mutex then neither threads are making progress for the full duration of wakeup, and this delay is incurred on every unlock.

This is the reason why today most mutex implementations are unfair. For example, quoting from https://webkit.org/blog/6161/locking-in-webkit/ (which is the post that inspired the creation of parking_lot):

However, allowing barging instead of enforcing FIFO allows for much higher throughput when a lock is heavily contended. Heavy contention in systems like WebKit that use very fine-grained locks implies that multiple threads are repeatedly locking and unlocking the same lock. In the worst case, a thread will make very little progress between two critical sections protected by the same lock. In a barging lock, if a thread unlocks a lock that had threads parked then it is still eligible to immediately reacquire it if it gets to the next critical section before the unparked thread gets scheduled. Barging permits threads engaged in microcontention to take turns acquiring the lock many times per turn. On the other hand, FIFO locks force contenders to form a convoy where they only get to hold the lock once per turn. This makes the program run much slower than with a barging lock because of the huge number of context switches – one per lock acquisition!

PI futexes as currently implemented in the Linux kernel enforce the use of fair unlocking and thus suffer from this performance penalty. This is the reason why they are not used in glibc by default. As such, I think they are unsuitable as the default implementation of Mutex in Rust.

Make sense.

I hope Linux can provide an unfair PI futex in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-unix Operating system: Unix-like S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.