Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pthread_cond_wait: Add mutex to protect the waiter count #15566

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

pussuw
Copy link
Contributor

@pussuw pussuw commented Jan 15, 2025

Summary

The load/compare and RMW to wait_count need protection. Bumping the counter can be done by atomic_fetch_add as there is no race here. However reading the counter when signaling needs to take a lock so that if multiple threads signal at the same time, only one has exclusive access to the counter.

NOTE:
The assumption that the user will call pthread_cond_signal / pthread_cond_broadcast with the mutex given to pthread_cond_wait held is simply not true. It MAY hold it, but it is not forced. Thus, using the user space lock for protecting the wait counter as well is not valid!

Impact

This fixes regression from #14581 and #14786

Testing

MPFS with multiple threads using pthread_cond.
rv-virt:smp64

Direct reference from POSIX:

The pthread_cond_signal() or pthread_cond_broadcast() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behaviour is required, then that mutex is locked by the thread calling pthread_cond_signal() or pthread_cond_broadcast().

[1] https://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_cond_signal.html

@github-actions github-actions bot added Area: OS Components OS Components issues Size: S The size of the change in this PR is small labels Jan 15, 2025
@pussuw
Copy link
Contributor Author

pussuw commented Jan 15, 2025

@hujun260 and @xiaoxiang781216 can you please check this ?

@pussuw
Copy link
Contributor Author

pussuw commented Jan 15, 2025

Does the NuttX atomic library not offer atomic_load() ?

sched/pthread/pthread_condsignal.c Outdated Show resolved Hide resolved
@hujun260
Copy link
Contributor

atomic_load

use atomic_read

@pussuw
Copy link
Contributor Author

pussuw commented Jan 16, 2025

atomic_load

use atomic_read

atomic_read requires atomic_t type, which is volatile int. Are we supposed to use atomic_t inside the kernel, even if the toolchain implements atomic_load et al. ?

The atomic.h API is a bit unclear to me after the recent modifications.

@xiaoxiang781216
Copy link
Contributor

atomic_load

use atomic_read

atomic_read requires atomic_t type, which is volatile int. Are we supposed to use atomic_t inside the kernel, even if the toolchain implements atomic_load et al. ?

The atomic.h API is a bit unclear to me after the recent modifications.

The recent improvement ensure atomic api can be used on all arch even without toolchain support.

sem_getvalue returns ERROR and sets errno if it fails, we don't want to
return OK in this case, we want to return the non-negated error number.
sem_getvalue returns ERROR and sets errno if it fails, we don't want to
return OK in this case, we want to return the non-negated error number.
@pussuw
Copy link
Contributor Author

pussuw commented Jan 16, 2025

atomic_load

use atomic_read

atomic_read requires atomic_t type, which is volatile int. Are we supposed to use atomic_t inside the kernel, even if the toolchain implements atomic_load et al. ?
The atomic.h API is a bit unclear to me after the recent modifications.

The recent improvement ensure atomic api can be used on all arch even without toolchain support.

Should we add atomic_load ? Since atomic_read does not ensure memory ordering ?

@xiaoxiang781216
Copy link
Contributor

atomic_load

use atomic_read

atomic_read requires atomic_t type, which is volatile int. Are we supposed to use atomic_t inside the kernel, even if the toolchain implements atomic_load et al. ?
The atomic.h API is a bit unclear to me after the recent modifications.

The recent improvement ensure atomic api can be used on all arch even without toolchain support.

Should we add atomic_load ? Since atomic_read does not ensure memory ordering ?

the interface follow Linux kernel design

@pussuw
Copy link
Contributor Author

pussuw commented Jan 16, 2025

atomic_load

use atomic_read

atomic_read requires atomic_t type, which is volatile int. Are we supposed to use atomic_t inside the kernel, even if the toolchain implements atomic_load et al. ?
The atomic.h API is a bit unclear to me after the recent modifications.

The recent improvement ensure atomic api can be used on all arch even without toolchain support.

Should we add atomic_load ? Since atomic_read does not ensure memory ordering ?

the interface follow Linux kernel design

Ok. In my case using atomic_read is OK since reading cond->wait_count does not need to be ordered, so I'll just change to atomic_read.

In any case, if ever needed / wanted, atomic_load can be added very simply:

#define atomic_load(obj)                      ATOMIC_FUNC(load, 4)(obj, __ATOMIC_SEQ_CST)
#define atomic_load_acquire(obj)              atomic_read_acquire(obj)
#define atomic64_load(obj)                    ATOMIC_FUNC(load, 8)(obj, __ATOMIC_SEQ_CST)
#define atomic64_load_acquire(obj)            atomic_read64_acquire(obj)

@pussuw pussuw force-pushed the fix_pthread_cond branch 2 times, most recently from 216afde to b7e7f21 Compare January 16, 2025 10:21
Copy link
Member

@lupyuen lupyuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested OK on rv-virt:knsh64. Thanks :-)
https://gist.github.com/lupyuen/03e520876f9ba15a64918d5faba663b4

nsh> uname -a
NuttX 10.4.0 b7e7f216c2 Jan 16 2025 18:29:57 risc-v rv-virt
nsh> ostest
ostest_main: Exiting with status 0

@pussuw
Copy link
Contributor Author

pussuw commented Jan 16, 2025

Is someone capable of interpreting the sim target error? I tried running the target / citest locally but it's giving me unrecognized arguments: --json=/....

@lupyuen
Copy link
Member

lupyuen commented Jan 16, 2025

I'm running with Docker, might take a while...

sudo docker run \
  -it \
  ghcr.io/apache/nuttx/apache-nuttx-ci-linux:latest \
  /bin/bash
cd
git clone https://github.com/tiiuae/nuttx --branch fix_pthread_cond
git clone https://github.com/apache/nuttx-apps apps
pushd nuttx ; echo NuttX Source: https://github.com/apache/nuttx/tree/$(git rev-parse HEAD) ; popd
pushd apps  ; echo NuttX Apps: https://github.com/apache/nuttx-apps/tree/$(git rev-parse HEAD) ; popd
cd nuttx/tools/ci
./cibuild.sh -c -A -N -R testlist/sim-01.dat 

## TODO: Dump the CI Test Log
ls -l ~/nuttx/boards/sim/sim/sim/configs/citest/logs/sim/sim
cat ~/nuttx/boards/sim/sim/sim/configs/citest/logs/sim/sim/*

## Repeat for risc-v-05
./cibuild.sh -c -A -N -R testlist/risc-v-05.dat 
ls -l ~/nuttx/boards/risc-v/qemu-rv/rv-virt/configs/citest/logs/rv-virt/qemu
cat ~/nuttx/boards/risc-v/qemu-rv/rv-virt/configs/citest/logs/rv-virt/qemu/*

@pussuw
Copy link
Contributor Author

pussuw commented Jan 16, 2025

Looks like some sort of deadlock, let me figure it out.

@lupyuen
Copy link
Member

lupyuen commented Jan 16, 2025

Hmmm strange...

  • I ran both sim-01/citest and risc-v-05/citest
  • Both CI Test Log files are blank
  • Except risc-v-05/citest shows "ABC". Maybe NuttX hung during startup?
  • Here are the JSON Files from both CI Test, which are harder to read:
  • sim-01-pytest.json
  • risc-v-05-pytest.json

@pussuw
Copy link
Contributor Author

pussuw commented Jan 16, 2025

The problem is related to cond->mutex somehow. If I remove locking / unlocking it from pthread_cond_broadcast, sim:citest boots and runs.

Another thing I noticed, is if I remove C++ support, the system boots and the tests pass. The first thing the system does in flat mode is it runs the static C++ constructors:

CHelloWorld: Constructor: mSecret=42

So the issue must be somewhere there. I still don't understand what's wrong with the lock, nothing even calls pthread_cond_broadcast as far as I can see.

@pussuw
Copy link
Contributor Author

pussuw commented Jan 17, 2025

Using compare&exchange seems to have done the trick. Last thing that remains is to verify this is still POSIX compliant, I guess CI (sim:citest) runs the ltp test cases ?

@lupyuen
Copy link
Member

lupyuen commented Jan 17, 2025

@pussuw
Copy link
Contributor Author

pussuw commented Jan 17, 2025

Yep. LTP runs on rv-virt:citest too:

* https://github.com/NuttX/nuttx/actions/runs/12822145965/job/35754623191#step:7:149

Ok, if the tests pass I consider this issue resolved. All my local tests show green (and my original problem is now gone).

Copy link
Member

@lupyuen lupyuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested OK on rv-virt:knsh64. Thanks :-)
https://gist.github.com/lupyuen/8f8aa014ddaf9b3623e6c834fe861764

nsh> uname -a
NuttX 10.4.0 c58794a955 Jan 17 2025 17:13:46 risc-v rv-virt
nsh> ostest
ostest_main: Exiting with status 0

The load/compare and RMW to wait_count need protection. Using atomic
operations should resolve both issues.

NOTE:
The assumption that the user will call pthread_cond_signal /
pthread_cond_broadcast with the mutex given to pthread_cond_wait held is
simply not true. It MAY hold it, but it is not forced. Thus, using the
user space lock for protecting the wait counter as well is not valid!

The pthread_cond_signal() or pthread_cond_broadcast() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behaviour is required, then that mutex is locked by the thread calling pthread_cond_signal() or pthread_cond_broadcast().

[1] https://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_cond_signal.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: OS Components OS Components issues Size: S The size of the change in this PR is small
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants