Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal monitor impl not using coop mutex causing deadlocks on Android. #112358

Merged

Conversation

lateralusX
Copy link
Member

@lateralusX lateralusX commented Feb 10, 2025

On Android we have seen ANR issues, like the one described in #111485. After investigating several different dumps including all threads it turns out that we could end up in a deadlock when init a monitor since that code path didn't use a coop mutex and owner of lock could end up in GC code while holding that lock, leading to deadlock if another thread was about to lock the same monitor init lock. In several dumps we see the following two threads:

Thread 1:

syscall+28
__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384 sgen_gc_lock+105
mono_gc_wait_for_bridge_processing_internal+70
sgen_gchandle_get_target+288
alloc_mon+358
ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583

Thread 2:

syscall+28
__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652
alloc_mon+105
ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583

So in this scenario Thread 1 holds monitor_mutex that is not a coop mutex and end up trying to take GC lock, since it calls, mono_gc_wait_for_bridge_processing_internal, but since a GC is already started (waiting on STW to complete), Thread 1 will block holding monitor_mutex.

Thread 2 will try to lock monitor_mutex as well, and since it's not a coop mutex it will block on OS __futex_wait_ex without changing Mono thread state to blocking, preventing the STW from processing.

Fix is to switch to coop aware implementation of monitor_mutex.

Normally this should have been resolved on Android since we run hybrid suspend meaning we should be able to run a signal handler on the blocking thread that would suspend it meaning that STW would continue, but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations.

This fix will take care of the deadlock, but if there are issues running Signals on Android, then threads not attached to runtime using coop attach methods could end up in similar situations blocking STW.

On Android we have seen ANR issues, like the one described in
dotnet#111485. After investigating
several different dumps including all threads it turns out that we
could end up in a deadlock when init a monitor since that code path
didn't use a coop mutex and owner of lock could end up in GC code
while holding that lock, leading to deadlock if another thread was
about to lock the same monitor init lock. In several dumps we see
the following two threads:

Thread 1:

syscall+28
__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14
NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384
sgen_gc_lock+105
mono_gc_wait_for_bridge_processing_internal+70
sgen_gchandle_get_target+288
alloc_mon+358
ves_icall_System_Threading_Monitor_Monitor_wait+452
ves_icall_System_Threading_Monitor_Monitor_wait_raw+583

Thread 2:

syscall+28
__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144
NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652
alloc_mon+105
ves_icall_System_Threading_Monitor_Monitor_wait+452
ves_icall_System_Threading_Monitor_Monitor_wait_raw+583

So in this scenario Thread 1 holds monitor_mutex that is not a coop
mutex and end up trying to take GC lock, since it calls,
mono_gc_wait_for_bridge_processing_internal, but since a GC is already
started (waiting on STW to complete), Thread 1 will block holding
monitor_mutex.

Thread 2 will try to lock monitor_mutex as well, and since its not a
coop mutex it will block on OS __futex_wait_ex without changing
Mono thread state to blocking, preventing the STW from processing.

Fix is to switch to coop aware implementation of monitor_mutex.

Normally this should have been resolved on Android since we run
hybrid suspend meaning we should be able to run a signal handler on the
blocking thread that would suspend it meaning that STW would continue,
but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations.

This fix will take care of the deadlock, but if there are issues running
Signals on Android, then threads not attached to runtime using
coop attach methods could end up in similar situations
blocking STW.
@steveisok steveisok merged commit fe9e119 into dotnet:main Feb 10, 2025
71 of 73 checks passed
@steveisok
Copy link
Member

/backport to release/9.0

@steveisok
Copy link
Member

/backport to release/8.0

Copy link
Contributor

Started backporting to release/9.0: https://github.com/dotnet/runtime/actions/runs/13251250955

Copy link
Contributor

Started backporting to release/8.0: https://github.com/dotnet/runtime/actions/runs/13251252723

grendello added a commit to grendello/runtime that referenced this pull request Feb 11, 2025
* main:
  Code clean up in AP for NonNull* (dotnet#112027)
  JIT: Invalidate LSRA's DFS tree if we aren't running new layout phase (dotnet#112364)
  Update dependencies from https://github.com/dotnet/source-build-reference-packages build 20250204.2 (dotnet#112339)
  Add doc on OS onboarding (dotnet#112026)
  Add `TypeName` APIs to simplify metadata lookup. (dotnet#111598)
  Internal monitor impl not using coop mutex causing deadlocks on Android. (dotnet#112358)
  Do not run NAOT arm64 OSX testing on all PRs (dotnet#112342)
  Special-case empty enumerables in AsyncEnumerable (dotnet#112321)
  Have mono handle ConvertToIntegerNative for Double and Single (dotnet#112206)
  Update dependencies from https://github.com/dotnet/arcade build 20250206.4 (dotnet#112338)
  System.Configuration.ConfigurationManager.Tests: use Assembly.Location to determine ThisApplicationPath. (dotnet#112231)
  Force write of local file header when "version needed to extract" changes (dotnet#112032)
  JIT: Don't reorder handler blocks (dotnet#112292)
  [RISC-V] Synthesize some floating constants inline (dotnet#111529)
  Enable `SA1000`: Spacing around keywords (dotnet#112302)
  Fix relocs for linux-riscv64 AOT (dotnet#112331)
steveisok pushed a commit that referenced this pull request Feb 12, 2025
…id. (#112373)

Backport of #112358 to release/9.0

On Android we have seen ANR issues, like the one described in
#111485. After investigating
several different dumps including all threads it turns out that we
could end up in a deadlock when init a monitor since that code path
didn't use a coop mutex and owner of lock could end up in GC code
while holding that lock, leading to deadlock if another thread was
about to lock the same monitor init lock. In several dumps we see
the following two threads:

Thread 1:

syscall+28
__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14
NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384
sgen_gc_lock+105
mono_gc_wait_for_bridge_processing_internal+70
sgen_gchandle_get_target+288
alloc_mon+358
ves_icall_System_Threading_Monitor_Monitor_wait+452
ves_icall_System_Threading_Monitor_Monitor_wait_raw+583

Thread 2:

syscall+28
__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144
NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652
alloc_mon+105
ves_icall_System_Threading_Monitor_Monitor_wait+452
ves_icall_System_Threading_Monitor_Monitor_wait_raw+583

So in this scenario Thread 1 holds monitor_mutex that is not a coop
mutex and end up trying to take GC lock, since it calls,
mono_gc_wait_for_bridge_processing_internal, but since a GC is already
started (waiting on STW to complete), Thread 1 will block holding
monitor_mutex.

Thread 2 will try to lock monitor_mutex as well, and since its not a
coop mutex it will block on OS __futex_wait_ex without changing
Mono thread state to blocking, preventing the STW from processing.

Fix is to switch to coop aware implementation of monitor_mutex.

Normally this should have been resolved on Android since we run
hybrid suspend meaning we should be able to run a signal handler on the
blocking thread that would suspend it meaning that STW would continue,
but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations.

This fix will take care of the deadlock, but if there are issues running
Signals on Android, then threads not attached to runtime using
coop attach methods could end up in similar situations
blocking STW.

Co-authored-by: lateralusX <[email protected]>
steveisok pushed a commit that referenced this pull request Feb 12, 2025
…id. (#112374)

Backport of #112358 to release/8.0

On Android we have seen ANR issues, like the one described in
#111485. After investigating
several different dumps including all threads it turns out that we
could end up in a deadlock when init a monitor since that code path
didn't use a coop mutex and owner of lock could end up in GC code
while holding that lock, leading to deadlock if another thread was
about to lock the same monitor init lock. In several dumps we see
the following two threads:

Thread 1:

syscall+28
__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14
NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384
sgen_gc_lock+105
mono_gc_wait_for_bridge_processing_internal+70
sgen_gchandle_get_target+288
alloc_mon+358
ves_icall_System_Threading_Monitor_Monitor_wait+452
ves_icall_System_Threading_Monitor_Monitor_wait_raw+583

Thread 2:

syscall+28
__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144
NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652
alloc_mon+105
ves_icall_System_Threading_Monitor_Monitor_wait+452
ves_icall_System_Threading_Monitor_Monitor_wait_raw+583

So in this scenario Thread 1 holds monitor_mutex that is not a coop
mutex and end up trying to take GC lock, since it calls,
mono_gc_wait_for_bridge_processing_internal, but since a GC is already
started (waiting on STW to complete), Thread 1 will block holding
monitor_mutex.

Thread 2 will try to lock monitor_mutex as well, and since its not a
coop mutex it will block on OS __futex_wait_ex without changing
Mono thread state to blocking, preventing the STW from processing.

Fix is to switch to coop aware implementation of monitor_mutex.

Normally this should have been resolved on Android since we run
hybrid suspend meaning we should be able to run a signal handler on the
blocking thread that would suspend it meaning that STW would continue,
but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations.

This fix will take care of the deadlock, but if there are issues running
Signals on Android, then threads not attached to runtime using
coop attach methods could end up in similar situations
blocking STW.

Co-authored-by: lateralusX <[email protected]>
@github-actions github-actions bot locked and limited conversation to collaborators Mar 13, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants