Internal monitor impl not using coop mutex causing deadlocks on Android. #112358

lateralusX · 2025-02-10T17:54:05Z

On Android we have seen ANR issues, like the one described in #111485. After investigating several different dumps including all threads it turns out that we could end up in a deadlock when init a monitor since that code path didn't use a coop mutex and owner of lock could end up in GC code while holding that lock, leading to deadlock if another thread was about to lock the same monitor init lock. In several dumps we see the following two threads:

Thread 1:

syscall+28
__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384 sgen_gc_lock+105
mono_gc_wait_for_bridge_processing_internal+70
sgen_gchandle_get_target+288
alloc_mon+358
ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583

Thread 2:

syscall+28
__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652
alloc_mon+105
ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583

So in this scenario Thread 1 holds monitor_mutex that is not a coop mutex and end up trying to take GC lock, since it calls, mono_gc_wait_for_bridge_processing_internal, but since a GC is already started (waiting on STW to complete), Thread 1 will block holding monitor_mutex.

Thread 2 will try to lock monitor_mutex as well, and since it's not a coop mutex it will block on OS __futex_wait_ex without changing Mono thread state to blocking, preventing the STW from processing.

Fix is to switch to coop aware implementation of monitor_mutex.

Normally this should have been resolved on Android since we run hybrid suspend meaning we should be able to run a signal handler on the blocking thread that would suspend it meaning that STW would continue, but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations.

This fix will take care of the deadlock, but if there are issues running Signals on Android, then threads not attached to runtime using coop attach methods could end up in similar situations blocking STW.

On Android we have seen ANR issues, like the one described in dotnet#111485. After investigating several different dumps including all threads it turns out that we could end up in a deadlock when init a monitor since that code path didn't use a coop mutex and owner of lock could end up in GC code while holding that lock, leading to deadlock if another thread was about to lock the same monitor init lock. In several dumps we see the following two threads: Thread 1: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384 sgen_gc_lock+105 mono_gc_wait_for_bridge_processing_internal+70 sgen_gchandle_get_target+288 alloc_mon+358 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 Thread 2: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652 alloc_mon+105 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 So in this scenario Thread 1 holds monitor_mutex that is not a coop mutex and end up trying to take GC lock, since it calls, mono_gc_wait_for_bridge_processing_internal, but since a GC is already started (waiting on STW to complete), Thread 1 will block holding monitor_mutex. Thread 2 will try to lock monitor_mutex as well, and since its not a coop mutex it will block on OS __futex_wait_ex without changing Mono thread state to blocking, preventing the STW from processing. Fix is to switch to coop aware implementation of monitor_mutex. Normally this should have been resolved on Android since we run hybrid suspend meaning we should be able to run a signal handler on the blocking thread that would suspend it meaning that STW would continue, but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations. This fix will take care of the deadlock, but if there are issues running Signals on Android, then threads not attached to runtime using coop attach methods could end up in similar situations blocking STW.

steveisok · 2025-02-10T21:59:02Z

/backport to release/9.0

steveisok · 2025-02-10T21:59:10Z

/backport to release/8.0

github-actions · 2025-02-10T21:59:11Z

Started backporting to release/9.0: https://github.com/dotnet/runtime/actions/runs/13251250955

github-actions · 2025-02-10T21:59:23Z

Started backporting to release/8.0: https://github.com/dotnet/runtime/actions/runs/13251252723

* main: Code clean up in AP for NonNull* (dotnet#112027) JIT: Invalidate LSRA's DFS tree if we aren't running new layout phase (dotnet#112364) Update dependencies from https://github.com/dotnet/source-build-reference-packages build 20250204.2 (dotnet#112339) Add doc on OS onboarding (dotnet#112026) Add `TypeName` APIs to simplify metadata lookup. (dotnet#111598) Internal monitor impl not using coop mutex causing deadlocks on Android. (dotnet#112358) Do not run NAOT arm64 OSX testing on all PRs (dotnet#112342) Special-case empty enumerables in AsyncEnumerable (dotnet#112321) Have mono handle ConvertToIntegerNative for Double and Single (dotnet#112206) Update dependencies from https://github.com/dotnet/arcade build 20250206.4 (dotnet#112338) System.Configuration.ConfigurationManager.Tests: use Assembly.Location to determine ThisApplicationPath. (dotnet#112231) Force write of local file header when "version needed to extract" changes (dotnet#112032) JIT: Don't reorder handler blocks (dotnet#112292) [RISC-V] Synthesize some floating constants inline (dotnet#111529) Enable `SA1000`: Spacing around keywords (dotnet#112302) Fix relocs for linux-riscv64 AOT (dotnet#112331)

…id. (#112373) Backport of #112358 to release/9.0 On Android we have seen ANR issues, like the one described in #111485. After investigating several different dumps including all threads it turns out that we could end up in a deadlock when init a monitor since that code path didn't use a coop mutex and owner of lock could end up in GC code while holding that lock, leading to deadlock if another thread was about to lock the same monitor init lock. In several dumps we see the following two threads: Thread 1: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384 sgen_gc_lock+105 mono_gc_wait_for_bridge_processing_internal+70 sgen_gchandle_get_target+288 alloc_mon+358 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 Thread 2: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652 alloc_mon+105 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 So in this scenario Thread 1 holds monitor_mutex that is not a coop mutex and end up trying to take GC lock, since it calls, mono_gc_wait_for_bridge_processing_internal, but since a GC is already started (waiting on STW to complete), Thread 1 will block holding monitor_mutex. Thread 2 will try to lock monitor_mutex as well, and since its not a coop mutex it will block on OS __futex_wait_ex without changing Mono thread state to blocking, preventing the STW from processing. Fix is to switch to coop aware implementation of monitor_mutex. Normally this should have been resolved on Android since we run hybrid suspend meaning we should be able to run a signal handler on the blocking thread that would suspend it meaning that STW would continue, but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations. This fix will take care of the deadlock, but if there are issues running Signals on Android, then threads not attached to runtime using coop attach methods could end up in similar situations blocking STW. Co-authored-by: lateralusX <[email protected]>

…id. (#112374) Backport of #112358 to release/8.0 On Android we have seen ANR issues, like the one described in #111485. After investigating several different dumps including all threads it turns out that we could end up in a deadlock when init a monitor since that code path didn't use a coop mutex and owner of lock could end up in GC code while holding that lock, leading to deadlock if another thread was about to lock the same monitor init lock. In several dumps we see the following two threads: Thread 1: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384 sgen_gc_lock+105 mono_gc_wait_for_bridge_processing_internal+70 sgen_gchandle_get_target+288 alloc_mon+358 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 Thread 2: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652 alloc_mon+105 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 So in this scenario Thread 1 holds monitor_mutex that is not a coop mutex and end up trying to take GC lock, since it calls, mono_gc_wait_for_bridge_processing_internal, but since a GC is already started (waiting on STW to complete), Thread 1 will block holding monitor_mutex. Thread 2 will try to lock monitor_mutex as well, and since its not a coop mutex it will block on OS __futex_wait_ex without changing Mono thread state to blocking, preventing the STW from processing. Fix is to switch to coop aware implementation of monitor_mutex. Normally this should have been resolved on Android since we run hybrid suspend meaning we should be able to run a signal handler on the blocking thread that would suspend it meaning that STW would continue, but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations. This fix will take care of the deadlock, but if there are issues running Signals on Android, then threads not attached to runtime using coop attach methods could end up in similar situations blocking STW. Co-authored-by: lateralusX <[email protected]>

lateralusX requested review from BrzVlad, steveisok and vitek-karas as code owners February 10, 2025 17:54

dotnet-issue-labeler bot added the area-VM-meta-mono label Feb 10, 2025

dotnet-policy-service bot assigned lateralusX Feb 10, 2025

BrzVlad approved these changes Feb 10, 2025

View reviewed changes

steveisok merged commit fe9e119 into dotnet:main Feb 10, 2025
71 of 73 checks passed

This was referenced Feb 10, 2025

[release/9.0] Internal monitor impl not using coop mutex causing deadlocks on Android. #112373

Merged

[release/8.0] Internal monitor impl not using coop mutex causing deadlocks on Android. #112374

Merged

lateralusX mentioned this pull request Feb 12, 2025

Android Not Responding with [split_config.arm64_v8a.apk!libmonosgen-2.0.so] mono_threads_attach_coop #111485

Open

github-actions bot locked and limited conversation to collaborators Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal monitor impl not using coop mutex causing deadlocks on Android. #112358

Internal monitor impl not using coop mutex causing deadlocks on Android. #112358

lateralusX commented Feb 10, 2025 •

edited

Loading

steveisok commented Feb 10, 2025

steveisok commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

Internal monitor impl not using coop mutex causing deadlocks on Android. #112358

Internal monitor impl not using coop mutex causing deadlocks on Android. #112358

Conversation

lateralusX commented Feb 10, 2025 • edited Loading

steveisok commented Feb 10, 2025

steveisok commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

lateralusX commented Feb 10, 2025 •

edited

Loading