From 624400877ec9f64889159532325620b4e4a66349 Mon Sep 17 00:00:00 2001 From: lateralusX Date: Mon, 10 Feb 2025 18:52:24 +0100 Subject: [PATCH] Internal monitor impl not using coop mutex causing deadlocks on Android. On Android we have seen ANR issues, like the one described in https://github.com/dotnet/runtime/issues/111485. After investigating several different dumps including all threads it turns out that we could end up in a deadlock when init a monitor since that code path didn't use a coop mutex and owner of lock could end up in GC code while holding that lock, leading to deadlock if another thread was about to lock the same monitor init lock. In several dumps we see the following two threads: Thread 1: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384 sgen_gc_lock+105 mono_gc_wait_for_bridge_processing_internal+70 sgen_gchandle_get_target+288 alloc_mon+358 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 Thread 2: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652 alloc_mon+105 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 So in this scenario Thread 1 holds monitor_mutex that is not a coop mutex and end up trying to take GC lock, since it calls, mono_gc_wait_for_bridge_processing_internal, but since a GC is already started (waiting on STW to complete), Thread 1 will block holding monitor_mutex. Thread 2 will try to lock monitor_mutex as well, and since its not a coop mutex it will block on OS __futex_wait_ex without changing Mono thread state to blocking, preventing the STW from processing. Fix is to switch to coop aware implementation of monitor_mutex. Normally this should have been resolved on Android since we run hybrid suspend meaning we should be able to run a signal handler on the blocking thread that would suspend it meaning that STW would continue, but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations. This fix will take care of the deadlock, but if there are issues running Signals on Android, then threads not attached to runtime using coop attach methods could end up in similar situations blocking STW. --- src/mono/mono/metadata/monitor.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/mono/mono/metadata/monitor.c b/src/mono/mono/metadata/monitor.c index 6a72695e9fbedc..2e08f94d3361e7 100644 --- a/src/mono/mono/metadata/monitor.c +++ b/src/mono/mono/metadata/monitor.c @@ -82,9 +82,9 @@ struct _MonitorArray { MonoThreadsSync monitors [MONO_ZERO_LEN_ARRAY]; }; -#define mono_monitor_allocator_lock() mono_os_mutex_lock (&monitor_mutex) -#define mono_monitor_allocator_unlock() mono_os_mutex_unlock (&monitor_mutex) -static mono_mutex_t monitor_mutex; +#define mono_monitor_allocator_lock() mono_coop_mutex_lock (&monitor_mutex) +#define mono_monitor_allocator_unlock() mono_coop_mutex_unlock (&monitor_mutex) +static MonoCoopMutex monitor_mutex; static MonoThreadsSync *monitor_freelist; static MonitorArray *monitor_allocated; static int array_size = 16; @@ -255,7 +255,7 @@ lock_word_new_flat (gint32 owner) void mono_monitor_init (void) { - mono_os_mutex_init_recursive (&monitor_mutex); + mono_coop_mutex_init_recursive (&monitor_mutex); } static int