Increased amount of ANRs after disabling concurrent GC #9365

TimBurik · 2024-10-03T16:26:20Z

Android framework version

net8.0-android

Affected platform version

.NET 8.0.303

Description

After switching from Xamarin.Android to .Net8 we used to get a lot of native crashes in monosgen, for example 222003f52fa3496385d14a89c778a6e4-symbolicated.txt

After long investigation (and a hint from the issue dotnet/runtime#100311) It turns out that concurrent SGen is actually enabled by default in .net-android

android/src/Xamarin.Android.Build.Tasks/Microsoft.Android.Sdk/targets/Microsoft.Android.Sdk.DefaultProperties.targets

Line 9 in df9111d

    
           <AndroidEnableSGenConcurrent Condition=" '$(AndroidEnableSGenConcurrent)' == '' ">true</AndroidEnableSGenConcurrent>

so we explicitly disable it - and now the amount of native crashes in monosgen is minimal, but instead we are getting a lot of ANR reports in Sentry and GooglePlay Console.

ANRs seems to be reported using Android's ApplicationExitInfo mechanism, according to stacktraces main thread seems to be blocked by awaiting native mutex somewhere in the monosgen (example: anr_stacktrace.txt)

Additional information, which might be relevant:

according to GooglePlay Console, majority of the ANRs related to the broadcast of either Intent { act=android.intent.action.SCREEN_OFF } or Intent { act=android.intent.action.SCREEN_ON };
majority of ANRs seems to be happening when application is in the background;

Steps to Reproduce

Unfortunately, we don't have exact steps to reproduce.
The only thing that is sure that it is happening when targeting .net-android34.0 (version for Xamarin.Android doesn't have this issue) and issue started happening after adding the following to the csproj:
<AndroidEnableSGenConcurrent>false</AndroidEnableSGenConcurrent>

Did you find any workaround?

No workaround found yet

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

grendello · 2024-10-04T08:23:26Z

@TimBurik I'm afraid we can't do anything for you here, this appears to be an issue split between MonoVM and the Sentry Native SDK. @lambdageek do you think someone on your side could take a look to see if anything can be done here?

There's an issue being worked on related to Sentry, #9055, perhaps this here issue is also caused by the same problems?

@supervacuus, sorry for tagging you out of the blue, but would you be able to look into the Sentry side of things here?

I realize we have precious little information, but perhaps someone will be able to spot something in their respective areas and help out.

TimBurik · 2024-10-04T08:56:15Z

@grendello thank you for the response.

Regarding #9055 - we are aware of this issue, and we have a separate native crash group, which is most likely related to this issue. But the ANRs doesn't seem to be related to this issue and to Sentry, because we are seeing exactly the same picture in the GooglePlay Console - increased amount of ANRs, all containing libmonosgen-2.0 in stack traces

grendello · 2024-10-04T09:17:36Z

The GC ANRs might be related to something inside the GC bridge, hopefully @lambdageek will be able to help out here. However, this ANR looks familiar:

#00  pc 0x0000000000089cf0  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+32)
#01  pc 0x000000000008e6d8  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+148)
#02  pc 0x000000000009c3dc  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+112)
#03  pc 0x00000000001eeb20  /data/app/~~x6iHlhee06D2KSrJ3_D8mw==/<app.bundle.id>-UM4qcDy_vV_DZ6DnDt2oCA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so

You are most likely not using marshal methods, since they weren't enabled by default in NET8, but we've fixed an issue recently which was related to them and one of the repros had the above stack trace as well. The problem was related to Java ("native") threads being incorrectly attached to the MonoVM runtime and putting the GC in a bad state. You can find more info about it in this and the following comments. @filipnavara, does this look familiar?

Perhaps something in your app causes a similar corruption of the GC state?

@TimBurik if you're able to reproduce the ANRs locally, would you be able to test with NET9 rc2?

TimBurik · 2024-10-04T10:36:04Z

@grendello thank you for the pointers!

We are planning to test the app on .NET 9 in the nearest future, we would make sure to use RC2 for those tests (when RC2 would be available).

Also, we have added debug symbols for libmonosgen-2.0 into GooglePlay Console and now we have more detailed logs. For example, this is the reported main thread stack trace from one of the ANRs:

  #00  pc 0x0000000000078dec  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
  #01  pc 0x000000000007d7e0  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144)
  #02  pc 0x000000000008ba8c  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+108)
  #03  pc 0x00000000001eeb20  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_os_sem_timedwait+204) 
  #04  pc 0x00000000001ee844  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_threads_wait_pending_operations+323) 
  #05  pc 0x00000000002a50f4  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (unified_suspend_stop_world+349) 
  #06  pc 0x00000000002a4e84  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_client_stop_world+159) 
  #07  pc 0x00000000002c5350  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_stop_world+3974) 
  #08  pc 0x00000000002c21bc  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_perform_collection+2637) 
  #09  pc 0x00000000002c20ec  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_ensure_free_space+2616) 
  #10  pc 0x00000000002be8b8  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj_nolock+279) 
  #11  pc 0x00000000002bed9c  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj+454) 
  #12  pc 0x00000000002a686c  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_gc_alloc_obj+690) 
  #13  pc 0x0000000000004438

grendello · 2024-10-04T10:47:17Z

@TimBurik this more and more looks like the issue fixed by @filipnavara, he declared he'd take a look next week - it would be great if you were able to test with NET9 RC2 before then :)

grendello · 2024-10-04T10:57:30Z

@TimBurik if RC2 isn't available by next week, do try with RC1 - it's to see if the issues you're seeing still even exist in NET9

supervacuus · 2024-10-04T11:13:10Z

I've been following the thread since you tagged me. It seems this isn't sentry-related. I've informed sentry peeps, just in case.

grendello · 2024-10-04T11:15:07Z

@supervacuus thanks, I tagged you just in case - this is all low-level stuff, it might be affecting Sentry as well in some ways.

lambdageek · 2024-10-04T13:27:37Z

/cc @vitek-karas

TimBurik · 2024-10-22T09:00:56Z

@grendello sorry for late response
we've just released .Net9-rc2 build to a small population in production (we still don't have a reliable way to reproduce it), and unfortunately we see the same ANR reports in the GooglePlay Console.
Here are some more examples of the main thread stack traces from those reports:

#00  pc 0x000000000004f85c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
#01  pc 0x0000000000054330  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144)
#02  pc 0x0000000000062d4c  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+108)
#03  pc 0x00000000001faecc  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_os_sem_timedwait+204)
#04  pc 0x00000000001fadf4  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_threads_wait_pending_operations+329)
#05  pc 0x00000000002b0e24  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (unified_suspend_stop_world+349)
#06  pc 0x00000000002b0bcc  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_client_stop_world+159)
#07  pc 0x00000000002c6ec8  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_stop_world+3988)
#08  pc 0x00000000002c3e44  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_perform_collection+2637)
#09  pc 0x00000000002c3d74  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_ensure_free_space+2616)
#10  pc 0x00000000002c0618  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj_nolock+279)
#11  pc 0x00000000002b2f84  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_gc_alloc_vector+1119)
#12  pc 0x000000000000cb00

#00  pc 0x000000000001daec  /system/lib64/libc.so (syscall+28)
#01  pc 0x0000000000028a88  /system/lib64/libc.so (sem_wait+88)
#02  pc 0x00000000001fac20  /data/app/<app.bundle.id>-fM0-AVfJ2sE77in08Tz1vA==/lib/arm64/libmonosgen-2.0.so (mono_thread_info_wait_for_resume+204)
#03  pc 0x000000000020035c  /data/app/<app.bundle.id>-fM0-AVfJ2sE77in08Tz1vA==/lib/arm64/libmonosgen-2.0.so (mono_threads_enter_gc_unsafe_region_unbalanced_with_info+503)
#04  pc 0x0000000000283944  /data/app/<app.bundle.id>-fM0-AVfJ2sE77in08Tz1vA==/lib/arm64/libmonosgen-2.0.so (mono_threads_attach_coop_internal+4712)
#05  pc 0x0000000000283978  /data/app/<app.bundle.id>-fM0-AVfJ2sE77in08Tz1vA==/lib/arm64/libmonosgen-2.0.so (mono_threads_attach_coop+4731)
#06  pc 0x000000000000b5d8

#00  pc 0x00000000000902fc  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
#01  pc 0x00000000000950d0  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144)
#02  pc 0x00000000000a3670  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+112)
#03  pc 0x00000000001faecc  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_os_sem_timedwait+204)
#04  pc 0x00000000001fadf4  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_threads_wait_pending_operations+329)
#05  pc 0x00000000002b0e24  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (unified_suspend_stop_world+349)
#06  pc 0x00000000002b0bcc  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_client_stop_world+159)
#07  pc 0x00000000002c6ec8  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_stop_world+3988)
#08  pc 0x00000000002c3e44  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_perform_collection+2637)
#09  pc 0x00000000002c3d74  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_ensure_free_space+2616)
#10  pc 0x00000000002c0618  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj_nolock+279)
#11  pc 0x00000000002c0afc  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj+454)
#12  pc 0x00000000002b2588  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_gc_alloc_obj+690)
#13  pc 0x0000000000274ea0  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_object_new_specific_checked+5494)
#14  pc 0x00000000002746e4  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (extract_this_ptr+5341)
#15  pc 0x0000000000274b70  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_runtime_try_invoke_byrefs+5202)
#16  pc 0x000000000022fa28  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (ves_icall_InternalInvoke+3629)
#17  pc 0x0000000000238e40  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (ves_icall_InternalInvoke_raw+370)
#18  pc 0x0000000000004494

#00  pc 0x000000000008921c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
#01  pc 0x000000000008db00  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144)
#02  pc 0x000000000009c5a8  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+104)
#03  pc 0x00000000001faecc  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_os_sem_timedwait+204)
#04  pc 0x00000000001fadf4  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_threads_wait_pending_operations+329)
#05  pc 0x00000000002b0e24  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (unified_suspend_stop_world+349)
#06  pc 0x00000000002b0bcc  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_client_stop_world+159)
#07  pc 0x00000000002c6ec8  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_stop_world+3988)
#08  pc 0x00000000002c3e44  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_perform_collection+2637)
#09  pc 0x00000000002c3d74  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_ensure_free_space+2616)
#10  pc 0x00000000002c0618  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj_nolock+279)
#11  pc 0x00000000002b3168  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_gc_alloc_string+1208)
#12  pc 0x000000000026da44  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_string_new_size_checked+6308)
#13  pc 0x000000000027690c  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_string_new_size_handle+6282)
#14  pc 0x000000000023d97c  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (ves_icall_System_String_FastAllocateString_raw+550)
#15  pc 0x0000000000008024

BrzVlad · 2024-10-22T11:16:23Z

I don't know how the ANR system works, I would imagine they happen if the app is non responsive for a while. If this is the case, I'm not sure whether this issue can actually be considered to be a bug. We seem to have a problem with the concurrent GC in the runtime and this issue needs to be investigated somehow. Disabling concurrent GC will just lead to longer GC pause times which seem to trigger this ANR. Is a GC pause time of 0.5 - 1s supposed to trigger this ANR ? If this is true, it seems to me that the best way to move forward would be to share the application so we can see if the serial GC takes longer than expected for whatever reason and mainly to try some debug flags to see if we detect some problems with the concurrent collector.

TimBurik · 2024-10-22T12:01:05Z

@BrzVlad according to the Android documentation, one of the reasons for ANR is when BroadcastReceiver has failed to process an event within 5 seconds - this seems to be our case, as there's an additional description in the GooglePlay Console like "Broadcast of Intent { act=android.intent.action.SCREEN_ON }"

By the way, let me share a full report from the GooglePlay Console, which contains stack traces from all the threads: anr_all_stacktraces.txt
Maybe it contains some hints to what the main thread may be waiting for.

I could also try to gather additional information regarding the GC performance, but in general we didn't notice any significant performance degradation after disabling concurrent GC

grendello · 2024-10-22T12:12:19Z

@TimBurik is the BroadcastReceiver in your case running in an isolated process?

@BrzVlad if it's a problem with BroadcastReceiver, then perhaps it's "simply" a matter of a Java thread incorrectly attached to MonoVM?

TimBurik · 2024-10-22T13:22:27Z

@grendello in our case most of the ANRs are related to the SystemEventsBroadcastReceiver from Sentry, which I believe is run in the same process by default
https://github.com/getsentry/sentry-java/blob/4988d5bf9e0a5be708853cf53d07b8503e4c77be/sentry-android-core/src/main/java/io/sentry/android/core/SystemEventsBreadcrumbsIntegration.java#L124-L142

But this is not the only source for ANRs for us, as we also have reports:

from other BroadcastReceivers, for example FirebaseInstanceIdReceiver: FirebaseInstanceIdReceiver_ANR.txt;
from JobScheduleServicer: JobInfoSchedulerService_ANR.txt;
and even due to input dispatching timeout: Input_Dispatching_timed_out_ANR.txt;

All of those reports have very similar main thread stack traces, related to GC.

It's even seems that those broadcast events, service calls and input events are not really causing the issue, and the main thread is already blocked by the time these events occur.

TimBurik · 2024-10-23T15:34:45Z

@grendello @BrzVlad is there anything else we could do to help you shed the light on this issue?
Unfortunately, I don't believe we would be able to share the app, and, as I've mentioned before, we don't really have a reliable way to reproduce the issue yet.. But we sure could check other theories or gather additional information, if you have any ideas.

BrzVlad · 2024-10-23T16:33:25Z

Seing that this issue happens after disabling concurrent GC I was biased into assuming that the GCs simply take longer to run and this is causing issues. After a second look at the stacktraces, it indeed looks like the suspend machinery is hanging. It makes no sense to me why this issue would only happen with serial GC though. We have an env var MONO_SLEEP_ABORT_LIMIT which makes it such that the application will abort if the suspend takes longer than the provided limit in ms, while also printing each thread state to the console. I think you should set this var to something that wouldn't trigger ANRs, like 1000ms maybe (suspending threads shouldn't take more than a few milliseconds). If we get explicit crashes in the runtime then this should confirm that the issue is in our runtime and maybe also make it easier to reproduce if we set a more aggressive suspend limit locally.

TimBurik · 2024-11-07T14:10:41Z

@BrzVlad could you also give a hint to where the threads dump could be found?
We did manage to find a pretty consistent way to reproduce the "stop the world hang" (although with an extreme abort limit of 40ms). Hang is always happening after a few minutes when application is backgrounded, but never happens while the app is in use.

In logcat we see the message:
E suspend_thread suspend took 40 ms, which is more than the allowed 40 ms
but the threads dump, which should be printed before, is nowhere to be found.

We are using the following values as Android environment variables:

MONO_SLEEP_ABORT_LIMIT=40
MONO_LOG_LEVEL=info
MONO_LOG_MASK=gc

and also we are using the adb shell setprop debug.mono.log gc command.

BrzVlad · 2024-11-07T14:57:06Z

I would have expected that the suspend state of each thread to be printed but it looks like that logging path doesn't go through the android log but rather to the console. It would be useful for us to try to reproduce this failure but given the abort limit is not excessive and it only happens while in background, it makes me wonder whether it just the os scheduling the app less frequently and the suspend process just taking a bit longer than usual, because threads reach GC safepoint location slower.

TimBurik · 2024-11-07T15:43:55Z

This issue doesn't seem to affect every GC in the background, in the log cat we also see a couple of GC-related logs:

2024-11-07 16:08:11.722 19003-19011 <app.id>  I  Background concurrent copying GC freed 5559KB AllocSpace bytes, 57(1168KB) LOS objects, 49% free, 11MB/23MB, paused 3.069ms,1.275ms total 242.239ms
2024-11-07 16:08:55.162 19003-19003 <app.id>  I  Explicit concurrent copying GC freed 12MB AllocSpace bytes, 20(400KB) LOS objects, 49% free, 8040KB/15MB, paused 442us,230us total 150.742ms
.......
2024-11-07 16:09:31.504 19003-19011 <app.id>  I  Background young concurrent copying GC freed 6048KB AllocSpace bytes, 0(0B) LOS objects, 38% free, 9960KB/15MB, paused 577us,289us total 106.764ms
2024-11-07 16:09:58.109 19003-19011 <app.id>  I  Background concurrent copying GC freed 4932KB AllocSpace bytes, 0(0B) LOS objects, 49% free, 10MB/21MB, paused 196us,144us total 100.312ms
.......
2024-11-07 16:10:18.513 19003-19003 <app.id>  E  suspend_thread suspend took 40 ms, which is more than the allowed 40 ms

I'm also gonna try to reproduce this issue with bigger values for MONO_SLEEP_ABORT_LIMIT, just to make sure that this is a proper "hanging", not just performance degradation spike of the background process. But it would be sure useful to collect as much information as possible when this reproduction happens. Do you know if there is any way to extract those threads dump messages from the Android device?

TimBurik · 2024-11-08T13:09:54Z

@BrzVlad Just managed to reproduce the issue with limit value of 1s:

E suspend_thread suspend took 1000 ms, which is more than the allowed 1000 ms

The reproduction steps are the same - put application to the background and wait, although this time it required also several tries (foregrounding/bacgrounding app back) and considerable amount of waiting (30m+ after each backgrounding)

BrzVlad · 2024-11-12T07:53:03Z

Would it be possible to provide the sample app with some reproduction steps ?

TimBurik · 2024-11-12T15:48:04Z

Let me try to reproduce the issue on the sample app, and also discuss with the legal team possibility to share our production code.

Meanwhile, I have found a similar issue with a deadlock in GC stopping the world mono/mono#21582, which ended up being fixed by changing the value of the MONO_THREADS_SUSPEND. @BrzVlad do you know what is the default thread suspension mode for .net-android? Could this mode be related to this issue?

BrzVlad · 2024-11-12T16:14:27Z

That is a good observation. The default suspend mode should by hybrid. It definitely looks like it could be related, you should try using the preemptive mode.

TimBurik · 2024-11-27T11:00:27Z

@BrzVlad We have just tested MONO_THREADS_SUSPEND=preemptive on a small population in production.
The good news is that this option completely vanished original ANRs (example from before):

  #00  pc 0x0000000000078dec  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
  #01  pc 0x000000000007d7e0  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144)
  #02  pc 0x000000000008ba8c  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+108)
  #03  pc 0x00000000001eeb20  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_os_sem_timedwait+204) 
  #04  pc 0x00000000001ee844  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_threads_wait_pending_operations+323) 
  #05  pc 0x00000000002a50f4  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (unified_suspend_stop_world+349) 
  #06  pc 0x00000000002a4e84  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_client_stop_world+159) 
  #07  pc 0x00000000002c5350  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_stop_world+3974) 
  #08  pc 0x00000000002c21bc  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_perform_collection+2637) 
  #09  pc 0x00000000002c20ec  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_ensure_free_space+2616) 
  #10  pc 0x00000000002be8b8  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj_nolock+279) 
  #11  pc 0x00000000002bed9c  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj+454) 
  #12  pc 0x00000000002a686c  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_gc_alloc_obj+690) 
  #13  pc 0x0000000000004438

The bad news is that instead, we now see a brand new ANR groups, related to suspend_signal_handler. Here are a couple of examples:

"main" tid=1 Native
  #00  pc 0x00000000000af72c  /apex/com.android.runtime/lib64/bionic/libc.so (__rt_sigsuspend+12)
  #01  pc 0x000000000006b3ec  /apex/com.android.runtime/lib64/bionic/libc.so (sigsuspend+56)
  #02  pc 0x00000000001ff994  /data/app/~~H3BQ2UayXApNge7gQvM5xQ==/<app.bundle.id>-2GUjyPnwyJj-TZHyg8V87g==/split_config.arm64_v8a.apk (suspend_signal_handler+200)
  #03  pc 0x00000000000005d0  [vdso] (__kernel_rt_sigreturn)
  #04  pc 0x00000000000afd28  /apex/com.android.runtime/lib64/bionic/libc.so (__epoll_pwait+8)
  #05  pc 0x000000000001874c  /system/lib64/libutils.so (android::Looper::pollInner+192)
  #06  pc 0x0000000000018628  /system/lib64/libutils.so (android::Looper::pollOnce+124)
  #07  pc 0x000000000018f184  /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce+48)
  at android.os.MessageQueue.nativePollOnce (Native method)
  at android.os.MessageQueue.next (MessageQueue.java:335)
  at android.os.Looper.loopOnce (Looper.java:187)
  at android.os.Looper.loop (Looper.java:319)
  at android.app.ActivityThread.main (ActivityThread.java:8919)
  at java.lang.reflect.Method.invoke (Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run (RuntimeInit.java:578)
  at com.android.internal.os.ZygoteInit.main (ZygoteInit.java:1103)

"<app.bundle.id>" tid=7924 Native
  #00  pc 0x000000000006a0c0  /system/lib64/libc.so (__rt_sigsuspend+4)
  #01  pc 0x0000000000029684  /system/lib64/libc.so (sigsuspend+44)
  #02  pc 0x00000000001ff994  /data/app/<app.bundle.id>-_I4CSOAWam382fA8t14IEg==/lib/arm64/libmonosgen-2.0.so (suspend_signal_handler+200)
  #03  pc 0x00000000000005dc  [vdso:000000737837f000]
  #04  pc 0x0000000000069ef8  /system/lib64/libc.so (__epoll_pwait+4)
  #05  pc 0x000000000001f700  /system/lib64/libc.so (epoll_pwait+48)
  #06  pc 0x0000000000015d04  /system/lib64/libutils.so (android::Looper::pollInner(int)+144)
  #07  pc 0x0000000000015bec  /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+108)
  #08  pc 0x000000000011b4d4  /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44)
  #09  pc 0x0000000000829dbc  /system/framework/arm64/boot-framework.oat (android.app.NativeActivity.onWindowFocusChangedNative [DEDUPED]+140)
  #10  pc 0x0000000001023328  /system/framework/arm64/boot-framework.oat (android.os.MessageQueue.next+424)
  #11  pc 0x000000000101c480  /system/framework/arm64/boot-framework.oat (android.os.Looper.loop+528)
  #12  pc 0x00000000006fbf08  /system/framework/arm64/boot-framework.oat (android.app.ActivityThread.main+1560)
  #13  pc 0x0000000000509638  /system/lib64/libart.so (art_quick_invoke_static_stub+600)
  #14  pc 0x00000000000d80b4  /system/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+260)
  #15  pc 0x0000000000431120  /system/lib64/libart.so (art::InvokeWithArgArray(art::ScopedObjectAccessAlreadyRunnable const&, art::ArtMethod*, art::ArgArray*, art::JValue*, char const*)+104)
  #16  pc 0x0000000000432c78  /system/lib64/libart.so (art::InvokeMethod(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, _jobject*, _jobject*, unsigned long)+1460)
  #17  pc 0x00000000003bc968  /system/lib64/libart.so (art::Method_invoke(_JNIEnv*, _jobject*, _jobject*, _jobject*)+52)
  #18  pc 0x00000000002684a4  /system/framework/arm64/boot.oat (java.lang.Class.getDeclaredMethodInternal [DEDUPED]+180)
  #19  pc 0x0000000001778898  /system/framework/arm64/boot-framework.oat (com.android.internal.os.Zygote$MethodAndArgsCaller.run+136)
  #20  pc 0x000000000177cb00  /system/framework/arm64/boot-framework.oat (com.android.internal.os.ZygoteInit.main+3472)
  #21  pc 0x0000000000509638  /system/lib64/libart.so (art_quick_invoke_static_stub+600)
  #22  pc 0x00000000000d80b4  /system/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+260)
  #23  pc 0x0000000000431120  /system/lib64/libart.so (art::InvokeWithArgArray(art::ScopedObjectAccessAlreadyRunnable const&, art::ArtMethod*, art::ArgArray*, art::JValue*, char const*)+104)
  #24  pc 0x0000000000430d7c  /system/lib64/libart.so (art::InvokeWithVarArgs(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, _jmethodID*, std::__va_list)+424)
  #25  pc 0x0000000000346794  /system/lib64/libart.so (art::JNI::CallStaticVoidMethodV(_JNIEnv*, _jclass*, _jmethodID*, std::__va_list)+624)
  #26  pc 0x00000000000bbe38  /system/lib64/libandroid_runtime.so (_JNIEnv::CallStaticVoidMethod(_jclass*, _jmethodID*, ...)+120)
  #27  pc 0x00000000000be09c  /system/lib64/libandroid_runtime.so (android::AndroidRuntime::start(char const*, android::Vector<android::String8> const&, bool)+844)
  #28  pc 0x0000000000002630  /system/bin/app_process64 (main+1592)
  #29  pc 0x000000000001b890  /system/lib64/libc.so (__libc_init+88)
  #30  pc 0x0000000000001f58  /system/bin/app_process64 (do_arm64_start+80)

"main" tid=1 Native
  #00  pc 0x00000000000b170c  /apex/com.android.runtime/lib64/bionic/libc.so (__rt_sigsuspend+12)
  #01  pc 0x000000000006d2fc  /apex/com.android.runtime/lib64/bionic/libc.so (sigsuspend+56)
  #02  pc 0x00000000001ff994  /data/app/~~w0aV2GqXY3mZlb0MRI-TFQ==/<app.bundle.id>-RsAhK2Rr_Ge0eW0U79pcuA==/split_config.arm64_v8a.apk (suspend_signal_handler+200)
  #03  pc 0x00000000000005d0  [vdso] (__kernel_rt_sigreturn)
  #04  pc 0x00000000002a5034  /data/app/~~w0aV2GqXY3mZlb0MRI-TFQ==/<app.bundle.id>-RsAhK2Rr_Ge0eW0U79pcuA==/split_config.arm64_v8a.apk
  at crc64b02072ca67704022.AndroidView.n_onCreateInputConnection (Native method)
  at crc64b02072ca67704022.AndroidView.onCreateInputConnection (AndroidView.java:65)
  at android.view.inputmethod.InputMethodManager.createInputConnection (InputMethodManager.java:4540)
  at android.view.inputmethod.InputMethodManager.startInputInner (InputMethodManager.java:2790)
  at android.view.inputmethod.InputMethodManager.startInputOnWindowFocusGainInternal (InputMethodManager.java:979)
  at android.view.inputmethod.InputMethodManager.-$$Nest$mstartInputOnWindowFocusGainInternal (unavailable)
  at android.view.inputmethod.InputMethodManager$DelegateImpl.onPostWindowGainedFocus (InputMethodManager.java:828)
  at android.view.ImeFocusController.onPostWindowFocus (ImeFocusController.java:114)
  at android.view.ViewRootImpl.handleWindowFocusChanged (ViewRootImpl.java:4483)
  at android.view.ViewRootImpl.-$$Nest$mhandleWindowFocusChanged (unavailable)
  at android.view.ViewRootImpl$ViewRootHandler.handleMessageImpl (ViewRootImpl.java:6636)
  at android.view.ViewRootImpl$ViewRootHandler.handleMessage (ViewRootImpl.java:6536)
  at android.os.Handler.dispatchMessage (Handler.java:108)
  at android.os.Looper.loopOnce (Looper.java:226)
  at android.os.Looper.loop (Looper.java:328)
  at android.app.ActivityThread.main (ActivityThread.java:9224)
  at java.lang.reflect.Method.invoke (Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run (RuntimeInit.java:594)
  at com.android.internal.os.ZygoteInit.main (ZygoteInit.java:1099)

I'm also attaching full stacktraces just in case:
signal_anr_full1.txt
signal_anr_full2.txt
signal_anr_full3.txt

lateralusX · 2025-02-12T14:59:33Z

@TimBurik, having threads blocking in suspend_signal_handler as part of a stop the world (stw) when using preemptive suspend model is expected (but not forever of course). When running preemptive suspend, Mono uses two signals on Android, SIGPWR and SIGXCPU to coordinate suspend and resume of threads. When stw needs to suspend threads it will issue SIGPWR to them, that will run our suspend_signal_handler on targeted thread that will report back as suspended to the runtime and then waiting for a resume signal inside suspend_signal_handler, that is the call to sigsupend call you see blocking on the threads:

  #00  pc 0x000000000006a0c0  /system/lib64/libc.so (__rt_sigsuspend+4)
  #01  pc 0x0000000000029684  /system/lib64/libc.so (sigsuspend+44)
  #02  pc 0x00000000001ff994  /data/app/<app.bundle.id>-_I4CSOAWam382fA8t14IEg==/lib/arm64/libmonosgen-2.0.so (suspend_signal_handler+200)

When runtime restart threads, it will issue SIGXCPU that will complete sigsuspend wait putting the thread back into running state. Runtime will then wait for all threads to report back before resuming execution.

It is critical that our suspend handlers are the handlers registered for these signals for the lifetime of the process. If something alters that after the runtime has loaded, newly created threads won't have the right signal handlers and if they attach to the runtime, those threads will either block stw to suspend the thrad (won't react to our SIGPWR signal) or prevent us from resuming the thread (won't react to our SIGXCPU signal).

Do you use any native or C#/Java library inside this application that potentially could alter our signal handlers in anyway?

I have looked a several of the attached anr dump files in this issue, but they all seem to be truncated ending right in the middle of a threads callstack. For example signal__anr_full2.txt ends like this when I download it:

"AudioTrack" tid=8508 Native
  #00  pc 0x000000000001daec  /system/lib64/libc.so (syscall+28)
  #01  pc 0x0000000000066f58  /system/lib64/libc.so (pthread_cond_wait+96)
  #02  pc 0x0000000000055e08  /system/lib64/libaudioclient.so (android::AudioTrack::AudioTrackThread::threadLoop()+412)
  #03  pc 0x00000000000116f4  /system/lib64/libutils.so (android::Thread::_threadLoop(void*)+280)
  #04  pc 0x00000000000678b4  /system/lib64/libc.so (__pthread_start(void*)+36)
  #05  pc 0x000000000001ef24  /system/lib64/libc.so (__start_thread+68)

"AudioTrack" tid=8509 Native

Is that something you see in the original files as well or do you have more content?

TimBurik · 2025-02-13T09:15:03Z

@lateralusX we are using sentry-dotnet and google-breakpad for crash analytics, which both seems to be altering SIGXCPU signal handler:
https://github.com/getsentry/sentry-native/blob/011e705f23604fd993ef3eab80c4d125a6b85a1f/src/backends/sentry_backend_crashpad.cpp#L83
https://github.com/google/breakpad/blob/41b6533e5f3dd7f0320ef58608ee32e8e4f132fb/src/client/ios/exception_handler_no_mach.cc#L61
Coincidently, right now we are are testing if disabling both of those systems would help in terms of ANRs while using preemptive suspend mode. Another idea we are planning to test is using cooperative suspend mode instead, since it had some improvements in the neighboring ticket dotnet/runtime#111485

And regarding the callstacks - this is everything that we see in GooglePlay Console, the only alteration to the original reports we did is obfuscating the package name)

lateralusX · 2025-02-13T09:58:40Z

I made a fix for dotnet/runtime#111485 that solves the deadlock that issue end up with using hybrid suspend. Coop and Android might have the same issues as you see when running hybrid, because looking at your callstacks from preemptive suspend I see that there are Mono threads blocked on Java thread pools that we have suspend (using signals), if the same threads attach to Mono runtime under hybrid/coop suspend, then they must have been attached in a way that is hybrid/coop compatible or those threads have violated Mono's threading model ending up in RUNNING state while blocking outside of managed or Mono runtime code.

In the past (Xamarin) Android ran using preemptive suspend model, so it would be good to make sure our signal handlers are not disabled by any 3'rd party code in this app. I know another library that mess with SIGXCPU, FFMPEG, but maybe you don't use that library in this app.

TimBurik · 2025-02-17T16:39:58Z

@lateralusX we got the result of our test with disabled crash reporters, and according to GooglePlay Console the total amount of ANRs and the amount of user-perceived ANRs has been halved comparing to previous release! It seems to confirm the theory that overriding signal handlers is the root cause of the ANRs. But also, interestingly enough, it did not eliminate suspend_signal_handler-related ANR groups entirely, but significantly reduced the amount of such reports. Here are couple of symbolicated stack traces just in case:

ANR_suspend_signal_handler_full1.txt
ANR_suspend_signal_handler_full2.txt

Although disabling crash reporters does reduce the amount of ANRs, it leaves us with very limited options for monitoring app health, especially in terms of native crashes. Is there a way to monitor native crashes without interfering with dotnet's signal handlers? In Xamarin, we used to call Mono.Runtime.RemoveSignalHandlers()/Mono.Runtime.InstallSignalHandlers() around initialization of the Breakpad, this is probably why Xamarin is not affected by using native reporter. Is there an alternative to this API in dotnet?
There's also interesting discussion about handlers and using third-party crash reporters here: #9055 (comment)

TimBurik · 2025-02-21T15:47:36Z

UPD: after a longer observation we see that ANR rate has increased and stabilized on the value, very similar to the value without disabling crash reporting. At the moment it is ~20% improvement, instead of ~50% we have observed initially.
Taking into consideration the fact that we also don't see any ANR groups disappeared after disabling signal handler manipulations, it doesn't seems like it helps with the root cause of those ANRs.

Still though, if you have any useful tips or insights for signal handlers usage and native crash monitoring - we could test them out.

lateralusX · 2025-03-14T12:15:28Z

@TimBurik

Re: "In Xamarin, we used to call Mono.Runtime.RemoveSignalHandlers()/Mono.Runtime.InstallSignalHandlers() around initialization of the Breakpad, this is probably why Xamarin is not affected by using native reporter. Is there an alternative to this API in dotnet?"

That part was never ported over to dotnet/runtime, see this for more discussions, dotnet/runtime#79706 and then there is a proposal here dotnet/runtime#101560 to get hold of unhandled exceptions and fatal errors without relying on overriding signals. Issue with that is that it needs 3'rd party libraries to support it and it won't handle all signals that runtime needs control of.

The underlying implementation of the InstallSignalHandlers/RemoveSignalHandlers in mono/mono can be found here, https://github.com/mono/mono/blob/0f53e9e151d92944cacab3e24ac359410c606df6/mono/mini/mini-posix.c#L451 and here, https://github.com/mono/mono/blob/0f53e9e151d92944cacab3e24ac359410c606df6/mono/mini/mini-posix.c#L509.

For dotnet/runtime mono what is most interesting in light of this issue are the signal handler's setup to handle thread suspend/resume, https://github.com/dotnet/runtime/blob/a9b700a1f6f3f524595393433bda6115b39afd71/src/mono/mono/utils/mono-threads-posix-signals.c#L227.

For the thread specific signals, we won't chain them since we assume we have exclusive access to those signals. An unsupported, none portable way is to at least make sure the signals installed in mono-threads-posix-signals.c are still around after all libraries that could affect, or hijack signals have been initialized. The process signal mask must also be intact meaning none of the mono specific thread signals can be blocked.

lateralusX · 2025-03-14T15:42:32Z

@TimBurik I also looked at your latest ANR dumps:

ANR_suspend_signal_handler_full1.txt:

This one seems to be truncated so there are threads cut out but I it still looks like we hit a rather interesting scenario. Main thread does some work that will hold loader lock, taken in mono_class_vtable_checked. Then there is "Thread-4" that runs finalizers and does a stackwalk, but it does a suspend and run to do the stackwalk, so I assume it has suspended Main thread while holding the loader lock. Then as part of the stackwalk it hits a case where it needs to create a class instance, that needs the loader lock, leading to a deadlock.

The only way to request a thread dump that will trigger the stackwalk is to explicit call mono_threads_request_thread_dump API. In mono that happens only in the SIGQUIT handler, so that signal seems to been fired and this is a side effect leading up to the deadlock. I can probably fix this issue by making sure the stackwalk done in response to mono_threads_request_thread_dump is async safe since its designed to be called from within signal handlers, so it expects to only do signal safe operations, meaning it won't try to load additional classes if not already loaded etc. Since this is a side effect of a SIGQUIT signal, it will of course not tackle why that happens, but in the dump no other threads seems to be doing anything strange.

ANR_suspend_signal_handler_full2.txt:

This one is slightly different, but it ends up in similar deadlock. In this case we have thread "AndroidAnrWatchDog" that calls into Mono runtime to get a type name, that leads up to a SIGSEGV (probably due to a heap corruption), that will call our SIGSEGV handler that does a native crash, triggering a full stackwalk of crashing thread. In parallel there is code triggering a GC, "Thread-14", it holds GC lock and wait for threads to suspend, another thread "GLThread 1071" gets suspended while holding loader lock and the thread handling the SIGSEGV is about stackwalk, but doesn't do it async safe, leading up to a deadlock on loader lock, so it deadlocks inside the signal handler, meaning that the suspend signal also send to this thread will wait, meaning that the GC thread will wait forever on this thread.

Making sure we do async safe stackwalk from within crash handler would fix this chain of events, but it won't fix the root cause that is probably a heap corruption, but at least it should make it into a crash report intsead of an ANR.

With these two dumps I have a couple of things to look at, it won't solve the underlying triggers (SIGQUIT and heap corruption) but it will at least make sure we would handle this complex error reporting scenarios better.

Mono's stack-walker can be signal/async safe when needed, making sure it won't allocate additional memory or taking internal runtime locks. It is however up to the caller of the stack walking API's to decide if it should be signal and/or async safe. Currently this is controlled using two different flags, the MONO_UNWIND_SIGNAL_SAFE as well as mono_thread_info_is_async_context. This is problematic since callers wants signal safe stack-walking but since not both are set, it will not behave fully signal safe. dotnet/android#9365 hit a couple of scenarios described here: dotnet/android#9365 (comment) that ends up deadlocking due to the fact that they did stack-walking from within a signal handler and deadlocked or dumping stack on suspended thread holding runtime loader lock, but without making the stack-walk async safe. Fix makes sure that calls to stack-walk API's can be made signal and/or async safe and that identified areas uses the correct set of flags given state of threads when stack-walking.

lateralusX · 2025-03-18T12:32:28Z

Following PR should at least fix deadlocks in above ANR's, dotnet/runtime#113645.

lateralusX · 2025-03-18T12:45:35Z

@TimBurik I looked back in history of this issue and realized that you had a local repro of the issue? Is that correct? If so, would it be possible for you to build a local runtime with some additional instrumentation (logging additional data into logcat) and use that together with your repro app? That way we could potentially get an idea around what thread that doesn't seem to suspend and then we could map that with the ANR to get what thread(s) we are waiting on. We could first try this with the original hybrid suspend mode and then potentially with preemptive suspend (with above fix) for even more additional information.

TimBurik · 2025-03-19T14:35:55Z

@lateralusX Thank you very much for taking your time to look into the dumps!

I can probably fix this issue by making sure the stackwalk done in response to mono_threads_request_thread_dump is async safe since its designed to be called from within signal handlers, so it expects to only do signal safe operations, meaning it won't try to load additional classes if not already loaded etc.

This is actually sounds great! We were suspecting that some of those ANRs are happening while processing crashes (because of the high amount of "background" ANRs in GooglePlay Console). We would very much prefer those crashes reported as crashes, and not as ANRs, as we currently have disproportionally high ANR rate comparing to the crash rate in production.

In this case we have thread "AndroidAnrWatchDog" that calls into Mono runtime to get a type name, that leads up to a SIGSEGV (probably due to a heap corruption), that will call our SIGSEGV handler that does a native crash, triggering a full stackwalk of crashing thread.

This one seems to be actually related to another ticket of ours related to crashes: #9563 It seems that our custom ANR monitor implementation ("AndroidAnrWatchDog" thread) was hitting some kind of infinite loop while collecting the context, requesting type name information over and over and over again, probably causing heap corruption in the end. Either way, after disabling this functionality the crash group has gone, and related ANRs probably as well.

I looked back in history of this issue and realized that you had a local repro of the issue? Is that correct?

Yes, we do have a somewhat stable reproduction of the GC deadlock with a precise configuration:

<AndroidEnableSGenConcurrent> set to false;
MONO_THREADS_SUSPEND unset ( set to default "hybrid"?). When set to "preemptive" deadlock cannot be reproduced with the same steps;
it seems to be reproduced only on Samsung phones, although we haven't tested this theory much;

would it be possible for you to build a local runtime with some additional instrumentation (logging additional data into logcat) and use that together with your repro app?

Sure! Do you have any specific ideas in mind of what instrumentation we could add/turn-on in the runtime? We would be able to run the test some time next week.

lateralusX · 2025-03-19T17:03:51Z

@TimBurik Sounds great. Let me write up some instructions during the week and share. Are you just using an Android SDK app (dotnet new android) or MAUI (dotnet new maui)? You will need to do a local build of the runtime enabling some additional logging in code, then you will use that version of the runtime when building your app running the reproduction step. The additional logging will hit logcat, so we will need to get some data out of logcat once the ANR reproduces, it will tell us more details on the threads that we are waiting on as part of the STW (Stop The World). When we have that data we can look in the ANR to detect what thread seems to be in wrong state and why it won't seem to reach a safe point in a timely manner.

TimBurik added Area: App Runtime Issues in `libmonodroid.so`. needs-triage Issues that need to be assigned. labels Oct 3, 2024

TimBurik assigned grendello Oct 3, 2024

grendello added Area: Mono Runtime Mono-related issues: BCL bugs, AOT issues, etc. and removed Area: App Runtime Issues in `libmonodroid.so`. needs-triage Issues that need to be assigned. labels Oct 4, 2024

grendello assigned vitek-karas Oct 22, 2024

vitek-karas assigned BrzVlad Oct 22, 2024

lateralusX mentioned this issue Feb 4, 2025

Android Not Responding with [split_config.arm64_v8a.apk!libmonosgen-2.0.so] mono_threads_attach_coop dotnet/runtime#111485

Open

lateralusX mentioned this issue Mar 18, 2025

[Mono]: Additional support for signal async safe stack-walks. dotnet/runtime#113645

Open

Increased amount of ANRs after disabling concurrent GC #9365

Increased amount of ANRs after disabling concurrent GC #9365

Comments

TimBurik commented Oct 3, 2024 • edited Loading

Android framework version

Affected platform version

Description

Steps to Reproduce

Did you find any workaround?

Relevant log output

grendello commented Oct 4, 2024

TimBurik commented Oct 4, 2024

grendello commented Oct 4, 2024

TimBurik commented Oct 4, 2024 • edited Loading

grendello commented Oct 4, 2024

grendello commented Oct 4, 2024

supervacuus commented Oct 4, 2024

grendello commented Oct 4, 2024

lambdageek commented Oct 4, 2024

TimBurik commented Oct 22, 2024

BrzVlad commented Oct 22, 2024

TimBurik commented Oct 22, 2024

grendello commented Oct 22, 2024

TimBurik commented Oct 22, 2024 • edited Loading

TimBurik commented Oct 23, 2024

BrzVlad commented Oct 23, 2024

TimBurik commented Nov 7, 2024

BrzVlad commented Nov 7, 2024

TimBurik commented Nov 7, 2024

TimBurik commented Nov 8, 2024

BrzVlad commented Nov 12, 2024

TimBurik commented Nov 12, 2024

BrzVlad commented Nov 12, 2024

TimBurik commented Nov 27, 2024

lateralusX commented Feb 12, 2025

TimBurik commented Feb 13, 2025

lateralusX commented Feb 13, 2025 • edited Loading

TimBurik commented Feb 17, 2025

TimBurik commented Feb 21, 2025

lateralusX commented Mar 14, 2025 • edited Loading

lateralusX commented Mar 14, 2025 • edited Loading

lateralusX commented Mar 18, 2025

lateralusX commented Mar 18, 2025 • edited Loading

TimBurik commented Mar 19, 2025

lateralusX commented Mar 19, 2025 • edited Loading

TimBurik commented Oct 3, 2024 •

edited

Loading

TimBurik commented Oct 4, 2024 •

edited

Loading

TimBurik commented Oct 22, 2024 •

edited

Loading

lateralusX commented Feb 13, 2025 •

edited

Loading

lateralusX commented Mar 14, 2025 •

edited

Loading

lateralusX commented Mar 14, 2025 •

edited

Loading

lateralusX commented Mar 18, 2025 •

edited

Loading

lateralusX commented Mar 19, 2025 •

edited

Loading