Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ANRs in SyncService.start/stop #4150

Open
jmartinesp opened this issue Jan 15, 2025 · 0 comments
Open

ANRs in SyncService.start/stop #4150

jmartinesp opened this issue Jan 15, 2025 · 0 comments
Labels
A-Performance O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect Something isn't working: bugs, crashes, hangs and other reported problems

Comments

@jmartinesp
Copy link
Member

We recently became aware of an abnormal growth of ANR (application not responding) issues, and the most common one seems to be one with the next stacktrace:

Stacktrace

  #00  pc 0x000000000004dc5c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
  #01  pc 0x000000000381f610  /data/app/~~kNW33dJeKDR24qAYlClfVw==/io.element.android.x-PLPFGHZP_N-S7DxV4AkKdQ==/split_config.arm64_v8a.apk
  #02  pc 0x00000000028a5f00  /data/app/~~kNW33dJeKDR24qAYlClfVw==/io.element.android.x-PLPFGHZP_N-S7DxV4AkKdQ==/split_config.arm64_v8a.apk
  #03  pc 0x0000000001b2897c  /data/app/~~kNW33dJeKDR24qAYlClfVw==/io.element.android.x-PLPFGHZP_N-S7DxV4AkKdQ==/split_config.arm64_v8a.apk (ffi_matrix_sdk_ffi_rust_future_poll_u32+124)
  #04  pc 0x000000000001404c  /data/app/~~kNW33dJeKDR24qAYlClfVw==/io.element.android.x-PLPFGHZP_N-S7DxV4AkKdQ==/split_config.arm64_v8a.apk (BuildId: ddeb5c7dcafc8a4ed4ba180f840e1967d8207a8b)
  #05  pc 0x0000000000010a78  /data/app/~~kNW33dJeKDR24qAYlClfVw==/io.element.android.x-PLPFGHZP_N-S7DxV4AkKdQ==/split_config.arm64_v8a.apk (BuildId: ddeb5c7dcafc8a4ed4ba180f840e1967d8207a8b)
  #06  pc 0x00000000000063c8  /data/app/~~kNW33dJeKDR24qAYlClfVw==/io.element.android.x-PLPFGHZP_N-S7DxV4AkKdQ==/split_config.arm64_v8a.apk (BuildId: ddeb5c7dcafc8a4ed4ba180f840e1967d8207a8b)
  #07  pc 0x00000000000082dc  /data/app/~~kNW33dJeKDR24qAYlClfVw==/io.element.android.x-PLPFGHZP_N-S7DxV4AkKdQ==/split_config.arm64_v8a.apk (Java_com_sun_jna_Native_invokeVoid+32) (BuildId: ddeb5c7dcafc8a4ed4ba180f840e1967d8207a8b)
  at com.sun.jna.Native.invokeVoid (Native.java)
  at com.sun.jna.Function.invoke (Function.java:418)
  at com.sun.jna.Function.invoke (Function.java:364)
  at com.sun.jna.Library$Handler.invoke (Library.java:270)
  at java.lang.reflect.Proxy.invoke (Proxy.java:1006)
  at org.matrix.rustcomponents.sdk.UniffiLib.ffi_matrix_sdk_ffi_rust_future_poll_void (matrix_sdk_ffi.kt)
  at org.matrix.rustcomponents.sdk.SyncService$start$3.invoke (SyncService.java:20015)
  at org.matrix.rustcomponents.sdk.SyncService$start$3.invoke (SyncService.java:20008)
  at org.matrix.rustcomponents.sdk.Matrix_sdk_ffiKt.uniffiRustCallAsync (matrix_sdk_ffi.kt:5532)
  at org.matrix.rustcomponents.sdk.Matrix_sdk_ffiKt$uniffiRustCallAsync$1.invokeSuspend (matrix_sdk_ffi.kt:17)
  at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith (ContinuationImpl.kt:33)
  at kotlinx.coroutines.DispatchedTaskKt.resume (DispatchedTask.kt:221)
  at kotlinx.coroutines.CancellableContinuationImpl.cancelCompletedResult$kotlinx_coroutines_core (CancellableContinuationImpl.kt:177)
  at kotlinx.coroutines.DispatchedTaskKt.resumeUnconfined (DispatchedTask.kt:177)
  at kotlinx.coroutines.DispatchedTaskKt.dispatch (DispatchedTask.kt:149)
  at kotlinx.coroutines.CancellableContinuationImpl.dispatchResume (CancellableContinuationImpl.kt:470)
  at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core (CancellableContinuationImpl.kt:504)
  at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core$default (CancellableContinuationImpl.kt:493)
  at kotlinx.coroutines.CancellableContinuationImpl.resumeWith (CancellableContinuationImpl.kt:359)
  at org.matrix.rustcomponents.sdk.uniffiRustFutureContinuationCallbackImpl.callback (matrix_sdk_ffi.kt:5517)
  at java.lang.reflect.Method.invoke (Native method)
  at com.sun.jna.CallbackReference$DefaultCallbackProxy.invokeCallback (CallbackReference.java:585)
  at com.sun.jna.CallbackReference$DefaultCallbackProxy.callback (CallbackReference.java:613)
  at com.sun.jna.Native.invokeVoid (Native.java)
  at com.sun.jna.Function.invoke (Function.java:418)
  at com.sun.jna.Function.invoke (Function.java:364)
  at com.sun.jna.Library$Handler.invoke (Library.java:270)
  at java.lang.reflect.Proxy.invoke (Proxy.java:1006)
  at org.matrix.rustcomponents.sdk.UniffiLib.ffi_matrix_sdk_ffi_rust_future_poll_void (matrix_sdk_ffi.kt)
  at org.matrix.rustcomponents.sdk.SyncService$stop$3.invoke (SyncService.java:20049)
  at org.matrix.rustcomponents.sdk.SyncService$stop$3.invoke (SyncService.java:20042)
  at org.matrix.rustcomponents.sdk.Matrix_sdk_ffiKt.uniffiRustCallAsync (matrix_sdk_ffi.kt:5532)
  at org.matrix.rustcomponents.sdk.Matrix_sdk_ffiKt$uniffiRustCallAsync$1.invokeSuspend (matrix_sdk_ffi.kt:17)
  at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith (ContinuationImpl.kt:33)
  at kotlinx.coroutines.DispatchedTask.run (DispatchedTask.kt:101)
  at android.os.Handler.handleCallback (Handler.java:938)
  at android.os.Handler.dispatchMessage (Handler.java:99)
  at android.os.Looper.loopOnce (Looper.java:346)
  at android.os.Looper.loop (Looper.java:475)
  at android.app.ActivityThread.main (ActivityThread.java:7889)
  at java.lang.reflect.Method.invoke (Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run (RuntimeInit.java:548)
  at com.android.internal.os.ZygoteInit.main (ZygoteInit.java:1009)

Which points to a SyncService.stop() call being shortly followed by a SyncService.start() one. This only happens at 3 points in the app:

  • In LoggedInFlowNode.observeSyncStateAndNetworkStatus(), which will stat the sync service both when the main activity of the app starts (think of it as the main navigation flow) and when the network is available if the service wasn't already running. The service will be stopped when the activity is no longer active too (as in, visible to the user).
  • In CallScreenPresenter.HandleMatrixClientSyncState(): the Element Call screen runs on a different activity so the callbacks in LoggedInFlowNode won't start the sync service when needed and we need to manually do it here. When the call is finished and the screen removed, the service will be stopped.
  • In SyncOnNotifiableEvent.startSyncIfNeeded()/stopSyncIfNeeded(): when we receive a push notification we start an opportunistic sync, then wait until either the user joins the call if it was a ringing call notification or we receive the event related to the notification in any other case, or a timeout happens, then stops the service again unless the app is now visible again.

The implementation of the flows above aren't perfect, and we may be calling SyncService.start/stop more than needed, which may be one of the root causes of this issue... centralising all these calls in a single component would probably make more sense.

The calls to these methods are all done in a separate coroutine, but all of them running on the Main dispatcher, which means if the underlying thread gets blocked somehow, the whole app freezes since it's the same thread that runs the UI. While this shouldn't happen because UniFFI seems to translate async methods to suspending functions, it seems to be happening somehow.

I wasn't able to reproduce this issue without a little help: I had to artificially add some delay inside SyncService.start() on the SDK so the app froze when I foregrounded/backgrounded it several times, but I imagine the real env would be similar. With these changes to force more frequent app freezes, the issue could be mitigated by running the SyncService.stop() calls in a background dispatcher (surprisingly, doing the same for SyncService.start() had no effect AFAICT): I'm not sure if this meant we now had a blocked background thread or if by being on background this meant the apparent deadlock we saw before just couldn't happen.

As for what might have caused this, the issue apparently started at v0.7.6, and the related SDK changes are these if I'm not mistaken. Nothing seems really related to this other than this PR in EXA but that doesn't seem to be the cause, so maybe it's something coming from a UniFFI internal change.

@jmartinesp jmartinesp added A-Performance O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect Something isn't working: bugs, crashes, hangs and other reported problems labels Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Performance O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect Something isn't working: bugs, crashes, hangs and other reported problems
Projects
None yet
Development

No branches or pull requests

1 participant