Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a rare CME in LatencyUtils #1984

Open
wants to merge 1 commit into
base: 2.0
Choose a base branch
from

Conversation

Axionize
Copy link
Contributor

@Axionize Axionize commented Jan 30, 2025

Problem

The original implementation of handleNettySyncTransaction could throw a ConcurrentModificationException if a runnable task modified the transactionMap (either directly or by calling addRealTimeTask). This was observed on Folia servers but could theoretically occur in any environment.

Solution

Refactored handleNettySyncTransaction to use a two-pass approach:

  1. First pass: Collect tasks and remove them from transactionMap using the iterator
  2. Second pass: Execute the collected tasks

Added a class-level ArrayList<Runnable> to store tasks between passes, which is cleared and reused each call to avoid unnecessary allocations.

Implementation Details

  • Added private final ArrayList<Runnable> tasksToRun field
  • Split the original single loop into collection and execution phases
  • Maintained synchronization around both phases to preserve thread safety
  • Added detailed comments explaining the approach and rationale
  • Chose not to use a ConcurrnetLinkedQueue because iteration may not reflect state of data structure and working to ensure this is always correct for our use case was too complicated

Performance Considerations

  • Using ArrayList for tasksToRun as it only needs sequential adds and iteration
  • Keeping it as a class field to reuse the backing array
  • Maintained LinkedList for transactionMap due to its frequent size changes and removal patterns

Testing

  • Verified on local server
  • Seems to be working, used in prod by lifestealsmp.com who originally reported the issues and hasn't happened again.

This change makes the code more robust by preventing CMEs rather than relying on documentation to prevent them.

[14:31:50 WARN]: [ac.grim.grimac.shaded.com.github.retrooper.packetevents.PacketEventsAPI] PacketEvents caught an unhandled exception while calling your listener.
java.util.ConcurrentModificationException: null
        at java.base/java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:978) ~[?:?]
        at java.base/java.util.LinkedList$ListItr.remove(LinkedList.java:933) ~[?:?]
        at grimac-733.jar/ac.grim.grimac.utils.latency.LatencyUtils.handleNettySyncTransaction(LatencyUtils.java:62) ~[grimac-733.jar:?]
        at grimac-733.jar/ac.grim.grimac.player.GrimPlayer.addTransactionResponse(GrimPlayer.java:381) ~[grimac-733.jar:?]
        at grimac-733.jar/ac.grim.grimac.events.packets.PacketPingListener.onPacketReceive(PacketPingListener.java:62) ~[grimac-733.jar:?]
        at grimac-733.jar/ac.grim.grimac.shaded.com.github.retrooper.packetevents.event.PacketReceiveEvent.call(PacketReceiveEvent.java:44) ~[grimac-733.jar:?]
        at grimac-733.jar/ac.grim.grimac.shaded.com.github.retrooper.packetevents.event.EventManager.callEvent(EventManager.java:84) ~[grimac-733.jar:?]
        at grimac-733.jar/ac.grim.grimac.shaded.com.github.retrooper.packetevents.util.PacketEventsImplHelper.handleServerBoundPacket(PacketEventsImplHelper.java:101) ~[grimac-733.jar:?]
        at grimac-733.jar/ac.grim.grimac.shaded.io.github.retrooper.packetevents.injector.handlers.PacketEventsDecoder.read(PacketEventsDecoder.java:57) ~[grimac-733.jar:?]
        at grimac-733.jar/ac.grim.grimac.shaded.io.github.retrooper.packetevents.injector.handlers.PacketEventsDecoder.decode(PacketEventsDecoder.java:64) ~[grimac-733.jar:?]
        at grimac-733.jar/ac.grim.grimac.shaded.io.github.retrooper.packetevents.injector.handlers.PacketEventsDecoder.decode(PacketEventsDecoder.java:41) ~[grimac-733.jar:?]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88) ~[netty-codec-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[netty-codec-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.handler.flow.FlowControlHandler.dequeue(FlowControlHandler.java:202) ~[netty-handler-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.handler.flow.FlowControlHandler.channelRead(FlowControlHandler.java:164) ~[netty-handler-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[netty-codec-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) ~[netty-codec-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) ~[netty-handler-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.handler.flush.FlushConsolidationHandler.channelRead(FlushConsolidationHandler.java:152) ~[netty-handler-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[netty-transport-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) ~[netty-transport-classes-epoll-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:509) ~[netty-transport-classes-epoll-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407) ~[netty-transport-classes-epoll-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.97.Final.jar:4.1.97.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.97.Final.jar:4.1.97.Final]
        at java.base/java.lang.Thread.run(Thread.java:1575) ~[?:?]

@Axionize Axionize added the status: review required The pull request is awaiting review label Jan 30, 2025
@SamB440
Copy link
Contributor

SamB440 commented Jan 30, 2025

But WHY does it occur on folia?

@SamB440
Copy link
Contributor

SamB440 commented Jan 30, 2025

I looked and I only see this method being called from a single thread so I don't understand how it's throwing a CME

@Axionize
Copy link
Contributor Author

But WHY does it occur on folia?

I have no clue all I know how to do is to make it stop happening. Quite frankly I don't really care why its happening I just want to fix it and move on.

@Pablete1234
Copy link

Pablete1234 commented Jan 30, 2025

We've seen this occur on a 1.8 modified paper server too a few days ago

I only see this method being called from a single thread so I don't understand how it's throwing a CME

If it was from different threads, the locks that are already in the code would prevent it. This issue has to be the same thread calling to add a transaction in some codepath, while iterating thru them.

@Axionize
Copy link
Contributor Author

Axionize commented Jan 30, 2025

We've seen this occur on a 1.8 modified paper server too a few days ago

I only see this method being called from a single thread so I don't understand how it's throwing a CME

If it was from different threads, the locks that are already in the code would prevent it. This issue has to be the same thread calling to add a transaction in some codepath, while iterating thru them.

It you trace all possible iterations of this code, this is impossible.

I looked and I only see this method being called from a single thread so I don't understand how it's throwing a CME

addRealTimeTask() executes runnables outside of the synchronized block. Its possible for another thread to call addRealTimeTask() and run a runnable on another thread that modifies the transactionMap or finishes executing after the lock has released in another thread or part of the code.

If you get really unlucky and modify transactionMap while its being iterated over in handleNettySyncTransaction() you can get a CME.

@Pablete1234
Copy link

image
It was happening basically non stop, but we haven't seen it happen again, so my guess is some task got stuck and it was impossible to cleanup until restart

Comment on lines +60 to +66
* Future considerations:
* If this becomes a performance bottleneck, we may revisit using a single-pass approach
* on non-Folia servers. We could also explore concurrent data structures or parallel
* execution, but this would lose the guarantee that transactions are processed in order.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative solution if this is a performance issue, is to do the queuing on the add method instead.

If any of the add task methods are called while we're iterating, add them to a list, and once the iteration is done add everything from that list to the transaction map.
If everything is working ideally the list will always be empty and be a non-issue, if something breaks the list may see some use

@Axionize Axionize force-pushed the feat/fix-Folia-LatencyUtils-CME branch from 59eb669 to 96a5e00 Compare February 3, 2025 15:20
@ManInMyVan
Copy link
Contributor

could using a ConcurrentLinkedDeque instead of a LinkedList fix this?

Index: src/main/java/ac/grim/grimac/utils/latency/LatencyUtils.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/main/java/ac/grim/grimac/utils/latency/LatencyUtils.java b/src/main/java/ac/grim/grimac/utils/latency/LatencyUtils.java
--- a/src/main/java/ac/grim/grimac/utils/latency/LatencyUtils.java	(revision e329171b113351ef91cfad0312031dbe7417b27f)
+++ b/src/main/java/ac/grim/grimac/utils/latency/LatencyUtils.java	(date 1738622032295)
@@ -7,11 +7,10 @@
 import ac.grim.grimac.utils.data.Pair;
 import com.github.retrooper.packetevents.netty.channel.ChannelHelper;
 
-import java.util.LinkedList;
-import java.util.ListIterator;
+import java.util.concurrent.ConcurrentLinkedDeque;
 
 public class LatencyUtils {
-    private final LinkedList<Pair<Integer, Runnable>> transactionMap = new LinkedList<>();
+    private final ConcurrentLinkedDeque<Pair<Integer, Runnable>> transactionMap = new ConcurrentLinkedDeque<>();
     private final GrimPlayer player;
 
     public LatencyUtils(GrimPlayer player) {
@@ -42,9 +41,7 @@
 
     public void handleNettySyncTransaction(int transaction) {
         synchronized (this) {
-            for (ListIterator<Pair<Integer, Runnable>> iterator = transactionMap.listIterator(); iterator.hasNext(); ) {
-                Pair<Integer, Runnable> pair = iterator.next();
-
+            for (Pair<Integer, Runnable> pair : transactionMap) {
                 // We are at most a tick ahead when running tasks based on transactions, meaning this is too far
                 if (transaction + 1 < pair.first())
                     return;
@@ -64,7 +61,7 @@
                     }
                 }
                 // We ran a task, remove it from the list
-                iterator.remove();
+                transactionMap.remove(pair);
             }
         }
     }

@Axionize
Copy link
Contributor Author

Axionize commented Feb 5, 2025

could using a ConcurrentLinkedDeque instead of a LinkedList fix this?

As I mentioned in my PR description:

  • Chose not to use a ConcurrnetLinkedQueue because iteration may not reflect state of data structure and working to ensure this is always correct for our use case was too complicated

ConcurrentLinkedQueue isn't magic unfortunately, its possible for iteration of the data structure to not reflect its underlying state which could lead to subtle bugs. For example if a runnable deletes an item in the list (not head or tail) it could lead to iteration skipping some runnables. Even if our own code doesn't do this we have no guarantee when we open up Grim with the check API that other checks won't trigger these issues (and that is something I'm unsure about).

CopyOnWriteArrayList or some variant of it might be what we're looking for but that would be even slower than what I'm already doing.

I would rather we fail obviously and hard than have to debug a stupid error 10 months from now with processing transactions no one can figure out.

synchronized (this) {
for (ListIterator<Pair<Integer, Runnable>> iterator = transactionMap.listIterator(); iterator.hasNext(); ) {
tasksToRun.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this happen first, and not last? This is preventing it being a local variable. Am I missing something?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could totally be a local variable, but by having it non-local you save on the memory allocation overhead and just use already allocated memory
at least that's my assumption of why its done this way

Copy link
Contributor Author

@Axionize Axionize Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a very minor optimization that saves you the cost of having to call new on every iteration (that is allocate and deallocate the ArrayList), in addition since its an ArrayList the underlying list will never be shrunk, meaning fewer O(n) resizes than if we used a local variable. If you log the size of the list you'll see that most of the time its 1 and occasionally it'll spike to 50 and every so often a few hundred.

In the case where a few hundred entries are added several resizes are required under the hood which costs time and allocations we save by clearing instead.

This was partially explained in my original PR post

Added a class-level ArrayList to store tasks between passes, which is cleared and reused each call to avoid unnecessary allocations.

  • Using ArrayList for tasksToRun as it only needs sequential adds and iteration
  • Keeping it as a class field to reuse the backing array

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: review required The pull request is awaiting review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants