Skip to content

Conversation

@codepitbull
Copy link
Contributor

Motivation

Resolves #38363

Changes

Under certai nconditions a bridge would not resume after losing its connection.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bridge stall issue where MQTT bridges would fail to resume message forwarding after connection failures. The root cause involved inflight message markers not being properly cleared during reconnections, causing messages to be permanently stuck in persistence queues.

Key changes:

  • Introduces separate handling for initial connections vs reconnections to avoid duplicate message delivery
  • Adds removeAllInFlightMarkers method to persistence layer to reset stuck messages
  • Implements new callbacks (ResetAllInflightMarkersCallback, OnReconnectCallback) to coordinate reconnection state
  • Refactors drainQueue() to properly clear all stale state and reset inflight markers

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
legacy-modbus-adapter-minimal-config.xml Removed unused test resource file
legacy-modbus-adapter-full-config.xml Removed unused test resource file
ClientQueueMemoryLocalPersistence.java Implements new method to remove all inflight markers for a queue
ClientQueuePersistenceImpl.java Adds persistence layer method to reset all inflight markers and notify waiting consumers
ClientQueuePersistence.java Extends interface with removeAllInFlightMarkers method
ClientQueueLocalPersistence.java Extends interface with removeAllInFlightMarkers method
RemoteMqttForwarder.java Major refactoring to handle reconnection scenarios, distinguish initial connection from reconnection, and properly reset state
BridgeMqttClient.java Adds logic to distinguish initial connection from reconnection using everConnected flag
MqttForwarder.java Extends interface with new methods for flushing buffered messages and handling reconnection
MessageForwarderImpl.java Implements callbacks to reset inflight markers and trigger message polling after reconnection
FINDINGS_BRIDGE.md Documents root cause analysis of the bridge stall issue

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

Test Results

  483 files  ±0    483 suites  ±0   6m 15s ⏱️ +23s
4 053 tests ±0  4 050 ✅ ±0  3 💤 ±0  0 ❌ ±0 
4 070 runs  ±0  4 067 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit 672d008. ± Comparison against base commit eabcb94.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

Coverage Report

Overall Project 65.29% -0.35%
Files changed 13.97%

File Coverage
ClientQueueMemoryLocalPersistence.java 89.23% -2.27%
RemoteMqttForwarder.java 70.17% -16.82%
ClientQueuePersistenceImpl.java 52.99% -2.45%
MessageForwarderImpl.java 9.56% -23.77%
BridgeMqttClient.java 0.3% -3.89%

Copy link
Member

@caoccao caoccao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hold one concern.

try {
// Wait for all inflight markers to be reset before returning
// This ensures onReconnect() will see clean queues when it triggers checkBuffers()
Futures.allAsList(futuresBuilder.build()).get(30, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 30 secs could be configurable, or a constant.
Also, on timeout, i.e. when the markers are not removed, we still do onReconnect but the markets would not be reset, they would be stuck. Might this be a potential issue?, should the wait be longer? should the exception be propagated as a RT and prevent the reconnect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned it into a constant.
Not really want to make it configurable.
The value is deliberately high. Normally the operation should finish in <100ms. If we ever reach the timeout the IO-layer is in trouble. And that's where the reconnect logic I added kicks in.

}
final ImmutableList.Builder<ListenableFuture<Void>> futuresBuilder = ImmutableList.builder();
for (final String queueIdToReset : forwarderQueueIds) {
futuresBuilder.add(queuePersistence.get().removeAllInFlightMarkers(queueIdToReset));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if removeAllInFlightMarkers() is interrupted by Futures.allAsList(futuresBuilder.build()).get(30, TimeUnit.SECONDS)? Is there a rollback logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally the operation should finish in <100ms. If we ever reach the timeout the IO-layer is in trouble. And that's where the reconnect logic I added in the last commit kicks in.

@caoccao
Copy link
Member

caoccao commented Dec 10, 2025

Better rebase to latest master. This branch has too much noise now.

@sonarqubecloud
Copy link

codepitbull and others added 10 commits December 10, 2025 11:29
  1. forceReconnect() disconnects the MQTT client
  2. BridgeMqttClient's addDisconnectedListener() detects the disconnect
  3. Auto-reconnect kicks in with exponential backoff (1s min, 2min max)
  4. On reconnect, addConnectedListener() fires
  5. This calls drainQueue() → resetAllInflightMarkersCallback again
  6. If persistence is now responsive, markers are cleared and messages flow
Copy link
Member

@caoccao caoccao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codepitbull codepitbull merged commit 1cccaeb into master Dec 11, 2025
6 of 7 checks passed
@codepitbull codepitbull deleted the fix/38363-bridge-stall branch December 11, 2025 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants