Skip to content

Conversation

@bowenlan-amzn
Copy link
Member

@bowenlan-amzn bowenlan-amzn commented Sep 24, 2025

Description

This PR is meant for harden the circuit breaker logic in query result consumer. The main logic change is only at one place, addEstimateAndMaybeBreak(aggsSize) in consumeResult before actually perform any real "consume logic".

On the other hand, considering this is a important but spaghetti code, and possibly need to be improved in the future, some refactoring works are also done.

Circuit Breaker Change

Circuit breaker addEstimateBytesAndMaybeBreak are used twice

  1. before consume: query result received at coordinator transport layer, and will be handled by query result consumer. We estimate the heap size of the query result, and check if it breaks the REQUEST circuit breaker.
  2. during tryExecuteNext: before doing partial reduce on the buffered query results, we estimate the extra heap size that will be used, and check if it breaks the REQUEST circuit breaker.

Some context about the partial merge logic:

  • PendingMerges are the core logic to buffer and reduce shard results in a batched manner without waiting for all results arrived.
  • consume buffer the shard result and check if the buffer size reaches threshold, and create merge task from buffer.
  • tryExecuteNext poll merge task from queue and perform the partial reduce.

Refactoring around Failure Handling

The rule I followed is to use onMergeFailure to handle any failure we captured. It could be a circuit breaker exception, a task cancellation exception or any exception caught during the partial reduce.

onMergeFailure stores the failure, resets the circuit breaker, and clears the merge task queue. It also sends the cancel search task request to other nodes so they could try stop processing the shard query at their best.

image

Performance Implication

TermsReduceBenchmark doesn't show any regression

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@bowenlan-amzn bowenlan-amzn force-pushed the result-consumer-harden branch 2 times, most recently from 3d88f2c to 5802e94 Compare September 24, 2025 14:04
@github-actions
Copy link
Contributor

❌ Gradle check result for 5802e94: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@jainankitk
Copy link
Contributor

@kaushalmahi12 - Can you help with review for this code change?

@github-actions
Copy link
Contributor

✅ Gradle check result for 89f9dde: SUCCESS

@codecov
Copy link

codecov bot commented Sep 25, 2025

Codecov Report

❌ Patch coverage is 83.46457% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.98%. Comparing base (983c4d7) to head (18cbfa0).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...search/action/search/QueryPhaseResultConsumer.java 83.33% 10 Missing and 11 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #19396      +/-   ##
============================================
- Coverage     73.08%   72.98%   -0.11%     
+ Complexity    70491    70382     -109     
============================================
  Files          5712     5712              
  Lines        322762   322754       -8     
  Branches      46743    46744       +1     
============================================
- Hits         235879   235549     -330     
- Misses        67941    68207     +266     
- Partials      18942    18998      +56     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bowenlan-amzn bowenlan-amzn marked this pull request as ready for review September 25, 2025 08:47
@bowenlan-amzn bowenlan-amzn requested a review from a team as a code owner September 25, 2025 08:47
@github-actions
Copy link
Contributor

❌ Gradle check result for ecc930c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

@kaushalmahi12 kaushalmahi12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bowenlan-amzn for improving this nightmarish class. Can you also add a diagram how does the existing class works and what are we improving in that. Diagram will definitely help the reviewers to quickly get the context.

@github-actions
Copy link
Contributor

❕ Gradle check result for 92c0bd9: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@bowenlan-amzn bowenlan-amzn force-pushed the result-consumer-harden branch from 92c0bd9 to f21f79d Compare October 1, 2025 00:06
@github-actions
Copy link
Contributor

github-actions bot commented Oct 1, 2025

❕ Gradle check result for f21f79d: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 1, 2025

❕ Gradle check result for 01aa6e3: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: bowenlan-amzn <[email protected]>
SearchPhaseControllerTests run thousands times w/o failure

Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
it's not idempotent and may be called by non-synchronized thread like tryExecuteNext

Signed-off-by: bowenlan-amzn <[email protected]>
@bowenlan-amzn bowenlan-amzn force-pushed the result-consumer-harden branch from 01aa6e3 to 18cbfa0 Compare October 2, 2025 11:00
@bowenlan-amzn
Copy link
Member Author

@kaushalmahi12 thanks for the review!
Pushed a commit of renaming "merge" to reduce, and provide better javadoc 18cbfa0 (#19396)

@github-actions
Copy link
Contributor

github-actions bot commented Oct 2, 2025

❌ Gradle check result for 18cbfa0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Oct 2, 2025

✅ Gradle check result for 18cbfa0: SUCCESS

@jainankitk jainankitk added the backport 3.3 Backport to 3.3 branch label Oct 2, 2025
@jainankitk jainankitk merged commit 7dfe238 into opensearch-project:main Oct 2, 2025
65 of 67 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Performance Roadmap Oct 2, 2025
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 2, 2025
…onsumer (#19396)

Signed-off-by: bowenlan-amzn <[email protected]>
(cherry picked from commit 7dfe238)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
rishabhmaurya pushed a commit that referenced this pull request Oct 2, 2025
…onsumer (#19396) (#19511)

(cherry picked from commit 7dfe238)

Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@bowenlan-amzn bowenlan-amzn deleted the result-consumer-harden branch October 3, 2025 01:04
peteralfonsi pushed a commit to peteralfonsi/OpenSearch that referenced this pull request Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 3.3 Backport to 3.3 branch v3.3.0

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants