-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Harden the circuit breaker and failure handle logic in query result consumer #19396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harden the circuit breaker and failure handle logic in query result consumer #19396
Conversation
3d88f2c to
5802e94
Compare
|
❌ Gradle check result for 5802e94: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
@kaushalmahi12 - Can you help with review for this code change? |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #19396 +/- ##
============================================
- Coverage 73.08% 72.98% -0.11%
+ Complexity 70491 70382 -109
============================================
Files 5712 5712
Lines 322762 322754 -8
Branches 46743 46744 +1
============================================
- Hits 235879 235549 -330
- Misses 67941 68207 +266
- Partials 18942 18998 +56 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7c613a4 to
0f48443
Compare
|
❌ Gradle check result for ecc930c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bowenlan-amzn for improving this nightmarish class. Can you also add a diagram how does the existing class works and what are we improving in that. Diagram will definitely help the reviewers to quickly get the context.
server/src/main/java/org/opensearch/action/search/QueryPhaseResultConsumer.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/action/search/QueryPhaseResultConsumer.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/action/search/QueryPhaseResultConsumer.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/action/search/QueryPhaseResultConsumer.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/action/search/QueryPhaseResultConsumer.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/action/search/QueryPhaseResultConsumer.java
Outdated
Show resolved
Hide resolved
ecc930c to
92c0bd9
Compare
|
❕ Gradle check result for 92c0bd9: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
92c0bd9 to
f21f79d
Compare
|
❕ Gradle check result for f21f79d: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
server/src/main/java/org/opensearch/action/search/QueryPhaseResultConsumer.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/action/search/QueryPhaseResultConsumer.java
Outdated
Show resolved
Hide resolved
|
❕ Gradle check result for 01aa6e3: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
…onsumer Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
SearchPhaseControllerTests run thousands times w/o failure Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
it's not idempotent and may be called by non-synchronized thread like tryExecuteNext Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
01aa6e3 to
18cbfa0
Compare
|
@kaushalmahi12 thanks for the review! |
|
❌ Gradle check result for 18cbfa0: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
…onsumer (#19396) Signed-off-by: bowenlan-amzn <[email protected]> (cherry picked from commit 7dfe238) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…onsumer (#19396) (#19511) (cherry picked from commit 7dfe238) Signed-off-by: bowenlan-amzn <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…onsumer (opensearch-project#19396) Signed-off-by: bowenlan-amzn <[email protected]>
Description
This PR is meant for harden the circuit breaker logic in query result consumer. The main logic change is only at one place,
addEstimateAndMaybeBreak(aggsSize)inconsumeResultbefore actually perform any real "consume logic".On the other hand, considering this is a important but spaghetti code, and possibly need to be improved in the future, some refactoring works are also done.
Circuit Breaker Change
Circuit breaker
addEstimateBytesAndMaybeBreakare used twiceconsume: query result received at coordinator transport layer, and will be handled by query result consumer. We estimate the heap size of the query result, and check if it breaks the REQUEST circuit breaker.tryExecuteNext: before doing partial reduce on the buffered query results, we estimate the extra heap size that will be used, and check if it breaks the REQUEST circuit breaker.Some context about the partial merge logic:
PendingMergesare the core logic to buffer and reduce shard results in a batched manner without waiting for all results arrived.consumebuffer the shard result and check if the buffer size reaches threshold, and create merge task from buffer.tryExecuteNextpoll merge task from queue and perform the partial reduce.Refactoring around Failure Handling
The rule I followed is to use
onMergeFailureto handle any failure we captured. It could be a circuit breaker exception, a task cancellation exception or any exception caught during the partial reduce.onMergeFailurestores the failure, resets the circuit breaker, and clears the merge task queue. It also sends the cancel search task request to other nodes so they could try stop processing the shard query at their best.Performance Implication
TermsReduceBenchmark doesn't show any regression
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.