Fix WAGED Instance Capacity NPE during rebalance failures #3010

GrantPSpencer · 2025-03-14T00:30:45Z

Issues

My PR addresses the following Helix issues and references them in the PR description:
WAGED Instance Capacity NPE prevents topstate handoff during rebalance failures #3009

Description

Here are some details about my PR, including screenshots of any UI changes:
TopState handoff can still occur in response to a node going down even if WAGED rebalance failures are occuring (assuming that there was a previously calculated best possible). However, if a participant is removed from the cluster during rebalance failures, then an NPE will occur which will prevent topState handoff from occurring. This PR adds a test and also proposes a potential fix by protecting instanceCapacity checks from null values.

Tests

The following tests are written for this issue:[
Added test testNPEonRebalanceFailure to class TestWagedNPE.java and did minor refactoring of the test class.
The following is the result of the "mvn test" command on the appropriate module:
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 31.493 s
[INFO] Finished at: 2025-03-13T17:02:12-07:00
[INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:
N/A

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

junkaixue · 2025-03-17T17:34:33Z

...x-core/src/main/java/org/apache/helix/controller/rebalancer/waged/WagedInstanceCapacity.java

+    if (!_instanceCapacityMap.containsKey(instance)) {
+      LOG.error("Instance: " + instance + " not found in instance capacity map. Cluster may be using previous "
+          + "idealState that includes an instance that is no longer part of the cluster.");
+      return false;
+    }
+


The interesting thing is that shall we skip this instance and continue or fail the entire pipeline using prev compute.

Throwing a helix exception at computeBestPossibleStates will just lead to falling back to the previous best possible. However, throwing an exception here would prevent computeBestPossibleStateForPartition from assigning correct states based off the previous best possible we fell back to. Stack trace below

We'd need to add error handling around the computeBestPossiblePartitionState call and have some fallback mechanism

3732 [HelixController-pipeline-default-TestWagedNPE_cluster-(70412709_DEFAULT)] ERROR org.apache.helix.controller.GenericHelixController [] - Exception while executing DEFAULT pipeline for cluster TestWagedNPE_cluster. Will not continue to next pipeline org.apache.helix.HelixException: Instance: localhost_0 not found in instance capacity map. Cluster may be using previous idealState that includes an instance that is no longer part of the cluster. at org.apache.helix.controller.rebalancer.waged.WagedInstanceCapacity.checkAndReduceInstanceCapacity(WagedInstanceCapacity.java:209) ~[classes/:?] at org.apache.helix.controller.dataproviders.ResourceControllerDataProvider.checkAndReduceCapacity(ResourceControllerDataProvider.java:543) ~[classes/:?] at org.apache.helix.controller.rebalancer.DelayedAutoRebalancer.computeBestPossibleStateForPartition(DelayedAutoRebalancer.java:378) ~[classes/:?] at org.apache.helix.controller.rebalancer.DelayedAutoRebalancer.computeBestPossiblePartitionState(DelayedAutoRebalancer.java:271) ~[classes/:?] at org.apache.helix.controller.rebalancer.DelayedAutoRebalancer.computeBestPossiblePartitionState(DelayedAutoRebalancer.java:54) ~[classes/:?] at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.lambda$computeNewIdealStates$0(WagedRebalancer.java:281) ~[classes/:?] at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?] at java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1693) ~[?:?] at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?] at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290) ~[?:?] at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746) ~[?:?] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) ~[?:?] at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:408) ~[?:?] at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:736) ~[?:?] at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159) ~[?:?] at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173) ~[?:?] at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) ~[?:?] at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) ~[?:?] at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:661) ~[?:?] at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeNewIdealStates(WagedRebalancer.java:277) ~[classes/:?] at org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleStateWithWagedRebalancer(BestPossibleStateCalcStage.java:445) ~[classes/:?] at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:289) ~[classes/:?] at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:94) ~[classes/:?] at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) ~[classes/:?] at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:905) [classes/:?] at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1556) [classes/:?]

then we should have a way to get state change (urgent rebalance pipeline to handle this). No skipping on urgent rebalance pipeline.

simply ignore it in main logic can cause miscomputation on placement for full / partial rebalance.

then we should have a way to get state change (urgent rebalance pipeline to handle this). No skipping on urgent rebalance pipeline.

I'm not sure I understand what you mean. As long as there are rebalance failures, waged will fall back to the previous best possible which may have nodes that have been removed from the cluster during the rebalance failures. Subsequent pipeline runs will face this same issue until the cause of the rebalance failure is addressed. The instanceCapacityMap should be the source of truth for assignable nodes for WagedInstanceCapacity

If we fail the pipeline entirely (current behavior because of NPE) then we will not distribute states across the available nodes from the previous IS

test and fix

eea4e0b

junkaixue reviewed Mar 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix WAGED Instance Capacity NPE during rebalance failures #3010

Fix WAGED Instance Capacity NPE during rebalance failures #3010

Uh oh!

GrantPSpencer commented Mar 14, 2025

Uh oh!

junkaixue Mar 17, 2025

Uh oh!

GrantPSpencer Apr 3, 2025

Uh oh!

junkaixue Apr 7, 2025

Uh oh!

junkaixue Apr 7, 2025

Uh oh!

GrantPSpencer Apr 10, 2025

Uh oh!

Uh oh!

Fix WAGED Instance Capacity NPE during rebalance failures #3010

Are you sure you want to change the base?

Fix WAGED Instance Capacity NPE during rebalance failures #3010

Uh oh!

Conversation

GrantPSpencer commented Mar 14, 2025

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Commits

Code Quality

Uh oh!

junkaixue Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

GrantPSpencer Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

junkaixue Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

junkaixue Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

GrantPSpencer Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!