PHOENIX-7566 ZK to SystemTable Sync and Event Reactor for Failover #2302

ritegarg · 2025-10-14T06:07:26Z

Summary of changes

Event Reactor for Failover, this reactor observes state transitions and makes state updates accordingly.
a. Add peer event reactor (for reaction to peer state transitions)
b. Add local event reactor (for reaction to local state transitions)
New flow to sync between ZK and HA System Table
a. ZK is the source of truth for config and state both. But only local state is maintained in ZK.
b. Updating System table will be best effort basis
c. Regular job with jitter to sync from ZK to System Table in case of missed update.
d. Introduced adminVersion to handle any admin updates to config.
Removed Degraded for reader/writer in favor of Degraded Standby
In update calls check if the state is already updated, double checking now once with local cache and once with direct ZK. This is needed as all the HAGroupStoreManagers will try to update the state on same trigger.
Added fromState to event subscription callback. Note that fromState can be null when ZNode is initialized.
Not blocking any transitions to HA_GROUP table in IndexRegionObserver even if we are in ACTIVE_TO_STANDBY state.

Testing

Event Reactor
a. E2E Failover Happy Path
b. E2E Failover with Abort
c. Peer Reaction when ACTIVE moves from Sync to Store&Forward mode and back
New flow to sync between ZK and HA System Table
a. Existing tests passing
b. Added test for regular job with jitter to sync from ZK to System Table in case of missed update.

kadirozde · 2025-10-16T00:04:15Z

phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/HAGroupStateListener.java

+     * @param fromState the previous state before the transition
+     *                  can be null for initial state.
+     *                  Also, can be inaccurate in case there is
+     *                  connection loss to ZK and multiple state changes happen in between.


Is this true? Are not we using persistent watchers?

Yes, in case of connection loss there can be events which are missed. Although this doesn't affect failover IMO, clients need to be aware of this limitation.

kadirozde · 2025-10-16T00:22:11Z

phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/HAGroupStoreClient.java

+     * Syncs data from ZooKeeper (source of truth) to the system table.
+     * This method is called periodically to ensure consistency.
+     */
+    private void syncZKToSystemTable() {


Instead of syncing every time, can we check if sync needed by reading the HA record from the HA group system table first, and then sync only if needed?

Done and also added IT to check the same

…checking system table record before writing

kadirozde · 2025-10-17T20:44:05Z

phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/HAGroupStoreClient.java

+                    = systemTableRecord.getClusterRole().getDefaultHAGroupState();
+            Long lastSyncTimeInMs
+                    = defaultHAGroupState.equals(HAGroupStoreRecord.HAGroupState.ACTIVE_NOT_IN_SYNC)
+                    ? System.currentTimeMillis()


If the ZNode for the HA group is not present, should we set lastSyncTimeInMs to zero always? I do not think it is safe to set to System.currentTimeMillis(). Also what does setting to null mean here? Is it equivalent to setting to zero?

This is because the ACTIVE cluster will start in ANIS state. The writer might not be active so setting this to current timestamp.
Also, setting null means we are current, it's more of a convention that we adopt. @Himanshu-g81 LMK if you have any strong preference since you are primary user for this field.

This is because the ACTIVE cluster will start in ANIS state. The writer might not be active so setting this to current timestamp

I think in that case it would be 0? i.e. cluster is not in sync but we don't have exact timestamp on when it was last in sync, we will not purge anything during compaction on standby side and as soon as it's in sync, this timestamp will also be updated.

Also, setting null means we are current, it's more of a convention that we adopt.

@ritegarg if the cluster is in sync, can we keep it currentTime (and not null?) (i.e. last timestamp when cluster was in sync - at the time when API is called, which is curren time). On reader it's handled with same assumption.

i.e. last timestamp when cluster was in sync - at the time when API is called
This might be hard to maintain as there can be sometime elapsed between calling and returning the response. It is better to keep as 0.

I propose we can adopt this convention
0 (long default value) -> Sync
-1 (Explicitly added when HAGroup ZNode is initialized) -> Unknown as cluster started in this mode
Any other timestamp (long) -> Last known sync time.

Standby will observe any updates sent from Active(via ZK listener) and will just copy the timestamp in its HAGroupStoreRecord on standby ZK cluster. For eg. when cluster moves from AIS to ANIS, standby will listen to this change and update local HAGroupStoreRecord in standby ZK with same value. When the record moves back from ANIS to AIS, standby peer will then detect the transition and just update the local ZK with same value. WDYT?

I am not sure I understand the value of setting anything other than zero when we do not know/determine the value for lastSyncTimeInMs. What are the consumers supposed to do when they see the value is -1 (instead of zero)? Regardless if it is zero or -1, the behavior will be the same. So, let us set it to zero in all such cases.

Discussed offline and finally decided to use 0 as default value when ZNode is initialized.
When we move from AIS -> ANIS, value is updated to beginning of last round
When we move from ANIS -> AIS post that, value is retained

Himanshu-g81 · 2025-10-21T09:44:22Z

phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/HAGroupStoreManager.java

+     *      check the state again and retry if the use case still needs it.
+     * @throws SQLException when there is an error with the database operation
+     */
+    public void initiateFailoverOnActiveCluster(final String haGroupName)


@ritegarg are you planning to add the API to mark cluster from STADNBY_TO_ACTIVE to ACTIVE in this PR?

@Himanshu-g81 you can use setHAGroupStatusToSync

…ENIX-7566-1006

kadirozde · 2025-10-27T21:02:37Z

phoenix-core-server/src/main/java/org/apache/phoenix/hbase/index/IndexRegionObserver.java

              throw new IOException("HAGroupStoreManager is null "
                      + "for current cluster, check configuration");
          }
          String tableName


I assume the slowness will be observed for the very first mutation on the specific HA group. When do we initialize an HA group store client? I am not worried about slowness in updating the HA group system table. However, this would be a concern if this slowness is also observed for user mutations. I think we need to initialize an HA group store client before receiving the first mutation. This may mean that we need to initiate initialization when the region sever level Phoenix coproc starts possibly asynchronously.

Created https://issues.apache.org/jira/browse/PHOENIX-7719 for this.

…ation logic

kadirozde

+1. Thanks!

PHOENIX-7566 ZK to SystemTable Sync and Event Reactor for Failover (apache#2302)

PHOENIX-7566 ZK to SystemTable Sync and Event Reactor for Failover

cf86560

ritegarg force-pushed the PHOENIX-7566-1006 branch from ccdad34 to cf86560 Compare October 15, 2025 05:19

kadirozde reviewed Oct 16, 2025

View reviewed changes

Ritesh Garg added 3 commits October 15, 2025 18:02

PHOENIX-7566 ZK to SystemTable Sync and Event Reactor for Failover

4fadd02

PHOENIX-7566 ZK Removing retry logic, updating e2e failover test and …

bf3f684

…checking system table record before writing

PHOENIX-7566 Fixing checkstyle logic

9be70af

ritegarg requested a review from kadirozde October 17, 2025 18:04

kadirozde reviewed Oct 17, 2025

View reviewed changes

Himanshu-g81 reviewed Oct 21, 2025

View reviewed changes

Ritesh Garg added 3 commits October 25, 2025 18:56

Merge remote-tracking branch 'upstream/PHOENIX-7562-feature' into PHO…

bbde4bd

…ENIX-7566-1006

Fixing checkstyle and blank lines

2554614

Fixing checkstyle and blank lines

8f54855

ritegarg requested review from Himanshu-g81 and kadirozde October 27, 2025 20:01

kadirozde reviewed Oct 27, 2025

View reviewed changes

PHOENIX-7566 Making lastSyncTimeInMs default to 0 and updating calcul…

07b9de3

…ation logic

ritegarg requested a review from kadirozde October 28, 2025 00:55

kadirozde approved these changes Oct 28, 2025

View reviewed changes

kadirozde merged commit 520f48b into apache:PHOENIX-7562-feature Oct 28, 2025

ritegarg added a commit to ritegarg/phoenix that referenced this pull request Oct 28, 2025

Merge pull request #2 from apache/PHOENIX-7562-feature

1b9c1ca

PHOENIX-7566 ZK to SystemTable Sync and Event Reactor for Failover (apache#2302)

PHOENIX-7566 ZK to SystemTable Sync and Event Reactor for Failover #2302

PHOENIX-7566 ZK to SystemTable Sync and Event Reactor for Failover #2302

Uh oh!

Conversation

ritegarg commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Himanshu-g81 Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kadirozde left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ritegarg commented Oct 14, 2025 •

edited

Loading

Himanshu-g81 Oct 21, 2025 •

edited

Loading