Update role label irrespective of connection failure from operator by pratheep-kumar · Pull Request #38 · freshworks-oss/redis-operator

pratheep-kumar · 2025-09-19T15:43:14Z

Fix: Redis operator fails to update pod role labels during leader election

Fixes #34

🐛 Problem

When a Redis pod becomes unresponsive (e.g., due to redis-cli debug sleep 300 or network issues), the Redis operator fails to update pod role labels after Sentinel completes a successful leader election. This results in:

All pods showing wrong labels even after successful leader election
Empty service endpoints (no master found)

📋 Steps to Reproduce

Install RedisFailover resource with 3 replicas (quorum setup)
Manually hang current master: redis-cli debug sleep 300
Wait for Sentinel to complete leader election (new master promoted)
Check pod labels: kubectl get pods -o yaml | yq .metadata.labels.redisfailovers-role
Expected: New master pod shows master, others show slave
Actual: labels are not updated service endpoints empty

🔍 Root Cause

The CheckAllSlavesFromMaster function in operator/redisfailover/service/check.go had a critical flaw:

Sequential processing: Updates labels and checks Redis connectivity in the same loop
Early exit on timeout: When hitting an unresponsive pod, function returns error immediately
Incomplete processing: Remaining pods never get their labels updated
Stale state: Service selectors can't find master pod due to incorrect labels

Error logs showing the issue:

time="2025-08-13T10:43:22Z" level=error msg="Get redis info failed, maybe this node is not ready, pod ip: 10.111.36.121"
time="2025-08-13T10:47:47Z" level=info msg="Update pod label, namespace: infra-dev, pod name: rfr-pigredis-1, labels: map[redisfailovers-role:master]"

✅ Solution

Separated label updates from connectivity checks into two distinct phases:

Phase 1: Always Update Labels

Update ALL pod labels based on known master IP (from Sentinel)
This phase always succeeds regardless of individual pod connectivity issues
Ensures service endpoints work immediately

Phase 2: Check Connectivity (Non-blocking)

Verify Redis replication consistency
Log errors but don't block label updates
Only fail if ALL pods are unresponsive (backward compatibility)

🔧 Changes Made

Modified: operator/redisfailover/service/check.go

CheckAllSlavesFromMaster() function refactored
Added two-phase processing approach

🧪 Testing

✅ All existing unit tests pass

📈 Expected Behavior After Fix

Before: Pod hangs → Timeout → Exit early → No labels updated → Service down
After:  Pod hangs → Labels updated first → Connectivity check fails → Continue → Service works

Scenario walkthrough:

Redis pod hangs (debug sleep 300)
Sentinel elects new master
Operator detects new master via responsive pods
Updates ALL pod labels immediately
Service endpoints work → No downtime! 🎯

… operator Signed-off-by: Pratheep Kumar <pratheepkumar0403@gmail.com>

Copilot

Pull Request Overview

This PR fixes a critical issue where the Redis operator fails to update pod role labels during leader election when Redis pods become unresponsive. The solution separates label updates from connectivity checks to ensure service availability is maintained even when some Redis instances are temporarily unreachable.

Refactored CheckAllSlavesFromMaster to use a two-phase approach: always update labels first, then check connectivity
Enhanced error handling in GetMasterIP and GetNumberMasters to be more resilient to connectivity issues
Minor import reordering and whitespace cleanup

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
service/redis/client.go	Minor import reordering and whitespace cleanup
operator/redisfailover/service/check.go	Core fix implementing two-phase label updates and improved error handling for Redis connectivity issues

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-19T18:12:40Z

+	// This ensures labels are always updated regardless of Redis connectivity issues
 	for _, rp := range rps.Items {
 		if rp.Status.PodIP == master {
+			r.logger.Infof("Update pod label, namespace: %s, pod name: %s, labels: %v", rf.Namespace, rp.Name, generateRedisMasterRoleLabel())


These debug log messages could generate excessive noise in production. Consider using Debug level instead of Info for label update operations, or add a condition to only log when labels actually change.

Copilot · 2025-09-19T18:12:40Z

 				return err
 			}
 		} else {
+			r.logger.Infof("Update pod label, namespace: %s, pod name: %s, labels: %v", rf.Namespace, rp.Name, generateRedisSlaveRoleLabel())


These debug log messages could generate excessive noise in production. Consider using Debug level instead of Info for label update operations, or add a condition to only log when labels actually change.

Copilot · 2025-09-19T18:12:41Z

+	// If we had some connectivity errors but labels were updated successfully,
+	// log a warning but allow the healing process to continue
+	if len(connectivityErrors) > 0 {
+		r.logger.Warningf("Some Redis instances were unresponsive during replication check, but pod labels were updated successfully. Connectivity errors: %d", len(connectivityErrors))


The warning message should be more specific about which Redis instances were unresponsive. Consider logging the pod IPs or names that had connectivity issues to aid in troubleshooting.

samof76 · 2025-09-19T18:23:06Z

 		if err != nil {
 			r.logger.Errorf("Get slave of master failed, maybe this node is not ready, pod ip: %s", rp.Status.PodIP)
-			return err
+			connectivityErrors = append(connectivityErrors, err)


Can this be exposed as metrics, rf_redis_connectivity_error

samof76 · 2025-09-19T18:24:30Z

 		}
 	}
+
+	// If ALL pods had connectivity errors, return an error


Did not get this! Whats happening here?

samof76 · 2025-09-19T18:29:29Z

+
+	// Log connectivity status for debugging
+	if connectivityErrors > 0 {
+		r.logger.Warningf("Found %d masters out of %d responsive nodes (%d nodes had connectivity issues)", nMasters, len(rips)-connectivityErrors, connectivityErrors)


What does this log warning even mean?

D13410N3 · 2025-10-14T11:02:21Z

Hi, any news on this PR?

pratheep-kumar force-pushed the fixing-master-role-update branch from 6861937 to 0f76552 Compare September 19, 2025 17:27

Fix - Updating role label to pods even during connection failure from…

448734d

… operator Signed-off-by: Pratheep Kumar <pratheepkumar0403@gmail.com>

pratheep-kumar force-pushed the fixing-master-role-update branch from 0f76552 to 448734d Compare September 19, 2025 17:29

pratheep-kumar changed the title ~~Update role label irrespective of connection failure from operator #34~~ Update role label irrespective of connection failure from operator Sep 19, 2025

pratheep-kumar marked this pull request as ready for review September 19, 2025 18:11

pratheep-kumar requested review from Sasidharan-Gopal, dinesh-murugiah and samof76 as code owners September 19, 2025 18:11

pratheep-kumar requested review from Kaushik-Vijayakumar-1 and Copilot September 19, 2025 18:11

Copilot AI reviewed Sep 19, 2025

View reviewed changes

samof76 reviewed Sep 19, 2025

View reviewed changes

Comment thread service/redis/client.go

samof76 reviewed Sep 19, 2025

View reviewed changes

pratheep-kumar self-assigned this Nov 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update role label irrespective of connection failure from operator#38

Update role label irrespective of connection failure from operator#38
pratheep-kumar wants to merge 1 commit into
masterfrom
fixing-master-role-update

pratheep-kumar commented Sep 19, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 19, 2025

Uh oh!

Copilot AI Sep 19, 2025

Uh oh!

Copilot AI Sep 19, 2025

Uh oh!

samof76 Sep 19, 2025

Uh oh!

samof76 Sep 19, 2025

Uh oh!

Uh oh!

samof76 Sep 19, 2025

Uh oh!

D13410N3 commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pratheep-kumar commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐛 Problem

📋 Steps to Reproduce

🔍 Root Cause

✅ Solution

Phase 1: Always Update Labels

Phase 2: Check Connectivity (Non-blocking)

🔧 Changes Made

🧪 Testing

📈 Expected Behavior After Fix

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

samof76 Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

samof76 Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

samof76 Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

D13410N3 commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pratheep-kumar commented Sep 19, 2025 •

edited

Loading