Skip to content

Update role label irrespective of connection failure from operator#38

Open
pratheep-kumar wants to merge 1 commit into
masterfrom
fixing-master-role-update
Open

Update role label irrespective of connection failure from operator#38
pratheep-kumar wants to merge 1 commit into
masterfrom
fixing-master-role-update

Conversation

@pratheep-kumar
Copy link
Copy Markdown
Collaborator

@pratheep-kumar pratheep-kumar commented Sep 19, 2025

Fix: Redis operator fails to update pod role labels during leader election

Fixes #34

🐛 Problem

When a Redis pod becomes unresponsive (e.g., due to redis-cli debug sleep 300 or network issues), the Redis operator fails to update pod role labels after Sentinel completes a successful leader election. This results in:

  • All pods showing wrong labels even after successful leader election
  • Empty service endpoints (no master found)

📋 Steps to Reproduce

  1. Install RedisFailover resource with 3 replicas (quorum setup)
  2. Manually hang current master: redis-cli debug sleep 300
  3. Screenshot 2025-09-19 at 10 02 18 PM
  4. Screenshot 2025-09-19 at 10 01 58 PM
  5. Wait for Sentinel to complete leader election (new master promoted)
  6. Check pod labels: kubectl get pods -o yaml | yq .metadata.labels.redisfailovers-role
  7. Screenshot 2025-09-19 at 10 06 17 PM
  8. Expected: New master pod shows master, others show slave
  9. Actual: labels are not updated service endpoints empty

🔍 Root Cause

The CheckAllSlavesFromMaster function in operator/redisfailover/service/check.go had a critical flaw:

  1. Sequential processing: Updates labels and checks Redis connectivity in the same loop
  2. Early exit on timeout: When hitting an unresponsive pod, function returns error immediately
  3. Incomplete processing: Remaining pods never get their labels updated
  4. Stale state: Service selectors can't find master pod due to incorrect labels

Error logs showing the issue:

time="2025-08-13T10:43:22Z" level=error msg="Get redis info failed, maybe this node is not ready, pod ip: 10.111.36.121"
time="2025-08-13T10:47:47Z" level=info msg="Update pod label, namespace: infra-dev, pod name: rfr-pigredis-1, labels: map[redisfailovers-role:master]"

Solution

Separated label updates from connectivity checks into two distinct phases:

Phase 1: Always Update Labels

  • Update ALL pod labels based on known master IP (from Sentinel)
  • This phase always succeeds regardless of individual pod connectivity issues
  • Ensures service endpoints work immediately

Phase 2: Check Connectivity (Non-blocking)

  • Verify Redis replication consistency
  • Log errors but don't block label updates
  • Only fail if ALL pods are unresponsive (backward compatibility)

🔧 Changes Made

Modified: operator/redisfailover/service/check.go

  • CheckAllSlavesFromMaster() function refactored
  • Added two-phase processing approach

🧪 Testing

  • ✅ All existing unit tests pass

📈 Expected Behavior After Fix

Before: Pod hangs → Timeout → Exit early → No labels updated → Service down
After:  Pod hangs → Labels updated first → Connectivity check fails → Continue → Service works

Scenario walkthrough:

  1. Redis pod hangs (debug sleep 300)
  2. Screenshot 2025-09-19 at 10 47 46 PM
  3. Sentinel elects new master
  4. Operator detects new master via responsive pods
  5. Updates ALL pod labels immediately
  6. Screenshot 2025-09-19 at 10 48 13 PM
  7. Service endpoints work → No downtime! 🎯

@pratheep-kumar pratheep-kumar force-pushed the fixing-master-role-update branch from 6861937 to 0f76552 Compare September 19, 2025 17:27
… operator

Signed-off-by: Pratheep Kumar <pratheepkumar0403@gmail.com>
@pratheep-kumar pratheep-kumar force-pushed the fixing-master-role-update branch from 0f76552 to 448734d Compare September 19, 2025 17:29
@pratheep-kumar pratheep-kumar changed the title Update role label irrespective of connection failure from operator #34 Update role label irrespective of connection failure from operator Sep 19, 2025
@pratheep-kumar pratheep-kumar marked this pull request as ready for review September 19, 2025 18:11
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a critical issue where the Redis operator fails to update pod role labels during leader election when Redis pods become unresponsive. The solution separates label updates from connectivity checks to ensure service availability is maintained even when some Redis instances are temporarily unreachable.

  • Refactored CheckAllSlavesFromMaster to use a two-phase approach: always update labels first, then check connectivity
  • Enhanced error handling in GetMasterIP and GetNumberMasters to be more resilient to connectivity issues
  • Minor import reordering and whitespace cleanup

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
service/redis/client.go Minor import reordering and whitespace cleanup
operator/redisfailover/service/check.go Core fix implementing two-phase label updates and improved error handling for Redis connectivity issues

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

// This ensures labels are always updated regardless of Redis connectivity issues
for _, rp := range rps.Items {
if rp.Status.PodIP == master {
r.logger.Infof("Update pod label, namespace: %s, pod name: %s, labels: %v", rf.Namespace, rp.Name, generateRedisMasterRoleLabel())
Copy link

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These debug log messages could generate excessive noise in production. Consider using Debug level instead of Info for label update operations, or add a condition to only log when labels actually change.

Copilot uses AI. Check for mistakes.
return err
}
} else {
r.logger.Infof("Update pod label, namespace: %s, pod name: %s, labels: %v", rf.Namespace, rp.Name, generateRedisSlaveRoleLabel())
Copy link

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These debug log messages could generate excessive noise in production. Consider using Debug level instead of Info for label update operations, or add a condition to only log when labels actually change.

Copilot uses AI. Check for mistakes.
// If we had some connectivity errors but labels were updated successfully,
// log a warning but allow the healing process to continue
if len(connectivityErrors) > 0 {
r.logger.Warningf("Some Redis instances were unresponsive during replication check, but pod labels were updated successfully. Connectivity errors: %d", len(connectivityErrors))
Copy link

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning message should be more specific about which Redis instances were unresponsive. Consider logging the pod IPs or names that had connectivity issues to aid in troubleshooting.

Copilot uses AI. Check for mistakes.
if err != nil {
r.logger.Errorf("Get slave of master failed, maybe this node is not ready, pod ip: %s", rp.Status.PodIP)
return err
connectivityErrors = append(connectivityErrors, err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be exposed as metrics, rf_redis_connectivity_error

}
}

// If ALL pods had connectivity errors, return an error
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not get this! Whats happening here?

Comment thread service/redis/client.go

// Log connectivity status for debugging
if connectivityErrors > 0 {
r.logger.Warningf("Found %d masters out of %d responsive nodes (%d nodes had connectivity issues)", nMasters, len(rips)-connectivityErrors, connectivityErrors)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this log warning even mean?

@D13410N3
Copy link
Copy Markdown

Hi, any news on this PR?

@pratheep-kumar pratheep-kumar self-assigned this Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Label redisfailovers-role is not updated on redis pod hang

4 participants