Update role label irrespective of connection failure from operator#38
Update role label irrespective of connection failure from operator#38pratheep-kumar wants to merge 1 commit into
Conversation
6861937 to
0f76552
Compare
… operator Signed-off-by: Pratheep Kumar <pratheepkumar0403@gmail.com>
0f76552 to
448734d
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a critical issue where the Redis operator fails to update pod role labels during leader election when Redis pods become unresponsive. The solution separates label updates from connectivity checks to ensure service availability is maintained even when some Redis instances are temporarily unreachable.
- Refactored
CheckAllSlavesFromMasterto use a two-phase approach: always update labels first, then check connectivity - Enhanced error handling in
GetMasterIPandGetNumberMastersto be more resilient to connectivity issues - Minor import reordering and whitespace cleanup
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| service/redis/client.go | Minor import reordering and whitespace cleanup |
| operator/redisfailover/service/check.go | Core fix implementing two-phase label updates and improved error handling for Redis connectivity issues |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| // This ensures labels are always updated regardless of Redis connectivity issues | ||
| for _, rp := range rps.Items { | ||
| if rp.Status.PodIP == master { | ||
| r.logger.Infof("Update pod label, namespace: %s, pod name: %s, labels: %v", rf.Namespace, rp.Name, generateRedisMasterRoleLabel()) |
There was a problem hiding this comment.
These debug log messages could generate excessive noise in production. Consider using Debug level instead of Info for label update operations, or add a condition to only log when labels actually change.
| return err | ||
| } | ||
| } else { | ||
| r.logger.Infof("Update pod label, namespace: %s, pod name: %s, labels: %v", rf.Namespace, rp.Name, generateRedisSlaveRoleLabel()) |
There was a problem hiding this comment.
These debug log messages could generate excessive noise in production. Consider using Debug level instead of Info for label update operations, or add a condition to only log when labels actually change.
| // If we had some connectivity errors but labels were updated successfully, | ||
| // log a warning but allow the healing process to continue | ||
| if len(connectivityErrors) > 0 { | ||
| r.logger.Warningf("Some Redis instances were unresponsive during replication check, but pod labels were updated successfully. Connectivity errors: %d", len(connectivityErrors)) |
There was a problem hiding this comment.
The warning message should be more specific about which Redis instances were unresponsive. Consider logging the pod IPs or names that had connectivity issues to aid in troubleshooting.
| if err != nil { | ||
| r.logger.Errorf("Get slave of master failed, maybe this node is not ready, pod ip: %s", rp.Status.PodIP) | ||
| return err | ||
| connectivityErrors = append(connectivityErrors, err) |
There was a problem hiding this comment.
Can this be exposed as metrics, rf_redis_connectivity_error
| } | ||
| } | ||
|
|
||
| // If ALL pods had connectivity errors, return an error |
There was a problem hiding this comment.
Did not get this! Whats happening here?
|
|
||
| // Log connectivity status for debugging | ||
| if connectivityErrors > 0 { | ||
| r.logger.Warningf("Found %d masters out of %d responsive nodes (%d nodes had connectivity issues)", nMasters, len(rips)-connectivityErrors, connectivityErrors) |
|
Hi, any news on this PR? |
Fix: Redis operator fails to update pod role labels during leader election
Fixes #34
🐛 Problem
When a Redis pod becomes unresponsive (e.g., due to
redis-cli debug sleep 300or network issues), the Redis operator fails to update pod role labels after Sentinel completes a successful leader election. This results in:📋 Steps to Reproduce
redis-cli debug sleep 300kubectl get pods -o yaml | yq .metadata.labels.redisfailovers-rolemaster, others showslave🔍 Root Cause
The
CheckAllSlavesFromMasterfunction inoperator/redisfailover/service/check.gohad a critical flaw:Error logs showing the issue:
✅ Solution
Separated label updates from connectivity checks into two distinct phases:
Phase 1: Always Update Labels
Phase 2: Check Connectivity (Non-blocking)
🔧 Changes Made
Modified:
operator/redisfailover/service/check.goCheckAllSlavesFromMaster()function refactored🧪 Testing
📈 Expected Behavior After Fix
Scenario walkthrough: