Skip to content

Conversation

@MarcusSorealheis
Copy link
Collaborator

@MarcusSorealheis MarcusSorealheis commented Nov 22, 2025

Description

Implements a self-healing Redis connection pool (RecoverablePool) that automatically detects and replaces failed connections without disrupting ongoing operations. When a connection fails, the pool transparently creates a new client, establishes the connection, and swaps it into the pool—making Redis connection failures invisible to the caller.

Important

This PR is an extension of #2067, should be merged after, and needs to be tested with real world workloads of our customers. It may break. A couple obvious risks in this implementation are:

  • thindering herd of replacement attempts
  • silent client leak potential (grep for "slow proc" in this blog)
  • Losing Fred's pool capabilities, as I write this I'm starting to think that there was some tuning we could have done in Fred to prevent these issues but I'm not sure (e.g. incrementing pool size)

Key changes:

  • Replace fred's built-in Pool with custom RecoverablePool that supports on-demand client replacement
  • Implement double-checked locking to prevent redundant replacements in concurrent scenarios
  • Add graceful shutdown of replaced clients to release server-side resources
  • Improve error messages with connection details for easier debugging

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

  1. Unit tests updated to use RecoverablePool
  2. Connection error tests verify failover behavior
  3. Health check tests confirm proper error reporting
  4. bazel test //nativelink-store:integration_tests/redis_store_test_test passes

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant