Eliminate redundant dial mutex causing unbounded connection queue contention #3088
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The immediate symptom this PR attempts to address: during periods of transient server connectivity errors, go-redis commands time out after upwards of 60s (or more), even though the socket read/write timeouts are 3s and the context timeout on the commands is 1s. We have root caused this bug to lock contention in the connection pool's lazy dialer.
There is a mutual exclusion lock in
DialHook
, which allows only one server dial to occur at the same time. In the event of server connectivity errors, this causes unbounded connection queueing under highly concurrent workloads.Consider, for example, a concurrent workload with the default dial timeout of 5s, and an unresponsive server endpoint. During this period, all dials are timing out.
DialHook
.DialHook
itself does not mutate any state in thehooksMixin
. I believe the mutex is redundant, and can be eliminated. The original commit that introduced the lock attempts to fix a race condition encountered inAddHook
. The mutex that guardschain
should be sufficient for this purpose.We have validated that this change fixes the unbounded queueing, and prevents the system from entering a prolonged state of not serving any useful throughput, during these periods.
I have also added a unit test to capture the regression. Without this patch, the unit test correctly fails;
Ping
takes successively longer on each invocation (1s, 2s, 3s, etc.).