Yes, Redis cares about your load balancing strategy
The funny thing about distributed faults is that they're largely created by several autonomous systems doing exactly what they're supposed to do. More often than not the problem is that what they're normally supposed to do doesn't translate into how they need to interact together as a cohesive whole.
Consider the following problem:
Periodically, and without warning, a client socket to Redis goes away and you receive an Errno::EAGAIN exception. You haven't explicitly closed it and the server is working exactly as expected. You immediately retry it and suddenly everything works again. What happened?
This is exactly the kind of distributed problem I alluded to. The actual problem is a combination of a load balancing strategy further upstream and an aggressive server-side timeout on Redis. Let's talk about load balancing real quick:
On load balancing
Consider the simple behavior of a round-robbining load balancing strategy. As new incoming requests are handled we evenly distribute them to every node in the cluster. This allows for great predictability as to where the next connection will go, but can suffer service degradation if there's a broad spread of response times.
Suppose you have 5 machines and every 5th connection you receive requires an operation that will require 1 second. The other 4 can be solved in 100ms. Further, a new connection comes in every 100ms. The first connection comes in to the first machine and it starts processing. The next 4 will each be distributed out to the remaining nodes. Now the first machine is still processing the 1st 1second operation, but it now receives another 1 second operation regardless. If this continues we'll always be working machine 1 at capacity but all the other nodes are under utilized.
Now consider the same setup as before, but with a least-connection strategy. In this case our load balancer simply tracks how many connections each server is currently maintaining and we chose one with the least.
The first 1-second request comes into the first machine. However, since the incoming connections only require 100ms and they're spaced 100ms apart, only machines 2 and 3 will see this traffic.
Why? Because as requests come in machines 2 and 3 will be closing their connections out as fast as they are coming in. When the long 1 second request concludes the first machine will be available again and it will receive the next connection. Rather than equally exercising all machines in the cluster, low traffic periods only exercise small segments. However, the advantage is that we can better accommodate a heterogenous workload like the one described here during higher-traffic periods.
Downstream effects & Redis
When Redis is configured for aggressive server-side timeouts it means that under utilized machines run the risk of having their sockets closed out from under them when traffic underflows the cluster based on a least-connection strategy. In this particular distributed fault we would observe periods of low traffic that would leave certain processes/machines in the cluster starving for work, and by the time work came back, they wouldn't have sockets anymore and we'd see the exception. Luckily, this kind of socket error raises an EAGAIN exception, which essentially tells you the solution: "try again."
Never underestimate how something as "trivial" as a load balancing strategy can effect other segments of the environment. Ruby was doing what it was supposed to, Redis was doing what it was supposed to, and the load balancer was doing what it was supposed to. Distributed system faults can be tricky to track down, but assuming anything can fail at any time for any reason is a great starting place.