I became interested in finding The Perfect Load Balancer when we had a series of incidents at work involving a service talking to a database that was behaving erratically. While our first focus was on making the database more stable, it was clear to me that there could have been a vastly reduced impact to service if we had been able to load-balance requests more effectively between the database's several read endpoints.The more I looked into the state of the art, the more surprised I was to discover that this is far from being a solved problem. There are plenty of load balancers, but many use algorithms that only work for one or two failure modes-and in these incidents, we had seen a variety of failure modes.This post describes what I learned about the current state of load balancing for high availability, my understanding of the problematic dynamics of the most common tools, and where I think we should go from here.(Disclaimer: This is based primarily on thought experiments and casual observations, and I have not had much luck in finding relevant academic literature. Critiques are very welcome!)TL;DRPoints I'd like you to take away from this:Server health can only be understood in the context of the cluster's healthLoad balancers that use active healthchecks to kick out servers may unnecessarily lose traffic when healthchecks fail to be representative of real traffic healthPassive monitoring of actual traffic allows latency and failure rate metrics to participate in equitable load distributionIf small differences in server health produce large differences in load balancing, the system may oscillate wildly and unpredictablyRandomness can inhibit mobbing and other unwanted correlated behaviorsFoundationsA quick note on terminology: In this post, I'll refer to clients talking to servers with no references to "connections," "nodes," etc. While a given piece of software can function as both a client and a server, even at the same time or in the same request flow, in the scenario I described the app servers are clients of the database servers, and I'll be focusing on this client-server relationship.So in the general case we have N clients talking to M servers:I'm also going to ignore the specifics of the requests. For simplicity, I'll say that the client's request is not optional and that fallback is not possible; if the call fails, the client experiences a degradation of service.The big question, then, is: When a client receives a request, how should it pick a server to call?(Note that I'm looking at requests, not long-lived connections which might carry steady streams, bursts of traffic, or requests at varying intervals. It also shouldn't particularly matter to the overall conclusions whether there is a connection made per request or whether they re-use connections.)Sidebar: Client-side vs. dedicatedYou might be wondering why I have every client talking to every server, commonly called "client-side load balancing" (although in this post's terminology, the load balancer is also called a client.) Why make the clients do this work? It's quite common to put all the servers behind a dedicated load balancer.The catch is that if you only have one dedicated load balancer node, you now have a single point of failure. That's why it's traditional to stand up at least three such nodes. But notice now that clients now need to choose which load balancer to talk to... and each load balancer node still needs to choose which server to send each request to! This doesn't even relocate the problem, it just doubles it. ("Now you have two problems.")I'm not saying that dedicated load balancers are bad. The problem of which load balancer to talk to is conventionally solved with DNS load balancing, which is usually fine, and there's a lot to be said for using a more centralized point for routing, logging, metrics, etc. But they don't really allow you to bypass the problem, since they can still fall prey to certain failure modes, and they're generally less flexible than client-side load balancing.ValuesSo what do we value in a load balancer? What are we optimizing for?In some order, depending on our needs:Reduce the impact of server or network failures on our overall service availabilityKeep service latency lowSpread load evenly between serversDon't overly stress a server if the others have spare capacityPredictability: Easier to see how much headroom the service hasSpread load unevenly if servers have varying capacities, which may vary in time or by server (equitable distribution, rather than equal distribution)A sudden spike, or large amount of traffic right after server startup, might not give the server time to warm up. A gradual increase to the same traffic level might be just fine.Non-service CPU loads, such as installing updates, might reduce the amount of CPU available on a single server.Naïve solutionsBefore trying to solve everything, let's look at some simplistic solutions. How do you distribute requests evenly when all is well?Round-robinClient cycles through serversGuaranteed even distributionRandom selectionStatistically approaches an even distribution, without keeping track of state (coordination/CPU tradeoff)Static choiceEach client just chooses one server for all requestsDNS load balancing effectively does this: Clients resolve the service's domain name to one or more addresses, and the client's network stack picks one and caches it. This is how incoming traffic is balanced for most dedicated load balancers; their clients don't need to know there are multiple servers.Sort of like random, works OK when 1) DNS TTLs are respected and 2) there are significantly more clients than servers (with similar request rates)And what happens if one of the servers goes down in such a configuration? If there are 3 servers, then 1 in 3 requests fail. A 67% success rate is pretty bad. (Not even a single "nine"!) The best possible success rate in this scenario, assuming a perfect load balancer and sufficient capacity on the two remaining servers, is 100%. How can we get there?Defining healthThe usual solution is healthchecks. Healthchecks allow a load balancer to detect certain server or network failures and avoid sending requests to servers that fail the check.In general, we wish to know how "healthy" each server is, whatever that means, because it may have predictive value in answering the core question: "Is this server likely to give a bad response if I send it this request?" There's a higher level question, too: "Is this server likely to become unhealthy if I send it more traffic?" (Or return to health, if I send it less.) Another way of saying this is that some cases of unhealthiness may be dependent on load, while others are load-independent; knowing the difference is essential to predicting how to route traffic when unhealthiness is observed.So broadly speaking, "health" is really a way of modeling external state in service of prediction. But what counts as unhealthy? And how do we measure it?Choosing a vantage pointBefore going into details, it's important to note that there are two very different viewpoints we can use:The intrinsic health of the server: Whether the server application is running, responding, able to talk to all of its own dependencies, and not under severe resource contention.The client's observed health of the server: The health of the server, but also the health of the server's host, the health of the intervening network, and even whether the client is configured with a valid address for the server.From a practical point of view, the server's intrinsic health doesn't matter if the client can't even reach it. Therefore, we'll mostly be looking at server health as observed from the client. There's some subtlety here, though: As the request rate to the server increases, the server application is likely to be the bottleneck, not the network or the host. If we start seeing increased latency or failure rate from the server, that might mean the server is suffering under request load, implying that an additional request burden could make its health worse. Alternatively, the server might have plenty of capacity, and the client is only observing a transient, load-independent network issue, perhaps due to some non-optimal routing. If that's the case, then additional traffic load is unlikely to change the situation. Given that in the general case it can be difficult to distinguish between these cases, we'll generally use the client's observations as the standard of health.What is the measure of health?So, what can a client learn about a server's health from the calls it is making?Latency: How long does it take for responses to come back? This can be broken down further: Connection establishment time, time to first byte of response, time to complete response; minimum, average, maximum, various percentiles. Note that this conflates network conditions and server load-load-independent and load-dependent sources, respectively (for the majority of cases.)Failure rate: What fraction of requests end in failure? (More on what failure means in a bit.)Concurrency: How many requests are currently in flight? This conflates effects from server and client behavior-there may be more in-flight requests to one server either because the server is backed up or because the client has decided to give it a larger proportion of requests for some reason.Queue size: If the client maintains a queue per server rather than a unified queue, a longer queue may be an indicator of either bad health or (again) unequal loading by the client.With queue size and concurrent request count we see that not all measurements are of health per se, but can also be indicative of loading. These are not directly comparable, but clients presumably want to give more requests to healthier and less-loaded servers, so these metrics can be used alongside more intrinsic ones such as latency and failure rate.These are all measurements made from the client's perspective. It's also