Here’s a typical scenario they mentioned: a bunch of servers, randomly connected to multiple leaf switches, is offering a service on the same IP address (that’s where anycast comes from).
Before going into the details, let’s ask a simple question: Does that work outside of PowerPoint? Absolutely. It’s a perfect design for a scale-out UDP service like DNS, and large DNS server farms are usually built that way.
TCP Anycast Challenges
The really interesting question: Does it work for TCP services? Now we’re coming to the really hard part – as the spine and leaf switches do ECMP or UCMP toward the anycast IP address, someone must keep track of session-to-server assignments, or all hell would break loose.
It’s easy to figure out that the design works in a steady-state situation. Data center switches do 5-tuple load balancing; every session is thus consistently forwarded to one of the servers. Problem solved… until you get a link or node failure.
Dealing with Link- or Node Loss
Most production-grade hardware ECMP implementations use hash buckets (more details), and if the number of next hops changes due to a topology change, the hash buckets are reassigned, sending most of the traffic to a server that has no idea what to do with it. Modern ECMP implementations avoid that with consistent hashing. Consistent hashing tries to avoids recomputing the hash buckets after a topology change1:
- Hash buckets for valid next hops are not touched.
- Invalid hash buckets (due to invalid next hop) are reassigned to valid next hops.
Obviously we’ll get some misdirected traffic, but those sessions are hopelessly lost anyway – they were connected to a server that is no longer reachable.
Adding New Servers
The really fun part starts when you try to add a server. To do that, the last-hop switch has to take a few buckets from every valid next hop, and assign them to the new server. That’s really hard to do without disrupting something2. Even waiting for a bucket to get idle (the flowlet load balancing approach) doesn’t help – an idle bucket does not mean there’s no active TCP session using it.
Oh, and finally there’s ICMP: ICMP replies include the original TCP/UDP port numbers, but no hardware switch is able to dig that far into the packet, so the ICMP reply is usually sent to some random server that has no idea what to do with it. Welcome to PMTUD hell.
Making Local TCP Anycast Work
Does that mean that it’s impossible to do local TCP anycast load balancing? Of course not – every hyperscaler uses that trick to implement scale-out network load balancing. Microsoft engineers wrote about their solution in 2013, Fastly documented their solution in 20163, Google has Maglev, Facebook open-sourced Katran, we know AWS has Hyperplane, but all we got from re:Invent videos was it’s awesome magic. They provided a few more details during Networking @Scale 2018 conference, but it was still at Karman line level.
You could do something similar at a much smaller scale with a cluster of firewalls or load balancers (assuming your vendor manages to count beyond two active nodes), but the performance of network services clusters is usually far from linear – the more boxes you add to the cluster, the less performance you gain with each additional box – due to cluster-wide state maintenance.
There are at least some open-source software solutions out there that you can use to build large-scale anycast TCP services. If you don’t feel comfortable using the hot-off-the-press gizmos like XDP, there’s Demonware’s BalanceD using Linux IPVS.
More to Explore
- Data Center Infrastructure for Networking Engineers webinar has a long load balancing section.
- I described Microsoft’s approach to scale-out load balancing and its implications in SDN Use Cases and in load balancing part of Microsoft Azure Networking webinar.
- The user-facing part of AWS load balancing is described in Amazon Web Services Networking webinar.
- Added links to Katran, Hyperplane, BalanceD, Cheetah, and Multipath TCP. Thanks a million to Hugo Slabbert, Scott O’Brien, Lincoln Dale, Minh Ha, and Olivier Bonaventure for sending me the relevant links.
And even harder if you want to solve it in hardware at terabit speeds ↩︎
Take your time and read the whole article. They went into intricate details I briefly touched upon in this blog post. ↩︎