Category: load balancing
One of my readers got an interesting idea: he’s trying to make the most of his WAN links by doing per-packet load balancing between a 30 Mbps and a 50 Mbps link. Not exactly surprisingly, the results are not what he expected.
Cloud builders are often using my ExpertExpress service to validate their designs. Tenant onboarding into a multi-tenant (private or public) cloud infrastructure is a common problem, and tenants frequently want to retain the existing network services appliances (firewalls and load balancers).
The Combine Physical and Virtual Appliances in a Private Cloud case study describes a typical solution that combines per-tenant virtual appliances with frontend physical appliances.
After discussing the load sharing intricacies we briefly dabbled with the concept of entropy labels.
Every time I write about unequal traffic distribution across a link aggregation group (LAG, aka Etherchannel or Port Channel) or ECMP fabric, someone asks a simple question “is there no way to reshuffle the traffic to make it more balanced?”
TL&DR summary: there are ways to do it, and some vendors already implemented them.
The algorithm that spreads the traffic across a group of outbound links (LAG or set of ECMP next hops) has to satisfy a few requirements:
- It has to work reasonably well in typical environments;
- It should not reorder packets of the same flow (here’s why);
- It has to be simple enough to be implementable in reasonably cheap ASICs;
The second and third requirement result in what the chipset manufacturers (and subsequently the hardware vendors) are offering today: hash-based distribution of packets. In case you need a step-by-step overview of this process, here’s how it works:
- Create an array of buckets and assign each outgoing link to one or more buckets. The bucket size is the number you see in marketing papers as “we support N-way ECMP” or “we have N-way LAG”.
- Take N fields from the outgoing packet header. The fields could be MAC addresses (source and/or destination), IP addresses (source and/or destination), IP port numbers, or even some other fixed-position fields in the packet header. Some vendors – for example Arista – allow you to configure which fields you want to use (assuming the platform chipset supports this functionality).
- Hash the fields from the packet header to get an integer between 0 and bucket size – 1. Example: for bucket sizes that are power of two take the low-order N bits of the hash.
- Enqueue the packet into the output queue of the interface that is associated with the bucket selected by the packet hash.
Have you noticed that the algorithm never checks the size of the output queue? If the hashing algorithm decides to send the packet through Interface#1, the switch will send the packet through Interface#1 even though that interface might be dropping packets like crazy due to continuous congestion, and all the other interfaces sit idle.
The reason the load-balancing algorithm never checks the load on the outbound interface is simple: the typical environment mentioned above is usually assumed to be a healthy mix of numerous independent mice flows. Throw a few elephants in the mix and the assumptions start breaking down.
The only vendor that was always able to cope with the elephants in the mix is Brocade due to the fact that their traditional typical environment (storage networks) consists mainly of elephants.
Can We Solve the Problem?
Here’s an intriguingly simple question: Why can’t we change the mix of outgoing interfaces in the N-way ECMP table to reflect the actual interface load? Wouldn’t that allow us to push the mice flows away from elephants crowding some of the interfaces?
In principle, the answer is “Sure, we could do that”, but we have to solve three challenges:
- Coarse-grained reshuffling could make matters worse. If your hardware supports 8-way ECMP and you have four uplinks, you might shift a large proportion of the traffic when you reassign the buckets to less-loaded interfaces, resulting in a nasty oscillation. Modern chipsets support at least 256-way ECMP, so that shouldn’t be a problem.
- The hardware you use has to support per-bucket counters. All hardware supports per-interface counters, but while they help you identify the congested interfaces, the won’t help you reshuffle the traffic – if the control-plane software cannot see how much traffic goes through each bucket, it makes no sense to randomly reshuffle the buckets hoping for the best.
- We shall not reorder the packets (at least within the data center), which means that we cannot reshuffle active buckets, but it’s relatively safe to change the outgoing interface of a currently inactive bucket. You could still reorder packets within a TCP session under an unlikely set of circumstances (figuring out what those circumstances are is left as an exercise for the reader), but we just might have to accept that slight risk of temporary performance degradation if we want to get better link utilization.
Would the reshuffle inactive buckets idea work in practice? Are there inactive buckets in a typical high-volume data center environment? Welcome to the weird world of flowlets.
What Are Flowlets?
It seems the idea of flowlets first appeared in the Harnessing TCP’s Burstiness with Flowlet Switching paper (see also corresponding PPT) – due to the bursty nature of TCP, you might be able to do pretty reliable bucket reshuffling with 256 or more buckets, as some buckets always tend to be empty.
Microsoft started using flowlets in Windows Server 2012 R2, and recently Cisco implemented flowlet-based dynamic load balancing in the ACI leaf-and-spine fabrics. Juniper is doing something similar (adaptive load balancing) on MX routers in Junos 14.1, and did Adaptive Flowlet Splicing within a Virtual Chassis Fabric (a nice rehash of the topic).
Need more information?
Olivier Hault sent me an interesting challenge:
I cannot find any simple network-layer solution that would allow me to use total available bandwidth between a Hypervisor with multiple uplinks and a Network Attached Storage (NAS) box.
TL&DR summary: you cannot find it because there’s none.
If the rest of the blog post feels like Latin, you SHOULD watch the Load Balancing and Scale-Out Application Architecture webinar.
The beginning of the story resembles traditional enterprise solutions:
I carried out an interesting quiz during one of my Interop workshop:
- How many use Linux-based servers? Almost everyone raised their hands;
- How many use Apache or Tomcat web servers? Yet again, almost everyone.
- How many run applications written in PHP, Python, Ruby…? Same crowd (probably even a bit more).
- How many use Nginx, Squid or HAProxy for load balancing? Very few.
Is there a rational explanation for this seemingly nonsensical result?
In a previous blog post I explained how load sharing across LDP-controlled MPLS core works. Now let’s focus on another detail: how are the packets assigned to individual paths across the core?
2014-08-14: Additional information was added to the blog post based on comments from Nischal Sheth, Frederic Cuiller and Tiziano Tofoni. Thank you!
Here’s a question that bothered me for years till I finally gave up and labbed it: does ECMP load sharing work in an MPLS core? More specifically, will an LSP split into multiple LSPs?
Recently I was discussing the benefits and drawbacks of virtual appliances, software-defined data centers, and self-service approach to application deployment with a group of extremely smart networking engineers.
After the usual set of objections, someone said “but if we won’t become more flexible, the developers will simply go to Amazon. In fact, they already use Amazon Web Services.”
One of my readers sent me this question:
I have a data center with huge L2 domains. I would like to move routing down to the top of the rack, however I’m stuck with a load-balancing question: how do load-balancers work if you have routed network and pool members that are multiple hops away? How is that possible to use with Direct Return?
There are multiple ways to make load balancers work across multiple subnets:
When OpenFlow was still fresh and exciting, someone made quite a name for himself by proposing a global load-balancing solution that would install per-session OpenFlow entries in every core switch around the world. Clearly a great idea, mimicking the best experiences we had with ATM SVCs.
Meanwhile some people started using OpenFlow in real-life networks for coarse-grained load balancing that improves the scalability of stateful network services. For more details, watch the video recorded during the Real Life OpenFlow-based SDN Use Cases webinar.
When Apple launched the new release of iOS last autumn, networking gurus realized the new iOS uses MP-TCP, a recent development that allows a single TCP socket (as presented to the higher layers of the application stack) to use multiple parallel TCP sessions. Does that mean we’re getting closer to fixing the TCP/IP stack?
TL&DR summary: Unfortunately not.
Every time I talk about small (per-application) virtual appliances, someone inevitably cries “And who will manage thousands of appliances?” Guess what – I’ve heard similar cries from the mainframe engineers when we started introducing Windows and Unix servers. In the meantime, some sysadmins manage more than 10.000 servers, and we’re still discussing the “benefits” of humongous monolithic firewalls.
A while ago I had a discussion with someone who wanted to be able to move whole application stacks between different private cloud solutions (VMware, Hyper-V, OpenStack, Cloud Stack) and a variety of public clouds.
Not surprisingly, there are plenty of startups working on the problem – if you’re interested in what they’re doing, I’d strongly recommend you add CloudCast.net to your list of favorite podcasts – but the only correct way to solve the problem is to design the applications in a cloud-friendly way.