Improving ECMP Load Balancing with Flowlets

Monday, January 19, 2015 07:42 +0100

Improving ECMP Load Balancing with Flowlets

Every time I write about unequal traffic distribution across a link aggregation group (LAG, aka Etherchannel or Port Channel) or ECMP fabric, someone asks a simple question “is there no way to reshuffle the traffic to make it more balanced?”

TL&DR summary: there are ways to do it, and some vendors already implemented them.

The Problem

The algorithm that spreads the traffic across a group of outbound links (LAG or set of ECMP next hops) has to satisfy a few requirements:

It has to work reasonably well in typical environments;
It should not reorder packets of the same flow (here’s why);
It has to be simple enough to be implementable in reasonably cheap ASICs;

The second and third requirement result in what the chipset manufacturers (and subsequently the hardware vendors) are offering today: hash-based distribution of packets. In case you need a step-by-step overview of this process, here’s how it works:

Create an array of buckets and assign each outgoing link to one or more buckets. The bucket size is the number you see in marketing papers as “we support N-way ECMP” or “we have N-way LAG”.
Take N fields from the outgoing packet header. The fields could be MAC addresses (source and/or destination), IP addresses (source and/or destination), IP port numbers, or even some other fixed-position fields in the packet header. Some vendors – for example Arista – allow you to configure which fields you want to use (assuming the platform chipset supports this functionality).
Hash the fields from the packet header to get an integer between 0 and bucket size – 1. Example: for bucket sizes that are power of two take the low-order N bits of the hash.
Enqueue the packet into the output queue of the interface that is associated with the bucket selected by the packet hash.

Have you noticed that the algorithm never checks the size of the output queue? If the hashing algorithm decides to send the packet through Interface#1, the switch will send the packet through Interface#1 even though that interface might be dropping packets like crazy due to continuous congestion, and all the other interfaces sit idle.

The reason the load-balancing algorithm never checks the load on the outbound interface is simple: the typical environment mentioned above is usually assumed to be a healthy mix of numerous independent mice flows. Throw a few elephants in the mix and the assumptions start breaking down.

The only vendor that was always able to cope with the elephants in the mix is Brocade due to the fact that their traditional typical environment (storage networks) consists mainly of elephants.

Can We Solve the Problem?

Here’s an intriguingly simple question: Why can’t we change the mix of outgoing interfaces in the N-way ECMP table to reflect the actual interface load? Wouldn’t that allow us to push the mice flows away from elephants crowding some of the interfaces?

In principle, the answer is “Sure, we could do that”, but we have to solve three challenges:

Coarse-grained reshuffling could make matters worse. If your hardware supports 8-way ECMP and you have four uplinks, you might shift a large proportion of the traffic when you reassign the buckets to less-loaded interfaces, resulting in a nasty oscillation. Modern chipsets support at least 256-way ECMP, so that shouldn’t be a problem.
The hardware you use has to support per-bucket counters. All hardware supports per-interface counters, but while they help you identify the congested interfaces, the won’t help you reshuffle the traffic – if the control-plane software cannot see how much traffic goes through each bucket, it makes no sense to randomly reshuffle the buckets hoping for the best.
We shall not reorder the packets (at least within the data center), which means that we cannot reshuffle active buckets, but it’s relatively safe to change the outgoing interface of a currently inactive bucket. You could still reorder packets within a TCP session under an unlikely set of circumstances (figuring out what those circumstances are is left as an exercise for the reader), but we just might have to accept that slight risk of temporary performance degradation if we want to get better link utilization.

Would the reshuffle inactive buckets idea work in practice? Are there inactive buckets in a typical high-volume data center environment? Welcome to the weird world of flowlets.

What Are Flowlets?

It seems the idea of flowlets first appeared in the Harnessing TCP’s Burstiness with Flowlet Switching paper (see also corresponding PPT) – due to the bursty nature of TCP, you might be able to do pretty reliable bucket reshuffling with 256 or more buckets, as some buckets always tend to be empty.

Microsoft started using flowlets in Windows Server 2012 R2, and recently Cisco implemented flowlet-based dynamic load balancing in the ACI leaf-and-spine fabrics. Juniper is doing something similar (adaptive load balancing) on MX routers in Junos 14.1, and did Adaptive Flowlet Splicing within a Virtual Chassis Fabric (a nice rehash of the topic).

Need more information?

Data Center Fabrics webinar describes data center solutions from leading networking vendors;
ipSpace.net webinar subscription also gives you access to further data center topics, including leaf-and-spine fabric architectures and design scenarios;

7 comments:

Olivier Bonaventure 19 January 2015 11:12

Flowbender, described in a recent paper published at Conext, leverages ECN and the smart hash feature of Broadcom chipsets. It could be useful in datacenter, see http://conferences2.sigcomm.org/co-next/2014/CoNEXT_papers/p149.pdf

Ivan Pepelnjak 19 January 2015 19:48

Thank you. Highly interesting. It's amazing how much you can do with simple tweaks on the end nodes.

Anonymous 20 January 2015 03:07

Quite interesting. A combination of Flowlets and Flowbender could practically remove any small amount of re-order that exists with flowbender. Before transmitting after chaning the TTL, make sure to wait for RTT/2 time and transmit.

Unknown 26 January 2015 17:37

What options are available for load-balancing when distributing multicast ( then UDP ) from a reduced set of sources to a reduced set of receivers? Say multicast as a transport technology and not as a way to distribute content to end users like IPTV

Michael Kashin 29 January 2015 03:49

Does anyone know of any flowlet load-balancing of overlay (e.g. VXLAN) traffic? I could only find an RFC Draft proposing to use some additional bits in VXLAN header to signal the different flowlets inside (https://tools.ietf.org/html/draft-chen-nvo3-load-banlancing-00).

jtarrio 18 March 2015 18:37

Ivan, do you know if AFS on Juniper VCF is already available? Their blog post is dated July 31 2014 and it only says that AFS *will* be a feature in VCF, as in sometime in the future. Do you know if it's currently shipping? Thanks!!

Ivan Pepelnjak 19 March 2015 08:00

Whenever in doubt, check the single-source-of-truth: release notes. Some vendors (Cisco and Juniper) actually publish them, others are not so easy to work with.

Anyway, from the latest Junos 14.1 release notes: "Adaptive load balancing support (Virtual Chassis Fabric) — Starting with Junos OS
Release 14.1X53-D10, adaptive load balancing (ALB) is supported in Virtual Chassis Fabric (VCF)."

The Problem

Can We Solve the Problem?

What Are Flowlets?

Need more information?

Recent posts in the same categories

data center

fabric

load balancing

7 comments: