vSphere Does Not Need LAG Bandaids – the Network Might

Monday, January 20, 2014 07:13 +0100

vSphere Does Not Need LAG Bandaids – the Network Might

Chris Wahl claimed in one of his recent blog posts that vSphere doesn't need LAG band-aids. He's absolutely right – vSphere’s loop prevention logic alleviates the need for STP-blocked links, allowing you to use full server uplink bandwidth without the complexity of link aggregation. Now let’s consider the networking perspective.

The conclusions Chris reached are perfectly valid in a classic data center or VDI environment with majority of the VM-generated traffic leaving the data center; the situation is drastically different in data centers with predominantly east-west traffic (be it inter-server or IP-based storage traffic).

Let’s start with a simple scenario:

Data center is small enough to have only two switches.
Two vSphere servers are connected to the two data center switches in a fully redundant setup (each server has one uplink to each ToR switch).
Load-Based Teaming (LBT) is used within vSphere instead of IP-based hash (vSphere terminology for a Link Aggregation Group).

The two ToR switches are not aware of the exact VM placement, resulting in traffic flowing across inter-switch link even when it could be exchanged locally.

VM MAC reachability as seen by the switches

Traffic flow between VMs on adjacent hypervisor hosts

Can We Fix It?

You can fix this problem by making most endpoints equidistant. You could introduce a second layer of switches (resulting in a full-blown leaf-and-spine fabric) or you could connect the servers to a layer of fabric extenders¹, which would also ensure the traffic between any two endpoints gets equal treatment.

Traffic between VMs appearing near to each other in a leaf-and-spine fabric gets better treatment than any other traffic (no leaf-to-spine oversubscription); in the FEX scenario all traffic gets identical treatment as FEX still doesn’t support local switching.

The leaf-and-spine (or FEX) solution obviously costs way more than a simple two-switch solution, so might consider:

Using link aggregation and LACP with vSphere or
Changing your mind and not using all uplinks for VM traffic.

But LBT Works So Much Better

Sure it does (and Chris provided some very good arguments for that claim in his blog post), but there’s nothing in the 802.1ax standard dictating the traffic distribution mechanism on a LAG. VMware could have decided to use LBT with a LAG, but they didn’t (because deep inside the virtual switch they tie the concept of a link aggregation group to the IP hash load balancing mechanism). Don’t blame standards and concepts for suboptimal implementations ;)

But Aren’t Static Port Channels Unreliable?

Of course they are; I wouldn’t touch them with a 10-foot pole. You should always use LACP to form a LAG, but VMware supports LACP only in the distributed switch which requires Enterprise Plus or NSX license². Yet again, don’t blame the standards or network design requirements, blame a vendor that wants to charge extra for baseline layer-2 functionality.

Is There Another Way Out?

Buying extra switches (or fabric extenders) is too expensive. Buying Enterprise Plus license for every vSphere host just to get LACP support is clearly out of question. Is there something else we could do? Of course:

Don’t spread VM traffic across all server uplinks. Dedicate uplinks connected to one switch primarily to VM traffic and use other uplinks primarily for vMotion or storage traffic. Use active/standby failover on uplinks to make sure you don’t lose connectivity after a link failure.
Make sure the inter-switch link gets enough bandwidth.

Typical high-end ToR switches (from almost any vendor) have 48 10GE/25GE/100GE ports³ and four ports that are four times faster (40GE/100GE/400GE). Using the faster ports for inter-switch links results in worst-case 3:1 oversubscription (48 lower-speed ports on the left-hand switch communicate exclusively with the 48 10GE ports on the right-hand switch over an equivalent of 16 10GE ports). Problem solved.

I wouldn’t, fabric extenders are a nightmare. Also, try to get better pricing for leaf switches by mentioning other vendors ;) ↩︎
Like all other vendors, VMware loves to change its licensing. Check VMware documentation for up-to-date details. ↩︎
Depending on switch model and when you’re reading this blog post. ↩︎

Latest blog posts in Multi-Chassis Link Aggregation series

Recent posts in the same categories

link aggregation

data center

virtualization

15 comments:

Pirmin 20 January 2014 14:08

Hi Ivan,

What's the max reasonable oversubscription you recommend? Would Chassis Trunking from Brocade be interesting as Aggregation L3 Switch Solution ?

Thanks

Ivan Pepelnjak 20 January 2014 14:57

There's no "reasonable" number - you should know your traffic (and use 1:3 when you're clueless).

Anonymous 20 January 2014 16:16

Okay, so just set all hosts to prefer port 1, problem solved..

Replies

Ivan Pepelnjak 20 January 2014 16:40

... and waste half the ports and half the switching bandwidth. Congratulations.

Unknown 22 January 2014 06:32

Thanks for taking the time to go deeper into this topic, Ivan. Your 5K/2K architecture seems most common in the field, although many 10 Gb installs opt to use a 5K as the ToR switch. Perhaps this gives a further nod to the 1000v's vPC pinning using the "free" edition?

Unknown 26 January 2014 11:30

LACP is included in the free edition ov Nexus 1000v. There is no reason NOT to use LACP for your vhost uplinks except if you "hate" Cisco, which isn't an argument.

Personally I would much happier if the Nexus 1000v implemented Fabricpath alleviating the need for any kind of LAG on the vhost uplinks. But this is a logical and efficient way to connect vhosts to a network so I'm not expecting Cisco to go that way!

Edmund Roche-Kelly 28 January 2014 07:17

The ToRs can get connected with a cheap 40 Gbps twinax cable. The extra switch hop is less than a microsecond of latency. Who cares about obtimizing that?

Replies

Ivan Pepelnjak 28 January 2014 07:22

That's perfectly fine - as long as the 40 Gbps E-W bandwidth is enough, and you have the extra ports. Once you get closer to the physical limits, optimization becomes more important.

Pieter E Smit 28 January 2014 21:48

We bumped into a problem that cause 30 second outage, probably spanning tree.
Vswitch dual connected to N4k's that have multiple up links to vpc N5k pair.

When we reload a N4k, the VM fail over is perfect, but when the N4k comes back we lose half the VM's for 30th.

My understanding is that Vswitch starts using new up link immediately as physical port comes up but spanning tree on unlink ports is still learning topology.

Replies

Ivan Pepelnjak 29 January 2014 07:21

Thanks for sharing this one. You need "portfast" or an equivalent on the switch to prevent the ports from going through the "listening" phase.

Anonymous 09 February 2014 20:05

Pieter check your vSwitch Failback setting. To avoid such failure scenario you may want to set it to No.

To understand the issue, you may want to read this blog post: http://frankdenneman.nl/2010/10/22/vswitch-failover-and-high-availability/

Anonymous 28 January 2014 23:01

What about iSCSI traffic?
Isn't MPIO to be preferred over LACP since MPIO can use all the links simultaneously?

Replies

Ivan Pepelnjak 29 January 2014 07:22

MPIO is definitely better than LACP. Separate (logical) interfaces with Adapter FEX might be the answer.

Alberto 11 September 2014 19:17

Hi Ivan,

I'm currently working on an environment where VMware ESXi servers are dual-homed (and statically LAGged) to a pair of FEX 2232. These extenders are currently single-homed to a pair of Nexus 5548 on vPC. I want to move away from the LAGs at the server level but by doing this I'll start running into the problems you describe above (potential over-subscription of the vPC peer-link, etc.). So, dual-homing the FEXes seems like the logical thing to do.

I've been doing some research into the reasons why I would like to avoid dual-homing the FEXes and so far I've come up with three:

a) Config of the FEX's interfaces need to be defined in both Nexus... for instance port 100/1/25 exists in both parent Nexus, so the config needs to be consistent across both --> not a big deal in my book... plus using config-sync and switch profiles should help simplify or avoid this problem

b) The total limit of FEXes supported by the vPC is cut in half, since every Nexus parent switch will be connected to all the FEXes --> not a big deal either for the environment I'm working on

c) On a single-homed deployment every FEX is dependant upon a single parent Nexus. If the parent Nexus "A" fails or crashes, only the "A" FEXes will be affected. On a dual-homed deployment all the FEXes are dependant upon both parent Nexus... under certain circumstances this could be defined as a SPOF. Of course an argument could be had regarding the vPC concept as being a SPOF itself... at some point you have to trust the vendor and its code, right?

What's your take on these problems, particularly on the third one. Do you think there are any other limitations worth considering before migrating from a single-homed to a dual-homed config?

Thanks!

Replies

Ivan Pepelnjak 12 September 2014 15:31

Dual-homed FEX-es might be the best option, as they ensure the load from server uplinks is (somewhat) evenly distributed across both switches.

Never went into the details of how dual-homed FEXes handle upstream switch failure, but AFAIK it's not a SPOF.

Add comment