Multi-chassis link aggregation (MLAG) basics

If you ask any Data Center networking engineer about his worst pains, I’m positive Spanning Tree Protocol (STP) will be very high on the shortlist. In a well-designed fully redundant hierarchical network where every device connects to at least two devices higher in the hierarchy, you lose half the bandwidth to STP loop prevention whims.

Of course you can try to dance around the problem:

... or you could decide to use a more humble approach and deploy multi-chassis link aggregation.

Link Aggregation Basics

Link aggregation is an ancient technology that allows you to bond multiple parallel links into a single virtual link (from the STP perspective). With parallel links being replaced by a single link, STP detects no loops and all the physical links can be fully utilized.

For whatever reason, vendors like to use all other terms but link aggregation. You’ll hear about port channel, Etherchannel, link bonding or multi-link trunking.

Multi-Chassis Link Aggregation

Imagine you could pretend two physical boxes use a single control plane and coordinated switching fabrics ... then the links terminated on two physical boxes actually terminate within the same control plane and you could aggregate them. Welcome to the wonderful world of Multi-Chassis Link Aggregation (MLAG).

MLAG nicely solves the STP problem: no bandwidth is wasted and close-to-full redundancy is retained (make sure you always read the smallprint to understand what happens if the switch hosting the control plane fails).

Standardization? No, thanks

MLAG is obviously a highly desirable tool in your design/deployment toolbox ... but no vendor (including those that promote their standard-based open approach) has taken the pains to start the standardization effort. Proprietary technology lock-in is obviously still a lucrative approach.

The architectural approaches used by individual vendors are widely different: sometimes they completely separate the control plane from the switching matrix (high-end solution from Juniper), turn one of the control planes into half-comatose state (Cisco with VSS), use cooperative control planes (Cisco with vPC) or a stacking (preferably called distributed or intelligent) solution (Cisco, HP and Juniper).

More information

You’ll get a high-level overview of all virtualization, LAN reference architectures, multi-chassis link aggregation, port extenders and large-scale bridging (including TRILL and FabricPath) in my Data Center 3.0 for Networking Engineers webinar (buy a recording or yearly subscription).

27 comments:

  1. I think Nortel was the first to provide this feature with what they called "Split Multi-Link Trunking" on the 8600.

    The concept is easy to sell to management and it works very well - until something goes wrong that is - then all hell breaks loose. As you rightly said "understand what happens if the switch hosting the control plane fails" or even a downstream switch for that matter. A number of years ago I rolled it out on a large campus and, a few catastrophic failures later, remove it all.

    ReplyDelete
  2. Peter John Hill01 October, 2010 10:03

    You hit all the big points perfectly. Network Engineers like to push routing down to the TOR. Spanning tree is annoying. VSS is a hot mess. VPC isn't horrible, but it isn't a standard. There isn't an RFC or IEEE standard to read to make sure it is working correctly. The Juniper matrix is awesome, but expensive.

    Maybe I am a curmudgeon. I'd like to think that it's just healthy paranoia. :)

    ReplyDelete
  3. still, NOBODY does as better and as fast as Alcatel-Lucent when you run it over mpls

    ReplyDelete
  4. Please help me understand: how does MPLS fit into this picture?

    ReplyDelete
  5. First off, you have to think of it as you are the SP providing a redundant service (VPLS/PW)through MCLA (or MC-LAG in ALU language). ALU use a proprietary protocol to sync the 2 nodes (acting as one node). this protocol, which runs over IP, should run over a protected (redundant and FRRed ) network as the nodes must always be in snyc (this is true for every MCLA solution). so MPLS fits nicely into the picture here also. there are much more details. try looking for MC-LAG.

    ReplyDelete
  6. While by no means do I think that I have a large environment, Nortel's Split Multilink Trunk has been working just fine in my main campus. 17 data closets, a handful of physical servers, and a dozen VMWare hosts are all link aggregated to two 8600 switches with two to eight gig links in a Split Multilink Trunk

    I feel that someone needs to go to bat for Nortel / Avaya, since they were doing SMLT way before Cisco's VSS came out to play.

    ReplyDelete
  7. And it would seem now that 3Com(then H3C, now HP) did this for long time, called IRF I believe - if we take the CISCO blindfold down, it would be interesting to see if there is enough of collective experience to compare different 'MCLS' solutions...

    ReplyDelete
  8. Could you point me to a (hopefully deeply) technical paper explaining how it works? I would love to compare solutions from various vendors.

    ReplyDelete
  9. As above, would you have a link to a technical document describing IRF? I got a whitepaper during the TechFieldDay, but based on its technical level, it was probably targeted @ Gartner&Co.

    ReplyDelete
  10. Here's a start:
    http://www.trcnetworks.com/nortel/Data/Swiches/8600/pdfs/Split_multi_Link_Trunking.pdf

    ...and a highlight:
    "Spanning Tree Protocol is disabled on the SMLT ports"

    Not relying on STP for redundancy is one thing. Switching it off is a whole other thing. Deploy SMLT and hope that nobody ever loops two edge switches? No thanks.

    Nortel has an extra twist: R(outed)SMLT, which is kind of like VPC + HSRP. Except there's no virtual router IP. Give your routers x.x.x.1 and x.x.x.2. Configure .1 as the gateway on end systems. If .1 fails, .2 assumes the dead router's address.

    If there's a power outage, and only .2 boots back up? You're done. (though there's a write-status-to-nvram fix for this)

    Come to think of it, it's a lot like vPC in that regard!

    ReplyDelete
  11. The best that I can do right now is user guides, but they have done a good job for me in explaining how the process works.


    Basic deployment scenarios:
    http://www116.nortel.com/docs/bvdoc/ene_tech_pubs/SMLT_and_RSMLT_Deployment_Guide_V1.1.pdf

    Campus design guide (outlines link aggregation and loop detection deployment):
    http://www142.nortelnetworks.com/mdfs_app/enterprise/TCGs/pdf/NN48500-575_2.0_Large_Campus_TSG.pdf

    Configuration Guide for SMLT (includes some of the better technical information):
    http://www142.nortelnetworks.com/mdfs_app/enterprise/ers8600/5.1/pdf/NN46205-518_02.01_Configuration-Link-Aggregation.pdf

    Configuration Guide for RSMLT (both chassis share layer 3 information like OSPF/BGP state)
    http://www142.nortelnetworks.com/mdfs_app/enterprise/ers8600/5.1/pdf/NN46205-523_02.02_Configuration_IP_Routing.pdf

    ReplyDelete
  12. Question about VSS style connections, is there any limitation to the distance (latency) that the 2 chassis can be apart from each other? Say you had a GigE ring between 3 sites and wanted to not route, but instead have 1 site use MCLA to the other 2 sites (the other 2 sites would be a VSS pair), would that work?

    ReplyDelete
  13. Don't even think about that. If the link between the VSS sites falls down, one of the boxes is dead. vPC would be somewhat usable for something like that (and OTV would be perfect), but definitely not VSS.

    ReplyDelete
  14. Indeed. I have been hit by that very vPC design, umm, "choice". Power failure, only 1 Nexus5K came back up, no vPC. I've been told that NX-OS v5, coming in 2011, will resolve this.

    ReplyDelete
  15. hm, not used to getting around HP site, sounds they did not rebrand H3C yet, so it seems a lot of guides are still on H3C sites.. Probably the best to start here (according Google :-)
    http://h3c.com/portal/Technical_Support___Documents/Technical_Documents/
    for all equipmen, then move to each swicth model if needed - seems IRF is supported on many models - 12k, 9500E, 7500E,58xx's. Could not see if IRF between different models is possible.

    Fairly thin on
    http://h3c.com/portal/Products___Solutions/Technology/IRF/

    One of config guides can be found on
    http://h3c.com/portal/download.do?id=1038276

    There seem to be no restriction on STP, in fact it seems this supports even MPLS and many other features.. I haven't had a chance to lay my hands on any of these products, above is only by reading documents :-)

    ReplyDelete
  16. +1 for SMLT. Having said that, it has taken a very long time to make it work properly between all the different products in the Avaya nee nortel lineup. For a while, there were many bugs affecting one MLT flavour or another. Likewise, VSS has many things to improve upon and fix.

    There was a prolonged marketing guy catfight @ networkworld between cisco's VSS and nortel SMLT/RSMLT.

    ReplyDelete
  17. You mention that STP loop prevention mechanims can suck up half of the bandwidth. I almost spit out my coffee when I read this. Really? That seems like an extremely high number.

    ReplyDelete
  18. Can I ask what the issue is with VMotion in top of rack? When you point ESX/i at a default gateway you can VMotion over different subnets just fine. Is there something I'm missing?

    Thanks!

    ReplyDelete
  19. I said you lose half of your bandwidth, not that STP sucks it up 8-)

    STP loop prevention turns off (sets them to "blocking") half of the links in a dual-tree design displayed in the first diagram (blocked links are grayed out in the diagram). STP itself uses very little bandwidth.

    ReplyDelete
  20. vMotion works fine across subnets, but the VM you move across a L3 boundary (from the perspective of VM NIC) has to acquire a different IP address (or you have to use routing tricks). See also the "Routing implications" part of my vMotion post:

    http://blog.ioshints.info/2010/09/vmotion-elephant-in-data-center-room.html

    ReplyDelete
  21. Hrmm, I see your point. I mistook the issue to be vMotion itself. Assuming no need for layer 2 adjacency couldn’t VRFs or some other technology that solves overlapping IP address space work? Perhaps limiting to one cluster per rack kind of thing?

    ReplyDelete
  22. Can't do a thing if you want to retain established sessions. Without that requirement, you don't need vMotion either.

    ReplyDelete
  23. Nice article and STP Loops can be a nightmare and worrisome. That is why I love etherchannels aka portchannels. Its a great feature to take multiple physical links and make them logically as one, no bw is wasted and redundancy is achieved!!

    Cisco VSS is the way of the future as it will do away with STP. ;)

    ReplyDelete
  24. VSS, vPC (or any other MCLA technology) can only solve the STP problems in dual-tree designs. If you have a less nicely-structured network (or uplinks to more than two boxes), you need TRILL or an equivalent to get rid of STP.

    BTW, VSS is just stacking-on-steroids; I prefer vPC.

    ReplyDelete
  25. Downstream loops can be prevented with two features:

    1) Simple Loop Protection Protocol - SMLT switches send probes down their SMLT links to the closet switch. If you see the SLPP hello packet return on another interface you know that you have a loop condition

    2) Control Plane limiting for Broadcast / Multicast traffic can be configured on a per port basis. I configure it on my SMLT links to down the interface and/or VLAN that is sending excess broadcast or multicast traffic

    In both solutions, this is configured on both SMLT switches with the trigger thresholds set to different levels (5 SLPP hello probes vs 50) so that only one side of the SMLT should be disabled during a downstream loop.

    ReplyDelete
  26. Hey, with regards to IRF, basically you can cluster any switch with a 10Gb/s interface on it, with the following general guidelines:
    Chassis (12500, 9500, 7500): Currently up to 2 devices can be clustered. Rumor is that will be increased to 4 in the future.
    5820: Up to 9 devices can be clustered
    5800, 5500, 5120: Up to 8 devices can be clustered

    Certain mixed devices can be clustered using IRF, specifically 5820's and 5800's.

    IRF Clustering is fully stateful, and supports basically all the regular switch featuresets. With regards to STP, an environment that uses all IRF on the Core or Aggregation devices can remove STP from the environment, and use LACP to provide path redundancy instead.

    I believe that the MSR routers support IRF as well, but I haven't configured it myself.

    ReplyDelete
  27. I believe that Arista's MLAG works very in a very similar fashion to Cisco's vPC, right down to aligning L3 gateway selection to avoid hairpinning routed traffic. It's supposed to work with any LACP-capable host or switch downstream, but I don't if the control-plane communication between the MLAG peers is proprietary or not. In practice though I can't see a big advantage for a standards-based approach there as you're unlikely to ever have a MLAG/vPC/etc. landing on dissimilar switches from the same vendor, let alone different vendors.

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.