Brocade VCS fabric has almost-perfect load balancing
Short summary for differently-attentive: proprietary load balancing Brocade uses over ISL trunks in VCS fabric is almost perfect (and way better for high-throughput sessions than what you get with other link aggregation methods).
During the Data Center Fabrics Packet Pushers Podcast we’ve been discussing load balancing across aggregated inter-switch links and Brocade’s claims that its “chip-based balancing” performs better than standard link aggregation group (LAG) load balancing. Ever skeptical, I said all LAG load balancing is chip-based (every vendor does high-speed switching in hardware). I also added that I would be mightily impressed if they’d actually solved intra-flow packet scheduling.
A few days ago Greg (@etherealmind) Ferro, the Packet Pushers host, received a nice e-mail from Michael Schipp containing two slides from a Brocade presentation and “Ivan owes a WOW” PS. I do owe a huge WOW ... but it takes a bit more than just a few slides to impress me (after all, Brook Reams published most of the information contained on those slides a while ago). However, Brook got in touch with me a few days after the podcast was published and provided enough in-depth information to make me a believer (thank you, Brook!).
The first thing Brocade did right (and it should have been standardized and implemented in all switches a long time ago) is automatic trunk discovery: whenever two VDX switches are connected with parallel physical links, those links are automatically trunked. To make use of the advanced load-balancing methods, they also have to be in the same port group (connected to the same chipset), which does reduce the resilience, but if that’s a concern, you can always have two (or more) parallel trunks; TRILL will provide per-MAC-address load balancing across the trunks.
Within each port group, Brocade’s hardware is able to perform per-packet round-robin frame scheduling with guaranteed in-order delivery. It does seem like a magic and it’s not documented anywhere (another painful difference between Brocade and Cisco – Cisco would be more than happy to flaunt its technology wonders), but Brook told me the magic sauce is hidden somewhere within Brocade’s patents and was also kind enough to point me to the most relevant patent.
Based on what’s in that patent (after stripping away all the “we might also be patenting the spread of high-pressure water flows over coffee beans in espresso machine” stuff), it seems that Brocade’s hardware measures link delay and inter-link skew and combines that to schedule the frame transmission in a way that guarantees the frames will always be received in order by the remote switch. They don’t do receiver-side reordering (which is hard), but transmit-side delaying. Very clever solution deserving a huge WOOOOW.
You might wonder how Brocade, a company with historical focus on Fiber Channel, managed to solve one of the tough LAN networking problems almost a decade ago. As you probably know, the networking industry has been in the just-good-enough-to-sell mode for decades. The link aggregation load balancing problem was always way below the pain threshold as a high-speed LAG (port channel) trunk usually carries many flows; doing per-flow (or even per-IP-address) load balancing across a LAG is most often good enough. Storage networking is different: a server servicing hundreds or thousands of users (with at least as many LAN sessions) has only a few storage sessions. Perfect load balancing was thus always a critical component of a SAN network ... and it just happens to be a great solution in LAN environments using iSCSI or NFS storage connectivity.
More information
To learn more about storage networking protocols, including Fiber Channel, iSCSI and NFS, and emerging Data Center technologies including TRILL, Multi-chassis Link Aggregation, FabricPath, FCoE and others watch my Data Center 3.0 for Networking Engineers webinar (buy a recording or yearly subscription).
The other vendors have all been focused on constantly cutting latencies. Good to see Brocade recognising that in some cases artificially increasing latency to match effective circuit lengths can improve overall performance.
It would be interesting to know whether there is a maximum on the amount of link latency difference that this can cope with. I have seen a production network where a link with 4ms latency was paired with one with 15ms. I am guessing this platform might have trouble balancing in this situation...
I would only use it on short (intra-DC) links and it probably works only over physical links with microsecond-level skew.
BTW, are you telling me the 4ms/15ms links were both P2P physical links (or lambdas) using different fiber runs?
This specific ("special") example was several years ago and I am working from memory, but I think those latency figures are in the right ballpark. The circuits were fully transparent Ethernet, but not raw L1, I believe they were both EoSDH. When it was ordered we thought it was primary / backup and the distance difference wouldn't be an issue - wasn't until after it went in we realised the customer was using LACP across both links.
I have some information, in good confidence, that what you describe is what Brocade does on their FC Switches. On the VDX, I was told that they do the load balancing a bit differently to achieve perfect load balancing. Perhaps they have learnt a few tricks from their FC SAN expereince to improve things for the LAN folks. You may want to circle back with Brocade on this.
This is why inverse-multiplexing solutions have been efficient all the time, to begin with. As soon as the number of endpoints is above some threshold, there is no significant advantage that one may gain using clever per-packet load-sharing. The per-packet solutions increase complexity, add marketing buzz, but seem to have little real use in decent-scale networks.
Of course you may get say 25%,25%,25%,25% on a 4-link port-channel as opposed to 20%, 30%, 15%, 35% but does it really matter if you are only using fraction of the bundle capacity? One may say - well a 35% utilized link adds more latency but so does Brocade solution - and the delay is not predictable either. The solution that brocade uses might be seen as "inverse reassembly" where sending side needs to buffer packets to equalize arrival timing. As opposed to receive-side reassembly buffers we now have "shaping" buffers that ensure in-order packet delivery. Complexity did not vanish it just got pushed around.
On a per packet base then you can get the full 4gb - of couse as long as both sides can push and receive that amount, as in the case in the VDX (using 10gb) to a max of 8 ports per ISL (80 gb Trill path).
Now this is used for switch to switch traffic, a server connecting to two or more VDX switches still uses LACP, so we have back to flow based from the serer.
Of course Brocade could put the ISL feature in their CNA's however that would mean your server could only connect to 1 switch. Not a good idea for HA.
You could also say what about putting two dual port CNA's in a server, then you would have two trill paths of 20gb to two switches in the fabric - however four ports per server is like going back to 2 FC and 2 NICs.
Just my thoughts,
Michael.
As soon as you have enough flows, the packet distribution method doesn't matter as long as it's random enough.
MRP (versions 1 and 2) is indeed from Foundry.
ASIC/s on the VDX is 6th gen ASIC fron the FC side. So maybe the new 16 gbe FC will get the update too for ISL's (I would guess so).
Now please remeber this for ISL (Inter Switch Links) in a single datacenter (al least at this point), therefor I would suggest that this is P2P phyical layer links only.
Also the current MAX number of supported VDX switches can form a frabic is 12. However if you find a need to have a larger size fabric then the solution needs to be validated by Brocade (Read there is not a hard limit, but a supported limit - 12 has been tested and approved) This is up from the first release of 10 units.
Hope this adds value.
Michael.
Brocade didn't, Foundry Networks did, Foundry was only recently acquired by Brocade. Foundry has been quietly providing Enterprise quality Ethernet Networking equipment for years. We have used their equipment since 2002 and can attest to their technical achievements.
FD
Maybe it's time you guys get your stories straight.
For the "fat single flow" example. Normally, endpoints connect at physical line rate that is the same or below that of the "uplink" port. Therefore, a typical *single* flow cannot completely overwhelm ISL link. Packet-level balancing, therefore, would be most efficient if implemented on ever inter-connection (host-switch, switch-switch, etc) to effectively increase single endpoint's transmission rate.
Link aggregation (inverse multiplexing) has been always used in the case of over-subscription scenarios where N downstream ports send traffic to M upstream and N>M. (compare this to circuit-switched network where over-subscription is not possible). This is how imuxing works in packet networks anyways. It's just different levels of granularity (packets, flows, etc) that you can use in packet networks, with deeper granularity required to optimize for sparse source/receiver topologies.
One interesting inherent problem with packet networks is that they are always designed contrary to one of their original ideas, which was "maximizing link utilization". PSNs are bound to be "flow oriented" due to upper level requirements and have to be over-provisioned to support QoS needs. One might think upper levels should have been designed to perform packet reordering in the network endpoints, but that never happened due to the fact that most ULPs have been "adapted to" and not "designed for" PSNs.
No other vendor in the entire industry (in either Ethernet or Fibre Channel networks) has or has ever had this type of technology.
This is mainly meant for intra-datacenter ISLs between adjacent switches. Obviously, spraying frames across multiple links means you need to be very careful about in-order delivery, so there are some "limitations". The ASIC controls the timing of the frames within port groups, so all ports belonging to the same frame-based trunk have to reside in the same port group. Initially we have 4-port groups and we could trunk 4 x 2 Gbps into a single 8 Gbps link. Today we support 8 x 8 Gbps links into a single 64 Gbps trunk with frame-level load balanding in Fibre Channel, and as you know, 8 x 10 Gbps in our VDX 6720 switches. BTW, we've had frame-based trunking in Ethernet since we launched our Brocade 8000 top-of-rack FCoE switch, but there it's limited to 40 Gbps (4 x 10 Gbps).
Another "limitation" is that the difference in cable lengths can't be too big, and that is the main reason this is *mostly* for intra-datacenter connections. But we do support frame-based trunking over long distance (at least in FC) up to hundreds of kms as long as you can guarantee the minimum cable length difference between the links (if all go over the same lambda and physical path, for example). We've also had this for years and you can see it clearly documented in all of our product manuals.
The benefits are very clear. If you trunk 8 x 10 Gbps links, you are *guaranteed* to be able to use those 80 Gbps of bandwidth, and you won't run into scenarios where one link is congested and you have spare bandwidth on another one, like it can happen with LAG (see http://packetattack.org/2010/11/27/the-scaling-limitations-of-etherchannel-or-why-11-does-not-equal-2/) and even in FC with other approaches (like exchange-based load balancing).
With frame-based trunking, you are guaranteed to have enough bandwidth for those flows as long as the aggregate bandwidth of the flows is lower than the aggregate bandwidth of the ISLs, and in this case 3 x 6 Gbps = 18 Gbps < 20 Gbps, so you wouldn't congest any of your flows.
So the question is if speed increase for that particular # of flows is actually worth the extra complexity.
I think the Ethernet world has been OK with wasted bandwidth for far too long, considering how we've been living with STP for this long...
Now that the hardware vendors have focused their persuasive powers onto server admins who don't understand that long-distance bridging is bad, we have to deal with the fact that STP was broken for the last few decades.
Thanks to everyone for providing interesting comments, observations and follow-up questions to this post. I decided to put together more content on the subject of Brocade ISL Trunks and just added it to the Brocade community site on VCS Technology. You will find it here:
http://community.brocade.com/community/brocadeblogs/vcs/blog/2011/04/06/brocade-isl-trunking-has-almost-perfect-load-balancing
I think it provides more color on how we extended the original "Brocade Trunking" for Fibre channel, (sometimes referred to as "frame trunking" for obvious reasons) to create "Brocade ISL Trunking" which is included in a VCS Ethernet Fabric. I also provied some additional information at the end of my blog in response to some of the questions, comments and speculations several of you posted here.
Ivan, as always, you provide sound informative content for the community.
I'm curious how this plays out today? We're an R&E w/ needs for better LB algorithms and Cisco themselves are telling us the higher throughput links simply use polynomials. Our testing of those polynomials has shown upwards of 15% loss at line rate across 4 x 10 Gigs, and higher at 3 x 100 Gigs.
The Brocade hardware is long gone. Most vendors use hash-based load balancing these days, and most of them have a nerd knob to turn on dynamic reshuffling. Obviously that works only on directly-connected egress links. Beyond that, Cisco ACI might be doing something, but most everyone else cannot as they don't have visibility into congestion beyond the egress interface.
The right way to solve this challenge is to implement uncongested path finding at the source host. Something as simple as FlowBender or MP-TCP could do the trick.