Do we need distributed switching on Nexus 2000?

Yandy sent me an interesting question:

Is it just me or do you also see the Nexus 2000 series not having any type of distributed forwarding as a major design flaw? Cisco keeps throwing in the “it's a line-card” line, but any dumb modular switch nowadays has distributed forwarding in all its line cards.

I’m at least as annoyed as Yandy is by the lack of distributed switching in the Nexus port (oops, fabric) extender product range, but let’s focus on a different question: does it matter?

Some background information first. A Nexus 2000 fabric extender (FEX) looks like a linecard in a Nexus 5000 switch with a significant caveat: all traffic between devices connected to a FEX must pass through the controlling switch. This design allows the FEX to remain a dumb device and while distributed switching (the ability of a FEX to forward traffic between locally attached devices independently from the controlling switch) was promised a while ago, it seems that the evolution of the FEX architecture is steering away from it (the 802.1Qbh standard has no provision for distributed switching).

Distributed low-latency any-to-any switching is highly desirable in most environments running distributed computations (for example, HPC or large-scale map-reduce workloads) where there’s heavy east-west traffic between hosts in the same VLAN. In such an environment, the FEX architecture will definitely increase latency and maybe even cause congestion due to large amount of traffic going through the controlling Nexus 5000. However, most workloads I see don’t fall into that category.

Web servers exchange traffic with end-users (browser clients) or back-end application or database servers, which are usually in a different VLAN (so the traffic has to cross a layer-3 device or even a firewall). Application servers usually have the same traffic characteristics.

Virtual Desktop Infrastructure (VDI) traffic is similar. Screen, keyboard and mouse (KVM) traffic is sent to the end-user, application traffic is exchanged with the Internet or web/application servers (again, in a different VLAN) and the storage traffic is usually handled separately anyway. Yet again, no east-west traffic.

Mid-range database servers (like Microsoft’s SQL server) generate little intra-VLAN traffic unless you deploy a redundant setup, but even then the amount of intra-VLAN traffic is limited by the number of SQL transactions the in-memory caches and the spinning rust can support. High-end distributed database architecture (for example, MySQL cluster) could be a different story.

Multi-tenant environment is probably a more interesting use case, more so if someone decides L3 inter-VLAN virtual appliances are a good idea (hint: they are NOT). There you could get a lot of east-west traffic due to inter-VLAN traffic being handled by a virtual machine that could reside anywhere in your network.

Summary: while distributed FEX switching could be very useful in some environments, you won’t see much difference in many cases. As always in the real life, you have to make tradeoffs; you could either have a Swiss army knife (which can always solve the problem ... more so if your last name is McGyver) or a durable Phillips screwdriver (which works really well, but might not be a good option if you need to open a bottle of wine).

In our scenario, if you have tens (or hundreds) of servers connected to a distribution-layer switch through access-layer devices, you have to decide whether you prefer lack of distributed switching and the ability to manage only two devices (a redundant pair of Nexus 5000s, which configure and manage all FEX devices connected to them) to managing tens of access-layer devices (redundant ToR/EoR or blade chassis switches). You could also decide to wait for the QFabric magic, but remember that (at least in its first incarnation) it won’t help you solve the blade chassis switch management problem.

More information

If you want to learn more about modern data center architectures, buy a recording of my Data Center 3.0 for Networking Engineers webinar. For more Data Center-related webinars, check the Data Center webinar roadmap. All those webinars are available as part of the yearly subscription package.


6 comments:

  1. Dmitri Kalintsev11 July, 2011 07:21

    Ivan,

    To further expand your point: with automatic workload management (for example, VMware DRS), you can't reliably say if your VMs are going to sit on the same host or on different hosts. In this case forcing east-west traffic further up actually improves latency consistency (yes, it is a bit crappier, but *consistently* so) ;)

    -- D

    ReplyDelete
  2. I have a question here, when connecting a single N2k extender to its parent N7k, do you suggest to connect it towards multiple M132xp cards or to a single M132xp card?
    8-)

    ReplyDelete
  3. Pavel Skovajsa13 July, 2011 14:38

    <quote>
    Web servers exchange traffic with end-users (browser clients) or back-end application or database servers, which are usually in a different VLAN (so the traffic has to cross a layer-3 device or even a firewall). Application servers usually have the same traffic characteristics.
    </quote>

    I actually believe Yandi had a "DFC type" distributed functionality in mind, so that even in the different vlan scenario (mentioned above) the traffic is still switched locally inside the "DFC". The only thing that matters is that the src and dst port are on the "same line card" of Nexus 5000, regarless of vlans/routing.

    This of course does not make you wrong, just pointing out that argumenting with different vlan could be misleading.

    ReplyDelete
  4. Ivan Pepelnjak14 July, 2011 08:27

    My personal opinion on distributed L3 switching: Cisco learned the lesson with MLS in Catalyst switches :-P

    ReplyDelete
  5. Best practice here would be to spread your FEX links across the multiple M132 cards.

    ReplyDelete
  6. Ivan, you make some interesting points. However, the choice is no longer between a full-blow dot1Q bridge and a dumb-down FEX. Dell now offers all the simplicity of a "dumbed-down" forwarding device that also performs MAC learning and can support east-west traffic flows. The device I am speaking of is the Dell PowerEdge IO Aggregator blade module for the m1000e blade server chassis.

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.