QFabric Part 1 – Hardware Architecture

Juniper has finally released the technical documentation for the QFabric virtual switch and its components (QF/Node, QF/Interconnect and QF/Director). As expected, my speculations weren’t too far off – if anything, Juniper didn’t go far enough along those lines, but we’ll get there later.

The generic hardware architecture of the QFabric switching complex has been well known for quite a while (listening to the Juniper QFabric Packet Pushers Podcast is highly recommended) – here’s a brief summary:

Redundant connections between QFabric elements and control plane stackable switches are not shown in the network diagram. Each QF/Node has two connections to the virtual chassis switches, QF/Interconnect has four (two per control board), QF/Director has six (three per network module).

QF/Directors are x86-based devices that act like the brains for the QFabric, providing fabric services (management, configuration, control, device discovery, DNS, DHCP, NFS) and routing engines for more complex node clusters (network node groups).

Each QFabric should have at least two QF/Directors with disks; you can add diskless QF/Directors (no SKU yet) if you need more processing power (not likely in the current software release).

QF/Interconnects (QFX3008) are very-high-speed totally proprietary switches that forward frames exchanged between QF/Nodes. Each QFX3008 provides up to 10Tbps of non-blocking bandwidth (where non-blocking is defined as “any input port can send a packet to any non-busy output port”) and uses three-stage Clos network to get the non-blocking behavior. With up to four QF/Interconnects per QFabric, the total QFabric switching bandwidth is 40Tbps. Impressive. Try to calculate how many 64 kbps voice calls can fit into that ;)

QF/Nodes are the well-known QFX3500 L2/L3 switches. They support 10GE, 2/4/8Gb FC and up to 4 40GE uplinks to the QF/Interconnect. With 48 10GE ports, you get 1:3 oversubscription if you use all four uplinks or 1:6 oversubscription if you decide to use only two uplinks (using only one uplink probably doesn’t make much sense).

QFabric uses out-of-band Control plane LAN implemented with two stacks of EX4200 switches. Each QFabric component has redundant connections to the control-plane LAN (QF/Node has one connection to each Virtual Chassis, QF/Interconnect has two, QF/Director three). All control-plane traffic is exchanged on the control-plane LAN (already getting ATM/SDH/MPLS-TP flashbacks?), nicely isolating it from the user traffic. The QF/Director has separate management and control plane ports, making the control plane LAN totally isolated.

1-tier? Really? Looking at the QFabric architecture, one has to wonder why Juniper claims it’s a 1-tier architecture. Honestly, it’s as much 1-tier as every MPLS/VPN network I’ve ever seen. However, like with MPLS/VPN, there’s a trick – QFabric uses single-lookup forwarding.

The ingress QF/Node performs full L2/L3 lookup (including ACL checks) and decides how to forward the packet to the egress QF/Node. The QF/Interconnect uses the proprietary frame forwarding information to get the user data to the egress QF/Node. The frame forwarding information likely includes enough details to allow the egress QF/Node to forward the frame to the output port.

The expensive part of the user frame/packet lookup is thus performed only once (whereas you’d get three full lookups in a traditional data center design using similar hardware architecture). Net result: 5 microsecond forwarding latency across the fabric. Not bad, considering that the QF/Interconnect itself has three hops.

Summary

Once you get over the totally proprietary nature of QFabric, the initial commitment you have to make (according to this post, the minimum you’d pay for a single QF/Interconnect with two linecards and two QF/Directors would be around $450.000 ... without optics or a single QF/Node) and the amount of lock-in you’d be exposed to (with all other vendors, you can slowly phase in or out of their fabrics; with QFabric it’s all-or-nothing), QFabric is indeed a masterpiece of engineering.

Due to all the above-mentioned facts, I would expect to see it deployed primarily in very large Greenfield environments; huge Hadoop/MapReduce clusters immediately come to mind.

More information

The Juniper QFabric Packet Pushers Podcast is probably still the best independent source of information on QFabric hardware architecture and its data plane.

I’ll talk about data center fabric architectures and networking requirements for cloud computing at the upcoming EuroNOG conference.

Fabric-like architectures from various vendors are the main focus of the Data Center Fabric Architectures webinar.

You’ll find in-depth discussions of various data center and network virtualization technologies in Data Center 3.0 for Networking Engineers webinar (recording), which is also part of the Data Center Trilogy.

Both webinars (and numerous others) are included in the yearly subscription.

28 comments:

  1. Your QFabric diagram is actually overly-simplified: each QF/Node has to have dual connections to the Control Plane, and this is achieved by having separate GbE copper connections to separate EX4200-based Virtual Chassis. Similarly the QF/Directors - the "Director Group" - has multiple connections (3..?) to the same VCs, and much the same for the QF/Interconnects. This is obvioulsy all about redundancy (good), but it does place significant restrictions on the physical topolgy of a QFabric Switch deployment. Everything needs to be within something like a 100m reach, and with the cable routing requirements of a real-world Data Centre I'm not sure how practical the scale-out capabilities will be... Facoring in the pricing, the high start-up costs for QF/I, QF/D, and all of the software licensing, it's difficult to see where the economies of scale will be achieved. Certainly it's not something to be considered lighlty...

    ReplyDelete
  2. Your QFabric diagram is actually overly-simplified: each QF/Node has to have dual connections to the Control Plane, and this is achieved by having separate GbE copper connections to separate EX4200-based Virtual Chassis. Similarly the QF/Directors - the "Director Group" - has multiple connections (3..?) to the same VCs, and much the same for the QF/Interconnects. This is obvioulsy all about redundancy (good), but it does place significant restrictions on the physical topolgy of a QFabric Switch deployment. Everything needs to be within something like a 100m reach, and with the cable routing requirements of a real-world Data Centre I'm not sure how practical the scale-out capabilities will be... Factoring in the pricing, the high start-up costs for QF/I, QF/D, and all of the software licensing, it's difficult to see where the economies of scale (of many multiple QF/N) will be achieved. Certainly it's not something to be considered lighlty...

    ReplyDelete
  3. Thank you. Updated.

    The only mention of the 100m Node-to-Interconnect cable length I found was in the PFC section (you have to stay within 100m if you want lossless transport).

    ReplyDelete
  4. 500,000,000 phone calls.

    ReplyDelete
  5. ... or approximately the whole China talking at once. Impressive 8-)

    ReplyDelete
  6. I'd be very nervous about putting my entire network under one control plane, even with redundancy. It just takes one memory leak in the controller plus a failover bug to take down every switch in the network. Not to mention the idea of a switch stack as critical to control-plane functioning.

    There's a reason distributed architectures have been favored for thirty years.

    ReplyDelete
  7. Tell that to the OpenFlow crowd :-P

    Actually, the QFabric's control-plane architecture is pretty distributed. More in the next post.

    ReplyDelete
  8. So you're saying QFabric will probably be what's deployed in the data center filtering China's traffic? Makes perfect sense.

    ReplyDelete
  9. Am I missing something, or is QFabric basically a giant chassis switch with each card broken into separate 1-2U device, with custom cabling lashing it all together? Once you get to the ingress port, it's all totally proprietary, and not at all "distributed","scale-out" or anything else buzzword-complaint. Cisco couldn't have dreamed up a better proprietary non-solution to what customers need.

    Give me a bunch of 48+ port TOR-style switches connected with a folded-Clos (or fattened butterfly or 3D torus or whatever) topology that can be managed as one, have one unified control plane, and do ECMP across all links at layer 2 and layer 3. Bonus points for some smart adaptive routing that doesn't require global information sharing.

    Plug-and-play, scale-out networking. That's revolutionary. QFabric isn't.

    ReplyDelete
  10. Although QFabric has to mature for sure, the concept seems ingeniously simple. Instead of building a flat layer 2 network out a bunch of vendor pushed half-baked "standards", Juniper seems to be extending upon an idea most are comfortable, a distributed layer2/layer3 switch.

    Is it just me or does it seem ripe for eventual OpenFlow support? To me it seems that the QF/Director must act in many ways like an OpenFlow Controller.

    I agree the cost could be a prohibiting point, but I'm impressed with premise. Maybe some day they will release a scaled down version to gain footing in smaller deployments

    ReplyDelete
  11. It is, actually...

    See, what you're describing is more than one lookup. That's the beauty of QFabric... it's a single L2 lookup. So its actually a brilliant technology.

    ReplyDelete
  12. Time to move on from the stone age architectures.

    ReplyDelete
  13. 1 tier logically, yes.

    ReplyDelete
  14. They already are ;) Watch what happens in 1H2012 :)

    As far as cost goes, since the QFX3500 is just a 40x10GigE switch, you can actually deploy them as part of a migration. No need to green field! Interconnects can always be added later.

    ReplyDelete
  15. Except there's nothing that prevents a scale-out architecture from doing source-routing using port numbers (or whatever headers are used on QFabric's proprietery links) instead of MACs. Many HPC interconnects work exactly this way.

    QFrabric is "brilliant" in the same way the Spruce Goose was brilliant. Impressive engineering, but a solution looking for a problem, and at way too high of a cost.
    The cost of a redundant QFabric system is something like $2100 per host-facing 10 GbE port with 3:1 oversubscription for 1920 host-facing ports. A two-stage folded-clos network of the same size, also with 3:1 osub at the edge, would only be about $600 per host-facing port based on current list pricing of 56 * (48-port 10 GbE switch) pricing from other vendors (assuming such switches could someday actually do TRILL or other L2 ECMP).

    Arista/Dell/HP: Juniper messed up, and you've got a big opportunity here.

    ReplyDelete
  16. Hi Ryan,

    Is your argument that you have to deploy two fabrics for redundancy? We have had a couple of early customers consider that path and elect to go with a single fabric instead. While the fabric behaves like a single switch, it is designed to be as resilient as the network many would like to deploy, but rarely do in practice.

    As far as cost goes. We believe there is significant value that goes along with providing applications with the performance they want, infrastructure flexibility rarely found before, and what should be significant operational savings that typically overshadow the capital cost of a network. Time will tell which and to what degree those assumptions are true.

    On the size front, there are those who have accused us of building something specifically for large search engines and such and others who say a 6000 port fabric is too small for their needs. We've said we know how to scale the architecture up and down, so don't count us out yet. This game is just getting interesting.

    Cheers,
    Abner from Juniper

    ReplyDelete
  17. Q-Fabric is nothing new. Geez...back in the day, Bay Networks Centillion did the same exact thing, with a complex redundant backplane that was way ahead of it's time. Nice repackaging of a legacy concept...and yes, Q-Fabric is actually less intelligent than the Centillion was from back in the 90's...or is it really, Q-Past_Fabric? ;)

    ReplyDelete
  18. I asked about a smaller version at VMworld. They had all 3 components on display just like they did at Interop in Vegas back in May. I was told that a smaller interconnect piece will be coming next year. The QFX3008 is a beast. Not that the Nexus 7k's are tiny mind you.

    ReplyDelete
  19. Hi Ivan.

    Still I can't understand the "one-lookup" statement. I clearly see that the Edge Node will perform one lookup, decide which way to reach the destination QFabric port, and embed that forwarding information into the proprietary frame format for the Interconnect to use it to send the packet accordingly.

    What I see is that, if this assumption is correct, the Interconnect must itself perform another, non-standard lookup in order to read that proprietary forwarding information and switch the frame to the destination node... which I assume will somehow know which port to place it into.

    I see one "Ethernet destination MAC address" lookup, and two "proprietary forwarding information header" lookups.

    Where am I wrong? I surely must be missing something.

    ReplyDelete
  20. You're absolutely right. One L2/L3 lookup, encapsulation into proprietary format and multiple tag (or whatever they call it) lookups throughout QFabric (Interconnect is a 3-stage Clos tree).

    ReplyDelete
  21. OK.

    What I don't see then is convergence time information. I understand that for the edge node to attach the proper "tag", it must have visibility of the topology relevant to its connected nodes. Therefore, every topology change (like a VM move) must trigger a FIB update to edge nodes from the QF Directors.

    With potentially hundreds of edge nodes... What procedure is used to guarantee timely convergence times for the distribution of this info? What is the target convergence times advertised by JNPR?

    As usual, congrats on the post Ivan.

    ReplyDelete
  22. An MPLS cloud is also logically a 1-tier hop, then. It all depends on perspective. :)

    ReplyDelete
  23. Well... I would not trust my network to just one huge control plane, no matter how much redundancy is thrown into the mix. Modern DC networks should, IMHO, support complete control plane separation between fabric A and B whenever you involve FCoE transport. I have a hard time figuring out how to do that with just one QFabric.

    ReplyDelete
  24. Gotta agree with Ben here.

    Not that distributed, Ivan, as only first-tier protocols (LACP and such) are distributed. Forwarding information base building is delegated to a single control plane, with no visibility into how this information is propagated, what split-horizon mechanisms are implemented, how its coherence is protected... It's basically a huge black-box with regards to Control Plane.

    Plus, it relies upon the stability of a separate out-of-band control network, with its own overlapping control plane...

    Talk about stone age, Chris... ;)

    ReplyDelete
  25. Hi Ivan,

    One comment on "totally proprietary." To be clear - and I don't want to put words into your mouth - you are referring to the internals of the system that actually create the fabric. Just like a chassis switch system, the internal components are proprietary, while all external facing ports are Ethernet and/or fiber channel. More here: http://j.mp/ov18e2

    Our view is that creating a protocol based fabric can get you slightly better efficiency and any to any connectivity than legacy kit, but switch fabrics will deliver much better efficiency, predictability, performance, security, and manageability. We expect those benefits to outweigh the complexity protocol based fabrics will foist on customers. While some have suggested the only way to deploy a QFabric is in a greenfield DC, that would be insane on our part. Early customers are deploying in a corner of the datacenter and expanding from there. I also expect the economics to get more interesting over time.

    Cheers,
    Abner (Juniper Networks)

    ReplyDelete
  26. You're absolutely correct ... but we both know that networking engineers hate large black boxes; they make troubleshooting way harder than it should be.

    However, seems like everyone is moving in the same direction, including Cisco's FEX products and OpenFlow (where although the controller-to-switch communication is standard, the real troubleshooting will have to be done in the guts of the controller, which will be proprietary).

    Already looking forward to the NFD2 discussions 8-)
    Ivan

    ReplyDelete
  27. I believe it is only fair to state that I am a Cisco engineer, just in case. :-D

    ReplyDelete
  28. I believe it is only fair to state that I am a Cisco engineer, just in case. :-D

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.