QFabric Part 3 – Forwarding

You won’t find much about the QFabric forwarding architecture and resulting behavior in the documentation; white papers might give you more insight and I’m positive more detailed ones will start appearing on Juniper’s web site now that the product is shipping. In the meantime, let’s see how far we can get based on two simple assumptions: (A) The "one tier architecture" claim is true and (B) Juniper has some very smart engineers.

One lookup per packet. According to documentation available on Juniper’s web site and explanations given in the Juniper QFabric Packet Pushers Podcast, the whole QFabric acts as a single L2/L3 switch. Even more, only a single L2/L3 lookup is performed when a packet traverses the Qfabric. Obviously, the lookup has to be performed by the ingress QF/Node.

MAC address learning. In a traditional L2 network, every L2 switch would learn the paths toward end hosts by gleaning the source MAC addresses of transit packets. This approach can no longer work in one-lookup-per-packet environment; the ingress QF/Node can use dynamic MAC address learning, the egress ones cannot.

MAC address reachability information must thus be propagated by the control plane (similar to Cisco’s OTV). The control-plane mechanisms (most likely implemented in QF/Director) can distribute MAC addresses only to those QF/Nodes that are connected to the target VLAN; QFabric thus has the potential to scale much better than traditional L2 networks.

Layer-3 forwarding. Every QF/Node is a L2/L3 switch and has to be able to do full L3 forwarding between any two VLANs (even if the target VLAN is not configured on it) if you want to retain the one-lookup philosophy. The IP routing tables and ARP tables thus have to be shared between all QF/Nodes in QFabric.

To be more precise: QF/Node has to have IP routes and ARP tables for the routing instances in which its ports (based on their VLAN membership) participate.

The shared layer-3 forwarding architecture has an interesting consequence: every single IP packet is forwarded along the shortest physical path toward the destination IP host (or next-hop router outside of QFabric), regardless of whether the destination is in the same or in a different IP subnet. Good bye, L2/L3 separation and traffic trombones.

Combined with multipathing across and within QF/Interconnects, QFabric gives you optimum any-to-any L2 or L3 connectivity. Now I’m getting impressed.

Shared default router MAC address. Think about the L3 forwarding facts I mentioned above, the way off-subnet forwarding works with IP, and the mandatory support for host (actually VM) mobility.

An IP host usually reaches off-subnet destinations through the default gateway – that would be the ingress QF/Node in QFabric. The IP host reaches the default gateway via default gateway’s MAC address that it learns through ARP. A virtual machine (VM) cannot change the default gateway or its MAC address after a live migration event (that would require fixes to the guest OS TCP/IP stack); it’s still sending the off-subnet packets to the same MAC address (causing traffic trombones in traditional data center networks). To make this work with “one lookup in ingress QF/Node” policy, all QF/Nodes must share the same MAC address for the virtual IP address of the per-subnet (per-VLAN) default gateway.

Now think about the implications – every IP packet reaches the default gateway in one hop (good), the ingress switch always does the L3 lookup (better) ... and we don’t need VRRP any more (finally), as it doesn’t matter where the server sends the off-subnet IP packets – every single QF/Node is able to intercept them and perform L3 lookup.

Is this unique? Actually it is. Although other vendors could implement a similar solution (and it wouldn’t be impossible to make it work with traditional switching architectures), they usually prefer to play it safe and roll out incremental improvements of existing architectures (or dead ends like large-scale bridging).

More information

I’ll talk about data center fabric architectures and networking requirements for cloud computing at the upcoming EuroNOG conference.

You’ll find in-depth discussions of various data center and network virtualization technologies in Data Center 3.0 for Networking Engineers webinar (buy recording or Data Center Trilogy). Fabric architectures from various networking vendors are described in the Data Center Fabric Architectures webinar (register). Both webinars are included in the yearly subscription.

11 comments:

  1. How far did we get? Is it true and do they have smart engineers? :)

    ReplyDelete
  2. They do ;)

    ReplyDelete
  3. "Shared default router MAC address"

    can we configure a couple of 7600, 3550 or 4948 to behave like this ?
    Force the same MAC, and same IP on the interface vlans, add some filters, and let the hosts with their default routes (bonding with failover).

    Has anyone ever tried something like that ?

    ReplyDelete
  4. I have a few questions. I have read your post now like five times and I am for sure speaking above my pay grade so maybe you can clear this up for me.

    1) One lookup per packet.
    This sounds great, but the interconnect cant just forward frames based on nothing. It will still have to use something to determine which node to send the frame to...be it a label slapped on the front of the frame or something else. Since forwarding decisions are made in hardware how does QFabric add anything to this area? In theoretical application how does this differ from MPLS?

    2) Shared default router MAC address.
    You assume we're doing L3 routing at the access layer. If we're doing routing at the distribution layer this is taken care of. Yes we still have traffic trombones, but if you want to go the traffic trombone route the only way I see QFabric could help with this is if you had a QFabric split between two datacenters. I'm not even sure they would support that design. So if we assume we have two QFabric deployments in two different datacenters we're back to the same design challenges we face today.

    3) L2/L3 Same Path
    I like this...but if we're routing at the access Layer why not use MPLS and set up VPLS? Would this not accomplish the same thing? I think I just made the biggest traffic trombone in the world, but if I am accurate then I am sure smart engineers at Cisco, Brocade, or HP could come up with a way to have each access switch act as the default route for hosts hanging off of it. Obviously this would need to include the distribution layer so it knows which access switch to forward to.

    Can someone break this down for me?

    ReplyDelete
  5. In theory we could, but it would be fragile, as you can't enforce that (A) all VLANs are configured on all boxes and (B) there is no other L2 box in the middle that could break things.

    Also, using today's implementations, every switch would get crazy seeing its own MAC address coming from various sources (not to mention duplicate MAC addresses). ARP might even work, but you would get repeated ARP requests from various switches as the VM moves around.

    ReplyDelete
  6. #1 - they need some labeling scheme in the fabric, similar (in concept) to MPLS/VPN. The ingress node has to indicate egress interface with the label.

    #2 - You're absolutely right. You get optimal forwarding within the QFabric, not between QFabric and external L3 forwarders.

    #3 - Sure you could do it, but it would be way more complex than what QFabric does. They hide the complexity of the underlying mechanisms (which are probably not too different from what you're describing) in the same way TRILL/OTV/802.1aq is hiding all the complexity of IS-IS.

    ReplyDelete
  7. Well, still in theory, it would be possible to filter that traffic (mac, arp), wouldn't it ?

    ReplyDelete
  8. In theory it might be possible ;) Or, as someone famously said once "looking at the code I couldn't see why it wouldn't work" :-P

    ReplyDelete
  9. Regarding #2... Optimal forwarding != shortest path, IMHO. I would like to understand what hashing algorithm is used within QFabric to determine the flow within the system. It's not clear whether ECMP is supported, load-sharing etc.

    Regarding #3... With regards to troubleshooting, I am not sure whether it's more complex to analyze a set of standard, published protocols than to troubleshoot a proprietary black-box. I will need to see QFabric more in detail to see what troubleshooting tools are we given in order to understand the involved mechanisms.

    ReplyDelete
  10. Hello, Ivan!
    As far as I understand the one-lookup philosophy means that all nodes have the same forwarding tables. What the reason to devide QFX's into ServerNodeGroup/NetworkNodeGroup and to make the restriction for NetworkNodeGroup for only eight node devices?

    ReplyDelete
  11. Control plane scalability. You can't interact with too many neighbors from a single point.

    QFabric control plane is just like a regular network (best example might be BGP with route reflectors): each node group interacts with directly attached neighbors, and they exchange the routing information through a set of central nodes.

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.