QFabric Behind the Curtain: I was spot-on

A few days ago Kurt Bales and Cooper Lees gave me access to a test QFabric environment. I always wanted to know what was really going on behind the QFabric curtain and the moment Kurt mentioned he was able to see some of those details, I was totally hooked.

Short summary: QFabric works exactly as I’d predicted three months before the user-facing documentation became publicly available (the behind-the-scenes view described in this blog post is probably still hard to find).

This post is by no means a critique of QFabric. If anything, I’m delighted there’s still a networking vendor that can create innovative solutions without unicorn tears, relying instead on field-tested technologies ... which might, among other things, make the solution more stable.

It looks like a giant switch

When you log into the QFabric management IP address (VIP), it looks exactly like a giant switch – single configuration, single set of interfaces, show commands etc. All the familiar Junos configuration components are there: system group, interfaces, VLANs and protocols. The only really new component is the fabric object with node-group definitions (more on QFabric node groups).

However, every giant switch needs troubleshooting, which usually requires access to individual components; in QFabric case, the request component login command that unveils the really interesting world behind the curtain.

ip@test> request component login ?
Possible completions:
  <node-name>          Inventory name for the remote node
  DRE-0                Diagnostic routing engine
  IC-Left/RE0          Interconnect device control board
  IC-Left/RE1          Interconnect device control board
  IC-Right/RE0         Interconnect device control board
  IC-Right/RE1         Interconnect device control board
  FC-0                 Fabric control
  FC-1                 Fabric control
  FM-0                 Fabric manager
  NW-NG-0              Node group
  R2-19-Node0          Node device
  R2-19-Node1          Node device
  R2-7-Node4           Node device
  R2-7-Node5           Node device
  R3-12-Node6          Node device
  R3-12-Node7          Node device
  R3-19-Node2          Node device
  R3-19-Node3          Node device
  RSNG01               Node group
  RSNG02               Node group

The names of physical entities (QF/Nodes, QF/Interconnects) could be either their serial numbers (default) or user-configurable names (recommended).

As you can see, you can login to individual physical devices, node groups, and virtual components like fabric controls and fabric manager. These virtual components run on QF/Directors – CentOS boxes running KVM (you can log into the QF/Director Linux shell and see the virtual machines with ps -elf).

Each QF/Director is running a number of common services, including database (MySQL), DHCP, FTP, NTP, SSH, GFS, DLM (distributed lock manager), NFS and Syslog servers:

ip@QFabric> show fabric administration inventory director-group status 
Director Group Status Sat Aug 25 09:52:08 PDT 2012

 Member Status Role     Mgmt Address    CPU Free Memory VMs Up Time
 ------ ------ -------- --------------- --- ----------- --- -------------
 dg0    online master   xxxxxxxxxxxx    10% 17642780k   4   3 days, 16:23 hrs
 dg1    online backup   xxxxxxxxxxxx    6%  20509268k   3   3 days, 16:13 hrs

 Member Device Id/Alias  Status  Role
 ------ ---------------- ------- ---------
 dg0    xxxxxxxxxxxxxxxx online  master   

  Master Services
  ---------------
  Database Server                online    
  Load Balancer Director         online    
  QFabric Partition Address      online    

  Director Group Managed Services
  -------------------------------
  Shared File System             online    
  Network File System            online    
  Virtual Machine Server         online    
  Load Balancer/DHCP             online    

  Hard Drive Status
  ----------------
  Volume ID:4                    optimal    
  Physical ID:1                  online     
  Physical ID:0                  online     
  SCSI ID:1                      100%       
  SCSI ID:0                      100%       

  Size  Used Avail Used% Mounted on 
  ----  ---- ----- ----- ----------
  423G  6.3G 395G  2%   /          
  99M   20M  75M   21%  /boot      
  93G   2.0G 91G   3%   /pbdata    

  Director Group Processes
  ------------------------
  Director Group Manager         online    
  Partition Manager              online    
  Software Mirroring             online    
  Shared File System master      online    
  Secure Shell Process           online    
  Network File System            online    
  DHCP Server master             online     master                           
  FTP Server                     online    
  Syslog                         online    
  Distributed Management         online    
  SNMP Trap Forwarder            online    
  SNMP Process                   online    
  Platform Management            online    
[... rest deleted ...]

Lo and behold – it’s actually running BGP internally

After logging into one of the fabric control virtual machines, you can execute the show bgp fabric summary command, which clearly indicates the control-plane protocol behind the scenes is multi-protocol BGP running numerous address families. Each fabric control VM runs BGP with all server or network nodes (not individual QF/Nodes) and with all QF/Interconnects.

qfabric-admin@FC-0> show bgp summary fabric | no-more 
Groups: 2 Peers: 6 Down peers: 0
Unconfigured peers: 5
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
bgp.l3vpn.0          
                      42         18          0          0          0          0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
128.0.128.4             100      10517      10602       0       0  3d 6:43:58 Establ
  bgp.l3vpn.0: 17/17/17/0
  bgp.rtarget.0: 28/31/31/0
  bgp.fabricvpn.0: 28/28/28/0
  bgp.bridgevpn.0: 8/8/8/0
  default.inet.0: 17/17/17/0
  default.fabric.0: 19/19/19/0
128.0.128.8             100      10594      10593       0       0  3d 6:44:06 Establ
  bgp.l3vpn.0: 0/18/18/0
  bgp.rtarget.0: 1/32/32/0
  bgp.fabricvpn.0: 0/103/103/0
  bgp.bridgevpn.0: 0/9/9/0
  default.inet.0: 0/18/18/0
  default.fabric.0: 0/91/91/0
128.0.130.4             100      10466      10552       0       0  3d 6:35:42 Establ
  bgp.rtarget.0: 0/4/4/0
  bgp.fabricvpn.0: 34/34/34/0
  bgp.bridgevpn.0: 0/0/0/0
  default.fabric.0: 34/34/34/0
128.0.130.10            100       9751       9636       0       0  3d 1:04:34 Establ
  bgp.rtarget.0: 0/4/4/0
  bgp.fabricvpn.0: 34/34/34/0
  bgp.bridgevpn.0: 0/0/0/0
  default.fabric.0: 34/34/34/0
128.0.130.24            100      10432      10547       0       0  3d 6:18:09 Establ
  bgp.l3vpn.0: 1/7/7/0
  bgp.rtarget.0: 0/7/7/0
  bgp.fabricvpn.0: 7/7/7/0
  bgp.bridgevpn.0: 1/1/1/0
  default.inet.0: 1/7/7/0
  default.fabric.0: 4/4/4/0
128.0.130.26            100      10410      10545       0       0  3d 6:19:11 Establ
  bgp.l3vpn.0: 0/0/0/0
  bgp.rtarget.0: 0/4/4/0
  bgp.fabricvpn.0: 0/0/0/0
  bgp.bridgevpn.0: 0/0/0/0

Any other node (example: QF/Interconnect), has two BGP sessions with both fabric control VMs:

qfabric-admin@IC-Left> show bgp summary fabric 
Groups: 1 Peers: 2 Down peers: 0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
128.0.128.6             100       9663       9775       0       0  3d 1:16:27 Establ
  bgp.rtarget.0: 28/32/32/0
  bgp.fabricvpn.0: 61/61/61/0
  bgp.bridgevpn.0: 0/0/0/0
  default.fabric.0: 61/61/61/0
128.0.128.8             100       9667       9773       0       0  3d 1:16:23 Establ
  bgp.rtarget.0: 0/32/32/0
  bgp.fabricvpn.0: 0/61/61/0
  bgp.bridgevpn.0: 0/0/0/0
  default.fabric.0: 0/61/61/0

Edge nodes use six MP-BGP address families (including default.inet.0 and default.fabric.0), QF/Interconnects have just four.

The fabric control VMs act as BGP route reflectors (exactly as I predicted). You can easily verify that by inspecting any individual BGP entry on one of the node groups – you’ll see the Originator and Cluster List BGP attributes:

65534:1:192.168.13.37/32 (2 entries, 1 announced)
        *BGP    Preference: 170/-101
                Route Distinguisher: 65534:1
                Next hop type: Indirect
                Address: 0x964f49c
                Next-hop reference count: 6
                Source: 128.0.128.6
                Next hop type: Router, Next hop index: 131070
                Next hop: 128.0.130.24 via dcfabric.0, selected
                Label operation: PFE Id 7 Port Id 55
                Label TTL action: PFE Id 7 Port Id 55
                Session Id: 0x0
                Next hop: 128.0.130.24 via dcfabric.0
                Label operation: PFE Id 8 Port Id 55
                Label TTL action: PFE Id 8 Port Id 55
                Session Id: 0x0
                Protocol next hop: 128.0.130.24:49160(NE_PORT)
                Layer 3 Fabric Label 5
                Composite next hop: 964f440 1738 INH Session ID: 0x0
                Indirect next hop: 92c8d00 131072 INH Session ID: 0x0
                State: <Active Int Ext>
                Local AS:   100 Peer AS:   100
                Age: 3d 6:54:40 Metric2: 0 
                Validation State: unverified 
                Task: BGP_100.128.0.128.6+33035
                Announcement bits (1): 0-Resolve tree 1 
                AS path: I (Originator) Cluster list:  0.0.0.1
                AS path:  Originator ID: 128.0.130.24
                Communities: target:65534:117440513(L3:1)
                Import Accepted
                Timestamp: 0x116
                Route flags: arp
                Route type: Host
                Route protocol : arp
                L2domain : 5
                SNPA count: 1, SNPA length: 8
                SNPA Type: Network Element Port SNPA
                NE Port ID: 49160
                Localpref: 100
                Router ID: 128.0.128.6
                Secondary Tables: default.inet.0
                Composite next hops: 1
                        Protocol next hop: 128.0.130.24:49160(NE_PORT)
                        Layer 3 Fabric Label 5
                        Composite next hop: 964f440 1738 INH Session ID: 0x0
                        Indirect next hop: 92c8d00 131072 INH Session ID: 0x0
                        Indirect path forwarding next hops: 2
                                Next hop type: Router
                                Next hop: 128.0.130.24 via dcfabric.0
                                Session Id: 0x0
                                Next hop: 128.0.130.24 via dcfabric.0
                                Session Id: 0x0

Addressing

QFabric control plane uses locally-administered MAC addresses and IP address block 128.0.0.0/16. You can see all the MAC and IP addresses with the show arp command executed on any of the internal components. The bme interfaces are the control-plane interfaces, the vlan interface is a user-facing SVI interface.

qfabric-admin@NW-NG-0> show arp 
MAC Address       Address     Name              Interface           Flags
00:13:dc:ff:72:01 10.73.2.9   10.73.2.9         vlan.501            none
02:00:00:00:40:01 128.0.0.1   128.0.0.1         bme0.2              permanent
02:00:00:00:40:02 128.0.0.2   128.0.0.2         bme0.2              permanent
02:00:00:00:40:05 128.0.0.4   128.0.0.4         bme0.0              permanent
02:00:00:00:40:05 128.0.0.5   128.0.0.5         bme0.1              permanent
02:00:00:00:40:05 128.0.0.5   128.0.0.5         bme0.2              permanent
02:00:00:00:40:05 128.0.0.6   128.0.0.6         bme0.0              permanent
02:00:00:00:40:07 128.0.0.7   128.0.0.7         bme0.1              permanent
02:00:00:00:40:07 128.0.0.7   128.0.0.7         bme0.2              permanent
02:00:00:00:40:08 128.0.0.8   128.0.0.8         bme0.1              permanent
02:00:00:00:40:08 128.0.0.8   128.0.0.8         bme0.2              permanent
02:00:00:00:40:09 128.0.0.9   128.0.0.9         bme0.1              permanent
02:00:00:00:40:09 128.0.0.9   128.0.0.9         bme0.2              permanent
[... rest deleted ...]

Look Ma! There are the labels!

In my blog post I predicted QFabric uses MPLS internally. It’s impossible to figure out without a 40Gbps sniffer whether MPLS label stack is the exact encapsulation format QFabric is using, but it sure looks like MPLS from the outside.

The dcfabric interface uses mpls as one of the protocols:

qfabric-admin@RSNG01> show interfaces dcfabric.0 
  Logical interface dcfabric.0 (Index 64) (SNMP ifIndex 1214251262) 
    Flags: SNMP-Traps Encapsulation: ENET2
    Input packets : 0 
    Output packets: 0
    Protocol inet, MTU: 1558
      Flags: Is-Primary
    Protocol mpls, MTU: 1546, Maximum labels: 3
      Flags: Is-Primary
    Protocol eth-switch, MTU: 0
      Flags: Is-Primary

You can also see MPLS-like labels in numerous BGP entries, for example in the bridgevpn address family ...

65534:1:5.c8:e2:c3:01:78:8f/144               
         *[BGP/170] 1w3d 15:28:00, localpref 100
            AS path: I, validation-state: unverified
            to 128.0.128.4 via dcfabric.0, Push 1730, Push 1, Push 55(top)
          > to 128.0.128.4 via dcfabric.0, Push 1730, Push 2, Push 55(top)
          [BGP/170] 1w3d 15:28:00, localpref 100, from 128.0.128.8
            AS path: I, validation-state: unverified
            to 128.0.128.4 via dcfabric.0, Push 1730, Push 1, Push 55(top)
          > to 128.0.128.4 via dcfabric.0, Push 1730, Push 2, Push 55(top)

The same set of three labels appears in a host route pointing to a host connected to another QF/Node:

65534:1:10.73.2.9/32                
           *[BGP/170] 3d 12:32:09, localpref 100
              AS path: I, validation-state: unverified
            > to 128.0.128.4 via dcfabric.0, Push 5, Push 1, Push 23(top)
            [BGP/170] 3d 12:32:09, localpref 100, from 128.0.128.8
              AS path: I, validation-state: unverified
            > to 128.0.128.4 via dcfabric.0, Push 5, Push 1, Push 23(top)

IP prefixes directly connected to the QFabric have just one label – probably a pointer to an IP forwarding table entry.

65534:1:10.73.2.0/29                
           *[BGP/170] 3d 12:31:59, localpref 101, from 128.0.128.4
              AS path: I, validation-state: unverified
            > to 128.0.128.4:129(NE_PORT), Layer 3 Fabric Label 5
            [BGP/170] 3d 12:31:59, localpref 101, from 128.0.128.8
              AS path: I, validation-state: unverified
            > to 128.0.128.4:129(NE_PORT), Layer 3 Fabric Label 5 

On the other hand, the MPLS routing and forwarding tables are empty, indicating that this is very probably not the MPLS we’re used to.

Summary

Behind the scenes, QFabric runs like any well-designed service provider network: a cluster of central servers provides common services (including DHCP, NFS, FTP, NTP and Syslog), BGP is used in the control plane to distribute customer prefixes (IP addresses, host/ARP routes, MAC addresses) and MPLS-like encapsulation that can attach a label stack to a L2 frame or L3 datagram is used in the forwarding plane.

The true magic of QFabric is the CLI VM, which presents the internal IP+MPLS-like network as a single switch without any OpenFlow or SDN magic. Wouldn’t it be nice to have something similar in the service provider networks?

2012-12-17: Comments are temporarily disabled, as a moron selling acne-reducing snake oil found this blog post interesting. Contact me using the 'Contact' link at the top of the page.

19 comments:

  1. Hi Ivan,
    May be this paper from Juniper will be of interest.
    http://www.juniper.net/us/en/local/pdf/whitepapers/2000443-en.pdf


    ReplyDelete
  2. Very nice! excellent post and information as always

    ReplyDelete
  3. That is very, very, nice. I look forward to more reports regarding your experiences with it.

    ReplyDelete
  4. Ivan,

    Are you perhaps being a little kind to yourself here?

    As an example....

    Ref this central comment in your original post.

    'They would likely keep the individual components in the QFabric pretty autonomous and use distributed processing while using QF/Director as the central management/configuration device (similar to UCS manager in Cisco UCS).'

    Ref the role of the Director from Juniper.

    'To draw parallels with a traditional chassis-based switch, the QFabric Director is equivalent to the supervisor module and routing engine.'

    I am sure you would agree the credit here should be going to the smart guys and girls at Juniper who created the architecture, design and code to make this happen. Like most embedded coders, they don't get to make a big noise on the internet about their rather clever work.

    ReplyDelete
  5. Very timely post. I had in my "homework" list to figure out just what was going on under the hood after you pointed out to me that QFabric is a distributed (not centralized) control-plane. You did most of my homework for me, although I need to re-read this post a few times to digest it completely. As usual. ;-) Thank you.

    ReplyDelete
  6. From Juniper

    The QFabric architecture subscribes to the “centralize what you can, distribute what you must” philosophy by implementing a distributed control plane in order to build scale-out networks.

    Network node group Routing Engine: NNG routing engine performs the routing engine functionality on the NNG QFabric Nodes as described earlier. It runs as an active/backup pair of VMs across the physically disjointed compute nodes in the QFabric Director cluster.

    QFabric Director compute clusters are composed of two compute nodes, each running identical software stacks.
    (although the current implementation supports a maximum of two compute nodes per cluster, the architecture can theoretically support multiple servers.) Two of the compute nodes have a disk subsystem directly attached to them; the remaining compute nodes are diskless and therefore essentially stateless. each disk subsystem consists of two 2 Tb disks arranged in a raID1 mirror configuration, and the contents across these subsystems are synchronously block replicated for redundancy.

    ReplyDelete
  7. Can't you setup mirroring to check forwarding plane headers? As QFabric hardware is 'switch' COTS ASIC, instead of fully programmable NPU like Trio, it seems likely that forwarding plane wouldn't even support custom headers.

    ReplyDelete
  8. Unbelievably awesome Ivan! The comment about having this in SP struck home.... I would seriously consider selling my soul to Scratch for the capability Qfabric ha to manage a distributed network. In my 2 x 10 ^ 4 node work network

    ReplyDelete
  9. Everytime I look at how you must do things in J vs how you must do them in C, or everytime tcam issues kill a switch, or so-called router in the case of 7600 and n7k, it makes me want to get a hedge trimmer to cut the cables and a screwdriver, and launch a world tour of our edge peering points and data centers, to rip out every C device from peering router, to L2 agg switch to leaf switch to spine switch to access router, and toss them out in the street. I'd really like to see coverage of MPLS VPN on J MX series, and on EX and QFX series devices.. Also... Fw filter policy tricks that just are not possible in Cisco ACLs.

    ReplyDelete
  10. http://www.ietf.org/id/draft-ietf-l2vpn-evpn-01.txt

    ReplyDelete
  11. What about http://www.heise.de/netze/rfc/rfcs/rfc5735.shtml#page-10

    ->128.0/16 to be allocated

    ReplyDelete
    Replies
    1. The 128.0.128.0 address space is used by the fabric control protocol internally and is based on BGP, it is not part of any external reachability information.

      You can see 128.0 addresses referenced when using the "show route fabric" command.

      But you will not see any 128.0 when using "show route" unless/until it is used on the Internet.

      Delete
  12. Nice post Ivan.I had heard about the use of BGP & MPLS in QFabric. This post confirms it with more details.

    What's your view on SPB findings its way into DC fabrics by some of the vendors? It also has roots in service provider world with goals of achieving scale, ease of provisioning and O&M?

    ReplyDelete
  13. Hi Ivan,

    I was wondering what would happen if you were to remove the MAC info from the equation. Instead of mapping L2 to L3 via a distributed ARP table... why not just remove all L2 from the equation and perform pure L3 based forwarding? Terminate ARP at the leafs and you have optimal L2/L3 any node to any node...

    ReplyDelete
    Replies
    1. That would be ideal, but we both know that we have to support all sorts of crazy non-IP protocols (ex: FCoE :D ) and IP-based abominations that refuse to die (ex: Microsoft NLB).

      As much as I'd like L3 forwarding everywhere, when reality hits you, you have to implement a mix of L2 and L3.

      Delete
Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.