QFabric Behind the Curtain: I was spot-on

A few days ago Kurt Bales and Cooper Lees gave me access to a test QFabric environment. I always wanted to know what was really going on behind the QFabric curtain and the moment Kurt mentioned he was able to see some of those details, I was totally hooked.

Short summary: QFabric works exactly as I’d predicted three months before the user-facing documentation became publicly available (the behind-the-scenes view described in this blog post is probably still hard to find).

This post is by no means a critique of QFabric. If anything, I’m delighted there’s still a networking vendor that can create innovative solutions without unicorn tears, relying instead on field-tested technologies ... which might, among other things, make the solution more stable.

It looks like a giant switch

When you log into the QFabric management IP address (VIP), it looks exactly like a giant switch – single configuration, single set of interfaces, show commands etc. All the familiar Junos configuration components are there: system group, interfaces, VLANs and protocols. The only really new component is the fabric object with node-group definitions (more on QFabric node groups).

However, every giant switch needs troubleshooting, which usually requires access to individual components; in QFabric case, the request component login command that unveils the really interesting world behind the curtain.

ip@test> request component login ?
Possible completions:
<node-name> Inventory name for the remote node
DRE-0 Diagnostic routing engine
IC-Left/RE0 Interconnect device control board
IC-Left/RE1 Interconnect device control board
IC-Right/RE0 Interconnect device control board
IC-Right/RE1 Interconnect device control board
FC-0 Fabric control
FC-1 Fabric control
FM-0 Fabric manager
NW-NG-0 Node group
R2-19-Node0 Node device
R2-19-Node1 Node device
R2-7-Node4 Node device
R2-7-Node5 Node device
R3-12-Node6 Node device
R3-12-Node7 Node device
R3-19-Node2 Node device
R3-19-Node3 Node device
RSNG01 Node group
RSNG02 Node group

The names of physical entities (QF/Nodes, QF/Interconnects) could be either their serial numbers (default) or user-configurable names (recommended).

As you can see, you can login to individual physical devices, node groups, and virtual components like fabric controls and fabric manager. These virtual components run on QF/Directors – CentOS boxes running KVM (you can log into the QF/Director Linux shell and see the virtual machines with ps -elf).

Each QF/Director is running a number of common services, including database (MySQL), DHCP, FTP, NTP, SSH, GFS, DLM (distributed lock manager), NFS and Syslog servers:

ip@QFabric> show fabric administration inventory director-group status 
Director Group Status Sat Aug 25 09:52:08 PDT 2012

Member Status Role Mgmt Address CPU Free Memory VMs Up Time
------ ------ -------- --------------- --- ----------- --- -------------
dg0 online master xxxxxxxxxxxx 10% 17642780k 4 3 days, 16:23 hrs
dg1 online backup xxxxxxxxxxxx 6% 20509268k 3 3 days, 16:13 hrs

Member Device Id/Alias Status Role
------ ---------------- ------- ---------
dg0 xxxxxxxxxxxxxxxx online master

Master Services
---------------
Database Server online
Load Balancer Director online
QFabric Partition Address online

Director Group Managed Services
-------------------------------
Shared File System online
Network File System online
Virtual Machine Server online
Load Balancer/DHCP online

Hard Drive Status
----------------
Volume ID:4 optimal
Physical ID:1 online
Physical ID:0 online
SCSI ID:1 100%
SCSI ID:0 100%

Size Used Avail Used% Mounted on
---- ---- ----- ----- ----------
423G 6.3G 395G 2% /
99M 20M 75M 21% /boot
93G 2.0G 91G 3% /pbdata

Director Group Processes
------------------------
Director Group Manager online
Partition Manager online
Software Mirroring online
Shared File System master online
Secure Shell Process online
Network File System online
DHCP Server master online master
FTP Server online
Syslog online
Distributed Management online
SNMP Trap Forwarder online
SNMP Process online
Platform Management online
[... rest deleted ...]

Lo and behold – it’s actually running BGP internally

After logging into one of the fabric control virtual machines, you can execute the show bgp fabric summary command, which clearly indicates the control-plane protocol behind the scenes is multi-protocol BGP running numerous address families. Each fabric control VM runs BGP with all server or network nodes (not individual QF/Nodes) and with all QF/Interconnects.

qfabric-admin@FC-0> show bgp summary fabric | no-more 
Groups: 2 Peers: 6 Down peers: 0
Unconfigured peers: 5
Table Tot Paths Act Paths Suppressed History Damp State Pending
bgp.l3vpn.0
42 18 0 0 0 0
Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
128.0.128.4 100 10517 10602 0 0 3d 6:43:58 Establ
bgp.l3vpn.0: 17/17/17/0
bgp.rtarget.0: 28/31/31/0
bgp.fabricvpn.0: 28/28/28/0
bgp.bridgevpn.0: 8/8/8/0
default.inet.0: 17/17/17/0
default.fabric.0: 19/19/19/0
128.0.128.8 100 10594 10593 0 0 3d 6:44:06 Establ
bgp.l3vpn.0: 0/18/18/0
bgp.rtarget.0: 1/32/32/0
bgp.fabricvpn.0: 0/103/103/0
bgp.bridgevpn.0: 0/9/9/0
default.inet.0: 0/18/18/0
default.fabric.0: 0/91/91/0
128.0.130.4 100 10466 10552 0 0 3d 6:35:42 Establ
bgp.rtarget.0: 0/4/4/0
bgp.fabricvpn.0: 34/34/34/0
bgp.bridgevpn.0: 0/0/0/0
default.fabric.0: 34/34/34/0
128.0.130.10 100 9751 9636 0 0 3d 1:04:34 Establ
bgp.rtarget.0: 0/4/4/0
bgp.fabricvpn.0: 34/34/34/0
bgp.bridgevpn.0: 0/0/0/0
default.fabric.0: 34/34/34/0
128.0.130.24 100 10432 10547 0 0 3d 6:18:09 Establ
bgp.l3vpn.0: 1/7/7/0
bgp.rtarget.0: 0/7/7/0
bgp.fabricvpn.0: 7/7/7/0
bgp.bridgevpn.0: 1/1/1/0
default.inet.0: 1/7/7/0
default.fabric.0: 4/4/4/0
128.0.130.26 100 10410 10545 0 0 3d 6:19:11 Establ
bgp.l3vpn.0: 0/0/0/0
bgp.rtarget.0: 0/4/4/0
bgp.fabricvpn.0: 0/0/0/0
bgp.bridgevpn.0: 0/0/0/0

Any other node (example: QF/Interconnect), has two BGP sessions with both fabric control VMs:

qfabric-admin@IC-Left> show bgp summary fabric 
Groups: 1 Peers: 2 Down peers: 0
Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
128.0.128.6 100 9663 9775 0 0 3d 1:16:27 Establ
bgp.rtarget.0: 28/32/32/0
bgp.fabricvpn.0: 61/61/61/0
bgp.bridgevpn.0: 0/0/0/0
default.fabric.0: 61/61/61/0
128.0.128.8 100 9667 9773 0 0 3d 1:16:23 Establ
bgp.rtarget.0: 0/32/32/0
bgp.fabricvpn.0: 0/61/61/0
bgp.bridgevpn.0: 0/0/0/0
default.fabric.0: 0/61/61/0

Edge nodes use six MP-BGP address families (including default.inet.0 and default.fabric.0), QF/Interconnects have just four.

The fabric control VMs act as BGP route reflectors (exactly as I predicted). You can easily verify that by inspecting any individual BGP entry on one of the node groups – you’ll see the Originator and Cluster List BGP attributes:

65534:1:192.168.13.37/32 (2 entries, 1 announced)
*BGP Preference: 170/-101
Route Distinguisher: 65534:1
Next hop type: Indirect
Address: 0x964f49c
Next-hop reference count: 6
Source: 128.0.128.6
Next hop type: Router, Next hop index: 131070
Next hop: 128.0.130.24 via dcfabric.0, selected
Label operation: PFE Id 7 Port Id 55
Label TTL action: PFE Id 7 Port Id 55
Session Id: 0x0
Next hop: 128.0.130.24 via dcfabric.0
Label operation: PFE Id 8 Port Id 55
Label TTL action: PFE Id 8 Port Id 55
Session Id: 0x0
Protocol next hop: 128.0.130.24:49160(NE_PORT)
Layer 3 Fabric Label 5
Composite next hop: 964f440 1738 INH Session ID: 0x0
Indirect next hop: 92c8d00 131072 INH Session ID: 0x0
State: <Active Int Ext>
Local AS: 100 Peer AS: 100
Age: 3d 6:54:40 Metric2: 0
Validation State: unverified
Task: BGP_100.128.0.128.6+33035
Announcement bits (1): 0-Resolve tree 1
AS path: I (Originator) Cluster list: 0.0.0.1
AS path: Originator ID: 128.0.130.24
Communities: target:65534:117440513(L3:1)
Import Accepted
Timestamp: 0x116
Route flags: arp
Route type: Host
Route protocol : arp
L2domain : 5
SNPA count: 1, SNPA length: 8
SNPA Type: Network Element Port SNPA
NE Port ID: 49160
Localpref: 100
Router ID: 128.0.128.6
Secondary Tables: default.inet.0
Composite next hops: 1
Protocol next hop: 128.0.130.24:49160(NE_PORT)
Layer 3 Fabric Label 5
Composite next hop: 964f440 1738 INH Session ID: 0x0
Indirect next hop: 92c8d00 131072 INH Session ID: 0x0
Indirect path forwarding next hops: 2
Next hop type: Router
Next hop: 128.0.130.24 via dcfabric.0
Session Id: 0x0
Next hop: 128.0.130.24 via dcfabric.0
Session Id: 0x0

Addressing

QFabric control plane uses locally-administered MAC addresses and IP address block 128.0.0.0/16. You can see all the MAC and IP addresses with the show arp command executed on any of the internal components. The bme interfaces are the control-plane interfaces, the vlan interface is a user-facing SVI interface.

qfabric-admin@NW-NG-0> show arp 
MAC Address Address Name Interface Flags
00:13:dc:ff:72:01 10.73.2.9 10.73.2.9 vlan.501 none
02:00:00:00:40:01 128.0.0.1 128.0.0.1 bme0.2 permanent
02:00:00:00:40:02 128.0.0.2 128.0.0.2 bme0.2 permanent
02:00:00:00:40:05 128.0.0.4 128.0.0.4 bme0.0 permanent
02:00:00:00:40:05 128.0.0.5 128.0.0.5 bme0.1 permanent
02:00:00:00:40:05 128.0.0.5 128.0.0.5 bme0.2 permanent
02:00:00:00:40:05 128.0.0.6 128.0.0.6 bme0.0 permanent
02:00:00:00:40:07 128.0.0.7 128.0.0.7 bme0.1 permanent
02:00:00:00:40:07 128.0.0.7 128.0.0.7 bme0.2 permanent
02:00:00:00:40:08 128.0.0.8 128.0.0.8 bme0.1 permanent
02:00:00:00:40:08 128.0.0.8 128.0.0.8 bme0.2 permanent
02:00:00:00:40:09 128.0.0.9 128.0.0.9 bme0.1 permanent
02:00:00:00:40:09 128.0.0.9 128.0.0.9 bme0.2 permanent
[... rest deleted ...]

Look Ma! There are the labels!

In my blog post I predicted QFabric uses MPLS internally. It’s impossible to figure out without a 40Gbps sniffer whether MPLS label stack is the exact encapsulation format QFabric is using, but it sure looks like MPLS from the outside.

The dcfabric interface uses mpls as one of the protocols:

qfabric-admin@RSNG01> show interfaces dcfabric.0 
Logical interface dcfabric.0 (Index 64) (SNMP ifIndex 1214251262)
Flags: SNMP-Traps Encapsulation: ENET2
Input packets : 0
Output packets: 0
Protocol inet, MTU: 1558
Flags: Is-Primary
Protocol mpls, MTU: 1546, Maximum labels: 3
Flags: Is-Primary
Protocol eth-switch, MTU: 0
Flags: Is-Primary

You can also see MPLS-like labels in numerous BGP entries, for example in the bridgevpn address family ...

65534:1:5.c8:e2:c3:01:78:8f/144               
*[BGP/170] 1w3d 15:28:00, localpref 100
AS path: I, validation-state: unverified
to 128.0.128.4 via dcfabric.0, Push 1730, Push 1, Push 55(top)
> to 128.0.128.4 via dcfabric.0, Push 1730, Push 2, Push 55(top)
[BGP/170] 1w3d 15:28:00, localpref 100, from 128.0.128.8
AS path: I, validation-state: unverified
to 128.0.128.4 via dcfabric.0, Push 1730, Push 1, Push 55(top)
> to 128.0.128.4 via dcfabric.0, Push 1730, Push 2, Push 55(top)

The same set of three labels appears in a host route pointing to a host connected to another QF/Node:

65534:1:10.73.2.9/32                
*[BGP/170] 3d 12:32:09, localpref 100
AS path: I, validation-state: unverified
> to 128.0.128.4 via dcfabric.0, Push 5, Push 1, Push 23(top)
[BGP/170] 3d 12:32:09, localpref 100, from 128.0.128.8
AS path: I, validation-state: unverified
> to 128.0.128.4 via dcfabric.0, Push 5, Push 1, Push 23(top)

IP prefixes directly connected to the QFabric have just one label – probably a pointer to an IP forwarding table entry.

65534:1:10.73.2.0/29                
*[BGP/170] 3d 12:31:59, localpref 101, from 128.0.128.4
AS path: I, validation-state: unverified
> to 128.0.128.4:129(NE_PORT), Layer 3 Fabric Label 5
[BGP/170] 3d 12:31:59, localpref 101, from 128.0.128.8
AS path: I, validation-state: unverified
> to 128.0.128.4:129(NE_PORT), Layer 3 Fabric Label 5

On the other hand, the MPLS routing and forwarding tables are empty, indicating that this is very probably not the MPLS we’re used to.

Summary

Behind the scenes, QFabric runs like any well-designed service provider network: a cluster of central servers provides common services (including DHCP, NFS, FTP, NTP and Syslog), BGP is used in the control plane to distribute customer prefixes (IP addresses, host/ARP routes, MAC addresses) and MPLS-like encapsulation that can attach a label stack to a L2 frame or L3 datagram is used in the forwarding plane.

The true magic of QFabric is the CLI VM, which presents the internal IP+MPLS-like network as a single switch without any OpenFlow or SDN magic. Wouldn’t it be nice to have something similar in the service provider networks?

2012-12-17: Comments are temporarily disabled, as a moron selling acne-reducing snake oil found this blog post interesting. Contact me using the 'Contact' link at the top of the page.

19 comments:

  1. Hi Ivan,
    May be this paper from Juniper will be of interest.
    http://www.juniper.net/us/en/local/pdf/whitepapers/2000443-en.pdf


    Replies
    1. Thank you! Excellent one ;)
  2. As usual, excellent work Ivan!
  3. Very nice! excellent post and information as always
  4. That is very, very, nice. I look forward to more reports regarding your experiences with it.
  5. Ivan,

    Are you perhaps being a little kind to yourself here?

    As an example....

    Ref this central comment in your original post.

    'They would likely keep the individual components in the QFabric pretty autonomous and use distributed processing while using QF/Director as the central management/configuration device (similar to UCS manager in Cisco UCS).'

    Ref the role of the Director from Juniper.

    'To draw parallels with a traditional chassis-based switch, the QFabric Director is equivalent to the supervisor module and routing engine.'

    I am sure you would agree the credit here should be going to the smart guys and girls at Juniper who created the architecture, design and code to make this happen. Like most embedded coders, they don't get to make a big noise on the internet about their rather clever work.
  6. Very timely post. I had in my "homework" list to figure out just what was going on under the hood after you pointed out to me that QFabric is a distributed (not centralized) control-plane. You did most of my homework for me, although I need to re-read this post a few times to digest it completely. As usual. ;-) Thank you.
  7. From Juniper

    The QFabric architecture subscribes to the “centralize what you can, distribute what you must” philosophy by implementing a distributed control plane in order to build scale-out networks.

    Network node group Routing Engine: NNG routing engine performs the routing engine functionality on the NNG QFabric Nodes as described earlier. It runs as an active/backup pair of VMs across the physically disjointed compute nodes in the QFabric Director cluster.

    QFabric Director compute clusters are composed of two compute nodes, each running identical software stacks.
    (although the current implementation supports a maximum of two compute nodes per cluster, the architecture can theoretically support multiple servers.) Two of the compute nodes have a disk subsystem directly attached to them; the remaining compute nodes are diskless and therefore essentially stateless. each disk subsystem consists of two 2 Tb disks arranged in a raID1 mirror configuration, and the contents across these subsystems are synchronously block replicated for redundancy.
  8. Can't you setup mirroring to check forwarding plane headers? As QFabric hardware is 'switch' COTS ASIC, instead of fully programmable NPU like Trio, it seems likely that forwarding plane wouldn't even support custom headers.
  9. Unbelievably awesome Ivan! The comment about having this in SP struck home.... I would seriously consider selling my soul to Scratch for the capability Qfabric ha to manage a distributed network. In my 2 x 10 ^ 4 node work network
  10. Everytime I look at how you must do things in J vs how you must do them in C, or everytime tcam issues kill a switch, or so-called router in the case of 7600 and n7k, it makes me want to get a hedge trimmer to cut the cables and a screwdriver, and launch a world tour of our edge peering points and data centers, to rip out every C device from peering router, to L2 agg switch to leaf switch to spine switch to access router, and toss them out in the street. I'd really like to see coverage of MPLS VPN on J MX series, and on EX and QFX series devices.. Also... Fw filter policy tricks that just are not possible in Cisco ACLs.

  11. http://www.ietf.org/id/draft-ietf-l2vpn-evpn-01.txt
  12. What about http://www.heise.de/netze/rfc/rfcs/rfc5735.shtml#page-10

    ->128.0/16 to be allocated
    Replies
    1. The 128.0.128.0 address space is used by the fabric control protocol internally and is based on BGP, it is not part of any external reachability information.

      You can see 128.0 addresses referenced when using the "show route fabric" command.

      But you will not see any 128.0 when using "show route" unless/until it is used on the Internet.
  13. Nice post Ivan.I had heard about the use of BGP & MPLS in QFabric. This post confirms it with more details.

    What's your view on SPB findings its way into DC fabrics by some of the vendors? It also has roots in service provider world with goals of achieving scale, ease of provisioning and O&M?
  14. Good stuff Ivan !
  15. Hi Ivan,

    I was wondering what would happen if you were to remove the MAC info from the equation. Instead of mapping L2 to L3 via a distributed ARP table... why not just remove all L2 from the equation and perform pure L3 based forwarding? Terminate ARP at the leafs and you have optimal L2/L3 any node to any node...
    Replies
    1. That would be ideal, but we both know that we have to support all sorts of crazy non-IP protocols (ex: FCoE :D ) and IP-based abominations that refuse to die (ex: Microsoft NLB).

      As much as I'd like L3 forwarding everywhere, when reality hits you, you have to implement a mix of L2 and L3.
Add comment
Sidebar