This Is Not the Host Route You’re Looking For

When describing Hyper-V Network Virtualization packet forwarding I briefly mentioned that the hypervisor switches create (an equivalent of) a host route for every VM they need to know about, prompting some readers to question the scalability of such an approach. As it turns out, layer-3 switches did the same thing under the hood for years.

How We Think It Works

IP forwarding process is traditionally explained along these lines:

  • Destination IP address is looked up in IP forwarding table (FIB), resulting in IP next hop or connected interface (in which case the next hop is the destination IP address itself);
  • ARP cache is looked up to find the MAC address of the IP next hop.

According to this explanation, the IP FIB contains the prefixes copied from the IP routing table. However, this is not how most layer-3 switches work.

How It Actually Works

I’ll use a recent implementation of Cisco Express Forwarding (CEF) to illustrate what’s really going on behind the scenes. The printouts were taken from vIOS running within Cisco CML (it’s great to have cloud-based routers when you can’t access your home lab due to 10-day-long power outage).

This is the routing table I had on the router (static route and default route were set through DHCP).

R1#show ip route 10.11.12.0 longer
[…]
Gateway of last resort is 10.11.12.1 to network 0.0.0.0

10.0.0.0/8 is variably subnetted, 3 subnets, 2 masks
C 10.11.12.0/24 is directly connected, GigabitEthernet0/0
S 10.11.12.2/32 [254/0] via 10.11.12.1, GigabitEthernet0/0
L 10.11.12.3/32 is directly connected, GigabitEthernet0/0

CEF table closely reflects the IP table, but there are already a few extra entries:

R1#show ip cef | include 10.11.12  
10.11.12.0/24 attached GigabitEthernet0/0
10.11.12.0/32 receive GigabitEthernet0/0
10.11.12.1/32 attached GigabitEthernet0/0
10.11.12.2/32 10.11.12.1 GigabitEthernet0/0
10.11.12.3/32 receive GigabitEthernet0/0
10.11.12.255/32 receive GigabitEthernet0/0

Now let’s ping a directly connected host …

R1#ping 10.11.12.4
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.11.12.4, timeout is 2 seconds:
.!!!!

… and there’s an extra entry in the CEF table:

R1#show ip cef | include 10.11.12  
10.11.12.0/24 attached GigabitEthernet0/0
10.11.12.0/32 receive GigabitEthernet0/0
10.11.12.1/32 attached GigabitEthernet0/0
10.11.12.2/32 10.11.12.1 GigabitEthernet0/0
10.11.12.3/32 receive GigabitEthernet0/0
10.11.12.4/32 attached GigabitEthernet0/0
10.11.12.255/32 receive GigabitEthernet0/0

Wait, What?

Does that mean that the ping command created an extra entry in the CEF table? Of course not – but it did trigger the ARP process, which indirectly created a new adjacency in the CEF table (these adjacencies don’t expire due to repeated ARPing done by Cisco IOS). The ARP-generated adjacency looks exactly like any other host route (although you can see from various fields in the detailed CEF printout that it’s an adjacency route):

R1#show ip cef 10.11.12.4 internal
10.11.12.4/32, epoch 0, flags [att], refcnt 5, per-destination sharing
sources: Adj
subblocks:
Adj source: IP adj out of GigabitEthernet0/0, addr 10.11.12.4 0D0AB300
Dependent covered prefix type adjfib, cover 10.11.12.0/24
ifnums:
GigabitEthernet0/0(2): 10.11.12.4
path list 0D14E48C, 3 locks, per-destination, flags 0x4A [nonsh, rif, hwcn]
path 0D581C30, share 1/1, type adjacency prefix, for IPv4
attached to GigabitEthernet0/0, IP adj out of GigabitEthernet0/0, addr 10.11.12.4 0D0AB300
output chain:
IP adj out of GigabitEthernet0/0, addr 10.11.12.4 0D0AB300

How Do We Know Hardware Switches Work the Same Way?

Obviously it’s impossible to claim with any certainty how a particular switch works without seeing the hardware specs (mission impossible for vendor ASICs as well as Broadcom’s merchant silicon).

Some switch vendors still talk about IP routing entries and ARP entries, others (for example, Nexus 3000) already use IP prefix and IP host entry terminology. I chose Nexus 3000 for a reason – many data center switches use the same chipset and thus probably use the same forwarding techniques.

Intel is way more forthcoming than Broadcom – the FM4000 data sheet contains plenty of details about its forwarding architecture, and if I understand it correctly, IP forwarding table lookup must result in an ARP entry (which means that the IP forwarding table must contain host routes).

Summary

Hardware layer-3 switches need an IP forwarding entry for every attached IP host, although some vendors might call these entries ARP entries. Virtual layer-3 switches are no different, and might use a totally different terminology for the host route forwarding entries to confuse the casual reader.

3 comments:

  1. Many vendors call these 'glean' routes as they are the glue between L2 and L3.

    In EOS btw its explicitly shown via "show ip route hosts" (software view of it), or for hardware view you'd use "show platform (platformtype) l3 hardware routes host" and "show platform (platformtype) l3 hardware routes host lpm".

  2. I am not sure if we are comparing apples to apples. In case of hypervisor, it has to learn ARP entries of all the VM's in that DC's, where as in case of a hardware device, the ARP entries are just on the connected interface and need not learn ARP's that are not part of connected interface.
    Replies
    1. Hypervisor needs to know the ARP entries of all VMs in all routing domains present in that hypervisor. That might be way less than all VMs in the data center. The proof is left as an exercise for the reader ;)

      A ToR switch needs to know ARP entries of "connected" devices, which might be a lot more than it seems in case of optimal L3 forwarding (like Arista's VARP or Enterasys' Fabric Routing).
Add comment
Sidebar