A while ago Greg Ferro wrote a great article describing integration of overlay and physical networks in which he wrote that “an overlay network tunnel has no state in the physical network”, triggering an almost-immediate reaction from Marten Terpstra (of RIPE fame, now @ Plexxi) arguing that the network (at least the first ToR switch) knows the MAC and IP address of hypervisor host and thus has at least some state associated with the tunnel.
Marten is correct from a purely scholastic perspective (using his argument, the network keeps some state about TCP sessions as well), but what really matters is how much state is kept, which device keeps it, how it’s created and how often it changes.
How much state does a device keep?
The end hosts have to keep state of every single TCP and UDP session, but most transit network devices (apart from abominations like NAT) don’t care about those sessions, making Internet as fast as it is.
Decades ago we had a truly reliable system that kept session state in every single network node; it never lost a packet, but it barely coped with 2 Mbps links (the oldtimers might remember it as X.25 ;).
The state granularity should get ever coarser as you go deeper into the network core – edge switches keep MAC address tables and ARP/ND caches of adjacent end hosts, core routers know about IP subnets, routers in public Internet know about the publicly advertised prefixes (including every prefix Bell South ever assigned to one of its single-homed customers), while the high-speed MPLS routers know about BGP next hops and other forwarding equivalence classes (FECs)
Which device keeps the state
Well-designed architecture has complexity (and state) concentrated at the network edge. The core devices keep minimum state (example: IP subnets), while the edge devices keep session state. In a virtual network case, the hypervisors should know the VM endpoints (MAC addresses, IP addresses, virtual segments) and the physical devices just the hypervisor IP address, not the other way round.
Furthermore, as much state as possible should be stored in low-speed devices using software-based forwarding. It’s pretty simple to store a million flows in software-based Open vSwitch (updating them is a different story) and mission-impossible to store 10.000 5-tuple flows in Trident 2 chipset used by most ToR switches.
How is state created
Systems with control-plane (proactive) state creation (example: routing table built from routing protocol information) are always more scalable than systems that have to react to data-plane events in real time (example: MAC address learning or NAT table maintenance).
Data-plane-driven state is particularly problematic for devices with hardware forwarding – packets that change state (example: TCP SYN packets creating new NAT translation) might have to be punted to the CPU, or you might have to implement state maintenance in hardware, which is expensive.
Finally, there’s the square circle aka “soft state” – cases where the protocol designers needed state in the network, but didn’t want to create a proper protocol to maintain it, so the end devices get burdened with periodic state refresh messages, and the transit devices spend CPU cycles refreshing the state. RSVP is a typical example, and everyone running large-scale MPLS/TE networks simply loves the periodic refresh messages sent by tunnel head-ends – they keep the core routers processing them cozily warm.
How often does state change
Devices with slow-changing state (example: BGP routers) are obviously more stable than devices with fast-changing state (example: Carrier-Grade NAT). The proof is left as an exercise for the reader.
Whenever you’re evaluating a network architecture or reading a vendor whitepaper describing next-generation unicorn-tears-blessed solution, try to identify how much state individual components keep, how it’s created and how often it changes. Hardware devices storing plenty of state tend to be complex and expensive (keep that in mind when evaluating the next application-aware fabric).
Not surprisingly, RFC 3429 (Some Internet Architectural Guidelines and Philosophy) gives you similar advice, although in way more eloquent form.
We migrated our blog a few days ago, and the commenting functionality is not there yet. In the meantime please find our content on LinkedIn and comment there.