Host-to-Network Multihoming Kludges
Continuing our routing-on-hosts discussions, Enno Rey (of the Troopers and IPv6 security fame) made another interesting remark “years ago we were so happy when we finally got rid of gated on Solaris” and I countered with “there are still people who fondly remember the days of running gated on Solaris” because it’s a nice solution to host-to-network multihoming problem.
To set the context: imagine a host connected to two or more network edge devices (example: server connected to two ToR switches) and offering a service that has to be reachable from the outside. How do you make it work?
The correct solution is obvious (in hindsight):
- Assign a different IP address to each interface to get small layer-2 domains (IP subnet is assigned to a single switch) and layer-3 scalability through address summarization;
- Use session layer to establish a mapping between service names and transport/network addresses.
Unfortunately, this solution was deemed religiously incorrect and so TCP/IP stack still doesn’t have a proper session layer.
Some people decided to solve the problem in the application layer. New age solutions use scale-out application architecture where you don’t care about a particular service instance service availability (and subsequently don’t need host-to-network multihoming) because there are always multiple instances of the same service. You could also use a nasty kludge like happy eyeballs and simply try all service endpoints advertised via DNS in parallel.
The traditional way of “solving” this problem was to push it down the stack until it landed in the networking land, and because people writing server operating systems could get link bonding to work with a hub sitting at their desktop (I’m obviously exaggerating but you get the idea) they expected the same behavior from the network regardless of its underlying topology.
End result: VLANs spanning at least two switches, sometimes even whole data centers (to support VM mobility), resulting in expensive and brittle kludges like layer-2 fabrics or multi-chassis link aggregation.
Now imagine you’d use some service advertisement protocol to allow the host to advertise its services even in legacy environments where a service is tied to a fixed IP address.
Actually, we do have the solution for years - the DNS SRV records. Too bad the application and middleware developers never heard about them.
Trying to solve the problem in the network layer, you could invent a whole new protocol to get the job done, or you could use BGP (because it’s widely available and usually works) to advertise server loopback address to the network. You could even go a step further and run the latest version of FRR on the Linux server and use unnumbered physical interfaces with adjacent Cumulus Linux switch for truly scalable plug-and-play networking, like Dinesh Dutt described in the Leaf-and-Spine Fabric Designs webinar.
Introducing SCTP thus causes application-level changes, not to mention the inability to get it through many firewall (at all - I'm not even talking about new firewall rules).
I wrote about the problems with SCTP a while ago (and earned quite a few "you're an idiot" accolades on Reddit or wherever it was reposted not so long ago):
http://blog.ipspace.net/2009/08/what-went-wrong-sctp.html
Now networking is totally a fashion industry, always taking the old clothes and re-purposing as something totally new... :-)
Look at Named Data Networking (NDN) if you want to be fashionable! :-)
DNS has had this architecture since its inception in 1985.
However, I would rather solve the problem by running BGP on the host. Oh, wait. I am already doing that and it just works™ ;)
A different but more powerful design is to use a loopback address on the host and regular IP addresses on the physical interfaces with Multipath TCP. The loopback address is used as a rendez-vous point to establish the initial subflow but Multipath TCP advertises the addresses of the physical interfaces and traffic automatically flows along them. If there is not enough bandwidth on one interface, Multipath TCP will use the second. If one link fails, Multipath TCP will recover within a rtt. I've heard some people using this approach with CEPH servers and they seemed to be pretty happy with it.
The only way a server could initiate sub flows would be if middle boxes sitting on ISPs doing NAT and security would understand MPTCP headers so they could match existing flow entries with new subflows.