Host-to-Network Multihoming Kludges

Continuing our routing-on-hosts discussions, Enno Rey (of the Troopers and IPv6 security fame) made another interesting remark “years ago we were so happy when we finally got rid of gated on Solaris” and I countered with “there are still people who fondly remember the days of running gated on Solaris” because it’s a nice solution to host-to-network multihoming problem.

Quoting RFC1925, “It’s easier to move a problem around than to solve it” and people have been extremely good at moving this particular problem around for decades.

To set the context: imagine a host connected to two or more network edge devices (example: server connected to two ToR switches) and offering a service that has to be reachable from the outside. How do you make it work?

The correct solution is obvious (in hindsight):

  • Assign a different IP address to each interface to get small layer-2 domains (IP subnet is assigned to a single switch) and layer-3 scalability through address summarization;
  • Use session layer to establish a mapping between service names and transport/network addresses.

Unfortunately, this solution was deemed religiously incorrect and so TCP/IP stack still doesn’t have a proper session layer.

Multipath TCP is a step in the right direction. Unfortunately, while it does provide seamless multipathing it doesn’t solve the multihomed service problem.

Some people decided to solve the problem in the application layer. New age solutions use scale-out application architecture where you don’t care about a particular service instance service availability (and subsequently don’t need host-to-network multihoming) because there are always multiple instances of the same service. You could also use a nasty kludge like happy eyeballs and simply try all service endpoints advertised via DNS in parallel.

The traditional way of “solving” this problem was to push it down the stack until it landed in the networking land, and because people writing server operating systems could get link bonding to work with a hub sitting at their desktop (I’m obviously exaggerating but you get the idea) they expected the same behavior from the network regardless of its underlying topology.

End result: VLANs spanning at least two switches, sometimes even whole data centers (to support VM mobility), resulting in expensive and brittle kludges like layer-2 fabrics or multi-chassis link aggregation.

Now imagine you’d use some service advertisement protocol to allow the host to advertise its services even in legacy environments where a service is tied to a fixed IP address.

Yeah, I know we had that in Novell IPX, and no, don’t get me started on SAP chattiness. Too bad the IPv6 people took only the address autoconfiguration idea from IPX and not the other goodies it had.

Actually, we do have the solution for years - the DNS SRV records. Too bad the application and middleware developers never heard about them.

Trying to solve the problem in the network layer, you could invent a whole new protocol to get the job done, or you could use BGP (because it’s widely available and usually works) to advertise server loopback address to the network. You could even go a step further and run the latest version of FRR on the Linux server and use unnumbered physical interfaces with adjacent Cumulus Linux switch for truly scalable plug-and-play networking, like Dinesh Dutt described in the Leaf-and-Spine Fabric Designs webinar.

Before you write a comment – I’m well aware I just described ES-IS, another great idea that was deemed religiously incorrect because it was invented by the wrong standards body.

Latest blog posts in BGP in Data Center Fabrics series

17 comments:

  1. Aside from the fact that a lot of applications would need to be re-written, I don't know why SCTP hasn't really taken off, except within the Telecoms sector.
    Replies
    1. Because SCTP requires a new API, so all applications have to be redesigned, recompliled, and redeployed. MPTCP is transparent to the legacy applications. However, sometimes this is a disadvantage. And SCTP still does not have a widely accepted concurrent multipath solutions. Normally, it could use only one active path.
    2. To be more precise: with all networking libraries I've seen so far you either have to specify which transport protocol you want to use (TCP or UDP) or can't specify it at all.

      Introducing SCTP thus causes application-level changes, not to mention the inability to get it through many firewall (at all - I'm not even talking about new firewall rules).

      I wrote about the problems with SCTP a while ago (and earned quite a few "you're an idiot" accolades on Reddit or wherever it was reposted not so long ago):

      http://blog.ipspace.net/2009/08/what-went-wrong-sctp.html
    3. SCTP is the favorite protocol inside an IMS. It is used for all the sever-to-server interfaces for SIP and Diameter. It is not used for the clients, since it requires much more CPU than UDP, and for millions of devices you might not want to pay extra money for this required capacity extension. Usually, it is good enough for a single client to call again if something went wrong. However, in the IMS core you need quick seamless failover, and this is delivered by SCTP. The API changes for the Telco vendors are no problems, since they own the source code and anyhow alwayzs recompile their products.
    4. If you need sub-20 or sub-10 ms failover in some safety critical domains, then you have to use active-active multiple copy multi-path transport. Both MPTCP and mSCTP could potentially deliver it, but sometimes you would do it at the application layer instead. On example is the linked session specification in EUROCAE ED-137. The mSCTP variant of SCTP is not widely implemented yet, so you might not be able to use it easily.
    5. To reach sub 20msec or sub-10 msec failover, you probably need to actively duplicate packets over disjoint paths. MPTCP could be extended to do that, but it is also possible to do this with layer 3 solutions like segment routing and regular TCP, see http://inl.info.ucl.ac.be/publications/traffic-duplication-through-segmentable-disjoint-paths
  2. IPv6 started well by becoming v2 of IPX. And then came some fashion people and destroyed it... :-(
    Now networking is totally a fashion industry, always taking the old clothes and re-purposing as something totally new... :-)
    Look at Named Data Networking (NDN) if you want to be fashionable! :-)
  3. "New age solutions use scale-out application architecture" :

    DNS has had this architecture since its inception in 1985.
  4. Please don't ever stop doing what you do Mr. Ivan P. At times you may feel like your efforts are futile. But, some people are definitely listening and taking notes.
  5. MPTCP contemplates multihomed services. However, due to NAT they decided to focus on implementing MPTCP on the client side first. It will probably not be feasible to implement it on the server side on the IPv4 side but it might be doable on IPv6 if firewalls start to understand the new MPTCP headers and allow new subflows initiated by the server on an existing flow.

    However, I would rather solve the problem by running BGP on the host. Oh, wait. I am already doing that and it just works™ ;)
    Replies
    1. In any case, MPTCP solves two different problems with some overlapping.
    2. Apple uses Multipath TCP servers to support all the iPhones and iPads that use Siri. This is a significant deployment of Multipath TCP on the server side.
    3. I'm not convinced that combining loopback addresses on the servers with with unnumbered physical interfaces is the best approach because it relies too much on the network. With this approach, the network will always forward the packets along the shortest path to the loopback address of the host. If the host is connected to two different switches for redundancy, then it means that only one link will be used. If the link fails, you'll have to wait for the IGP convergence (or worse iBGP if you put BGP on the hosts) to recover from the failure.

      A different but more powerful design is to use a loopback address on the host and regular IP addresses on the physical interfaces with Multipath TCP. The loopback address is used as a rendez-vous point to establish the initial subflow but Multipath TCP advertises the addresses of the physical interfaces and traffic automatically flows along them. If there is not enough bandwidth on one interface, Multipath TCP will use the second. If one link fails, Multipath TCP will recover within a rtt. I've heard some people using this approach with CEPH servers and they seemed to be pretty happy with it.
    4. Sure, but you forget that the servers can't initiate subflows. Only the client can and that means that even if the server is multi-homed it can't take advantage of the multiple IPs without some sort of dynamic routing.

      The only way a server could initiate sub flows would be if middle boxes sitting on ISPs doing NAT and security would understand MPTCP headers so they could match existing flow entries with new subflows.
    5. The protocol spec allows both clients and servers to create subflows. The default path managers in the Linux kernel implementation assume that only clients will create subflows, but this can be changed by writing another path manager. See http://inl.info.ucl.ac.be/publications/smapp-towards-smart-multipath-tcp-enabled-applications for a user space path manager that allows a daemon to manage the subflows used by MPTCP
  6. How about BGP in-host and peering to ToR to solve this...kind of like an ESG (vmware)
    Replies
    1. How about this sentence in one of the last paragraphs: "Trying to solve the problem in the network layer ... you could use BGP" ;)
Add comment
Sidebar