Path Failure Detection on Multi-Homed Servers

TL&DR: Installing an Ethernet NIC with two uplinks in a server is easy1. Connecting those uplinks to two edge switches is common sense2. Detecting physical link failure is trivial in Gigabit Ethernet world. Deciding between two independent uplinks or a link aggregation group is interesting. Detecting path failure and disabling the useless uplink that causes traffic blackholing is a living hell (more details in this Design Clinic question).

Want to know more? Let’s dive into the gory details.

Imagine you have a server with two uplinks connected to two edge switches. You want to use one or both uplinks3 but don’t want to send the traffic into a black hole, so you have to know whether the data path between your server and its peers is operational.

The most trivial scenario is a link failure. Ethernet Network Interface Card (NIC) detects the failure, reports it to the operating system kernel, the link is disabled, and all the outgoing traffic takes the other link.

Next is a transceiver (or NIC or switch ASIC port) failure. The link is up, but the traffic sent over it is lost. Years ago, we used protocols like UDLD to detect unidirectional links. Gigabit Ethernet (and faster technologies) include Link Fault Signalling that can detect failures between the transceivers. You need a control-plane protocol to detect failures beyond a cable and directly-attached components.

Detecting Failures with a Control-Plane Protocol

We usually connect servers to VLANs that sometimes stretch more than one data center (because why not) and want to use a single IP address per server. That means the only control-plane protocol one can use between a server and an adjacent switch is a layer-2 protocol, and the only choice we usually have is LACP. Welcome to the beautifully complex world of Multi-Chassis Link Aggregation (MLAG)4.

Using LACP/MLAG5 to detect path failure is a brilliant application of RFC 1925 Rule 6. Let the networking vendors figure out which switch can reach the rest of the fabric, hoping the other member of the MLAG cluster will shut down its interfaces or stop participating in LACP. Guess what – they might be as clueless as you are; getting a majority vote in a cluster with two members is an exercise in futility. At least they have a peer link bundle between the switches that they can use to shuffle the traffic toward the healthy switch, but not if you use a virtual peer link. Cisco claims to have all sorts of resiliency mechanisms in its vPC Fabric Peering implementation, but I couldn’t find any details. I still don’t know whether they are implemented in the Nexus OS code or PowerPoint6.

In a World without LAG

Now let’s assume you got burned by MLAG7, want to follow the vendor design guidelines8, or want to use all uplinks for iSCSI MPIO or vMotion9. What could you do?

Some switches have uplink tracking – the switch shuts down all server-facing interfaces when it loses all uplinks – but I’m not sure this functionality is widely available in data center switches. I already mentioned Cisco’s lack of details, and Arista seems no better. All I found was a brief mention of the uplink-failure-detection keyword without further explanation.

Maybe we could solve the problem on the server? VMware has beacon probing on ESX servers, but they don’t believe in miracles in this case. You need at least three uplinks for beacon probing. Not exactly useful if you have servers with two uplinks (and few people need more than two 100GE uplinks per server).

Could we use the first-hop gateway as a witness node? Linux bonding driver supports ARP monitoring and sends periodic ARP requests to a specified destination IP address through all uplinks. Still, according to the engineer asking the Design Clinic question, that code isn’t exactly bug-free.

Finally, you could accept the risk – if your leaf switches have four (or six) uplinks, the chance of a leaf switch becoming isolated from the rest of the fabric is pretty low, so you might just give up and stop worrying about byzantine failures.

BGP Is the Answer. What Was the Question?

What’s left? BGP, of course. You could install FRR on your Linux servers, run BGP with the adjacent switches and advertise the server’s loopback IP address. To be honest, properly implemented RIP would also work, and I can’t fathom why we couldn’t get a decent host-to-network protocol in the last 40 years10. All we need is a protocol that:

  • Allows a multi-homed host to advertise its addresses
  • Prevents route leaks that could cause servers to become routers11. BGP does that automatically; we’d have to use hop count to filter RIP updates sent by the servers12.
  • Bonus point: run that protocol over an unnumbered switch-to-server link.

It sounds like a great idea, but it would require OS vendor support13 and coordination between server- and network administrators. Nah, that’s never going to happen in enterprise IT.

No worries, I’m pretty sure one or the other SmartNIC14 vendor will eventually start selling “a perfect solution”: run BGP from the SmartNIC and adjust the link state reported to the server based on routes received over such session – another perfect example of RFC 1925 rule 6a.

More Details

ipSpace.net subscribers can also:


  1. Bonus points if you realized a NIC could fail and installed two NICs. ↩︎

  2. I did get a “why would anyone want to do that” question from someone working at VMware, but I considered that par for the course. If you claim to be a networking engineer but can’t answer that question, OTOH, you have a bigger problem than what this blog post discusses. ↩︎

  3. Skipping the load balancing can-of-worms for the moment. ↩︎

  4. Most of MLAG’s complexity is hidden from the server administrators, but that does not mean it’s not there waiting to explode in your face. ↩︎

  5. Vendors use different names like vPC for MLAG functionality. Some also call a link aggregation group (LAG) a Port Channel or an EtherChannel↩︎

  6. The details are probably described in some Cisco Live presentation, but my Google-Fu is failing me. ↩︎

  7. I know about quite a few data center meltdowns caused by MLAG bugs, but I guess not everyone gets exposed to so many pathological cases. ↩︎

  8. For example, VMware recommends independent uplinks in NSX-T deployments. ↩︎

  9. Multi-interface vMotion or iSCSI MPIO needs multiple IP addresses per host with traffic for an individual IP address tied to a particular uplink. You cannot implement that with a link aggregation group. ↩︎

  10. OSI had ES-IS protocol from day one. Did the IETF community feel the urge to be different, or was everything OSI touched considered cooties? ↩︎

  11. IBM was running OSPF on mainframes, and it was perfectly possible to turn your mainframe into the most expensive core router you’ve ever seen with a dismal packet forwarding performance. ↩︎

  12. Probably doable with a route map matching on metric. ↩︎

  13. I’m looking at you, VMware. ↩︎

  14. Known as Data Processing Unit in marketese. ↩︎

Latest blog posts in Site and Host Multihoming series

5 comments:

  1. If you happen to run bare metal kubernetes/openshift with e.g. Calico or Cilium CNI, running BGP between server and ToR almost becomes the default deployment. Obviously, now we're not in enterprise IT virtualisation territory anymore (although kubevirt could functionally provide most of the desired technology...)

  2. The answer is BGP + BFD. This project adds FRR on each server and then advertise all host and potentially also client VM IPs as /32 routes to the leafs. No M-LAG, VLANs does not span over multiple racks. There are some caveats since does not yet support accelerated datapath like DPDK. Maybe someone will add ES-IS to FRR in the near future...

    https://developers.redhat.com/articles/2022/09/22/learn-about-new-bgp-capabilities-red-hat-openstack-17#

  3. Great points in this article, moving the resilience from L2 to L3 brilliant.

  4. amazing article Ivan. What i ran into the other day was a switch fault resitting the config to vanilla so SW port is up / NIC is up and forwarding, into nowhere. server dies. Indeed asking the host HSRP ish protocol to reach the DG/NH or not that would be good. FRR looks good, we did OSPF on LANs in my BT time for stacks of pizza's and what came after that. Not so sure if BGP in the ToR boils my cookie. But great article thanks.

  5. The routed way is definitely good, at the cost perhaps of some complexity you see.

    One annoyance is what IP address gets used by default by the system for outbound traffic. It would be nice to have a generic OS-level way to say "this IP on lo0 should be default for outbound IP traffic unless to the connected link subnet itself".

    Obviously some software allows you to specify the source IP to use, but again more complexity in config. And some doesn't. I've solved it before with an iptables/nft SNAT rule for everything not on the connected subnet, but again it's messier than one would like.

    Replies
    1. A few other tricks (I guess I'll have to write a follow-up blog post one of these days):

      • Assign the same IP address to the loopback and all uplinks. Obviously that makes running BGP sessions a bit harder, so you have to run them over IPv6 LLA
      • Don't care about the outbound sessions (too much)
      • Use MP-TCP for outbound sessions
      • Use MPIO for iSCSI
    2. The IPv6 link-local is a very nice approach alright (even better when combined with automatic neighbor discovery). You could probably get away without the IP on each uplink in a lot of cases as the system will pick a GUA/IPv4 from the loopback if it's the only choice.

Add comment
Sidebar