To BFD or Not to BFD?

Omer asked a pretty common question about BFD on one of my blog posts (slightly reworded):

Would you still use BFD even if you have a direct router-to-router physical link without L2 transport in the middle to detect if there is some kind of software failure on the other side?

Sander Steffann quickly replied:

Too many things can go wrong, even on a simple point-to-point link. The latest ones I have observed are a router not properly detecting when a link goes down and one of the SFPs on a DAC cable failing while keeping the link up towards the router.

We all know that the absolutely correct answer is “it depends” (even though, in this case, I would forgo that get-out-of-jail card and lean heavily toward YES). Let’s try to quantify it using the same questions we used when discussing BGP timers:

What problem are we trying to solve? We’re trying to detect inter-router link failure faster than what would be feasible using routing protocol hellos or timeouts.

Are there better ways to solve the problem? You could detect link failure at:

  • Physical layer (carrier/light loss);
  • Data link layer (UDLD, LACP, Gigabit Ethernet signaling…)
  • Network layer (BFD, routing protocols)

Physical layer failure detection is the easiest one, assuming it’s reasonably reliable… but it cannot detect anything else but physical medium failure (cable cut) or drastic transceiver failure (power loss or similar).

Data link layer mechanisms might be able to detect transceiver failures (although I’m still wondering where 10GE signaling is done – pointers are highly appreciated). LACP introduces unnecessary complexity on point-to-point router links, and UDLD is vendor-specific. Furthermore, data link layer mechanisms cannot detect end-to-end failures across a layer-2 network.

BFD is perfectly positioned to solve the network path element failure detection challenge. It sits at the waist of the protocol hourglass, is standardized, and is simple enough to be easy to implement.

What’s the worst thing that could happen if I use BFD? Aggressive BFD timers might trigger false positives due to packet loss and bring down perfectly good links.

Why would twiddling BFD timers break things? Some platforms might implement BFD in hardware at the line rate. I’m probably not rich enough to be able to afford them (or this car).

Most other platforms implement BFD in software – a major consideration on platforms that use the same general-purpose CPU for packet forwarding. If higher-priority processes use all the CPU, BFD starves and might be unable to send packets or process them fast enough.

The problem is not limited to routers and switches – Booking.com experienced it on their load balancers.

You might think that the switches performing packet forwarding in hardware wouldn’t be susceptible to the same problems. They are – unless you protect your infrastructure with strict ACLs at all the edges, it’s always possible to attack it with a DoS flood. A long while ago, I could hose a Catalyst 6500 using nothing more aggressive than ARP requests.

You can (and should) use Control Plane Protection (CoPP) to protect the central CPU of your network device. Just make sure CoPP can treat NNI traffic (BFD and routing protocols) differently than UNI traffic (ARP or ping).

Long story short: be moderate on BFD timers. Figure out what your real business needs are, not what everyone assumes they are, and choose the simplest possible approach that would meet them (see also what Deutsche Telekom did to simplify their network).

Latest blog posts in BGP in Data Center Fabrics series

5 comments:

  1. Hi Ivan,

    Is there any benefit when enabling BFD on OSPF as well on BGP?

    On a side note, I've have seen cases where MPLS labels were misprogrammed into the FIB and this could have been detected with BFD. So I'd like to make a case for LDP-OAM or RSVP-OAM, as those MPLS labels are separate from IGP routes in the FIB and those paths would need to be tested separately.

    I'm not suggesting to have 50 ms timers everywhere, but just enabling BFD in more places can be a good step for a more robust network.

    Cheers
    Replies
    1. I would say BFD makes sense for any routing protocol in environments that have to detect router-to-router in sub-second time.
  2. >>Furthermore, data link layer mechanisms cannot detect end-to-end failures across a layer-2 network.
    Is this true in general? What about OAM implementations like for ATM & Ethernet? I'm not sure, but I believe ATM-OAM is end-to-end...
    But besides the fact ATM is dead, you still can't be sure the PVC is ending to the right box...
  3. Bfd works for subs eco do convergence protocols such as link state and not bgp since bgp is nlri heavy protocol and convergence is almost around 10-15 seconds and it is not wise to use 3 seconds for hold timer in bgp as bgp boxes run heavy control plane traffic and the timers vary from peer to peer wrt number of routes.
    Always use layer 1/2 detection subsecond for faster convergence.
    Layer 3 detection can be for around 5-10sec.

    Bfd/bgp combination works fine.


Add comment
Sidebar