Micro-BFD: BFD over LAG (Port Channel)
The discussion in the comments to my LAG versus ECMP post took a totally unexpected turn when someone mentioned BFD failure detection over port channels (link aggregation groups – LAGs).
What’s the big deal?
What is BFD
Bidirectional Forwarding Detection (BFD) is a lightweight protocol that “provides low-overhead, short-duration detection of failures in the path between adjacent forwarding engines” (straight from RFC 5880).
Networking engineers love BFD for several reasons:
- It’s simpler and less CPU-intensive than routing protocol adjacency messages;
- It can be implemented in smart linecards, further reducing the control-plane CPU load, and making it possible to detect forwarding failures in milliseconds even on boxes with hundreds of interfaces (try doing that with OSPF);
- It detects byzantine failures between two forwarding engines that cannot be reliably detected at physical and data-link layers.
BFD uses the principle of shared fate to do its job – it tests the actual transmission path using the same protocol (IP) as the forwarded traffic – and thus reliably detects a forwarding failure in a simple transmission path, regardless of the number of components in the path.
BFD and LAG
LAG appears as a single layer-2 forwarding path to layer-3 forwarding engine. BFD messages can take any one of the links in the LAG (based on the LAG load balancing algorithm), breaking the shared fate assumption – a forwarded packet traversing another LAG member link might encounter a partially failed component, and it’s impossible for BFD to detect that.
The only solution to this problem is to run BFD across every LAG member link – the BFD code has to become LAG-aware and test end-to-end connectivity across every member link independently (see RFC 7130 for more details). Even more, LACP and BFD have to work in parallel – a member is added to a LAG only when both LACP and BFD agree it’s OK to do so.
More information
I discussed the role of BFD in fast convergence in the Advanced Routing Protocol Topics part of How Networks Really Work webinar.
And probably ASR903.
Nice post.
To bring some platform specific material, here are some document about BFD that could be useful for your readers (not only BoB) :
Intro: https://supportforums.cisco.com/document/62656/introduction-bfd-asr9000
Architecture: https://supportforums.cisco.com/document/144626/bfd-support-cisco-asr9000
Troubleshooting for IOS-XR: https://supportforums.cisco.com/blog/12016611/bfd-configuration-troubleshooting-cisco-ios-and-xr-routers
and some scaling figures for CRS-1 and CRS-3: https://supportforums.cisco.com/document/12019081/bfd-crs
Fred
I don't think Cisco's implementation on both XR and NX-OS follows that RFC. it's a proprietary, not so efficient (to say the least), implementation...
Juniper and ALU do have an RFC-compliant implementation though.
Thanks
You are right, Cisco ASR 9000 implements RFC7130 support from IOS-XR 5.2.0.
Fred
I believe it might replace UDLD as a faster alternative to monitor the individual links in an LACP Port Channel.
I have not yet found out how to configure this feature, though. My N6K should support it.
1. Some implementations at some moments in time had very serious issues with BFD (BFD-based link state not propery propagated to all protocols, MPLS being the prime victim). OK, it was not Cisco and not Juniper.
2. Do people do LAG without LACP ? (OK, not the greatest protocol for removing bad links, but still effective)
3. Do people do LAG without "seeing the light" ? When you "see the light" (get the L1 signal sent from the other side - no Ethernet over SomethingElse) a number of issues disappear.
What am I missing here ?
#3 - "Not seeing the light" detects 99% of all the failures, but we're usually not worried about them (because we detect them). Faulty transceivers and similar **** is what wakes you up at 2AM (rarely, but still). Oh, and then there are people using "media converters" ;) and LAG over Metro Ethernet (don't ask!).
Finally, it's nice to be able to solve failure detection on all media using a consistent mechanism.
Like Ivan said, we sleep better that way.