Micro-BFD: BFD over LAG (Port Channel)

Monday, October 20, 2014 07:15 +0200

Micro-BFD: BFD over LAG (Port Channel)

The discussion in the comments to my LAG versus ECMP post took a totally unexpected turn when someone mentioned BFD failure detection over port channels (link aggregation groups – LAGs).

What’s the big deal?

What is BFD

Bidirectional Forwarding Detection (BFD) is a lightweight protocol that “provides low-overhead, short-duration detection of failures in the path between adjacent forwarding engines” (straight from RFC 5880).

Networking engineers love BFD for several reasons:

It’s simpler and less CPU-intensive than routing protocol adjacency messages;
It can be implemented in smart linecards, further reducing the control-plane CPU load, and making it possible to detect forwarding failures in milliseconds even on boxes with hundreds of interfaces (try doing that with OSPF);
It detects byzantine failures between two forwarding engines that cannot be reliably detected at physical and data-link layers.

BFD uses the principle of shared fate to do its job – it tests the actual transmission path using the same protocol (IP) as the forwarded traffic – and thus reliably detects a forwarding failure in a simple transmission path, regardless of the number of components in the path.

BFD and LAG

LAG appears as a single layer-2 forwarding path to layer-3 forwarding engine. BFD messages can take any one of the links in the LAG (based on the LAG load balancing algorithm), breaking the shared fate assumption – a forwarded packet traversing another LAG member link might encounter a partially failed component, and it’s impossible for BFD to detect that.

The only solution to this problem is to run BFD across every LAG member link – the BFD code has to become LAG-aware and test end-to-end connectivity across every member link independently (see RFC 7130 for more details). Even more, LACP and BFD have to work in parallel – a member is added to a LAG only when both LACP and BFD agree it’s OK to do so.

More information

I discussed the role of BFD in fast convergence in the Advanced Routing Protocol Topics part of How Networks Really Work webinar.

10 comments:

Anonymous 20 October 2014 08:57

Ivan, this is also supported on ASR1000: BFD over GEC http://www.cisco.com/c/en/us/td/docs/ios-xml/ios/cether/configuration/xe-3s/asr1000/ce-xe-3s-asr1000-book/ce-ieee-link-bndl-xe.html
And probably ASR903.

Fred Cuiller 20 October 2014 09:32

Hi Ivan,

Nice post.
To bring some platform specific material, here are some document about BFD that could be useful for your readers (not only BoB) :

Intro: https://supportforums.cisco.com/document/62656/introduction-bfd-asr9000

Architecture: https://supportforums.cisco.com/document/144626/bfd-support-cisco-asr9000

Troubleshooting for IOS-XR: https://supportforums.cisco.com/blog/12016611/bfd-configuration-troubleshooting-cisco-ios-and-xr-routers

and some scaling figures for CRS-1 and CRS-3: https://supportforums.cisco.com/document/12019081/bfd-crs

Fred

Ofer 20 October 2014 09:58

Hi Ivan,
I don't think Cisco's implementation on both XR and NX-OS follows that RFC. it's a proprietary, not so efficient (to say the least), implementation...

Juniper and ALU do have an RFC-compliant implementation though.

Thanks

Fred Cuiller 20 October 2014 10:27

Hi Ofer,

You are right, Cisco ASR 9000 implements RFC7130 support from IOS-XR 5.2.0.

Fred

Michael 20 October 2014 18:45

According to the Release Notes, Cisco NX-OS 7.x supports "Layer 2 Bidirectional Forwarding Detection".
I believe it might replace UDLD as a faster alternative to monitor the individual links in an LACP Port Channel.
I have not yet found out how to configure this feature, though. My N6K should support it.

R.-Adrian F. 20 October 2014 22:39

Some random thoughts about BFD and LAG:
1. Some implementations at some moments in time had very serious issues with BFD (BFD-based link state not propery propagated to all protocols, MPLS being the prime victim). OK, it was not Cisco and not Juniper.
2. Do people do LAG without LACP ? (OK, not the greatest protocol for removing bad links, but still effective)
3. Do people do LAG without "seeing the light" ? When you "see the light" (get the L1 signal sent from the other side - no Ethernet over SomethingElse) a number of issues disappear.

What am I missing here ?

Ivan Pepelnjak 21 October 2014 08:32

#2 - Not all boxes support fast LACP (similar to BFD: if you can't execute it on the line card, you have serious scalability issues), so you're removing bad links in tens of seconds.

#3 - "Not seeing the light" detects 99% of all the failures, but we're usually not worried about them (because we detect them). Faulty transceivers and similar **** is what wakes you up at 2AM (rarely, but still). Oh, and then there are people using "media converters" ;) and LAG over Metro Ethernet (don't ask!).

Finally, it's nice to be able to solve failure detection on all media using a consistent mechanism.

DuaneO 21 October 2014 23:28

We hope for lights out but plan for "lights on but nobody's home". We do run some lag over lambda services and I've seen failures both ways.

Like Ivan said, we sleep better that way.

Anonymous 18 December 2017 23:25

Hi Ivan, What application do you use to create these diagrams? :-)

Ivan Pepelnjak 19 December 2017 06:36

http://blog.ipspace.net/2013/07/the-tools-that-i-use-drawings.html

What is BFD

BFD and LAG

More information

Recent posts in the same categories

link aggregation

IP routing

10 comments: