Build the Next-Generation Data Center
6 week online course starting in spring 2017

Micro-BFD: BFD over LAG (Port Channel)

The discussion in the comments to my LAG versus ECMP post took a totally unexpected turn when someone mentioned BFD failure detection over port channels (link aggregation groups – LAGs).

What’s the big deal?

What is BFD

Bidirectional Forwarding Detection (BFD) is a lightweight protocol that “provides low-overhead, short-duration detection of failures in the path between adjacent forwarding engines” (straight from RFC 5880).

Networking engineers love BFD for several reasons:

  • It’s simpler and less CPU-intensive than routing protocol adjacency messages;
  • It can be implemented in smart linecards, further reducing the control-plane CPU load, and making it possible to detect forwarding failures in milliseconds even on boxes with hundreds of interfaces (try doing that with OSPF);
  • It detects byzantine failures between two forwarding engines that cannot be reliably detected at physical and data-link layers.

BFD uses the principle of shared fate to do its job – it tests the actual transmission path using the same protocol (IP) as the forwarded traffic – and thus reliably detects a forwarding failure in a simple transmission path, regardless of the number of components in the path.

BFD and LAG

LAG appears as a single layer-2 forwarding path to layer-3 forwarding engine. BFD messages can take any one of the links in the LAG (based on the LAG load balancing algorithm), breaking the shared fate assumption – a forwarded packet traversing another LAG member link might encounter a partially failed component, and it’s impossible for BFD to detect that.

The only solution to this problem is to run BFD across every LAG member link – the BFD code has to become LAG-aware and test end-to-end connectivity across every member link independently (see RFC 7130 for more details). Even more, LACP and BFD have to work in parallel – a member is added to a LAG only when both LACP and BFD agree it’s OK to do so.

Where can I get it?

Micro-BFD is available in Junos release 13.3, Cisco Nexus OS and Cisco IOS XR Release 4.0. Any other implementation? Write a comment!

More information

I wrote about BFD a long long time ago.

8 comments:

  1. Ivan, this is also supported on ASR1000: BFD over GEC http://www.cisco.com/c/en/us/td/docs/ios-xml/ios/cether/configuration/xe-3s/asr1000/ce-xe-3s-asr1000-book/ce-ieee-link-bndl-xe.html
    And probably ASR903.

    ReplyDelete
  2. Hi Ivan,

    Nice post.
    To bring some platform specific material, here are some document about BFD that could be useful for your readers (not only BoB) :

    Intro: https://supportforums.cisco.com/document/62656/introduction-bfd-asr9000

    Architecture: https://supportforums.cisco.com/document/144626/bfd-support-cisco-asr9000

    Troubleshooting for IOS-XR: https://supportforums.cisco.com/blog/12016611/bfd-configuration-troubleshooting-cisco-ios-and-xr-routers

    and some scaling figures for CRS-1 and CRS-3: https://supportforums.cisco.com/document/12019081/bfd-crs

    Fred

    ReplyDelete
  3. Hi Ivan,
    I don't think Cisco's implementation on both XR and NX-OS follows that RFC. it's a proprietary, not so efficient (to say the least), implementation...

    Juniper and ALU do have an RFC-compliant implementation though.

    Thanks

    ReplyDelete
    Replies
    1. Hi Ofer,

      You are right, Cisco ASR 9000 implements RFC7130 support from IOS-XR 5.2.0.

      Fred

      Delete
  4. According to the Release Notes, Cisco NX-OS 7.x supports "Layer 2 Bidirectional Forwarding Detection".
    I believe it might replace UDLD as a faster alternative to monitor the individual links in an LACP Port Channel.
    I have not yet found out how to configure this feature, though. My N6K should support it.

    ReplyDelete
  5. Some random thoughts about BFD and LAG:
    1. Some implementations at some moments in time had very serious issues with BFD (BFD-based link state not propery propagated to all protocols, MPLS being the prime victim). OK, it was not Cisco and not Juniper.
    2. Do people do LAG without LACP ? (OK, not the greatest protocol for removing bad links, but still effective)
    3. Do people do LAG without "seeing the light" ? When you "see the light" (get the L1 signal sent from the other side - no Ethernet over SomethingElse) a number of issues disappear.

    What am I missing here ?

    ReplyDelete
    Replies
    1. #2 - Not all boxes support fast LACP (similar to BFD: if you can't execute it on the line card, you have serious scalability issues), so you're removing bad links in tens of seconds.

      #3 - "Not seeing the light" detects 99% of all the failures, but we're usually not worried about them (because we detect them). Faulty transceivers and similar **** is what wakes you up at 2AM (rarely, but still). Oh, and then there are people using "media converters" ;) and LAG over Metro Ethernet (don't ask!).

      Finally, it's nice to be able to solve failure detection on all media using a consistent mechanism.

      Delete
    2. We hope for lights out but plan for "lights on but nobody's home". We do run some lag over lambda services and I've seen failures both ways.

      Like Ivan said, we sleep better that way.

      Delete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.