Graceful Restart and BFD
The whole High Availability Switching series started with a question along the lines of “does it make sense to run BFD together with Graceful Restart”. After Non-Stop Forwarding 101, Graceful Restart 101, and Graceful Restart and Convergence Speed we finally have enough information to answer that question.
TL&DR: Most probably not.
A more nuanced answer depends (as always) on a gazillion implementation details.
BFD implemented in forwarding hardware. This is the best option – BFD detects data plane failures, and routing protocol(s) detect control plane failures. BFD failure should trigger regular routing protocol convergence, and routing protocol timeouts should trigger Graceful Restart procedures.
BFD sharing fate with the control plane. A control plane failure (which would trigger Graceful Restart) would also result in BFD session failure. BFD failure could be used to enter Graceful Restart procedure (and start the Restart Timer) before the routing protocol detects a neighbor failure. However, BFD failure should not be used to flush the forwarding tables or start the routing protocol convergence.
You’ll find more details in Generic Application of BFD (RFC 5882).
Moving from Theory to Practice
If you insist on using BFD with Graceful Restart, get reliable answers to these questions (or do the tests yourself):
- Can the helper nodes decouple BFD and routing protocol failure detection and start an unconditional convergence or Graceful Restart as needed?
- Is the behavior following a BFD failure configurable?
- Does the helper node use the Control Plane Independent bit in BFD control messages to change its behavior?
I tried to find out the implementation details of Graceful Restart and BFD interactions. The closest I got was:
- This Junos document which totally confused me
- Cisco IOS XE BGP Configuration Guide saying “‌Configuring both Bidirectional Forwarding Detection (BFD) and BGP graceful restart for NSF on a device running BGP may result in suboptimal routing.” which supports my TL&DR conclusions ;)
- An Arista EOS document (behind a regwall) effectively saying “Our Stateful Switchover is fast enough that a BFD session doesn’t go down. You can therefore use BFD with BGP Graceful Restart.”
Hands-on experience would be highly appreciated – please write a comment!
I'd suggest to step back a bit and consider the bigger picture.
What is BFD good for? What is GR/NSF/NSR/SSO good for?
BFD and GR/NSF/NSR/SSO have different goals: one enables quick fail over, the other prevents fail over. Combining both promises to be interesting.
Reliably and quickly detecting a forwarding failure is helpful when there is a different path to fail over to. When there is no alternative path, quick failure detection seems less important.
BFD implementations often combine data plane (BFD echo mode) and control plane (BFD session) failure detection and thus assume a shared fate between data plane and control plane.
GR and NSF are based on the assumption that the data plane can still function although the control planed has (temporarily) failed.
NSR/SSO shall hide control plane failures by (more or less) transparently failing over to a different processor.
Some combinations of GR/NSF/NSR/SSO can help to mask temporary control plane failures that do not affect the data plane.
NSF+GR allows forwarding despite temporary control plane failures. Likewise NSF+NSR/SSO.
[IMHO NSR/SSO should be implemented completely transparently and always be enabled when there are two or more control plane processors. Why even have hardware redundancy for the control plane when it does not work well enough to enable unconditionally?]
When routers are not able to quickly react to topology changes (think (multiple) full Internet BGP tables with weak routers), GR seems useful to avoid churn and cascading failures.
BFD is intended to reliably and timely detect forwarding failures. Now what should one do with this information? Continue forwarding down the known failed path with the help of something like GR/NSF/NSR/SSO? Why detect the forwarding failure at all, if it is to be ignored anyway?
How can BFD be used with complex routers where the data plane can still function although the control plane has failed? How to handle a complex router with redundant control plane, e.g., two route processor modules? One idea could be to use BFD echo mode in the data plane with a short detection interval and a control plane session with a long detection interval (or no BFD control plane session at all). Combined with an additional path to fail over to if a data plane failure is detected this can help, but it does add a whole lot of complexity, which might reduce reliability in practice.
A different, simpler approach to network redundancy would be to have less complex routers without NSF/NSR/SSO, but more of those to build redundant paths. Then quick and reliable failure detection, e.g., with BFD, can be used to fail over whenever a data plane or control plane failure is detected.
Well, assuming that the C-bit is set honestly (will be funny if not) and assuming that the Helper is using this bit correctly (and I think it's pretty well defined what "correctly" means - see section 4.3 in RFC 5882), the answer is pretty clear, isn't it:
IF you do not have any alternative path, then the only reason to use BFD is to show off; so grow up and stop using it.
IF you do have an alternative path, but the device in question sends C=0, then it means that BFD on that device is sharing fate with the control plane. Hence, from the Helper perspective, if BFD is lost, it means nothing. The Helper has no way of knowing whether it should flush the routes (forwarding plane failure) or start the GR (control plane failure). Juniper SRX (and I think also QFX) is an example of this: BFD is handled by the Routing Engine there, and hence it is not CP-independent. And they honestly set C=0. So what will Helper choose to do, when BFD goes down - will it flush the routes or will it start helping GR? A few Helper implementations I saw (including Juniper) will opt for the latter, because otherwise GR would never be able to work - almost surely, BFD will go down before the restarting router manages to send a new BGP Open, as it normally should. BUT... starting GR unconditionally, of course, means that now BFD will never be able to work. Lose-lose situation. In the Juniper document that was shared in the blog post, they describe "dont-help-shared-fate-bfd-down" option, which was apparently added at some point (I never used it), and it seems to control exactly this: choosing between GR-that-never-works and BFD-that-never-works. This way or another - BGP + GR + BFD_C=0 sounds like a bad idea.
IF you do have an alternative path and the device in question proudly sends C=1, then you are lucky. Juniper MX is such an example, because they support "Distributed BFD" and "Inline BFD", both of which are implemented on the line cards and can survive Routing Engine reboot. So, any well-implemented Helper should now be able to distinguish between forwarding plane failure (BFD goes down => flush all routes) and control plane failure (BFD stays up => start helping GR as usual). Hence, BGP + GR + BFD_C=1 sounds like not a bad idea to me...
A distinction should be made between single link and multipath BFD. Single link / single-hop BFD is only to connected neighbors, with their interface IPs on same subnet. It is enabled on the interface config and for the routing protocol. It should be processed by NPU, which could be on separate line card.
Multipath / multi-hop BFD sessions are between loopback IPs. The BGP neighbor is configured with something like “fall-over bfd multi-hop” and also requires a BFD map. It is not enabled on the interfaces. This would be process by the control plane CPU.
For example, in the access network scenario. If there are multiple interfaces (paths) between 2 directly connected routers, then the BGP can be between their loopbacks. Multiple static routes are linked to BFD on the interfaces. BGP is not tied to BFD. If some of the interfaces fail then BFD detects that quickly and removes some static routes, BGP protocol and end-to-end path stays up. If all links fail, then then IGP route to BGP next hop is down. Recursive lookup means that even if BGP does not yet know about it all the routes are invalidated, they are still in BGP table but not in the RIB. You can also change the default next-hop-self behavior for routes received from eBGP neighbors and distribute their loopback into your IGP.
It seems that NSF + GR can be used with BGP in this design while still having quick response to link/dataplane failure. Using separate IPs for data plane and control plane and BFD handled by NPU/ASIC/linecard means their failure detection and handling can be done differently.
It is difficult to test BFD with simulators because the NPU can CPU is not separate, and sometimes BFD is even disabled for the virtual routers. Exact behavior is platform dependent and can differ between boxes are running the same NOS.