Graceful Restart and Routing Protocol Convergence
I’m always amazed when I encounter networking engineers who want to have a fast-converging network using Non-Stop Forwarding (which implies Graceful Restart). It’s even worse than asking for smooth-running heptagonal wheels.
As we discussed in the Fast Failover series, any decent router uses a variety of mechanisms to detect adjacent device failure:
- Physical link failure;
- Routing protocol timeouts;
- Next-hop liveliness checks (BFD, CFM…)
Dealing with physical link failures is easy: it doesn’t make sense to pretend life is good if a link is down. Either we’re dealing with a genuine link failure, or the adjacent device experienced a severe enough problem so that it cannot pretend to be alive any longer (power outage or linecard blowing up come to mind). The only sane way to deal with the situation is to run a regular routing protocol convergence1.
How about Hello timeouts? It could be a genuine device failure with the physical link staying up due to a gazillion of weird reasons. It could also be a planned or unplanned restart of the remote device. In that case, and if the failed device advertised Non-Stop Forwarding capability, we shouldn’t panic but follow Graceful Restart procedures.
What happens next depends on the routing protocol.
BGP advertises Graceful Restart capability in the BGP OPEN message. If a helper device wants to play along, it should wait for the Restart Timer interval (advertised in the same BGP OPEN message) until it flushes the BGP routes advertised by the failed neighbor and starts the convergence process. The default value of the restart timer on Cisco IOS XE is 120 seconds; the minimum sane value is the time it takes the remote device to recover. Regardless of the restart timer value, the helper device is in routing convergence limbo until that timer expires.
Conclusion: When using Graceful Restart, BGP convergence could take at least as long as it takes for the slowest device participating in this scheme to restart. The time to react to any topology changes that might have occurred in the meantime is even longer due to how BGP updates are processed when undergoing a Graceful Restart (see Graceful Restart 101 blog post for details).
OSPF starts the Graceful Restart procedure with an opaque LSA that has to be sent by the restarting device. When undergoing a planned restart, the restarting device specifies the desired timeout in the opaque LSA, but at least we know what’s going on – it’s a planned procedure, not a device failure.
On the other hand, the only way for an OSPF network to survive an unplanned device failure is to ensure that the OSPF Hello timeout doesn’t expire before the failed device restarts. Should you wish to support this scenario, you’ll have a ridiculously slow-converging network no matter what.
Summary: Non-Stop Forwarding and fast convergence go together as well as oil and water bricks. You could have one or the other.
-
Some implementations might treat physical link failure as a cause to start the Graceful Restart (at least according to this Juniper document). Why anyone thinks that forwarding packets into a failed interface/link makes sense is beyond my comprehension. ↩︎
The comment about links losing signal reminded me of the „dying gasp“ DSL and Metro-Ethernet equipment can send from their capacitors if they lose power. For a carrier the difference between „CPE lost power“ and „someone cut the line at an undetermined place“ can make a huge difference in Troubleshooting and assignment of responsibility. Also, this helps declaring links unusable much quicker.
In the implementations that I've seen so far, Hold Timer expiration IS NOT a valid reason to start Graceful Restart process. If it is really just a control plane restart, the remote peer must send BGP Open message, with Restarting bit set. And this must happen before Hold Timer expiration. Hence, Hold Timer must be higher than the time required for your remote peer to restart its BGP process. Since GR is heavily used today in proprietary clustering solutions (especially in stateful devices, because why would you really want clustering in a non-stateful device...), typically that proprietary clustering mechanism will detect a real failure in a matter of seconds, and the new master will (relatively) quickly send BGP Open. Hence, you can hope for some Hold Timer of 15 seconds to work fine. Point being - you are not really bound by those 120 seconds of Restart Time, because GR process will not kick in in case of real failure.
Combining BGP with BFD might be useful here, providing that your BFD is implemented on the forwarding plane and hence does not share fate with control plane (C-bit is set in your BFD packets). Then you can distinguish between Graceful Restart and a "real" failure: