Do We Need LFA or FRR for Fast Failover in ECMP Designs?
One of my readers sent me a question along these lines:
Imagine you have a router with four equal-cost paths to prefix X, two toward upstream-1 and two toward upstream-2. Now let’s suppose that one of those links goes down and you want to have link protection. Do I really need Loop-Free Alternate (LFA) or MPLS Fast Reroute (FRR) to get fast (= immediate) failover or could I rely on multiple equal-cost paths to get the job done? I’m getting different answers from different vendors…
Please note that we’re talking about a very specific question of whether in scenarios with equal-cost layer-3 paths the hardware forwarding data structures get adjusted automatically on link failure (without CPU reprogramming them), and whether LFA needs to be configured to make the adjustment happen.
Update history:
- 2020-11-04 10:05Z: added a few intro paragraphs to (hopefully) better explain the problem.
We know that the forwarding data structures will eventually be adjusted, and that in ECMP scenarios the adjustment happens well before the routing protocol has any chance of doing its job. There’s no doubt about that… but is the adjustment done in hardware or by the CPU?
We’re also not concerned (at the moment) what the end result would be, and how the load would be spread across remaining links, but we’d like to know how long the temporary traffic loss might last.
This is how I always understood things should work:
- Equal-cost paths are installed in routing and forwarding tables. They might be implemented as independent forwarding entries, or as a single forwarding entry pointing to a next-hop group (I discussed next-hop groups in OpenFlow Deep Dive webinar).
- Once a link to a next hop fails, the corresponding entry is removed from routing and forwarding table (or the next-hop group is adjusted).
- If you have LFA or Fast Reroute in place, the failed next hop could be replaced with another next hop without involving a routing protocol.
- Without LFA or Fast Reroute, you just have fewer equal-cost next hops and life goes on.
After a while, the routing protocol wakes up, does its job, and adjusts the routing and forwarding table.
Obviously that’s how things should work. I’m positive there are tons of implementation details involved, some of them ASIC-related, and the only way to get reliable results is to set everything up in a lab, connect a Spirent-like tester to it, and pull a cable… and then you’ll know how things work on (A) the platform you tested (B) using a specific software release with (C) specific ASICs.
However, it looks like some vendors decided LFA or FRR is the only way to go… This is what someone told my reader:
What the $vendor is telling me is that ECMP cannot provide fast-protection because the hash-buckets are reallocated by the control plane when a ECMP link fails
Wait… WTF??? The hardware can be programmed to replace an entry in a next-hop group with a standby LFA/FRR entry, but it cannot remove an entry from a next-hop group? And even if that’s true, nobody thought about using a dummy replacement entry that would point to one of the other valid next hops until the control plane wakes up… which is what LFA would end up doing anyway?
Here’s what seems to be going on in some of the platforms out there (according to my reader):
It looks that amongst some of the high-end router vendors (Juniper MX960, Cisco ASR9922, Nokia 7950 XRS), IOS-XR ECMP implementation does not provide fast-protection while Nokia and Junos do. Cisco in fact needs LFA to provide fast-protection for ECMP’ed prefixes while Junos and Nokia don’t run LFA for ECMP’ed prefixes as they don’t need to.
LFA over ECMP’ed prefixes seems to provide (during protection and while the IGP converges), a lower grade of flows spraying when compared to pure ECMP with the difference in flows number amongst destinations being an additional risk factor towards link saturation and/or QoS threshold crossing.
Junos seems to require the forwarding policy knob ECMP-FRR (why is it in the BGP section?) for some of the platforms while Nokia has a fast implementation of ECMP by default on any platform with its uniform-failover. It’s not always clear though what kind of fast-protections (if any) hold for all forms of ECMP hashing:
- ECMP amongst (multipath) BGP next hops;
- ECMP amongst L3 single-links;
- L3 LAGs and/or amongst L2 links (e.g. LAG’s individual links).
It’s not always clear what their relation to LFA is either since they could be alternatives or coexist. I have no info from Huawei as do not have any of their boxes but they look promising as they seem to provide the best of both worlds where the fast IP/LDP ECMP implementation coexists with LFA and with the latter kicking in only when ECMP is not considered available anymore.
To me, this area of networking needs to be polished/clarified before even thinking of moving over to SR and TI-LFA designs.
Any other real-life experience would be highly appreciated. If you KNOW the answer for any specific platform (as opposed to this is how it SHOULD be), please write a comment… and if you have a juicy dirty secret to share, send me an email and I’ll add it to this blog post as another anonymous contribution.
- ECMP links running IS-IS
- IS-IS + Segment Routing + TI-LFA
Topology: 4 x ECMP link between two routers
This is what happens: You loose link #1: ECMP is still available: Minimal traffic loss (<10ms) as traffic is simply re-distributed onto other ECMP links You loose link #2: Same happens as for link #1 You loose link #3: Fast failover to last remaining link. BUT since ECMP is gone, TI-LFA will now calculate a backup path You loose link #4: TI-LFA will provide fast failure, assuming there is another path available.
I think the ECMP behavior is independent of the routing protocol, because it is implemented in CEF / hardware. But have not tested it properly with LDP / BGP etc. myself.
@Jan: Thanks for the data!
However, 10 msec loss could still be caused by CPU reprogramming the ECMP buckets. Have you experienced any difference in how long the outage was based on whether you had pure ECMP or LFA on top of it?
Thank you! Ivan
One thing to note is that in a pure spine-and-leaf network, you can, or perhaps even will, get loops when a link goes down, until the routing protocol has converged.
Consider a network with two spines (S1,S2) and three leafs (L1, L2, L3). A host connected to L1 send a packet to a host on L2. L1 decides to send it to spine S1, but unbeknownst to the leafs, the link between S1 and L2 has gone down. S1 has realized this, and reprogrammed itself, so when it receives the packet that needs to go to L2, it will try to send that to one of the other leafs, L1 or L3, and hope that they will send it to the other spine, S2, since it is the remaining path to L2.
But, since L1 and L3 have not yet realized that S1 no longer has a link to L2, they may decide to send the packet back to S1. In particular, L1 will almost certainly hash the packet the same way as when it got it the first time, and send it to S1 again. Which will hash it the same way it did previously, and send it to L1. Loop.
Until the routing protocol converges a second or two later.
This is inherent in a pure spine-and-leaf network, and the only way to avoid it is to have a less pure network design. For example:
Whether you need that fast rerouting in your network, or if you can wait until the routing protocol converges, is of course a different question. And likely depends on how often you have link failures. In the datacenter network I'm managing, where links fail almost never, we would be OK with a convergence time of a minute or more; but our OSPF converges in a second or two, so no problem there. (We still have the spines connected to each other, but for other reasons.)
@Bellman: the behavior you describe depends on whether you designed a valley-free routing topology (in which case you'll experience packet drops until the routing protocol does its job, and a few other interesting things) or not (in which case you'll experience path hunting and temporary loops).
Even LFA wouldn't help in a leaf-and-spine fabric, you'd need remote LFA to get to another spine switch.
@Jan: thanks for your contribution. It's interesting as we're seeing a different behaviour. In our IOSXR, by default, LFA is calculated for ECMP'ed prefixes and thus it kicks in even if just one link fails (e.g. your link #1). What's your configuration ? Have you excluded somehow via CLI the LFA computation for ECMP'ed prefixes ?
Yes, you can certainly configure your network so the spines drop packages instead of trying to route around the broken link.
However, will you be happier because the spine is now dropping 100% of the traffic it receives destined for L2, instead of looping (and then dropping) a fraction of that same traffic? This is in the context of wanting very quick failovers, faster than the routing protocol can react, so presumably you you want as small outages as possible.
(And if you [generic you, not specifically you, Ivan] prefer valley-free routing, remember to consider what happens if the broken link is the one to the leaf where your monitoring and/or management stations are connected. Suddenly you have no way of managing and monitoring that spine... Unless those stations are dual-homed to two leaf routers. A physically separate management network just moves the problem to that network; or are valley-free proponents OK with a valleyed management network?)
> Even LFA wouldn't help in a leaf-and-spine fabric, you'd need > remote LFA to get to another spine switch.
Yes, that was exactly the point I was trying to make! But I could obviously have been clearer about that. A pure leaf-and-spine network inherently does not have a Loop-Free Alternative from the spines. You need to break the topology somehow. Tunnels to the other spines are one way of doing that, physical links to neighbouring spines another (the one I personally like best), and so on.
(This was all a bit of a side-note to your main post. I hope I have not derailed the discussion too much.)
I have addressed some of it in one of “between 0x2 nerds” webinars. In general - you are comparing 2 different techniques - IP FRR vs fast-rehash. IP FRR (xLFA) relies on pre-computed backup next-hop that is loop-free (could be ECMP) and is a control plane function (eventually end-result is downloaded into HW), it could take into consideration some additional data - SRLGs, interface load, etc. Fast-rehash is a forwarding construct, where the next-hop (could be called differently) is not a single entry but an array of entries (ECMP bundle as downloaded by the control plane). If one of them becomes unavailable (BFD or LoS or interface down events) it is simply removed from the array and the hashing is updated accordingly, since the name. Usually - you’d see LFA implemented on a high end routers, it is much more intelligent/complex and provides non connected bypass (rLFA/TI-LFA). Fast-rehash on contrary protects only connected links and doesn’t require any additional computation (ECMP alternatives are per definition loop-free). Usually implemented in DC environments. Hope this explains it. IP FRR RFCs are produced by IETF RTGWG
How about Open EIGRP for it's use of Successor/Feasible successors. LOL
@Ivan: I double checked my notes and here is what I learned during lab testing, also checking with our engineering colleagues.
Disclaimer: - I work for Cisco's CX organization. - Testing was done on Broadcom based XR platforms (NCS 5500). Might actually be different on other XR platforms, since the HW implementation DOES play a role.
Hardware protection / programming for ECMP links is in fact (as written above) only activated when also activating a FRR feature (LFA, rLFA or TI-LFA). Without it, no backup path is programmed in hardware and link down notifies IGP to start (re)convergence.
With FRR enabled, it is pure hardware protection. Official convergence target is <50ms, but might be quicker. In testing, you can also see that not every traffic stream is affected. Most don't see any traffic loss, but a few packets where in the wrong NPU at the wrong time ;)
@Andrea: I only tested with TI-LFA. I know that the LFA / rLFA implementation is quite different from TI-LFA so it could well be that they behave differently.
@JeffT: Thanks for your contribution – much appreciated. Will definitely go and watch “between 0x2 nerds” webinar asap.
My objection is that in a link-protection only scenario the LFA solution is suboptimal if compared to ECMP/fast-rehashing.
I’ll give you an example and will provide you with some associated mumbling/maths. Give me a shout if it doesn't make any sense as am hungry for conjecture’s confirmation.
The way we think IOSXR behaves in the below scenario is the following:
===================================================================================== Say there are D prefixes X1,...,Xd with 5 ECMP NH (i.e. 1,2,3,4,5) and 15 buckets
At regime we have ECMP and therefore the hash bucket allocation is the following 123451234512345
If link3 fails then the following should happen with IOSXR
for pfx X1, link 3 is protected by link 4 and thus the NH pointer moves from the NH GROUP [1,2,3,4,5] to NH_GROUP [1,2,4,4,5] and the pre-programmed hash bucket re-allocation must therefore provide the following 1,2,4,4,5,1,2,4,4,5,1,2,4,4,5
for pfx X2, link 3 is protected by link 5 and thus the NH pointer moves from the NH_GROUP [1,2,3,4,5] to NH_GROUP [1,2,5,4,5] and the pre-programmed hash bucket re-allocation must therefore provide the following 1,2,5,4,5,1,2,5,4,5,1,2,5,4,5
for pfx X3, link 3 is protected by link 1 and thus the NH pointer moves from the NH_GROUP [1,2,3,4,5] to NH_GROUP [1,2,1,4,5] and the pre-programmed hash bucket re-allocation must therefore provide the following 1,2,1,4,5,1,2,1,4,5,1,2,1,4,5
for pfx X4, link 3 is protected by link 2 and thus the NH pointer moves from the NH GROUP [1,2,3,4,5] to NH_GROUP [1,2,2,4,5] and the pre-programmed hash bucket re-allocation must therefore provide the following 1,2,2,4,5,1,2,2,4,5,1,2,2,4,5
for pfx X5, link 3 is protected by link 4 and thus the NH pointer moves from the NH GROUP [1,2,3,4,5] to NH_GROUP [1,2,4,4,5] and the pre-programmed hash bucket re-allocation must therefore provide the following 1,2,4,4,5,1,2,4,4,5,1,2,4,4,5
...
for pfx Xd, link 3 is protected by link 2 and thus the NH pointer moves from the NH GROUP [1,2,3,4,5] to NH_GROUP [1,2,2,4,5] and the pre-programmed hash bucket re-allocation must therefore provide the following 1,2,2,4,5,1,2,2,4,5,1,2,2,4,5
================================================================================ Now, if am not mistaken, calling H the number of the ECMP NHs and thus generalising what just shown but without delving into the maths (I can but I’d avoid this for now), we can observe that in the LFA scenario a link, say link 4, during link 3 protection has a load in terms of number of flows equivalent to the flows of one particular prefix group (out of the H-1 prefix groups) with weight 2/H plus the load of all of the other H-2 prefix groups’ flows with weight 1/H. In an ECMP scenario instead that very same link 4 during protection has a load in terms of number of flows equivalent to the total number of flows across all of the prefix groups and with weight 1/(H-1). Comparing the two values you can tell that the LFA protection is not robust (risking link saturation and/or qos thresholds crossing) against a disproportion in terms of number of flows per destination/prefix amongst the (H-1) groups of ECMP’ed prefixes (e.g.. in an IP tunneling environments where all of the VxLAN, GTP, GRE destinations happen to be in the very same prefix group out of the H-1 groups of prefixes that has a weight of 2/H over link 4). Having said that, one argument that I heard around is that this could be mitigated/counter-measured if these additional number of flows had a relatively low per-flow bw if compared with non-tunneled destinations’ flows. Should this (at all) happen though, it would mean that the (ECMP) per-flow load-balancing @ regime (with no fault) would be already crap as some flows (those not tunneled) have a much higher bw than the tunneled ones which is a contradiction in terms to me. We’re here assuming by the way that flow-based load balancing within these tunnels (e.g. GTP, GRE,VxLAN,.. ) is supported by the chipsets.
Last but not least, we should mention that LFA, as opposed to ECMP, is not resilient to a double fault as if the link and its backup link failed then LFA would actually provide… fast-discard ??
Cheers
Andrea
@Ivan - 10ms includes failure detection, not just reprogramming HW, so it depends on how fast the failure is detected/propagated to HW.
In previous life, implementing IPFRR on E/// SSR (EZ), we got HW performance tuned to 2K NH changes per 1ms. Jericho (to my memory) would be using SuperFEC structure and should be much faster.
@Andrea - I don't work for Cisco and can't comment on internals of their hashing. It would also be a reasonable assumption that most chipsets (COTS) do 5 tuple hashing (not looking into GTP TEID or VXLAN VNI).
In most implementations, there's no load-based rebalancing within ECMP group (sticky), and ECMP is the only key to build the group (really basic grouping). LFA on contrary is computed by the control plane and can take a number of additional points into computation, SRLGs, common line cards, load on the link, etc.
The intent is often - give me a node protecting LFA and if there isn't, link protecting would do. Also note that implementation of any none directly connected protection schemes (TI-LFA) is an order of magnitude more complex than basic LFA.
My point is - this is not an apple to apple comparison, there's more to it.
@Jan LFA is indeed very different than xLFA. LFA - protect interface A(prefix optionally) by interface B (given it is loop-free within the constrains), when A goes down flip a pointer. TI-LFA - if not local protection available (e.g LFA), compute a tunnel that terminates at the PQ node, protect interface A by sending traffic into the tunnel that is preinstantiated in HW
@Andrea - "Last but not least, we should mention that LFA, as opposed to ECMP, is not resilient to a double fault as if the link and its backup link failed then LFA would actually provide… fast-discard ??"
Neither technology is resilient to a double fault, control plane convergence plays fundamental role in recovery from the single failure, recomputing ECMP bundles or new LFAs after the failure has happened. Fast-rehash reacts on interface state change, not routing, however, eventually control plane converges, updates RIB and downloads updated routes to FIB (depending on implementation either bottom part of RIB (flattening NHs) or top part of FIB :-) will regroup. Further optimizations are possible, I'm trying to look from a generic prospective (and being a coupe of years away from building routers :))
@Ivan, re the 3rd bullet point that your reader asked:
"L3 LAGs and/or amongst L2 links (e.g. LAG’s individual links)."
Cisco had a detailed write-up about the breakdown of the CEF load-balancing structure here:
https://community.cisco.com/t5/service-providers-documents/asr9000-xr-load-balancing-architecture-and-characteristics/ta-p/3124809
So looks like for hierarchical FIB, 3 hardware lookup tables are needed for a non-LAG lookup: a FIB, which points to a RDLI/NRDLI structure, which in turn points to adjaceny table, which in turn points either to physical OIF or yet another indirection table -- the 4th one, the LAG, which finally points to OIF.
That means a max of 4 lookups in hardware for the destination, for a 'LAG in ECMP' entry. That's just destination lookup, not further processing like ACL, QOS... which means no line-rate forwarding, at least not for smaller packets. Clearly they have to trade-off between performance and high-availability here.
I'm not trying to nit-pick on Cisco, I'm pretty sure other vendors can't do much better (or if they can match Cisco) -- Juniper structure for ex, is just as lengthy. But if vendors start to claim things line-rate performance, that's when we need to be suspicious.
@Jeff, thx a lot for sharing all this info, esp. the details of your platform!! Wrt Andrea's comment, "LFA, as opposed to ECMP, is not resilient to a double fault as if the link and its backup link failed then LFA would actually provide… fast-discard", I don't think he meant double-fault as in the often-used meaning of the term. Here he likely meant when both the protected link and the LFA link went down, then LFA would provide fast discard.
I agree with Andrea's observation, because by default, both IOSXR and Junos's LFA tiebreaking algorithms reduce the list of eligible LFAs to one. Junos' process: 1)prefer ECMP next hops, then prefer backup NH that provides node protection, then prefer NH that provides link protection, followed by NH closest to destination, followed by NH closest to PLR, and finally when all else fails, prefer NH with lowest system ID.
IOS-XR by defaults also prefers direct ECMP next hop, followed by lowest total metric NH, disjoint linecard, and node protection.
The only place that Cisco public doccos document the default selection algorithm is here:
https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst9500/software/release/17-3/command_reference/b_173_9500_cr/ip_routing_commands.html#wp2703610168
This is consistent with Jan's first comment, his test result, which results in ECMP protection, and when all ECMP fails, LFA kicks in, because by default LFA's first selection would be ECMP paths. I'm wondering what kind of configuration Andrea had, because in his first comment he said LFA kicks in even after 1 of his ECMP fails? Would be good if we can see the detailed config or output.
From reading Jan's 2nd comment, I get the impression that he only tested ECMP with xLFA though. It would be good if someone who has access to IOS-XR equipment to run vanilla IGP with ECMP, disable all xLFA, and run show cef internal to look at the FIB structure, including the hashbuckets, then enable LFA, and look at the hash buckets again. That way we can tell exactly what's going on, and it's pretty simple setup.
Granted, what we see with cef internal is the software-FIB, but that software FIB is the input to create the more simplified hardware FIB, so what we see is the real thing happening on the hardware level.
Also Jeff, re xLFA being control-plane function and ECMP being data-plane construct -- your 1st comment, I don't think that's the case. LFA was meant from the beginning as a data-plane short-term quick relief, a firefighting workaround to ease the forwarding latency pressure, until the IGP can converge and provide the permanent solution again. That's why Andrea's 2nd comment re its possible cause of instability condition, its lack of resilience... was very much on point, because xLFA was never meant to be a long-term answer.
Re the control-plane vs data-plane thingy, after all, we can argue, that ALL features within a router HAVE their roots in the control plane. After all, it's the control plane that sets up everything at the beginning. But in this case, xLFA is considered a data-plane construct because it was pushed into the FIB and activated the instant an error condition was detected, with no involvement from the software aka control plane. I think we're basically saying the same thing in different ways here btw :)).
Also, while xLFA, esp. rLFA and TI-LFA are indeed more intelligent than stateless ECMP, it's intelligence that accompanies a great deal of complexity, is painful, and with consequences. X25 was intelligent, ATM was very intelligent, but both were probably too intelligent for their own good.
Wrt to xLFA, when one looks at all the P-node, Q-node, PQ-node, the targeted LDP sessions, the tie-breaking algo, the amount of calculation... then it can quickly overwhelm one's mind. And all this complexity will scale quickly with the size of the network. With complexity comes fragility, esp. when one has to troubleshoot it or takes over a large network that has it implemented. All of this, for a temporary backup feature.
I think that's why they want to reduce the complexity and the flow disruption with TI-LFA in segment routing, by having the routers calculate the post-convergence backup LFA route(s), but still, a lot of the complexity is still there. EVPN multi-homing with ESI is complex, seamless MPLS is complex, but xLFA trumps both IMO. I'm not complaining or hating on xLFA; I just mean if one strikes for simplicity, ECMP might be good enough in a lot of cases ;) .
Huawei DC switches can be configured to use "consistent hashing" where no re-hashing happens if one of the ECMP links fail. This may lead to less optimal load distribution on the remaining ECMP links.
However re-hashing still occurs if a new link is added to the ECMP or the failed link recovers.
ECMP and FRR can be used at the same time and FRR kicks in only when all of the ECMP links fail.
@JeffT re. comment of 09 November 2020 11:04
I agree with you Jeff that TI-LFA and LFA is much more than ECMP as it leverages the sophisticated control plane LFA computation but my point is that in a link-protection only scenario ECMP fast-rehash is way better than LFA or TI-LFA in terms of the degree of flows-spraying.
@JeffT re. comment of @ 09 November 2020 11:25
As Minh Ha pointed out, I only wanted to highlight that in a link-protection only scenario it is paradoxical that with LFA or TI-LFA if both the protected link and the LFA-backup link went down, then LFA would actually provide 'fast discard' while ECMP fast-rehash would correctly still provide fast protection.
@ Minh HA re. comment @ 10 November 2020 06:25
The link you provided is for IOSXE actually - so am not sure it is relevant for IOSXR but it contains to me a pretty interesting info: It looks like that by default EIGRP prefers using ECMP fast-rehash as opposed to LFA while OSPF doesn't as it always enforces LFA. There is in fact for EIGRP only a command to disable the ECMP fast-rehash behaviour in favour of LFA when LFA is configured. Look at this link https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst9500/software/release/17-3/command_reference/b_173_9500_cr/ip_routing_commands.html#wp1624013282
This is my understanding of that command at least - it would help greatly if someone from Cisco could shed some light on this command.
Regarding the IOSXR config, it's just as simple as what follows:
router ospf BLA
fast-reroute per-prefix
Regarding the IOSXR show command, here's an output
Totally with you when you say: "It would be good if someone who has access to IOS-XR equipment to run vanilla IGP with ECMP, disable all xLFA, and run show cef internal to look at the FIB structure, including the hashbuckets, then enable LFA, and look at the hash buckets again. That way we can tell exactly what's going on, and it's pretty simple setup."
IOSXR behaviour is actually also in line with this IETF draft (https://tools.ietf.org/html/draft-ietf-rtgwg-bgp-pic-12#section-6.1), written by Cisco too, where it says that if LFA is on and the failure is local (e.g. interface down) then the LFA-protection for that ECMP link down is triggered as opposed to the fast-rehashing of the ECMP set been triggered. The example in the draft is about PIC-CORE but at the end of the day the BGP-NH underneath is reachable via a set of ECMP'ed IGP NHs which, to me, provides no loss of generality as such.
Cheers
Andrea