Valley-Free Routing in Data Center Fabrics
You might have noticed that almost every BGP as Data Center IGP design uses the same AS number on all spine switches (there are exceptions coming from people who use BGP as RIP with AS-path length serving as hop count… but let’s not go there).
There are two reasons for that design choice:
- Default BGP AS-path filters immediately stop path hunting after a link failure;
- The same default filters give you valley-free (or non-zigzag) routing – traffic between two leaf switches never traverses another leaf switch, making traffic flows and link utilization way more predictable.
We covered the drawbacks of path hunting in the Layer-3 fabrics section of Leaf-and-Spine Fabrics webinar… but what about the benefits or drawbacks of valley-free traffic flow?
Imagine a simple leaf-and-spine fabric that experienced a link failure.
If the fabric routing protocol design isn’t valley-free, C1 finds numerous alternate paths to L1: (L2 => C2, L3 => C2, L4 => C2), resulting in an explosion of forwarding table on C1, and plenty of routing protocol noise… but there’s no change in end-to-end forwarding – the path through the valley is longer than the direct path through C2. So far so good.
What about two link failures:
In this case, a path through a valley is the only way to get from L1 to L4, so it seems like valley-free routing is not a good idea.
As always, there’s a tradeoff, and if you haven’t identified it, you haven’t looked hard enough. If you have a fabric with two spines, valley-free routing breaks connectivity after two failures… but then you might be better off using OSPF anyway. As an alternative, you could use a spine-to-spine link to increase failure resilience, but that increases routing complexity.
It’s really hard to implement valley-free routing with OSPF anyway. More about that in another blog post.
In larger fabrics you’d probably want to use four spine switches, and you need four strategically-placed failures to break connectivity with valley-free routing, so it’s much better to focus on routing protocol convergence than on fabric partitioning.
Takeaway recipe: If you have two spine switches, use OSPF or IS-IS (instead of turning BGP into RIP). If you have more than two spine switches and you think you need BGP as the underlay routing protocol, use the same AS number on all spines to get valley-free routing.
ipSpace.net subscribers can find way more details in Leaf-and-Spine Fabrics webinar; if you want to add interactive discussions and mentoring to your learning process, go for the Building Next-Generation Data Centers online course.