Questions about BGP in the Data Center (with a Whiff of SRv6)
Henk Smit left numerous questions in a comment referring to the Rethinking BGP in the Data Center presentation by Russ White:
In Russ White’s presentation, he listed a few requirements to compare BGP, IS-IS and OSPF. Prefix distribution, filtering, TE, tagging, vendor-support, autoconfig and topology visibility. The one thing I was missing was: scalability.
I noticed the same thing. We kept hearing how BGP scales better than link-state protocols (no doubt about that) and how you couldn’t possibly build a large data center fabric with a link-state protocol… and yet this aspect wasn’t even mentioned.
When I read about BGP-in-DC for the first time, a few years ago, I remember people claiming that IS-IS couldn’t handle the flooding, when you have so many routers in your network. And the duplicate flooding was unsustainable when you have lots of neighbors (>=64?). But Russ didn’t mention scalability at all. On the other hand, we have 4 current drafts to improve IS-IS flooding (dynamic flooding, congestion control, proxy-area and 8-level-hierarchy).
So my question is: do people still think IS-IS doesn’t scale for large DCs? And if so, can anyone give me rough numbers where things go wrong? How many routers ? How many neighbors per router? Are we talking 10k routers in an area/domain? 100k? Why are areas not feasible? Anyone who has ever done any real performance measurements? (Not easy, I think). I’d love to hear what people think (less what rumors people heard from others). I understand that these number vary largely per implementation, but I’m still interested.
I tried to get answers to very similar questions three years back and did a series of podcasts on the topic including:
- Data Center Routing with RIFT with Dr. Tony Przygienda
- OpenFabric with Russ White
- Is BGP Good Enough with Dinesh Dutt.
What I got were vague statements from Tony and Russ effectively saying “you shouldn’t have a problem until you get to an area with hundreds of switches” and Dinesh saying “very large fabrics running OSPF work just fine if you do your homework”.
I might have a cynical conclusion or two, but I’ll try to stay diplomatic.
I would imagine my personal favorite DC design would be: 1) IS-IS for the underlay. Easy configuration. 2) EVPN/BGP for the overlay. Scales very well. 3) segment-routing in the data-plane. You can replace VXLAN, you can do TE if you want, etc.
That’s more-or-less what I’ve been saying for years (excluding segment routing ;)… and of course nobody listens, least of all vendors on a lemming run toward ever-more-convoluted BGP-over-BGP designs.
Or is segment-routing hardware still considered too expensive for large-scale DCs?
SR-MPLS is implementable on any decent merchant silicon ASIC – just take a look at Arista EOS. Most merchant silicon supported MPLS (with some limitations) a decade ago, and as hyperscalers moved WAN forwarding decisions to edge web proxies instead of WAN routers, I’m positive at least some of them use MPLS forwarding at the WAN edge.
SRv6 is a different story. Very few merchant ASICs support it (apart from anything can be done with programmable silicon fairy tales). Jericho2c+ seems to be one of the exceptions… but it started sampling last September, and we have no idea what its SRv6 limitations might be. Has anyone seen it in a shipping product? Are you aware of any other merchant silicon SRv6 implementation and their limitations (segment stack depth comes to mind)? Comments are most welcome!
It's hard to get good information on the practical problems with link-state routing in "large scale data centers" (whatever you may consider "large"…). The best I have found so far (linked to from this blog IIRC) is "What I've learned about scaling OSPF in Datacenters" from Justin Pietsch (https://elegantnetwork.github.io/posts/What-Ive-learned-about-OSPF/).
There is an interesting bit regarding flooding: "Which means you have to be very careful about what is flooded, what are the areas, and how things are summarized. I didn’t understand this until I had a simulation and I could try things out with areas or without, and the effect was dramatic."
In I think this March Dinesh Dutt did a Webinar on ipSpace called "Using OSPF in Leaf-and-Spine Fabrics" where he added some details on how to make areas and summarization work in this setting (there are subtle, but important differences in the results of this and the RFC 7938 BGP setup).
Another bit of information regarding the alleged flooding problems can be gleaned from Google paper "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network". Section 5.2.2 describes how they prevent(ed?) flooding problems (back then?). One could use IS-IS Mesh Groups as described in the informational RFC 2973 and implemented by many vendors to build something similar.
The expired Openfabric draft from Russ White attempted to reduce flooding. That makes one wonder why Russ white did not mention this point in his "Rethinking BGP in the Data Center" presentation.
Anyway, most "Enterprise" data centers should be of sufficiently small size to not show any problems with a basic deployment of OSPF or IS-IS in the underlay. I'd like to point at tables 2 and 3 of RFC 2329 from 1998 regarding OSPF scalability information over two decades ago. But please observe that this report does not mention the number of full adjacencies per router, which is the stated problem in Clos style networks.
I'd say the classic approach of an IGP (e.g., OSPF or IS-IS) in the underlay and MP-BGP in the overlay is a simple and proven approach. I think it is easier to understand and work with than using BGP twice, one set of sessions as underlay IGP and another for the overlay. Using BGP once for a combination of underlay and overlay (the FRR way as pushed by Cumulus^WNvidia) is fine, too, if supported by your vendor.
In another recent webinar ("Multi-Vendor EVPN Deployments" from May) on ipSpace, Dinesh Dutt listed OSPF+BGP as the most commonly supported basis for EVPN on data center switches.
Thanks, Erik
On the question of VXLAN vs SR-MPLS; are there any reasons why VXLAN seems to be the defacto choice? Is it that white box software, which is popular in the DC, just finds regular IP routing with VXLAN encapsulation alot easier to work with than SR-MPLS? Do DC engineers just not like working with MPLS in the DC?
I think the main reason that SRv6 seems to be stalled is that every vender is still coming up with different solutions to solve the extra overhead of the IPv6 header. Juniper is pushing SRm6 while Cisco just released their SRv6 uSID. It seems like we wont know which way the industry will go for another few years but it seems that regular SRv6 is dead.
Neat. Thanks for the info.
Drivenets made a J2C+ box announcement about a month ago: https://www.prnewswire.com/il/news-releases/drivenets-network-cloud-is-first-to-support-broadcom-j2c-and-triple-network-scale-with-largest-networking-solution-in-the-market-898000194.html