Worth Reading: Running BGP in Large-Scale Data Centers

Saturday, May 29, 2021 06:34 UTC

Worth Reading: Running BGP in Large-Scale Data Centers

Here’s one of the major differences between Facebook and Google: one of them publishes research papers with helpful and actionable information, the other uses publications as recruitment drive full of we’re so awesome but you have to trust us – we’re not sharing the crucial details.

Recent data point: Facebook published an interesting paper describing their data center BGP design. Absolutely worth reading.

Just in case you haven’t realized: Petr Lapukhov of the RFC 7938 fame moved from Microsoft to Facebook a few years ago. Coincidence? I think not.

Recent posts in the same categories

data center

design

BGP

5 comments:

Henk 29 May 2021 02:45

In Russ White's presentation (from your post on May 23rd), he listed a few requirements to compare BGP, IS-IS and OSPF. Prefix distribution, filtering, TE, tagging, vendor-support, autoconfig and topology visibility. The one thing I was missing was: scalability.

When I read about BGP-in-DC for the first time, a few years ago, I remember people claiming that IS-IS couldn't handle the flooding, when you have so many routers in your network. And the duplicate flooding was unsustainable when you have lots of neighbors (>=64?). But Russ didn't mention scalability at all. On the other hand, we have 4 current drafts to improve IS-IS flooding (dynamic flooding, congestion control, proxy-area and 8-level-hierarchy).

So my question is: do people still think IS-IS doesn't scale for large DCs? And if so, can anyone give me rough numbers of where things go wrong ? How many routers ? How many neighbors per router? Are we talking 10k routers in an area/domain? 100k? Why are areas not feasible? Anyone who has ever done any real performance measurements? (Not easy, I think). I'd love to hear what people think (less what rumours people heard from others). I understand that these number vary largely per implementation, but I'm still interested.

I would imagine my personal favorite DC design would be: 1) IS-IS for the underlay. Easy configuration. 2) EVPN/BGP for the overlay. Scales very well. 3) segment-routing in the data-plane. You can replace VXLAN, you can do TE if you want, etc.

Or is segment-routing hardware still considered too expensive for large-scale DCs?

Ivan Pepelnjak 29 May 2021 04:30

Great questions. I've been asking them for years, and haven't got any reasonable answers apart from "yeah, that's not really a problem until you get really big" from the proponents of the new hotness, and "we made it work with OSPF... and so did AWS" from people who focused on getting the job done.

Vincent Bernat 29 May 2021 05:54

Unfortunately, AWS people never shared much details on how they did it. Maybe it's their culture of secrecy, maybe because everything is not as clean as it should be. It's good to know OSPF can scale, but I wouldn't bet my own network on it just to find a big problem bringing the whole network to an halt a few weeks later. Even without AWS scale, you really don't know where the limit is. If you run into bugs, you may not get the appropriate support from vendor because everybody else is busy doing it with BGP.

Jakub 29 May 2021 06:32

I'm still impressed by BGP usage by Cumulus Linux team.

Minh Ha 31 May 2021 08:38

I second Henk's question re IS-IS scalability in large DCs given modern hardware -- I remember raising the same question in another post last year. I reached the same conclusion as Russ White for the control plane side, that 300k routes are doable for SPF calculation, for routers running modern day's multi-core CPUs. The only downside is the flooding. But this is where one cannot throw blanket conclusion such as: no, IS-IS won't scale with this many prefixes. If the network is static with very few changes, of course IS-IS can scale, thanks to reduced flooding. In a network with frequent changes, I doubt even BGP can scale. Why? Because there's another part to scalability: FIB update time. TCAM update time is inherently slow, and gets exponentially slower as your prefixes increase to the hundreds of thousands. IPv6 also will be twice as slow as IPV4. So it doesn't matter how BGP handles millions of prefix painlessly on control plane, the data-plane still takes a long, long time to update. So the 300m prefixes case that Russ brought up in the presentation, is only good for demo, won't work in practice. First of all, not sure if there's enough TCAM space to hold up anywhere near that many routes. AFAIK, there isn't.

I also got struck with nostalgia reading FB's BGP use case, maybe because I'm just old-schooled :). When was the last time we saw confederation brought up, let alone used in production? Also, I resonate big time with their overall philosophy, reflected well in each and every step of their design process: build a web-scale DC the same way the Internet is built, with hierarchy and summarization for scale, and keep it simple to the bare minimum. I remember we discussed this too via email the other month Ivan.

I feel sick whenever I hear in Cloud Presentations, things like: nothing scales to our need, so we have to invent our own way. So the Internet is smaller in scale than yours? And their time-tested scalable design principles, along with proven, mature technologies, are no good for you? What arrogance and audacity you have. It's good and refreshing to know there're people who still champion the old ways, building network on solid fundamental principles, instead of fanfare and hype.

Pls keep sharing more stuff like this Ivan. Great work :))!!

Add comment