Scaling L3-Only Data Center Networks

Andrew wondered how one could scale the L3-only data center networking approach I outlined in this blog post and asked:

When dealing with guests on each host, if each host injects a /32 for each guest, by the time the routes are on the spine, you're potentially well past the 128k route limit. Can you elaborate on how this can scale beyond 128k routes?

Short answer: it won’t.

One of the obsessions of our industry is to try to find a one-size-fits-everything solutions. It's like trying to design something that could be a mountain bike today and an M1 Abrams tomorrow. Reality doesn't work that way, even the physicists are still searching for the Grand Unified Theory of Everything, and if they ever find one, it’ll probably be so complex that you won’t have a chance of ever understanding it.

Coming back to the data centers:

  • Most data centers have a few thousand VMs, public cloud providers being the obvious exception. Few thousand routes would easily fit into the forwarding tables of modern data center switches.
  • If you have more than a hundred thousand guest VMs, you have to decide what compromises you’re willing to make. There are plenty of solutions addressing various scaling aspects, including overlay virtual networking implemented on hypervisors or ToR switches.
  • If you’re thinking about deploying hundreds of thousands of containers, you should get your addressing plan right and advertise a prefix (not host routes) from every container host. Oh, and IPv6 would make your life way easier.
  • Finally, if you really insist on having random IP addresses spread all over your humongous data center, maybe ILA is what you’re looking for.

Also, if you have hundreds of thousands of guest VMs, and try to find answers to design challenges on public blogs, you’ll get the design you deserve. I might be able to help you (or at least give you a few hints) after understanding your needs and requirements, but I can’t give you a generic answer.

Will Overlay Networks Help?

Andrew continued his comment with…

I've read about overlays like VXLAN but I still don't see how that would avoid the problem on the spines, as each spine would have to know about every /32, which could be on any one of the leaves below.

The whole idea of overlay virtual networking is that the transport network doesn’t see the end-user traffic and thus doesn’t need to know endpoint locations. If you deploy overlay virtual networks, regardless of whether you do that in the virtual switches or on the ToR switches, the spines won’t see the guest MAC or IP addresses. In this respect, overlay virtual networking isn’t any different from MPLS or PBB.

Finally, have to mention that you’ll learn a lot about overlay virtual networks by watching this webinar and even more about VXLAN from this webinar, and we covered L3-only networks extensively in the Leaf-and-Spine Fabric Designs webinar.

7 comments:

  1. This comment has been removed by the author.
    Replies
    1. Hi Ivan,
      Can you please elaborate more on "Oh, and IPv6 would make your life way easier."
      Thanks
    2. If you have hundreds of containers per host, it's sometimes hard to get enough IPv4 addresses (particularly if they have to be reached from the outside, assuming you're not doing the default NAT stuff with Docker). Assigning a /64 IPv6 prefix per container host is a no-brainer, there's 65000 of them in a /48 ;)
  2. In case of ILA, should there be a global orchestrator to allocate a unique identifier for endpoints ?
    Replies
    1. Absolutely. And the mappings of the high-order 64 bits.
  3. Ad-hoc distributed allocations are also possible, up to the imagination of the implementer :)
Add comment
Sidebar