EVPN Control Plane in Infrastructure Cloud Networking
One of my readers sent me this question (probably after stumbling upon a remark I made in the AWS Networking webinar):
You had mentioned that AWS is probably not using EVPN for their overlay control-plane because it doesn’t work for their scale. Can you elaborate please? I’m going through an EVPN PoC and curious to learn more.
It’s safe to assume AWS uses some sort of overlay virtual networking (like every other sane large-scale cloud provider). We don’t know any details; AWS never felt the need to use conferences as recruitment drives, and what little they told us at re:Invent described the system mostly from the customer perspective.
It’s also safe to assume they have hundreds of thousands of servers (overlay virtual networking endpoints) in a single region. Making BGP run on top of that would be an interesting engineering challenge, and filtering extraneous information not needed by the hypervisor hosts (RT-based ORF) would be great fun. However, that’s not why I’m pretty sure they’re not using EVPN - EVPN is simply the wrong tool for the job.
While it seems like EVPN and whatever AWS or Azure are using solve the same problem (mapping customer IP- and MAC addresses into transport next hops), there are tons of fundamental differences between the environments EVPN was designed for and IaaS public cloud infrastructure.
Typical L2VPN environment has dynamic endpoints. MAC- and IP addresses are discovered locally with whatever discovery tool (MAC learning, DHCP/ARP snooping…) and have to be propagated to all other network edge devices. In a properly-implemented scalable cloud infrastructure the orchestration system controls MAC- and IP-address assignments. There’s absolutely no need for dynamic endpoint discovery or dynamic propagation of endpoint information. Once a VM is started through the orchestration system, the MAC-to-VTEP mapping is propagated to all other hypervisor hosts participating in the same virtual network.
Endpoints can move in a L2VPN environment. While moving endpoints tend to be uncommon across WAN, they do move a lot in enterprise data centers - vMotion in all its variants (including the bizarre ones) is the most popular virtualization thingy out there. Even worse, when using technologies like VMware HA or DRS, the endpoints move without the involvement of the orchestration system. EVPN would be a perfect fit for such an environment.
In most large-scale public clouds the endpoints don’t move, the only way to get a VM off a server is to restart it, and if a server crashes, the VMs running on it have to be restarted through the orchestration system. There’s absolutely no need for an autonomous endpoint propagation protocol.
EVPN is trying to deal with all sorts of crazy scenarios like emulating MLAG or having multiple connections into a bridged network. No such monstrosities have ever been observed in large-scale public clouds… maybe because the people running them have too much useful work to do to stretch VLANs all over the place. It also helps if you’re big enough not to care about redundant server connections because you can afford to lose all 40+ servers connected to a ToR switch without batting an eyelid.
To make matters more interesting: virtual networking in infrastructure clouds needs more than just endpoint reachability. At the very minimum you have to implement packet filters (security groups) and while a True BGP Believer might want to use FlowSpec to get that done, most sane people would give up way before that.
Considering all of the above, what could be a useful control plane between a cloud orchestration system and hypervisor endpoints? Two mechanisms immediately come to mind:
- An API on the hypervisor that is called by the orchestration system whenever it needs to configure the hypervisor parameters (start a VM, create a network/subnet/security group, establish MAC-to-VTEP mapping…)
- A message bus between the orchestration system and the hypervisors. Whoever has some new bit of information drops it on the message bus, and it gets magically propagated to all the recipients interested in that information in due time.
Based on my experience with the speed of AWS and Azure orchestration systems I would suspect that AWS uses the former approach while Azure uses the latter.
EVPN is neither of those. Without information filtering in place, BGP is an eventually-consistent database that pushes the same information to all endpoints. Not exactly what we might need in this particular scenario.
Long story short: EVPN is an interesting bit of technology, but probably the wrong tool to implement control plane of an infrastructure cloud that has to provide tenant virtual networks. It does get used as the gateway technology between such a cloud and physical devices though. Juniper Contrail was the first one (I’m aware of) that used it that way, and even VMware gave up their attempts to push everyone else to adopt the odd baby they got with the Nicira acquisition (OVSDB) and switched to EVPN in NSX-T 3.0.
More information
Want to know how networking in public clouds works?
- Start with AWS Networking 101 and Azure Networking 101.
- When you’re ready for more dive into AWS Networking and Microsoft Azure Networking webinars.
- When you feel it’s time to truly master cloud networking, go for the Networking in Public Cloud Deployments online course.
Interested in EVPN? We have you covered - there are tons of EVPN-related blog posts, and we explored EVPN from data center and service provider perspective in the EVPN Technical Deep Dive webinar.
Finally, I described both NSX-V and NSX-T in VMware NSX Technical Deep Dive webinar and compared it with Cisco ACI in VMware NSX, Cisco ACI or Standard-Based EVPN webinar.
While I agree on your explanation of how webscale / megascale environments are using an overlay between their hypervisors, I think that EVPN could have been used as a solution in their environments as well. The main reason is that their infrastructures don't use it is that EVPN didn't exist at that time or wasn't mature enough when they needed a solution.
Network vendors have started to support EVPN from 2017. All implementations had their usual bugs and missing features as with every new technology implementation. Added to that there were multiple hardware dependencies that were solved over time.
If I'm looking at the two mechanisms you describe, setting up EVPN on hosts with multicast replication, eBGP route servers (for filtering and perhaps flowspec) and combine this with an RFC 4684 implementation would essentially get you the same type of solution.
Obviously I don't see these environments moving away from their well tested and proven implementation, but I don't see EVPN inherently being the wrong tool for the job.
If anybody is interested this prensentation from reinvent 2017 explain a bit on how they do it. https://youtu.be/8gc2DgBqo9U
Great write-up Ivan, full of technical details! Looks to me like the bigger point of your post is that, big cloud providers like AWS use the as-simple-as-possible design approach, minimizing the number of protocols and the amount of state to be kept in the core, which is the right way to scale a network to that size. That aligns perfectly with what Justin Pietsch said in his multicast post, that the only things he wants to deal with in his ideal network are IPV4, OSPF, and BGP. A close-to-stateless core, with all complexity pushed to the edge. So naturally they need not all the nuts and bolts of EVPN, which tries very hard to be a lot of things to a lot of people.
Re their another day, another billion flows/packets presentations, I've watched several of them and IMO, they keep spewing the same nonsense year in, year out, all in a very shallow manner. Take the one posted by Damien above. At 15:17, the presenter said AWS looked at MPLS among others, and found that even MPLS couldn't scale to their need. And then they go on about their gigantic mapping service from around 21:30 onward. What rubbish! MPLS is, in essence, tag switching; EVPN too, falls under the umbrella of BGP MPLS VPN technologies and so is essentially MPLS in all but name, but that's another story. Anyone who makes use of tag switching these days, is effectively using MPLS one way or another. And I'm pretty damn sure tagging is what they use in their 'gigantic mapping service' table. Call it what they like, doesn't change the nature of the beast itself.
IMHO, MPLS, using hierarchical methods like CsC or Inter-AS option C, can be as scalable as you want it to be, and at the same time not complicated. Look at cellular networks in excess of 150k DCE equipments running MPLS. MPLS within a routing domain, is only limited by the underlying routing protocol. And if AWS can run their networks using a second-best OSPF implementation on old hardware, there's no reason for a network running the more scalable IS-IS, with a good implementation and a good design of course, not to be able to scale into the tens of thousand of nodes in one area, and hundreds of thousands in a multiple-level domain, using current hardware.
I don't know how they claim their AWS mapping service is very fast. From their presentation, it kinda reminds me of good old LANE, with some twists. Basically a central controller/routing system composes all of the forwarding entries, then distributes it to hypervisors' caches for fast lookup. Sounds a lot like OpenFlow too, minus its crappy aspects of course ;). Unless the mapping lookup is done in hardware, it can't be as fast as they claim it to be. Maybe they do use TCAM in their Smart-NIC router :p. If so, they likely use some form of tagging to minimize the state used to represent each TCAM entry in order to scale, so tag-switching again.
We do have services running off ASW EC2, and the overall speed over long period is far from exceptional. One of the services, where students would use Citrix to RDP into EC2 instances and run their Engineering packages like Ansys, Matlab, Arcgis..., is slow. In one instance recently, an Arcgis operation took 18 hours to complete, to the point of unworkable for our customer. Running the same operation on the local laptop, took 10 mins. I was there, so it was not made up at all.