Public Cloud Behind-the-Scenes Magic
One of my subscribers sent me this question after watching the networking part of Introduction to Cloud Computing webinar:
Does anyone know what secret networking magic the Cloud providers are doing deep in their fabrics which are not exposed to consumers of their services?
TL&DR: Of course not… and I’m guessing it would be pretty expensive if I knew and told you.
However, one can always guess based on what can be observed (see also: AWS networking 101, Azure networking 101).
- They must be using overlay virtual networking to implement virtual networks. Nothing else would scale to what they need – scalability numbers achieved by products like Cisco ACI are laughable from a hyperscaler perspective.
- It must be either complex enough or large enough not to be implementable on ToR switches.
- AWS is the only one of the big three to offer bare-metal servers, and we know their magic runs in their smart NICs (as Pensando so proudly points out like it would validate their business model). Azure seems to be using FPGAs, and Google relied on a software solution.
For more details see:
- Azure accelerated networking
- Andromeda: performance, isolation, and velocity at scale in cloud network virtualization
Network load balancing and Internet-facing NAT are truly interesting. Microsoft wrote a paper describing an early implementation of their Network Load Balancer, and it’s reasonably easy to envision how the same approach could be used for NAT. I’m positive AWS is doing something similar.
See also:
- Maglev: A Fast and Reliable Software Network Load Balancer
- Stateless datacenter load-balancing with Beamer
While you could solve load balancing with a proper combination of worker nodes and hypervisor tricks, I’m positive other complex networking services like AWS Transit Gateway run on top of the virtual networking (like virtual machines), but in multi-tenant bare-metal instances. For an overview of this idea, see Real Virtual Routers used in Oracle Cloud.
It seems like most everything else runs in managed VMs. It’s pretty obvious Azure application load balancing is implemented with virtual machines and a Network Load Balancer sitting in front of them, VPN gateways are supposedly Windows servers (that’s why it took 30 minutes to provision one), and even their recently introduced Route Server is just two managed VMs, probably with somewhat-privileged access to the orchestration system. AWS and Google are probably using similar approaches, or they could be using multi-tenant bare metal servers for efficiency reasons… but do you really care about implementation costs if you charge them to the customer?
Anything else? Would appreciate comments with links to insightful papers.
Ivan, in addition to the above, there are 2 papers from Google detailing some of their network's design principle s and practices:
https://cseweb.ucsd.edu/~vahdat/papers/b4-sigcomm13.pdf
https://people.eecs.berkeley.edu/~sylvia/cs268-2019/papers/ramesh16a.pdf
In the first one, on page 4, they briefly mentioned their own B4 switch, which has Clos internal architecture similar to FB's Six-Pack. Overall, looks like Google makes heavy use of BGP, IS-IS and MPLS to scale their infrastructure.
Also, correct me if I'm wrong, but surely MPLS is a viable technology to build L3 virtual network if one doesn't want to resort to Overlay, no? Overlay is complex and therefore slow. MPLS is simpler and faster. The downside with MPLS is the more VRFs you have, the more CAM/TCAM resources are required and this can prove prohibitive given how scarce they are even in modern ASICs.
Ah, the B4 paper... aka "look, we're so cool, we decided to become a router manufacturer". See https://blog.ipspace.net/2012/05/openflow-google-brilliant-but-not.html
As for MPLS as transport topology instead of an overlay, see https://blog.ipspace.net/2020/05/need-vxlan-transport.html
Kind regards, Ivan
Hi Ivan, After 6 years of working at AWS I don’t really know how it works either. For the basic principles of VPC under the hood your subscribers might like this video. It’s a bit old, but still pretty relevant.
https://m.youtube.com/watch?v=Zd5hsL-JNY4
I've also found this paper that describes in detail how MS has implemented their virtual networking platform over the years, and why they've chosen to go with overlay/directory service model:
https://www.usenix.org/system/files/conference/nsdi17/nsdi17-firestone.pdf
Looks like the implementations of Azure and GCP's virtual networking (detailed in the Andromeda paper) overlap a fair bit. One thing is certain: Openflow, in its classic form, is unworkable/unscalable. The VFP paper hints at that as the reason why NSX scales poorly (1000 hosts). Both Azure and GCP had to make heavy modifications to OF model in order to scale their infrastructure.
The overlay implementation obviously trades performance for scalability: section 5 of the VFP and sections 3, 4 of the Andromeda paper, give a glimpse into how slow their data planes can get as they give detailed architecture of the platforms. That's why MS decided, in the end, to go with hardware offloading, using FPGA SmartNiC -- essentially a specialized switch/router attached to a server -- to implement virtual networking, for better scalability.
The directory service model is also a concept prevalent across AWS, Azure, and GCP, albeit under different names. in AWS, it's called the Mapping service, in Azure, Directory, and Hoverboard in GCP. They all use this service to scale their routing table to millions of entries on the cheap, again at the cost of performance hit, because it's done in software and requires communication to dedicated devices. Flow caching is used to improve performance, which is reminiscent of MLS back in the 90s.
Overall, since the philosophy behind their VNET is very much similar, whoever has the sanest physical design will have the best performance relative to the others. AWS by far seems to be on top as their physical architecture looks the sanest to me.