A while ago I explained why OpenFlow might be a wrong tool for some jobs, and why centralized control plane might not make sense, and quickly got misquoted as saying “controllers don’t scale”. Nothing could be further from the truth, properly architected controller-based architectures can reach enormous scale – Amazon VPC is the best possible example.
Totally unrelated note to bloggers: please don’t use marketing whitepapers disguised as technical documents as counterarguments in technology-focused discussions.
Cloud Orchestration System as Overlay Virtual Networking Controller
The orchestration system in an IP-aware IaaS cloud architecture has all the information we need to set up forwarding entries in an overlay virtual networking implementation:
- Hypervisor-to-VTEP (transport IP address) mapping
- VM-to-hypervisor or container-to-host mapping
- MAC-to-VM or MAC-to-container mapping
- IP-to-VM or IP-to-container, and consequently IP-to-MAC (ARP) mapping
- Subnets and other connectivity needs of individual tenants
- Security requirements of individual VMs and tenants
Dynamic/floating IP addresses and VM mobility might introduce some hiccups into this rosy picture, but let’s ignore them for a moment.
Some cloud orchestration systems push this information straight into hypervisors (example: Hyper-V System Center Virtual Machine Manager). More scalable architectures replace a single instance of the orchestration system with a scale-out controller cluster relying on back-end database (probably what Amazon VPC and Azure are using). For extra boost in scalability, replace transactional back-end database with eventually consistent distributed database, which is usually good enough in large-scale UDP clouds (don't tell me I just reinvented MPLS/VPN - I'm well aware of that analogy ;).
Other implementations use more convoluted approaches, from layered controllers (example: NSX controller for OpenStack) to centralized control planes (example: Cisco Nexus 1000V). Layered controllers add complexity, but still perform remarkably well as long as they stay on the management plane. The moment a controller starts dealing with the real-time aspects of control- or data plane, its scalability plummets.
A Few Data Points
What I wrote above should be common sense to anyone who spent time researching or implementing large-scale networking architectures. Do we see the same trend in real-life implementations? Here are some data points from well-known commercial products.
Products that stay out of the control- and data plane:
- A cluster of three NSX controllers can manage up to 3000 hosts, a cluster of five controllers up to 5000 hosts (supported numbers given in NSX release notes are lower, but you get the idea);
- A single System Center Virtual Machine Manager (using Hyper-V Powershell API) can manage up to 400 hosts (this is not a hard number);
- VMware virtual distributed switch (vDS) can span 1000 hosts with vSphere 5.5 (350 in vSphere 5.1);
Products with centralized control plane:
- Cisco Nexus 1000V VSM can control 128 hosts in recent release (64 hosts in older releases);
- Last I heard ProgrammableFlow controller controls up to 200 switches.
I never got the maximum number of Hyper-V virtual switches PF6800 controller supports, the online brochure has zero technical details, and the documentation is still not public.
Comparing vDS and Nexus 1000V maximums is particularly entertaining. You could believe that:
- VMware understands networking better than Cisco does;
- VMware programmers write better networking code than Cisco’s programmers;
- VMware cares more about scalability of virtual networking than Cisco
… or you could accept the fact that there are some fundamental architectural differences between the two products that affect scalability. Do I need to say more?
Check out my cloud computing webinars – you can buy them individually or in a bundle, or get access to all of them with the yearly subscription. I’m also available for short online consulting sessions.