Project Calico: Is It Any Good?
At least a dozen engineers sent me emails or tweets mentioning Project Calico in the last few weeks – obviously the project is getting some real traction, so it was high time to look at what it’s all about.
TL&DR: Project Calico is yet another virtual networking implementation that’s a perfect fit for a particular use case, but falters when encountering the morass of edge cases.
What is Project Calico? It’s a virtual networking implementation for Linux, targeting high-density virtualization and container (Docker) environments. It comes with Neutron plug-in (and a few other plugins), making it immediately usable in OpenStack deployments.
How does it work? The virtual network is modeled as a single microsegmented flat VLAN with shared IP address space. iptables (packet filters implemented in Linux hosts) are used to implement tenant separation.
However, the creators of Calico know that a single flat VLAN doesn’t scale, so they added an interesting twist almost identical to my IPv6 microsegmentation ideas:
- Every Linux host is an IP router running BGP;
- Virtual machines (or containers) are attached to a virtual router, not a bridge (similar to what Juniper Contrail or Microsoft Hyper-V are doing);
- Host routes are used for end-to-end packet forwarding between customer endpoints and distributed between Linux hosts with BGP;
- BGP route reflectors (also running on Linux hosts) are used to build a scalable routing control plane.
This architecture retains the properties of a flat layer-2 network (it’s easy to move customer IP addresses between hosts), but scales much better – the number of active endpoints on the layer-2 segment is equal to the number of physical hosts, not the number of virtual machines or containers.
The flooding across the physical layer-2 network is also reduced to a minimum. Unless you want to deploy IP multicast across customer workload, you won’t see much more than ARP requests between physical hosts in the transport VLAN (it’s highly unlikely that a hypervisor host remains silent for a long enough time to trigger unknown unicast flooding).
The Obvious Problems
Project Calico uses a single microsegmented network, resulting in the obvious drawbacks of microsegmentation:
- Single forwarding domain, which makes service insertion way harder to implement than multiple forwarding domains. You could (in theory) use iptables on Linux hosts to implement service insertion with PBR, but I’m positive I don’t want to see the results or be involved in a troubleshooting session;
- Provider-coordinated IP addresses. When multiple tenants share the same IP address space, someone has to coordinate the addresses, which means that you cannot simply migrate your existing workload into such a cloud (unless you’re running new-age applications that rely exclusively on DNS and mechanisms like service registration and discovery);
- No overlapping IP addresses. The creators of Project Calico recognized the problem and proposed a “solution” using 464XLAT (emulating end-to-end IPv4 with a sequence of NAT46 and NAT64). I definitely wouldn’t want to be anywhere nearby a cloud that uses this approach (see also RFC 1925, section 2.3);
If you’re willing to accept these limitations, and offer cloud services where you force the tenants to use provider-supplied IP addresses (Amazon figured out years ago that they cannot do that), Project Calico might be a perfect fit for your private or public cloud… unless you want to build a scalable environment.
The Big Problems
The router-in-hypervisor approach is definitely the right way to go (it’s used by reasonably-scalable networks like Amazon VPC), but the flat VLAN transport between hypervisor hosts kills the beauty of the concept for environments that need more than one or two switches – you’re forced to deploy fragile kludges like MLAG, proprietary layer-2 fabrics, or concoctions that would make MacGyver proud.
Alternatively, you could use a real IP fabric and dump all host routes into physical switches using BGP between hypervisor hosts and physical switches… and kill scalability because you’d become dependent on the L3 table sizes in physical switches.
Finally, you might decide to do route summarization on the ToR switches (resulting in an architecture very similar to my microsegmentation ideas), which seems to be the next step that the Project Calico engineers are thinking about, but then you’d lose the ability to migrate VMs… or you might get really smart and advertise out-of-rack host routes and summarize the in-rack routes.
Can It Be Fixed?
Of course it might be possible to fix the Project Calico architecture to support true multi-tenancy and multiple routing domains:
- Use overlay virtual networks to decouple hypervisor-based packet forwarding from the transport network;
- Replace simple BGP with MPLS/VPN or EVPN.
If this description sounds a lot like an existing product, you’re absolutely right. Fixing Project Calico to address the not-so-very-insignificant corner cases would turn its architecture into something very close to Juniper Contrail.
Unfortunately, I don’t expect to see anything along these lines anytime soon. It’s relatively easy to take existing open-source components and add some glue; writing a proper multi-tenant control plane (using BGP or something else) from scratch is a totally different ballgame.
For another perspective on these same issues, read the blog post by Christopher Liljenstolpe, the original architect of Project Calico.
I do think scalability could be limited with all the host routes being advertised into the core switching environment. As you stated, summarization at the ToR, could be an answer.
You note that you have a concern with IP forwarding in Linux. Would this be similar to the performance of a vxlan gateway in kernel in a hypervisor such as ESX?
All in all, I thought it was a clever solution to avoid an overlay network deployment. For my enterprise though, it is not going to work in the near future since we have standardized on VMware.
Is Callico basically host routing per tenant with some proxy ARP on the vRouter ? Would be great if a network picture could be added to this article to make it clearer.
Does Amazon VPC do the same thing alongwith some type of MPLS VPN/ EVPN ?