Relative Speed of Public Cloud Orchestration Systems
When I was complaining about the speed (or lack thereof) of Azure orchestration system, someone replied “I tried to do $somethingComplicated on AWS and it also took forever”
Following the “opinions are great, data is better” mantra (as opposed to “never let facts get in the way of a good story” supposedly practiced by some podcasters), I decided to do a short experiment: create a very similar environment with Azure and AWS.
I took simple Terraform deployment configuration for AWS and Azure. Both included a virtual network, two subnets, a route table, a packet filter, and a VM with public IP address. Here are the observed times:
aws_vpc: 18s/1s aws_route_table: 2s/2s aws_subnet: 2s/1s aws_internet_gateway: 3s/24s aws_route: 2s/0s aws_security_group: 5s/1s aws_subnet: 14s ec2instance: 37s/32s
azurerm_resource_group: 1s/46s azurerm_route_table: 22s/11s azurerm_virtual_network: 35s/11s azurerm_network_security_group: 35s/11s azurerm_subnet: 4-7s/10-11s azurerm_subnet_route_table_association: 4s azurerm_public_ip: 1m0s/1m40s azurerm_network_interface: 1s/21s azurerm_virtual_machine.tf_vm: 21s/1m42s
Azure is markedly slower (apart from VM creation), and the real showstopper seems to be the public IP address. Admittedly I was looking for a ballpark figure and did a single apply/destroy cycle on each public cloud; if you feel like conducting a proper experiment spread across multiple times-of-day in multiple availability zones I’d love to see your results.
What worries me way more than the response time difference is the extreme variability of Azure response times. This is what I’ve observed when setting up and tearing down Azure Virtual Hub components (which I did do multiple times on several days in several regions):
- 15 – 25 minutes to create a virtual hub
- 3 – 15 minutes to create a virtual network connection
- 2.5 – 10 minutes to modify VNet connection route propagation parameters
- 5 – 15 minutes to destroy a virtual network connection
To make matters even worse, the large variance in response times was observed on parallel requests made by Terraform in the same Azure region. One virtual network connection would be created or destroyed in a few minutes, the other one (started at the same time) would take more than 10 minutes. Worst-case scenario I’ve heard of: 17 minutes to change an entry in a route table.
Does the difference matter? It does to me when I’m trying to figure out how things work. It might not matter if you’re making changes to your public cloud infrastructure every third blue moon… or you might care a lot if you plan to use Azure REST API for real-time dynamic modifications of your infrastructure like changing the next hop of user-defined route.
Interestingly, I got mixed messages on that last use case, from someone claiming it works great, to someone else saying he’d rather rely on the complex load balancing architecture than Azure API response times. Any further real-life experience would be highly appreciated.
History of Changes
- Added a sample real-life UDR worst-case scenario
Been there, done both, both suck :)
If I had to choose though (which I had), I'd go with the "load balancer sandwich" approach. The reason is that it fails over in a few seconds whereas the UDR change takes 1-2 minutes. Moreover, I personally don't see any win in complexity terms either as, in the case of UDR change, one still has to implement some sort of failure detection mechanism.
So, once again, I choose to put the complexity in ARM/Ansible/Terraform templates that deploy the sandwich architecture - write once, solve until the next API change :)