How Microsoft Azure Orchestration System Crashed My Demos
One of the first things I realized when I started my Azure journey was that the Azure orchestration system is incredibly slow. For example, it takes almost 40 seconds to display six routes from per-VNIC routing table. Imagine trying to troubleshoot a problem and having to cope with 30-second delay on every single SHOW command. Cisco IGS/R was faster than that.
If you’re old enough you might remember working with VT100 terminals (or an equivalent) connected to 300 baud modems… where typing too fast risked getting the output out-of-sync resulting in painful screen repaints (here’s an exercise for the youngsters: how long does it take to redraw an 80x24 character screen over a 300 bps connection?). That’s exactly how I felt using Azure CLI - the slow responses I was getting were severely hampering my productivity.
I thought I found a way to cope with it:
- Create a resource group (relatively fast);
- Create the resources I need for a particular demo (OK, we can wait a bit);
- Create the VMs as the very last part of the process (and be prepared to do some GUI Portal sleight-of-hand to keep the workshop attendees amused while the Azure gnomes are scraping together the necessary RAM, CPU and disk space)
- Delete a resource group after the end of the demo (the only sane way to clean up the gazillion objects created every time you try to do something) using asynchronous API call so we could move on to the next demo while the resource group was being deleted.
Tried that process numerous times when developing the demos, and it always worked flawlessly (after all, why shouldn’t it - I wasn’t doing anything too weird).
Unfortunately, the demo gods weren’t in good mood when I was running the Microsoft Azure Networking workshop last week.
The first demo worked. Nice, let’s delete the resource group and move on. Oh, wait, here’s an idea: let’s watch how things disappear from the Portal GUI. Oops, it doesn’t work. Wait a bit and refresh. Still nothing. Wait half an hour. Nothing changed. Retry the delete resource group CLI command. Doesn’t work. Retry in portal. Wait some more. No change at all. Lovely, let’s move on and skip the rest of the demos.
It took us the whole day (and numerous failed attempts) to delete the two virtual machines and associated objects needed for the first demo. In the meantime we experienced:
- Resource group (RG) being in deleting status for hours, and then reverting to successfully provisioned status (resulting in Hotel California-type jokes mentioning that they’re still charging for the services).
- Delete request failed messages generated way after I requested RG removal;
- Virtual machines being stuck in deleting state forever.
- A new rule in a network security group (NSG) not being applied to a VM NIC long after the API call adding it to the NSG succeeded;
Lesson learned: for the upcoming Azure Networking webinar I’ll pre-record the demos, publish them before the webinar, and ask the attendees to prepare the questions in case they’d like to know more about the demos.
What now?
We all know **** happens… but one would hope that would be a rare occasion. Looks like that’s not the case - some workshop attendees with hands-on Azure experience found nothing unusual (yeah, that happens every now and then). Even worse, one would expect to see the API slowdown mentioned in Azure Status History. No such luck - looks like it was business-as-usual.
I honestly don’t know any longer what to expect from a public cloud offering from a major vendor. What I’ve seen during that day looked more like an undergraduate science project than a commercial-grade product.
PS: Assuming 10 bits for a character (asynchronous) it's 30 character per second. So in this case it takes 64 seconds ~ 1 min.