Rant: Cloudy Snowflakes
I could spend days writing riffs on some of the more creative (in whatever dimension) comments left on my blog post or LinkedIn1. Here’s one about uselessness of network automation in cloud infrastructure (take that, AWS!):
If the problem is well known you can apply rules to it (automation). The problem with networking is that it results in a huge number of cases that are not known in advance. And I don’t mean only the stuff you add/remove to fix operational problems. A friend in one of the biggest private clouds was saying that more than 50% of transport services are customized (a static route here, a PBR there etc) or require customization during their lifecycle (e.g. add/remove a knob). Telcos are “worse” and for good reasons.
Yeah, I’ve seen such environments. I had discussions with a wide plethora of people building private and public (telco) clouds, and summarized the few things I learned (not many of them good) in Address the Business Challenges First part of the Business Aspects of Networking Technologies webinar.
I usually received interesting feedback from that presentation, ranging from “I wish my CIO was in the audience” to “I wish we heard all this before we burned a few millions”. Anyway, one of the slides in that presentation (also repeated in at least two other cloud-related webinars) says:
Clouds are all about on-demand services and orchestration
I would be glad if anyone could explain to me how you can provision on-demand services through an orchestration system if your infrastructure needs constant ad-hoc manual tweaking, which hopefully triggers a change review process and implementation in a maintenance window.
No takers? OK, I’ll spoil the fun.
I was told a huge multinational software vendor2 once launched a public cloud that used physical firewalls3. They didn’t dare automating the firewall configuration – after all, firewalls might crash while being configured every other blue moon, and that would bring down the whole “cloud” – and deployed a very efficient process to modify security rules:
- You could modify security rules through the orchestration system
- The system would log your request and open a ticket
- They would schedule a maintenance window
- During the maintenance window, someone would copy-paste the firewall configuration generated by the orchestration system (I hope) into the firewall
- You’d get an email saying “your changes have been implemented”
The process definitely meets the “on-demand” and “orchestration system” requirements, although in a slightly more creative way than one might expect. For whatever weird reason, that public cloud never took off4.
Anyway, back to the original comment. What the automation denier used as a counterexample isn’t a private cloud. It’s a badly managed large scale server virtualization infrastructure hell. You can’t call you concoction a cloud if you can’t even define the services you’re about to offer.
But wait, it gets better:
“more automation”, “fix your pipelines”, “fix your culture”, “FAANGs do it” is the usual reaction but i doubt it takes people 10 years (or more since this conversation started) to learn to write yaml templates and playbooks :)
The “write YAML templates and playbooks” derision is what you get when you have a bunch of hipsters evangelizing an otherwise sane concept – all we’ve heard from the most vocal proponents of network automation has been “ANSIBLE!!!” or “everyone will become a Python programmer”5.
That works great in a science project proof-of-concept; real automation projects require much more, starting with understanding of the problem you’re trying to solve and creating an optimal architecture of the whole system. We tried to address that in the Building Network Automation Solutions online course, and hundreds of attendees created successful automation projects based on the knowledge gained from that course – at least we have a real-life proof that it’s not impossible to automate significant parts of large enterprise networks.
Finally, you know how much I adore people claiming you should use the same approach as some FAANG (a great take on why that’s stupid), but they did solve problems you’ll never experience and while some of those solutions are clearly over-engineered it’s always worth figuring out what made their networks work6. In the case of public clouds:
- They have a stable transport infrastructure that provides optimal end-to-end transport service
- They implemented overlay networking on top of that and server- or VM-based interface between overlay networks and the physical world.
- All customer-related tweaks are configured within the tenant networks.
- Tenant networks are implemented in hypervisor virtual switches and don’t affect the infrastructure at all.
- Tenants can configure their networks with point-and-click GUI, API, automation tools, CLI, infrastructure-as-code tools, or cats jumping on a keyboard. The infrastructure team doesn’t care.
- Infrastructure team is responsible for the proper operation of the infrastructure; the tenants are responsible for the results of their actions. AWS published a wonderful shared responsibility model document describing that.
You can use the exact same approach in a private cloud, but you better start with VMware NSX licenses. A well-supported OpenStack distribution might work as well7, although I haven’t heard much about OpenStack in recent years. Is it still a thing?
Too expensive? It’s cheaper to let the infrastructure engineers suffer through VLAN nightmares and static route quagmires? PBR in transport network? Seriously? It might be just me, but if I were one of those engineers, I would already have a highly polished resume ;)
In other words, don’t blame a concept if you have to work in a toxic environment where it’s impossible to apply it.
-
These days, you can meet Technical Alchemists or Systems Thinkers on LinkedIn, and some comments reflect their job titles. ↩︎
-
Not Oracle. I promise. It’s easy to make fun of their cloud due to the reputation they have in other parts of their business, but the cloud seems solid. Even Corey Quinn doesn’t hate it too much which says a lot. ↩︎
-
Who could ever trust a virtual firewall? Someone could break into one of them, deploy a zero-day hypervisor exploit, modify the in-memory rules of other firewall instances running on the same hypervisor, and break into other customers’ networks… in Mission Impossible Part 6. ↩︎
-
Neither did another public cloud where the vendor expected you to sign a contract with them and raise a purchase order before you could provision the first service, but I’m digressing… ↩︎
-
Of course they’ll say their efforts “moved the discussion forward.” SDN evangelists used the same pathetic excuse when that bubble burst. ↩︎
-
Keeping in mind most of the more popular talks from Google are nothing else but thinly veiled recruitment drives (here’s another example). ↩︎
-
Or roll out your own version of OpenStack. I know people who needed a few years (and a few wasted millions) to realize that’s hard. ↩︎
Openstack sure is still a thing. Check the numbers yourself and you will see the continuous growth. And the development lifecycle. It is another topic how hard it is or it can be (to get nice, scalable and rock solid deployment), but one thing is clear - it is not for everybody...
Oh man, this is pure gold even by your standards Ivan.
At my last job the mandate was to build a proper private cloud. For the physical network we designed a textbook L3 Clos that provided one IP address per server. No VLANs, etc. Fully automated through Ansible ;-) even though the config was completely static. Everything else will be handled by OpenStack.
Then we start talking to the OpenStack vendors and every one requires VLANs. And not for any good reason either; it looks like they just cargo-culted ideas from previous VMware best practices. So now the underlay has to have EVPN and the config is 10x bigger and more fragile. And of course OpenStack has its own overlay so there's double encapsulation (I carefully configured MTUs everywhere so users could remain unaware).
> Then we start talking to the OpenStack vendors and every one requires VLANs. And not for any good reason either; it looks like they just cargo-culted ideas from previous VMware best practices.
No surprise there :(( but that makes no sense whatsoever -- I know an organization running OpenStack with VXLAN between loopback interfaces of Linux servers advertised over BGP with FRR. Works like a charm.