DPU Hype Considered Harmful
The hype generated by the “VMware supports DPU offload” announcement already resulted in fascinating misunderstandings. Here’s what I got from a System Architect:
We are dealing with an interesting scenario where a customer had limited data center space, but applications demand more resources. We are evaluating whether we could offload ESXi processing to DPUs (Pensando) to use existing servers as bare-metal servers. Would it be a use case for DPU?
First of all, congratulations to whichever vendor marketer managed to put that guy in that state of mind. Well done, sir, well done. Now for a dose of reality.
Are DPUs Any Good?
After VMware launched DPU-based acceleration for VMware NSX, marketing-focused websites frantically started discussing the benefits of DPUs. Although I’ve been writing about SmartNICs and DPUs for years, it’s time for another closer look at the emperor’s clothes.
What Is a DPU
DPU (Data Processing Unit) is a fancier name for a network adapter formerly known as SmartNIC – a server repackaged into an interface card form factor. We had them for decades (anyone remembers iSCSI offload adapters?)
AWS Automatic EC2 Instance Recovery
On March 30th 2022, AWS announced automatic recovery of EC2 instances. Does that mean that AWS got feature-parity with VMware High Availability, or that VMware got it right from the very start? No and No.
Automatic Instance Recover Is Not High Availability
Reading the AWS documentation (as opposed to the feature announcement) quickly reveals a caveat or two. The automatic recovery is performed if an instance becomes impaired because of an underlying hardware failure or a problem that requires AWS involvement to repair.
Worth Reading: VMware Operations Guide
Iwan Rahabok’s open-source VMware Operations Guide is now also available in Markdown-on-GitHub format. Networking engineers support vSphere/NSX infrastructure might be particularly interested in the Network Metrics chapter.
Running BGP between Virtual Machines and Data Center Fabric
Got this question from one of my readers:
When adopting the BGP on the VM model (say, a Kubernetes worker node on top of vSphere or KVM or Openstack), how do you deal with VM migration to another host (same data center, of course) for maintenance purposes? Do you keep peering with the old ToR even after the migration, or do you use some BGP trickery to allow the VM to peer with whatever ToR it’s closest to?
Short answer: you don’t.
Kubernetes was designed in a way that made worker nodes expendable. The Kubernetes cluster (and all properly designed applications) should recover automatically after a worker node restart. From the purely academic perspective, there’s no reason to migrate VMs running Kubernetes.
MTU Settings in Virtual Network Devices
When I finally1 managed to get SR Linux running with netlab, I wanted to test how it interacts with Cumulus VX and FRR in an OSPF+BGP lab… and failed. Jeroen Van Bemmel quickly identified the culprit: MTU. Yeah, it’s always the MTU (or DNS, or BGP).
I never experienced a similar problem, so of course I had to identify the root cause:
Worth Reading: Xen on AWS Nitro NICs
If you find smart NICs interesting, you’ll like the latest blog post by James Hamilton explaining how AWS emulated Xen environment on Nitro hardware to keep old VM instances running on new hardware.
Circular Dependencies Considered Harmful
A while ago my friend Nicola Modena sent me another intriguing curveball:
Imagine a CTO who has invested millions in a super-secure data center and wants to consolidate all compute workloads. If you were asked to run a BGP Route Reflector as a VM in that environment, and would like to bring OSPF or ISIS to that box to enable BGP ORR, would you use a GRE tunnel to avoid a dedicated VLAN or boring other hosts with routing protocol hello messages?
While there might be good reasons for doing that, my first knee-jerk reaction was:
Implementing Layer-2 Networks in a Public Cloud
A few weeks ago I got an excited tweet from someone working at Oracle Cloud Infrastructure: they launched full-blown layer-2 virtual networks in their public cloud to support customers migrating existing enterprise spaghetti mess into the cloud.
Let’s skip the usual does everyone using the applications now have to pay for Oracle licenses and I wonder what the lock in might be when I migrate my workloads into an Oracle cloud jokes and focus on the technical aspects of what they claim they implemented. Here’s my immediate reaction (limited to the usual 280 characters, because that’s the absolute upper limit of consumable content these days):
Repost: VMware Fault Tolerance Woes
I always claimed that VMware Fault Tolerance makes no sense. After all, the only thing it does is protect a VM against a server hardware failure… in the world where software crashes are way more common, and fat fingers cause most of the outages.
But wait, it gets worse, the whole thing is incredibly complex – you might like this description Minh Ha left as a comment to my Fifty Shades of High Availability blog post.
Making LLDP Work with Linux Bridge
Last week I described how I configured PVLAN on a Linux bridge. After checking the desired partial connectivity with ios_ping I wanted to verify it with LLDP neighbors. Ansible ios_facts module collects LLDP neighbor information, and it should be really easy using those facts to check whether port isolation works as expected.
--- - name: Display LLDP neighbors on selected interface hosts: all gather_facts: true vars: target_interface: GigabitEthernet0/1 tasks: - name: Display neighbors gathered with ios_facts debug: var: ansible_net_neighbors[target_interface]
Alas, none of the routers saw any neighbors on the target interface.
Implement Private VLAN Functionality with Linux Bridge and Libvirt
I wanted to test routing protocol behavior (IS-IS in particular) on partially meshed multi-access layer-2 networks like private VLANs or Carrier Ethernet E-Tree service. I recently spent plenty of time creating a Vagrant/libvirt lab environment on my Intel NUC running Ubuntu 20.04, and I wanted to use that environment in my tests.
Challenge-of-the-day: How do you implement private VLAN functionality with Vagrant using libvirt plugin?
There might be interesting KVM/libvirt options I’ve missed, but so far I figured two ways of connecting Vagrant-controlled virtual machines in libvirt environment:
Vendor Marketectures in Real Life
Remember my rants about VMware and firewall vendors promoting crazy solutions that work best in PowerPoint and cause more headaches than anything else (excluding increased vendor margins and sales team bonuses, of course)?
Here’s another we-don’t-need-all-that-complexity real-life story coming from one of my long-term subscribers:
Are Business Needs Just Excuses for Vendor Shenanigans?
Every now and then I call someone’s baby ugly (or maybe it was their third cousin’s baby and they nonetheless feel offended). In such cases a common resort is to cite business or market needs to prove how ignorant and clueless I am. Here’s a sample LinkedIn comment talking about my ignorance about the need for smart NICs:
The rise of custom silicon by Presando [sic], Mellanox, Amazon, Intel and others confirms there is a real market need.
Now let’s get something straight: while there are good reasons to use tons of different things that might look inappropriate, irrelevant or plain stupid to an outsider, I don’t believe in real market need argument being used to justify anything without supporting technical facts (tell me why you need that stuff and prove to me that using it is the best way of solving a problem).
Disaster Recovery: a Vendor Marketing Tale
Several engineers formerly working for a large virtualization vendor were pretty upset with me when I claimed that the virtualization consultants promote “disaster recovery using stretched VLANs” designs instead of alternatives that would implement proper separation of failure domains.
Guess what… it’s even worse than I thought.
Here’s a sequence of comments I received after reposting one of my “disaster recovery doesn’t need stretched VLANs” blog posts on LinkedIn sometime in late 2019: