One of the usual complaints I hear whenever I mention overlay virtual networks is “with overlay networks we lose all application visibility and QoS functionality” ... that worked so phenomenally in the physical networks, right?
The wonderful QoS the physical hardware gives you
To put my ramblings into perspective, let’s start with what we do have today. Most hardware vendors give you basic DiffServ functionality: classification based on L2-4 information, DSCP or 802.1p (CoS) marking, policing and queuing. Shaping is rare. Traffic engineering is almost nonexistent (while some platforms support MPLS TE I haven’t seen many people brave enough to deploy it in their data center network).
Usually a single vendor delivers an inconsistent set of QoS features that vary from platform to platform (based on the ASIC or merchant silicon used) or even from linecard to linecard (don’t even mention Catalyst 6500). Sometimes you need different commands or command syntax to configure QoS on different platforms from the same hardware vendor.
I don’t blame the vendors. Doing QoS at gigabit speeds in a terabit fabric is hard. Really hard. Having thousands of hardware output queues per port or hardware-based shaping is expensive (why do you think we had to pay an arm and a leg for ATM adapters?).
Do we need QoS?
Maybe not. Maybe it’s cheaper to build a leaf-and-spine fabric with more bandwidth than your servers can consume. Learn from the global Internet - everyone talks about QoS, but the emperor is still naked.
How should QoS work?
The only realistic QoS technology that works at terabit speeds is DiffServ – packet classification is encoded in DSCP or CoS (802.1p bits). In an ideal world the applications (or host OS) set the DSCP bits based on their needs, and the network accepts (or rewrites) the DSCP settings and provides the differentiated queuing, shaping and dropping.
In reality, the classification is usually done on the ingress network device, because we prefer playing MacGyvers instead of telling our customers (= applications) “what you mark is what you get”.
Finally, there are the poor souls that do QoS classification and marking in the network core because someone bought them edge switches that are too stupid to do it.
How much QoS do we get in the virtual switches?
Now let’s focus on the QoS functionality of the new network edge: the virtual switches. As in the physical world, there’s a full range of offerings, from minimalistic to pretty comprehensive:
- vDS in vSphere 5.1 has minimal QoS support: per-pool 802.1p marking and queuing;
- Nexus 1000V has a full suite of classification, marking, policing and queuing tools. It also copies inner DSCP and CoS values into VXLAN+MAC envelope;
- VMware NSX (the currently shipping NVP 3.1 release) uses a typical service provider model: you can define minimal (affecting queuing) and maximal (triggering policing) bandwidth per VM, accept or overwrite DSCP settings, and copy DSCP bits from virtual traffic into the transport envelopes;
- vDS in vSphere 5.5 is has full 5-tuple classifier and CoS/DSCP marking (here's how it works).
- We’ll see what NSX for vSphere delivers when it ships ;)
In my opinion, you get pretty much par for the course with the features of Nexus 1000V, VMware NSX or (hopefully) vSphere 5.5 vDS, and you get DSCP-based classification of overlay traffic with VMware NSX and Nexus 1000V.
It is true that you won’t be able to do per-TCP-port classification and marking of overlay virtual traffic in your ToR switch any time soon (but I’m positive there are at least a few vendors working on it).
It’s also true that someone will have to configure classification and marking on the new network edge (in virtual switches) using a different toolset, but if that’s an insurmountable problem, you might want to start looking for a new job anyway.