Whenever I get asked about QoS in the data center, my stock reply is “bandwidth is cheaper than QoS-induced complexity.” This is definitely true in most cases, and ideally the elephant problems should be solved higher up in the application stack, not with network-layer kludges, but are there situations where you actually need data center QoS?
Congestion detection and TCP ECN marking might be a good use case and can be done with minimal interface configuration – all it takes is a few configuration lines on Arista EOS and Cisco Nexus OS. Data Center TCP uses ECN markings to detect congestion and reduce transmission rate before packets get dropped (packets drops could result in not-so-insignificant performance degradation because they kick the NICs out of TCP offload mode).
There might be cases where you need QoS to reduce latency, but I don’t think VoIP qualifies. At 10Gbps speeds, you need 1 MB of packets sitting in the output queue to generate an additional millisecond of latency.
Finally, if you’re forced to implement queuing to reduce the impact of elephant flows (for example), insist on behaving like a service provider and keeping the network configuration as clean as possible – police ingress traffic (if needed) and queue packets based on DSCP or 802.1p markings. Application-aware processing (hopefully resulting in DSCP marking) belongs to hypervisors or the end-hosts, not to the ToR switch.
Anything else? Share your thoughts in the comments.