… updated on Thursday, November 19, 2020 12:17 UTC
iSCSI with PFC?
Nicolas Vermandé sent me a really interesting question: “I've been looking for answers to a simple question that even different people at Cisco don't seem to agree on: Is it a good idea to class IP traffic (iSCSI or NFS over TCP) in pause no-drop class? What is the impact of having both pauses and TCP sliding windows at the same time?”
Let’s rephrase the question using the terminology Fred Baker used in his Bufferbloat Masterclass: does it make sense to use lossless transport for elephant flows or is it better to drop packets and let TCP cope with packet loss?
It’s definitely not bad to randomly drop an occasional TCP packet of a mouse session – if you have thousands of TCP sessions on the same link and drop a single packet of one or two sessions to slow them down, the overall throughput won’t be affected too much ... and if you randomly hit different sessions at different times, you’re pretty close to effective management of a mice aggregate.
Elephants are different because they are rare and important (see also Storage Networking is Different and Does Dedicated iSCSI Infrastructure Make Sense?) – dropping a single packet of an elephant iSCSI session could affect thousands of end-user sessions (because the overall disk throughput would go down), more so if you’re using iSCSI to access VMware VMFS volumes (where a single iSCSI session carries the data of all VMs running on the vSphere host).
Classifying iSCSI as lossless traffic class thus seems to make a lot of sense (but see below), and comparing the results of QoS policing (= dropping) versus shaping (= delaying) on a small number of TCP sessions supports the same conclusions.
After Almost a Decade
I never found an authoritative answer to this question, and if you ask a dozen experts you’ll get two dozen answers, in particular if you add the does that mean we need large buffers in data center switches question to the mix. In any case, you might find these resources helpful:
- J Metz published fantastic napkin dialogues diving deep into iSCSI details;
- I discussed the “should we delay or drop” question with Thomas Graf and Juho Snellman, and the conclusions were that drops are not bad assuming you have a decent TCP stack implementation;
- Google introduced a totally new TCP congestion control (BBR) in 2016
- JR Rivers addressed the question of data center buffering in an excellent (and free) Networks, Buffers and Drops webinar.
More information
Storage networks, iSCSI, FCoE and Data Center Bridging (including PFC) are described in the Data Center 3.0 for Networking Engineers webinar.
1) if your network is dedicated to iSCSI, then by all means use delay based congestion mechanisms and lossless notifications like ECN.
2) if it isn't, you are in a world of hurt with competing with other forms of TCP.
3) I'd be very interested in a similar plot of an iSCSI network with fq_codel enabled, as well as a measurement of latency on your disk storage nodes under these loads with both the the lossless and fq_codel based approach. While throughput is important, high latencies cause other problems.
#2 - PFC + ETS giving iSCSI traffic guaranteed part of the bandwidth (ex: 3 Gbps on a 10 Gbps link) is pretty much equivalent to #1.
#3 - Me too ;)
I have had some discussions at Cisco Live London around this topic after attending the "Mastering Data Center QoS" session (https://www.ciscolive365.com/connect/sessionDetail.ww?SESSION_ID=6028).
Another factor that came into play was whether the switched network was just a single hop or multi-hop between the servers and the storage.
Enabling PFC on a multi-hop network could introduce head-of-line blocking on the inter-switch links. To stick with your analogy: If a single elephant is slowed down, then other elephants might be prevented from crossing the bridge between the switches.
In the end the recommendation that I got was not to enable PFC on multi-hop iSCSI networks, because the harm done by head-of-line-blocking could outweigh the benefit of using PFC. One of the participants in the discussion actually claimed he had seen a significant performance improvement after disabling flow-control on their iSCSI network. Unfortunately I haven't seen hard evidence of this theory.
I am interested to hear your opinion on this. Could the answer to this question be different for single-hop vs multi-hop DCB-enabled networks?
So far, all I've heard are theories. Hard facts would be nice, but I haven't found them yet.
Tweaks in recovery algorithms like proportional rate reduction for TCP (rfc6937) good have a big impact.
Also the test with 4 equal flows might not be as realistic as you might want them to be.