Introduction to 802.1Qbb (Priority-based Flow Control — PFC)

Yesterday I wrote that you don’t need DCB technologies to implement FCoE in your network. The FC-BB-5 standard is quite explicit (it also says that 802.1Qbb is the other option):

Lossless Ethernet may be implemented through the use of some Ethernet extensions. A possible Ethernet extension to implement Lossless Ethernet is the PAUSE mechanism defined in IEEE 802.3-2008.

The PAUSE mechanism (802.3x) gives you lossless behavior, but results in undesired side effects when you run LAN and SAN traffic across a converged Ethernet infrastructure.

Traffic blocking with the PAUSE mechanism

The PAUSE mechanism is part of the Ethernet (802.3) standard and allows a receiver on a point-to-point Ethernet link to stop the adjacent sender thereby preventing a buffer overflow and packet loss.

Imagine a simple FCoE network with a server, a storage array and a switch, with server sending large amounts of data to the storage array.

When the server overloads the storage array with the data it’s sending, the storage array sends a PAUSE frame back to the switch.

The switch stops sending data to the storage array after receiving the PAUSE frame and data sent by the server start to accumulate in switch’s internal buffers until the switch has to tell the server to pause.

At that moment, the server’s Ethernet interface is effectively blocked, which is not a problem if you have a dedicated FCoE infrastructure. The same result is unacceptable in a converged infrastructure, where FCoE and LAN traffic share the same links.

Traffic blocking with Priority Flow Control (802.1Qbb)

802.1Qbb is a simple extension of the 802.3x mechanism: the PAUSE frame contains a 8-bit bit mask of 802.1p priorities (specifying which traffic classes should be paused) and a timer for each priority specifying how long the traffic in that priority class should be paused. The per-priority PAUSE mechanism allows the storage array to tell the switch it should stop sending just the FCoE traffic (assuming FCoE traffic is marked with priority value=3).

Likewise, the switch can tell the server to stop sending FCoE traffic and the LAN traffic is not impacted.

It’s also possible (at least in theory) to combine 802.3x and 802.1Qbb. For example, the storage array could use the 802.3x PAUSE mechanism to slow down the switch, whereas the switch (after noticing its priority-3 queues are filling up) could use 802.1Qbb PAUSE frame to tell the server to stop sending FCoE traffic.

More details

The PFC mechanism can quickly result in head-of-line blocking and extended congestion and is thus applicable only to small bridged domain. It should be combined with congestion notification/avoidance mechanisms (for example, 802.1Qau) in larger domains.

PFC was designed for use on point-to-point links (it does not work in PON environments) and cannot be used together with 802.3x on the same link (two competing PAUSE mechanisms on the same link make no sense). It needs DCBX standard to negotiate the parameters between adjacent nodes, including the number of traffic classes that can support PFC and the priorities for which PFC should be enabled. A standard-compliant implementation of 802.1Qbb thus requires support for DCBX as well.

The timings are quite strict (sender should stop sending in ~ 600 nanoseconds), making a hardware implementation the only viable option.

Pre-standard implementations (speculative)

As the 802.1Qbb addendum hasn’t been ratified yet, all current PFC implementations are by definition pre-standard. However, the format of the PAUSE message hasn’t changed from the very early drafts, indicating that the existing hardware implementations will probably need just a software upgrade to support potential late changes to the DCBX protocol.

Need more?

You’ll get an overview of DCB, FCoE and numerous other Data Center technologies in my Data Center 3.0 for Networking Engineers webinar (buy a recording or yearly subscription).

The 802.1bb page has links to numerous presentations.

Priority Flow Control: Build Reliable Layer 2 Infrastructure white paper from Cisco has great in-depth description of 802.1Qbb and planning recommendations for Nexus 5000 and Cisco’s CNA.

20 comments:

  1. As i understand - this function is not implemented on L3-switches (like c3560)?
  2. PFC has nothing to do with L2 or L3 switching. It's a per-hop function and can be easily (and probably usefully) used on L3 links.

    However, it does require hardware support and will thus probably never be available on older switches (or will require new linecards on modular boxes).

    At the moment, PFC is considered to be a Data Center functionality (although, as said above, could be useful in other scenarios as well), so don't expect to see it elsewhere any time soon.
  3. Sad (:
    It would be great, to see something like PFC for vlans or l2/l3 addresses.
  4. Probably worth noting that PAUSE is how FC controls senders today so would work on Ethernet, but not for converged systems where you might want to back off the data so that storage can enjoy a higher level of service.
  5. Actually you have to back off storage so it doesn't get lost. More about that when I cover 802.1Qau/az.
  6. In a converged iSCSI environment where my server adapter support 802.1Qbb connected to a ToR switch that supports and is configured for PFC is there any need for specific 802.1Qbb support on the Array side? or is it enough to receive a classic pause frame from the array?

    One more Q - if the same server in question is using storage that is not directly connected does the frame properly traverse multiple switches back to the sender ok? I see lots of documentation mentioning the medium type/lenght limitations, but not much in the case of multiple switches.

    Lastly Thanks for the article! (i know I'm reading this a few years after the fact, but glad I found it).
    Replies
    1. Classic pause frames on the storage side SHOULD be enough, but do check with the switch vendor(s). There might be a slight gap between theory and practice ;)

      PFC is a hop-by-hop mechanism and thus works equally well for directly-connected nodes as for multi-switch environments (note: some people say you might get into unpleasant HoL blocking scenarios without QCN in very large environments).
    2. Thanks Ivan!

      I believe I'm at a satisfactory place with PFC. I've been trying to track down everything out there on the topic between the time I posted the comment and now.

      Found a blog reference stating that multiple storage vendors recommend using 802.3x mode desired on the switch side (Flow control RX only). Really couldn't find much even from the storage vendors themselves on the topic. But it sounds reasonable enough that I should only need to be able to process inbound PAUSE frames.

      I'm investigating QCN now as my logical next step. Not sure if a leaf spine layout where the switch 'hops' is an architecture that would benefit from QCN, but it almost sees impossible that it would even be implemented where there are more switch hops. I'm in a place where even my vendor documentation doesn't seem reliable regarding QCN (One white paper states support for QCN, another white paper for the same switch only mentions ECN). Hehe - It seems for every answer I find I come up with few more questions. Especially around converged iSCSI implementations. Lots on FCoE but not many are willing to give authoritative recommendations on converged iSCSI.

      I guess I really have to bite the bullet and start reading RFC3720 to my kids at bed time.

      Sorry for the rant - thanks for the response.

      -Gabe
    3. Don't trust white papers. Ever.

      If a feature is not described in the product configuration guide, it's not there ... and if the vendor doesn't publish product documentation online, run away as fast as you can.

      BTW, RFC 3720 won't help you. It deals with the stuff above TCP and assumes the network gnomes do their magic.
    4. Hahaha : -)

      Advise taken. Just to report back in.. my sort of my in-depth exploration of DCB and iSCSI pretty much ended with a senior colleague (who happened to be deeply involved with DCB in the lab during the FCoE push 2 years ago) was explaining to me that while PFC sounds hands down better than 802.3x FC, the big benefit comes from deploying it alongside ETC... and ETC along with PFC need to be negotiated end to end [Host <-> Switch <-> Storage] and well.. PFC might as well be no-existent from the storage perspective. In any case your articles have been beyond educational.

      Thanks for all the great feedback. You can be sure I'm closely following your new content via RSS.

      -Gabe.
  7. I am a bit confused when you say :

    "802.1Qbb is a simple extension of the 802.3x mechanism: the PAUSE frame contains a 8-bit bit mask of 802.1p priorities (specifying which traffic classes should be paused) and a timer for each priority specifying how long the traffic in that priority class should be paused. The per-priority PAUSE mechanism allows the storage array to tell the switch it should stop sending just the FCoE traffic (assuming FCoE traffic is marked with priority value=3)."

    According to the Cisco and IEEE documents, there are only 8 possible CoS used within PFC so that would be 3 bits (802.1p)

    Can you clarify IVan ? :)

    Thanks

    Nicolas
    Replies
    1. 3-bit 802.1p field = 8 values = 8-bit mask (one bit for each value indicating traffic in that class needs to be paused). Makes sense?
    2. Yeah but you said 8 bits :)
    3. Nicolas, a 3 bit field can specify only one of 8 classes at a time. The pause frame must be able to specify 1 of 8 classes, or 2 of 8 classes, or ... 7 of 8 classes at the same time. Therefore to specify any and all classes a particular pause frame applies to, you need one bit for each possible class, or 8 bits.

      Think of it this way, all combinations of 3 bits specify the names (numbers) of the 8 classes. The 8 bits specify, in order, which of the 8 classes the pause frame applies to, and that includes specify any or all.
  8. Is it possible to use 802.1Qbb in L2 overlay (VXLAN) env?
    Cant find any info about that.
    Replies
    1. You can use it in the underlay (I wouldn't). It makes no sense to use it in the overlay (and you cannot) because the underlay doesn't provide lossless transport.
  9. HI .. can the pause frame go over L2TP ... switch1-------tunnel-----destination . in this case when switch1 generate a pause frame for destination ,can it pass over the tunnel ??
    Replies
    1. Pause frames are useful primarily when you're trying to implement lossless transport, so it makes no sense to use them over lossy media (= tunnels).
  10. hi Ivan ,

    i have got the over all concept but when i tried to implement i got few doubts
    1] what should PFC structure should contain ? Do we need 8 parameters(Uint) for each class ?
    struct {
    Uint priority;
    Uint class1;
    Uint class2;
    .
    .
    Uint class8;
    };
    2) In that frame class priority is 2 octet and in this 2 octets first 2 bit for enabling class(on/off) and remaining for classes (0 -7) . If i want to set a priority for class 5 what will be my binary form? can you show me those full 2 octet in binary form.
    Replies
    1. PFC is an IEEE standard, so if you want to implement it, you might want to read the IEEE specs... at least that's how things were supposed to be working when I was still writing code.

      I'm positive you'll find all the relevant data structures there.
Add comment
Sidebar