Packet Reordering and Service Providers

My “Was it bufferbloat?” blog post generated an unexpected amount of responses, most of them focusing on a side note saying “it looks like there really are service providers out there that are clueless enough to reorder packets within a TCP session”. Let’s walk through them.

Why do you say that out of order packets are SP's issue? SP only provides IP connectivity and does not care about top sessions?

The out-of-order packets are definitely not SP’s issue. After all, they don’t care if I get reduced TCP throughput (see “Per-Packet Load Balancing Interferes with TCP Offload” and related comments for more details). They are, however, usually the ones causing the reordering.

Packet reordering is obviously a bad thing within a data center, but it's unclear how much (if any) damage is done by packet reordering on low-speed long-delay links. Links to reasonably authoritative material are highly appreciated.

I also thought that TCP segment reordering is a job of the Transport layer, which is a layer on the client and server, and what in the middle are not aware of.

TCP segment reassembly into a seamless stream is the job of the transport layer, but do keep in mind that the transport layer (TCP code in receiving host, to be more precise) can do its job better and faster if the packets are not reordered. As soon as the packets start arriving out of sequence, the TCP receiver doesn’t know whether they were reordered or lost in transit, and has to start considering recovery actions.

Different packets in the same session are not guaranteed to take the same path through the Internet.

Agreed, but that’s not nice and is best avoided if at all possible.

And if Apache sent all the data, the client should get all the data, before they process the closing of the TCP stream.

Precisely at this point the Reality decided to raise its unpredictable head and disagree.

Why Are TCP Packets Reordered?

One of the readers wrote:

Due to various path and link load-balancing methods, different packets can take a different path, and some might be queued differently along the path, or the path might have a different latency -- resulting in unpredictable ordering.

While he’s correct, his argument is slightly academic. Most networking equipment implements 5-tuple load balancing (or per-port load balancing) for a reason: it’s not safe to send the packets of the same session over multiple paths (particularly if they are UDP packets). Brocade is the only vendor that can do per-packet load balancing without reordering packets, and even they’re limited to parallel links that have to be terminated on the same ASIC on two adjacent boxes.

Using per-packet load balancing (where you’d send packets of the same session over multiple paths with different latencies) is rarely implemented in practice – it would generate too many unnecessary support calls.

Some people quote routing protocol convergence as the cause of reordered packets. Yet again, they are correct from the academic perspective, but I don’t expect a service provider to change network topology (which would result in potential reordering) every few seconds.

The same reader did identify the probably cause for most packet reordering we see: packets from the same stream are queued differently, either due to different forwarding paths they take within a device or due to landing in different queues.

Why would two packets from the same session land in different queues? The only reason I can see is rate limiting with out-of-contract packet remarking (set different DSCP value on out-of-contract packets) followed by a queuing configuration that puts out-of-contract packets in a different queue.

How can I fix that? Don’t use different queues; use WRED with different weights ensuring the out-of-contract packets get dropped way before you start dropping in-contract packets.


  1. If you used Carrier Ethernet instead of IP links you wouldn't be writing this!
    1. The next time you find a working Carrier Ethernet solution between my web site and all its visitors let me know ;)

      On a more serious note, if you do packet marking on the edge of a Carrier Ethernet network (based on in-contract/out-of-contract decision) and combine that with any form of differentiated queuing within the network you might get the same results.
  2. I've run into some ISPs in APAC doing per-packet load-balancing on their outbound traffic. After I explained to them the various reasons this is a Bad Idea, they switched over to saner traffic engineering methodologies.
  3. Some content providers have moved (some are moving) toward TCP anycast (just like CDNs), and as far as I can tell per packet loadbalancing as well as BGP multi-path AS-Path relax are the enemies of both TCP and chatty UDP anycast (AKA. DNSSEC) over internet. Please don't do that!
  4. Flowlets allows to change the switching path e.g. when we want to reduce load on one paths and move traffic to another one:
    It's an interesting idea: you can change path of the flow once the gap between bursts of the flow is longer than the trip time to the merging point (assume that we have splitting and mergin point of two splitted paths). This way we can avoid re-ordering as we make sure that merging point is reached by the previouse burst of the same flow.
  5. Meh. Still sounded like bufferbloat to me. Using TCP trace on your RTTs on your capture and is generally better than the wireshark equivalents. Try it. Post the plots.

    Classic tcptrace plots here, and worth looking at in context with your problem.

    On your packet re-ordering idea, many wifi aps tend to reorder when doing retries - which is a GOOD thing - as it lowers latency on the AP.

    TCP stream reassembly is the job of the OS and linux and osx handle it well. Modern Linux's, especially.

    Windows - even modern versions of windows - does it really badly. Redmond is asleep at the switch.
  6. And oh, yea, if have WRED, use it, if you can configure it.

    For all other things, there's fq_codel.
  7. And while I'm at it, many people implementing "SFQ" DO permute the hash every 10 seconds, as that is what wondershaper did, and zillions of people just blindly copied that. It forces re-ordering every 10 seconds, which kept latency low, and acted as a poor mans AQM, in it's own weird way.

    It's easy to see a misconfigured SFQ doing that if you have captures. 10 second period for the scrambling, under load.

    This dumb behavior in Linux's SFQ implementation was fixed around version 3.6, I believe, so periodic permutation no longer does that harm (but no longer acts as a poor man's AQM), but it will take a decade for that to sort itself out in the field.
  8. In the scenario where you use MPLS pseudo-wires without MPLS Control-Word set no the packets, you can very easily end up re-ordering packets towards MAC addresses starting with a 4 or a 6, I did a presentation about that some time ago:


Add comment