TCP Congestion Avoidance on Satellite Links

While some people spread misinformation others work hard to figure out how to make TCP work on exotic links with low bandwidth and one second RTT.

Ulrich Speidel published a highly interesting article on APNIC blog describing the challenges of satellite Internet access and the approach (network coded TCP) they took to avoid them.

What did they do?

They implemented an IP-layer coding mechanism on the layer-3 path that traversed the satellite link, effectively distributing every TCP packet across a number of transport packets (to minimize the effects of packet bursts on other TCP sessions) while also adding forward error correction to recover from reasonable packet loss rate without triggering TCP retransmissions.

VeloCloud demonstrated a similar solution at Networking Field Day 9 (video).

Couldn’t they just use QoS?

No. Similar to xDSL deployments they didn’t control the congestion point (the satellite uplink), which probably provided just a simple FIFO queue with tail drop (see also the blog post comments).

Couldn’t they use intelligent shaping?

No idea. Teclo Networks is using intelligent per-session shaping to improve TCP goodput on mobile networks (more details in Software Gone Wild Episode 25), but I don’t know enough to judge whether their approach would work in environment with very large TCP RTT. Have to ask Juho

Can’t we just fix TCP?

Maybe. While academics claim their machine-generated congestion control algorithms increased TCP throughput by almost a factor of 2, I’m not aware of any production experience, particularly in harsh environment (mobile or satellite)… and there might be a slight difference between theory and practice.

Obviously, if I’m missing something, please write a comment. Thank you!

Interested in more information?

No, this is not my usual “here be webinars” blurb. Ulrich wrote a great follow-up article that’s another must-read.

11 comments:

  1. Pacing should help, but still sadly isn't a standard feature.
  2. I do look forward to someone publishing some data on the fq_codel algorithm, tuned for sats = an interval of 1000ms and 50-100ms target latency.
  3. With link-level traffic shaping you could move the congestion point to your own router. If the bandwidth is reserved like a leased-line. If it is changing dynamically, then you need a sophisticated controller (e.g. SDN). You could also add a Peformance Enhancement Proxy (PEP). Most routers already contain one already. You could also use various header and content compressions. And of course, you could make multiple parallel threads.
  4. You could also use cross-layer QoS and admmission control with adaptive coding so could accomodate changing weather conditions. And you may also ask for FEC for reducing packet drops. This all could be combined with what you have already suggested...
  5. get some wider support for TFO? https://www.youtube.com/watch?v=Qo9rFpiLMWI
    Replies
    1. That solves just the question of the first RTT, which is important (and annoying if you're sitting at the browser) but there are many more things to solve.
  6. J Hand sent me this comment after Blogger failed a few times
    =============
    Couldn’t the results be just as well or better explained by a combination periodic weather fades and the possibility that the SATCOM link provider chose a modulation and coding scheme that is insufficiently robust for these fades? The additional coding added at the network layer shores up the insufficient link layer coding and reduces the residual packet loss rate during the fades to a level in which TCP throughput for the link can be maintained.

    If the problem were fades, queue oscillations wouldn’t likely be a significant part of the problem. The queues could be chronically underutilized because the TCP senders would be misinterpreting corruption loss as a sign of congestion, and slowing down unnecessarily, regardless of the queue levels.

    And as noted in the comments in the original article, if the cause of the packet loss *were* congestion/queue oscillations, then the additional coding would be masking congestion signals. The article seems to assume that there is too much packet loss (and therefor congestion signals), and is preventing the TCP sessions from ramping up. But too little packet loss and congestion signal would still result in queue oscillation. The TCP sessions would keep ramping up until eventually there was enough packet loss that the network coding couldn’t mask it. The article doesn’t seem to explain why the network coding scheme they used got the level of packet loss right.

    Again, sorry, but please set me right on what I’m missing.
  7. forward error correction on satellite link is implemented on L1, why would anyone need to make it on L3 too? not sure if it is possible to make it more efficient.
  8. and what about TCP-acceleration... forgotten?, but it is there :)
  9. we have 650+ ms at our links across Yamal 402 satellite and among the IP-telephony thing, which is running smooth, the modem TCP-acceleration works pretty good for multiple concurrent TCP-sessions, even when there is a significant oversub. When the traffic is encapsulated even to basic GRE, the modem can do nothing to it, and the key role is granted to an in-path device like Riverbed.
Add comment
Sidebar