TCP Congestion Avoidance on Satellite Links
While some people spread misinformation others work hard to figure out how to make TCP work on exotic links with low bandwidth and one second RTT.
Ulrich Speidel published a highly interesting article on APNIC blog describing the challenges of satellite Internet access and the approach (network coded TCP) they took to avoid them.
What did they do?
They implemented an IP-layer coding mechanism on the layer-3 path that traversed the satellite link, effectively distributing every TCP packet across a number of transport packets (to minimize the effects of packet bursts on other TCP sessions) while also adding forward error correction to recover from reasonable packet loss rate without triggering TCP retransmissions.
VeloCloud demonstrated a similar solution at Networking Field Day 9 (video).
Couldn’t they just use QoS?
No. Similar to xDSL deployments they didn’t control the congestion point (the satellite uplink), which probably provided just a simple FIFO queue with tail drop (see also the blog post comments).
Couldn’t they use intelligent shaping?
No idea. Teclo Networks is using intelligent per-session shaping to improve TCP goodput on mobile networks (more details in Software Gone Wild Episode 25), but I don’t know enough to judge whether their approach would work in environment with very large TCP RTT. Have to ask Juho…
Can’t we just fix TCP?
Maybe. While academics claim their machine-generated congestion control algorithms increased TCP throughput by almost a factor of 2, I’m not aware of any production experience, particularly in harsh environment (mobile or satellite)… and there might be a slight difference between theory and practice.
Obviously, if I’m missing something, please write a comment. Thank you!
Interested in more information?
No, this is not my usual “here be webinars” blurb. Ulrich wrote a great follow-up article that’s another must-read.
=============
Couldn’t the results be just as well or better explained by a combination periodic weather fades and the possibility that the SATCOM link provider chose a modulation and coding scheme that is insufficiently robust for these fades? The additional coding added at the network layer shores up the insufficient link layer coding and reduces the residual packet loss rate during the fades to a level in which TCP throughput for the link can be maintained.
If the problem were fades, queue oscillations wouldn’t likely be a significant part of the problem. The queues could be chronically underutilized because the TCP senders would be misinterpreting corruption loss as a sign of congestion, and slowing down unnecessarily, regardless of the queue levels.
And as noted in the comments in the original article, if the cause of the packet loss *were* congestion/queue oscillations, then the additional coding would be masking congestion signals. The article seems to assume that there is too much packet loss (and therefor congestion signals), and is preventing the TCP sessions from ramping up. But too little packet loss and congestion signal would still result in queue oscillation. The TCP sessions would keep ramping up until eventually there was enough packet loss that the network coding couldn’t mask it. The article doesn’t seem to explain why the network coding scheme they used got the level of packet loss right.
Again, sorry, but please set me right on what I’m missing.