Every now and then I get an email from a subscriber having video download problems. Most of the time the problem auto-magically disappears (and there’s no indication of packet loss or ridiculous latency in traceroute printout), but a few days ago Henry Moats managed to consistently reproduce the problem and sent me exactly what I needed: a pcap file.
TL&DR summary: you have to know a lot about application-level protocols, application servers and operating systems to troubleshoot networking problems.
Henry immediately noticed something extremely weird: all of a sudden (in the middle of the transfer), my server sent destination unreachable ICMP reply and stopped responding to TCP packets.
I was totally stumped – the only module on my web server that could generate administratively prohibited ICMP reply seemed to be iptables, so it looked like the web server dropped the TCP session without sending TCP RST or FIN (weird) and the iptables module subsequently rejected all incoming TCP packets of the same session.
The pcap file showed plenty of retransmissions and out-of-order packets (it looks like there really are service providers out there that are clueless enough to reorder packets within a TCP session), but there was no obvious reason for the abrupt session drop, and the web server log files provided no clue: all requests sent by Henry’s web browser executed correctly.
The only weird clue the pcap file provided was the timing: session dropped approximately 17 seconds after the transfer started, which was unpleasantly close to a 15-second timeout I vaguely remembered from one of the web server configuration files. A quick search found the only parameter that seemed to be relevant:
$ ack 15 conf*/*
The KeepAliveTimeout specifies how long a web server keeps an idle HTTP session open, so it might be relevant… but why would it kick in during the data transfer?
I thought the answer could be bufferbloat: excessive buffering performed in various parts of the TCP stack and within the network. It looked like my web server managed to dump the whole video file into some buffers and considered the transfer completed in seconds. When the browser failed to send another command within 15 seconds (because it was still busy receiving the data), the web server decided it was time to close the idle HTTP session.
Based on that assumption it was easy to implement a workaround: increase the KeepAliveTimeout to 60 seconds. Seems like it solved the problem (I also added “send Connection: close header on long downloads” to my bug list).
It’s probably not that simple
I’m still trying to understand what exactly Henry experienced. After all, there are plenty of people all around the world accessing my web site over low-speed lines (thus downloading individual files for minutes) and none of them experience the same symptoms. Henry might have accessed my web site through a transparent web proxy that buffered too much data, or it might have been something completely different.
Have you experienced something similar? Write a comment!