Sometimes It’s Not the Network

Marek Majkowski published an awesome real-life story on CloudFlare blog: users experienced occasional short-term sluggish performance and while everything pointed to a network problem, it turned out to be a garbage collection problem in Linux kernel.

Takeaway: It might not be the network's fault.

Also: How many people would be able to troubleshoot that problem and fix it? Technology is becoming way too complex, and I don’t think software-defined-whatever is the answer.

10 comments:

  1. Nowdays network is tightly coupled with applications and sometimes it is difficult to figure out the difference between the cause & effect - is the network the the root cause OR application 'decided' to change its behaviour due to a sw bug and network responded accordingly (misleading the troubleshooter). So it's good to know the network from the traffic pattern perspective - know what is normal behaviour.
  2. Often times, I feel like "SDN" is a piece of performance art designed to demonstrate as many sections of RFC1925 as possible.
    Replies
    1. Awesome. A must-add for my "What is SDN" slide deck. Thank you!
    2. Heres some relevant visual art for that slide deck.

      http://blamethenetwork.com/the-moment-you-prove-its-not-the-network/
  3. I suspect that as is the case typically, everyone pointed to a network problem prior to the analysis. Note that it was proven fairly early on that the issue was with the server (tip: iPerf is very handy).
    Replies
    1. Well its one thing to prove your innocence, its another to explain to others how you proved your innocence. Most people myself included would have given up after pinging the router and server because going deeper would have generated blank stares. What amazes me the most is that Cloudflare has people with patience to do such a thorough investigation.
  4. It's also worth noting that sometimes the problem could be ISP equipment. I've recently had to troubleshoot a 10M fiber TLS WAN issue where I was getting the allotted bandwidth, but as soon as I put any decent load over the WAN, our packet round trip time jumped from idling at 1-2ms to 300-500ms (normally we would get between 15-25ms at a good load). After days of troubleshooting and removing as much equipment as possible between sending/receiving ends, I concluded that it must be a problem with the ISP equipment. Sure enough, they insisted their equipment was functioning properly.. After it was all said in done, it ended up being a Layer 1 issue - the ISP's Fiber to Ethernet Media converter.. ugh! .. and yes, IPerf came in handy!
  5. 32Mbit buffers!!!???? Where is Jim Gettys when you need him?!
  6. On a standard CentOS 6 Linux box with default settings:
    sysctl net.ipv4.tcp_rmem
    net.ipv4.tcp_rmem = 4096 87380 4194304

    So it seems that the root cause was someone changing the default settings in an attempt to "optimize"
Add comment
Sidebar