Sometimes It’s Not the Network

Thursday, December 3, 2015 12:03 +0100

Sometimes It’s Not the Network

Marek Majkowski published an awesome real-life story on CloudFlare blog: users experienced occasional short-term sluggish performance and while everything pointed to a network problem, it turned out to be a garbage collection problem in Linux kernel.

Takeaway: It might not be the network's fault.

Also: How many people would be able to troubleshoot that problem and fix it? Technology is becoming way too complex, and I don’t think software-defined-whatever is the answer.

performance

10 comments:

Bogdan Golab 03 December 2015 12:37

Nowdays network is tightly coupled with applications and sometimes it is difficult to figure out the difference between the cause & effect - is the network the the root cause OR application 'decided' to change its behaviour due to a sw bug and network responded accordingly (misleading the troubleshooter). So it's good to know the network from the traffic pattern perspective - know what is normal behaviour.

Frank Sweetser 03 December 2015 12:42

Often times, I feel like "SDN" is a piece of performance art designed to demonstrate as many sections of RFC1925 as possible.

Replies

Ivan Pepelnjak 03 December 2015 15:39

Awesome. A must-add for my "What is SDN" slide deck. Thank you!

Fred Chagnon 04 December 2015 05:47

Heres some relevant visual art for that slide deck.

http://blamethenetwork.com/the-moment-you-prove-its-not-the-network/

Unknown 04 December 2015 01:58

I suspect that as is the case typically, everyone pointed to a network problem prior to the analysis. Note that it was proven fairly early on that the issue was with the server (tip: iPerf is very handy).

Replies

Unknown 04 December 2015 05:45

Well its one thing to prove your innocence, its another to explain to others how you proved your innocence. Most people myself included would have given up after pinging the router and server because going deeper would have generated blank stares. What amazes me the most is that Cloudflare has people with patience to do such a thorough investigation.

Unknown 04 December 2015 02:43

It's also worth noting that sometimes the problem could be ISP equipment. I've recently had to troubleshoot a 10M fiber TLS WAN issue where I was getting the allotted bandwidth, but as soon as I put any decent load over the WAN, our packet round trip time jumped from idling at 1-2ms to 300-500ms (normally we would get between 15-25ms at a good load). After days of troubleshooting and removing as much equipment as possible between sending/receiving ends, I concluded that it must be a problem with the ISP equipment. Sure enough, they insisted their equipment was functioning properly.. After it was all said in done, it ended up being a Layer 1 issue - the ISP's Fiber to Ethernet Media converter.. ugh! .. and yes, IPerf came in handy!

Christoph Wegener 04 December 2015 10:06

32Mbit buffers!!!???? Where is Jim Gettys when you need him?!

Replies

Christoph Wegener 04 December 2015 10:08

*32Mbyte, even.

Jeroen van Bemmel 07 December 2015 21:42

On a standard CentOS 6 Linux box with default settings:
sysctl net.ipv4.tcp_rmem
net.ipv4.tcp_rmem = 4096 87380 4194304

So it seems that the root cause was someone changing the default settings in an attempt to "optimize"

Add comment

Recent posts in the same categories

performance

10 comments: