Category: what went wrong
Still basically the same old debate from 25 years ago that experienced Network Architects and Engineers understood during technology changes; "Do you architect your network around an application(s) or do you architect your application(s) around your network"
I would change that to “the same meaningless debate”. Networking is infrastructure; it’s time we grow up and get used to it.
When Apple launched the new release of iOS last autumn, networking gurus realized the new iOS uses MP-TCP, a recent development that allows a single TCP socket (as presented to the higher layers of the application stack) to use multiple parallel TCP sessions. Does that mean we’re getting closer to fixing the TCP/IP stack?
TL&DR summary: Unfortunately not.
If you’re in the Service Provider business, this is (hopefully) old news: on Friday, RIPE decided to experiment with the Internet causing routers running IOS-XR to hiccup. They stopped the experiment in less than half an hour and only 2% of the Internet was affected according to Renesys analysis (a nice side effect: Tassos had great fun decoding the offending BGP attribute from hex dumps).
My first gut reaction was “something’s doesn’t feel right”. A BGP bug in IOS-XR affects only 2% of the Internet? Here are some possible conclusions:
Almost 30 years ago, I was lucky enough to work on one of the best systems of those days, VAX/VMS (BTW, it was able to run 30 interactive users in 2 MB of main memory), which had everything we’d wished for – it was truly interactive with hierarchical file system and file versioning (not to mention remote file access and distributed clusters). I couldn’t possible understand the woes of IBM mainframe programmers who had to deal with virtualized 132-column printers and 80-column card readers (ironically running in virtual machines that the rest of the world got some 20 years later). When I wanted to compile my program, I started the compiler; when they wanted to do the same, they had to edit a batch job, submit the batch job (assuming the disk libraries were already created), poll the queues to see when it completed and then open the editor to view the 132-column printout of compiler errors.
After a long discussion, I started to understand the problem: the whole system was burdened with so many legacy decisions that still had to be supported that there was nothing one could do to radically change it (yeah, it’s hard to explain that to a 20-year old kid full of himself).
One would hope that the IPv6 myths are slowly fading away as more people get exposed to IPv6 ... but if you like them, don’t worry; they are constantly being recycled. The IPv6: Why Bother? article published by InformIT is a perfect example:
With IPv6, there are enough addresses now that every country or major network can be assigned a large range. It can then assign subranges within that to networks that it connects to, and so on. This hierarchical assignment (in theory, at least) simplifies routing decisions.
Anyone dealing with FTP and firewalls has to ask himself “what were those guys
smoking?” As we all know, FTP is seriously broken :
- Command and data streams use separate sessions.
- Layer-3 addresses and layer-4 port numbers are carried in layer-7 messages.
- FTP server opens a reverse session to a dynamic port assigned by the FTP client.
Once upon a time, there was a very good reason for this weird behavior. As Marcus Ranum explained in his Internet nails talk @ TEDx (the title is based on the For Want of a Nail rhyme), the original FTP program had to use two sessions because the sessions in the original (pre-TCP) Arpanet network were unidirectional. When TCP was introduced and two sessions were no longer needed (or, at least, they could be opened in the same direction), the programmer responsible for the FTP code was simply too lazy to fix it.
One of the decades-long grudges most people have with BGP is that it’s so easy to insert bogus routing information into the Internet if your upstream ISP happens to be a careless idiot (as Google discovered when Pakistan decided to use blackhole routing for Youtube and leaked the routes). There are two potential solutions that use X.509 certificates to authenticate BGP information: Secure BGP (which uses optional transitive attributes) authenticates the originator as well as the whole AS-path (using AS-by-AS certificates), while the significantly simpler Secure Origin BGP (which uses new BGP messages) authenticates only the originator of the routing information.
When the friendly sales guy from your favorite vendor honors you with an “independent test lab” report on the newest wonderful gadget he’s trying to sell you, there’s one thing you can be sure of: the box behaves as described in the report. The “independent” labs are earning too much money verifying the test results to participate in outright lies. Whether the results correlate with your needs is a different story, but we’ll skip this discussion.
However, when you’re faced with a competitive report from an independent test lab “sponsored” (read: paid) by one of the vendors, rest assured it’s as twisted as it can be (you should also suspect the sponsoring vendor has some significant issues he’s trying to cover). The report will dutifully list the test configurations and the test results ... without mentioning that the configurations and the tests were cherry-picked by the sponsoring vendor. You don’t believe me? Put on your most cynical glasses and read the About us statement from the premier independent test lab.
You still want me to prove my point: look at the latest HP-versus-Cisco blade server test results (paid by HP). They took an oversubscribed UCS chassis (it had 4 10GB uplinks and up to 8 servers) and compared it to an HP chassis with 8 servers and 8 10GB uplinks. Furthermore, Kevin Tolly himself admitted in the comments that they’ve really tested the bandwidth between the servers within the chassis (absolute kudos for being so frank), which you might suspect could be somewhat irrelevant in a typical deployment scenario.
As expected, my “the socket API is broken” post generated numerous comments, many of them missing the point (for example, someone scolded me for quoting Wikipedia and not the official Linux documentation). I did not want to discuss the intricate technical details of the various incarnations of the API but the generic stupidity of having to deal with low-level networking details in the application.
Fabio was kind enough to provide the recommended method of using the Socket API from man getaddrinfo, effectively proving my point: why should every application use a convoluted function when all we want to do (in most cases) is connect to the server.
Patryk went even further and claimed that the socket API provides “basic functionality” and that libc is not the right place for anything more. Well, that mentality caused most of the IPv4-to-IPv6 application-related issues: obviously the applications developed before IPv6 was a serious consideration had to be rewritten because all the low-level code was embedded in the applications, not isolated in the library. A similar problem has effectively stalled SCTP deployment.
However, these are not the only problems we’re facing today. Even if the application properly implements the “try connecting to multiple addresses returned by DNS” function, the response time becomes unacceptable due to the default TCP timeout values coded in various operating systems’ TCP stacks.
Someone really wants to hear my opinion on SCTP (RFC 4960); he’s added a “what about SCTP” comment to several Internet-related posts I wrote in the last weeks. So, here are my totally unqualified (I have no hands-on experience) thoughts about SCTP.
Let me reiterate: I’m taking a 30,000-foot perspective here and whatever I’m writing could be completely wrong. If that’s the case, please point out my mistakes in your comments.
From the distance, the protocol looks promising. It provides datagram (unreliable messages), reliable message (record) and stream transport. Even more important, each connection can run across multiple IP addresses on each endpoint, providing native support for scalable IP multihoming (where each multihomed host resides in multiple PA address blocks from various Service Providers).
ATM has the potential to displace all existing internetworking technologies
One single network handles all traffic types: Bursty data and Time-sensitive continuous traffic (voice/video).
All these claims are still true if you just replace »ATM« with »IP«. So what went wrong with ATM (and why did the underdog IP win)? I can see the following major issues:
- ATM is a layer-2 technology that wanted to replace all other layer-2 technologies. Sometimes it made sense (ADSL), sometimes not so much (LAN … not to mention LANE). IP is a layer-3 technology that embraced all layer-2 technologies and unified them into a single network.
- ATM is an end-to-end circuit-oriented technology, which made perfect sense in a world where a single session (voice call, terminal session to mainframes) lasted for minutes or hours and therefore the cost of session setup became negligible. In a Web 2.0 world where each host opens tens of sessions per minute to servers all across the globe, the session setup costs would be prohibitive.
- Because of its circuit-oriented nature, ATM causes per-session overhead in each node in the network. Core IP routers don’t have to keep the session state as they forward individual IP datagrams independently. IP is thus inherently more scalable than ATM.
The shift that really made ATM obsolete was the changing data networking landscape: voice and long-lived low-bandwidth data sessions which dominated the world at the time when ATM was designed were dwarfed by the short-lived bursty high-bandwidth web requests. ATM was (in the end) a perfect solution to the wrong problem.
We all know that the global BGP table is exploding (see the Active BGP entries graph) and that it will eventually reach a point where the router manufacturers will not be able to cope with it via constant memory/ASIC upgrades (Note: a layer-3 switch is just a fancy marketing name for a router). The engineering community is struggling with new protocol ideas (for example, LISP) that would reduce the burden on the core Internet routers, but did you know that we could reduce the overall BGP/FIB memory consumption by over 35% (rolling back the clock by two and a half years) if only the Internet Service Providers would get their act together.
Take a look at the weekly CIDR report (archived by WebCite on June 22nd), more specifically into its Aggregation summary section. The BGP table size could be reduced by over 35% if the ISPs would stop announcing superfluous more specific prefixes (as the report heading says, the algorithm checks for an exact match in AS path, so people using deaggregation for traffic engineering purposes are not even included in this table). You can also take a look at the worst offenders and form your own opinions. These organizations increase the cost of doing business for everyone on the Internet.
Why is this behavior tolerated? It’s very simple: advertising a prefix with BGP (and affecting everyone else on the globe) costs you nothing. There is no direct business benefit gained by reducing the number of your BGP entries (and who cares about other people’s costs anyway) and you don’t need an Internet driver’s license (there’s also no BGP police, although it would be badly needed).
Fortunately, there are some people who got their act together. The leader in the week of June 15th was JamboNet (AS report archived by Webcite on June 22nd) that went from 42 prefixes to 7 prefixes.
What can you do to help? Advertise the prefixes assigned to you by Internet Registry, not more specific ones. Check your BGP table and clean it. Don’t use more specific prefixes solely for primary/backup uplink selection.