Why Does Internet Keep Breaking?

James Miles sent me a long list of really good questions along the lines of “why do we see so many Internet-related outages lately and is it due to BGP and DNS creaking of old age”. He started with:

Over the last few years there are more “high profile” incidents relating to Internet connectivity. I raise the question, why?

The most obvious reason: Internet became mission-critical infrastructure and well-publicized incidents attract eyeballs.

Ignoring the click baits, the underlying root cause is in many cases the race to the bottom. Large service providers brought that onto themselves when they thought they could undersell the early ISPs and compensate their losses with voice calls (only to discover that voice-over-Internet works too well).

The initial version of this blog post incorrectly claimed that FRR does not support multi-threaded routing daemons. Removed the offending part of the blog post; more details later.

The only way out of that morass is either simplified services (example: Deutsche Telekom Terastream), increased automation (bringing its own perils, see also Facebook October 2021 outage) or ever-more-appalling quality of support and service.

For example, we knew for ages what needs to be done to stop fat-finger incidents, and yet many large ISPs like Verizon (not picking on them, it’s just that their SNAFU went public) did absolutely nothing to implement the most rudimentary safeguards like limiting the number of BGP prefixes a customer can advertise.

You don’t have to take my word for it. There are public services tracking BGP leaks1 and hijacks2, and small-scale incidents happen every week. We’re also facing a few global leaks and hijacks every quarter3.

A long while ago a group of engineers focused on Internet stability defined best practices under the MANRS umbrella. Many ISPs started following them, but there are still too many of the sloppy incompetents out there. Compare the list of MANRS participants with the list of Tier-1 providers and reach your own conclusions.

BGP & DNS are some of the oldest protocols in regular use, are the protocols creaking with modern approaches?

I wouldn’t say so. Considering the limitations of hop-by-hop destination-only packet-by-packet forwarding, BGP works just fine (and is good enough for many use cases). It has too many knobs because vendors always tried to solve the next feature request with one more intent-based knob instead of a plugin architecture, but that’s a different story.

The real problem of BGP seem to be the implementations. There are so many things running on decades-old code that was written to run well on single core 16- or 32-bit processors with 4MB of RAM.

DNS seems to be in a bit more of a tight spot due to DNSSEC – the replies don’t fit into a single UDP packet anymore. I know just enough about DNS to be able to form wrong opinions, but from where I’m sitting it looks like we would have to change the way we do things and the default settings, but maybe not the whole protocol.

As the oldest protocols in regular use, are engineers losing the skill to effectively deploy them?

Both protocols (like everything else) are getting more and more complex as people pile new features on top of what was once a stable infrastructure. For example, even though I was running DNS- and email servers in 1990s (and even ported sendmail to MS-DOS to implement my own email service), I wouldn’t dream about running them these days — there are too many details I’m simply not familiar with.

BGP seems to be doing a bit better from the edge AS perspective — connecting your AS to two somewhat competent ISPs is as easy as it ever was. Beyond that, there’s a steep learning curve.

However, successfully implementing a science project doesn’t give you experience to run a large-scale system, and as the number of large-scale systems is limited, it’s hard to migrate from one to the other. The situation is similar in any sufficiently-commoditized infrastructure discipline, from power transmission to gas pipelines or water supply. Being able to pull cables through the walls doesn’t make you an expert in high-voltage power lines.

Unfortunately, what we’re lacking in networking (and most of IT in general) is a solid foundation on which to build things, rigorous training that the traditional engineering disciplines have, an equivalent of Professional Engineer exams, and professional liability. We’re often throwing spaghetti at the wall and get ecstatic when some of them stick.

Have we put too many extras and/or sticky plasters on the protocols and they are now destabilizing?

We put too many extras and sticky plasters everywhere. Too much of the infrastructure out there is a smoking pile of unstable kludges that eventually collapse.

We’re also increasingly relying on cheap (or free) third-party services, and get totally stupefied when they disappear for a few hours. The number of hidden dependencies in the stuff that runs everyday life is horrifying.

Want a longer version of this rant? You’ll find it in the Upcoming Internet Challenges webinar.


  1. Advertising unexpected transit routes through customer networks ↩︎

  2. Advertising IP prefixes belonging to third parties as originating within your autonomous system. ↩︎

  3. I got a link to this report from a tweet by Andree Toonk↩︎

3 comments:

  1. Probably, it would be beneficial mentioning the RIPE Database. The recent RDAP protocol, the RPL for expressing routing policies. I remember back in the 90s using RIPE routing information for generating proper BGP configurations. Every ISP could use those... The solution is well known, just lot of people ignore it, since it requires some resources and has some costs. It is easier to sell a really best effort service. The final decision is at the customer hand. Are you happy with a best effort, not reliable service or do you pay for better quality? A single Internet cannot fulfill all the diverse requirements...

  2. It is as you said Ivan. Complexity has gotten way out of hand these days that breakage can happen more often, or more catastrophically, or both. BGP for ex has become a swiss-army knife used to solve all problems under the sun, resulting in a very large code base, and large code base means more instability and more chance for bugs. So is DNS, so is everything really. Putting things to the Cloud doesn't simplify it either, as anyone who has experience with Office 365 can attest; random problems keep coming up all the time.

    A big cause of the problem, just as Randy Bush, one of the true wisemen of Networking pointed out -- TFS that one, I haven't read that short masterpiece of his -- is that the IVTF, the main body of Internet Standards, is a vested-interest group made up of vendor representatives whose goals are to sell ever more equipment, not to build a highly scalable structure. Having all the deciding power in the hand of one group is super dangerous, as history has shown, and having it in the hand of a group whose agenda is misaligned with that of users/operators, even more so.

    Look at some of what they have given us over the years: they didn't include the session protocol, gave us an addressing scheme missing half the structure, then claimed an address-exhaustion problem which was the direct result of this, and tried to impose a pathetic solution which was IPv6, which itself brings a lot of problems, because it was a workaround to a structural defect. Had they focused on fixing the structural issue, it would have resulted in a lot less complexity and address exhaustion would be a non-issue. But hey, doing that wouldn't result in more box sell, would it? Ipv6 is one of the biggest lies, and a result of dictatorship when one group of special interest having too much power, tries to impose decisions on people against their will. It solves nothing, and worsens a lot of things. A very pressing issue is the Internet routing table explosion, and with things like Multihoming and the use of Anycast -- which is not a type of adddress, but due to its topology-independent characteristic, a kind of name and a special case of Multicast -- this keeps getting worse. Ipv6 solves none of this, yet it's being forced down our throats because TINA.

    LISP is also a big joke, a clear case of misunderstanding the fundamentals, and an atrocious example of RFC 1925 rule 6. That's why the whole thing is unscalable and it's been going nowhere for 15 yrs. If this crap is forced on us the same way IPv6 is, what we can expect are complexity explosion and degraded performance. Have a quick look at what early adopters had to say about LISP:

    https://www.researchgate.net/publication/224178013_Routing_Scalability_An_Operator's_View

    If they did things right with DNS and addressing, there's no need for many of the kludges, including EVPN, VXLAN, and IP Anycast, at all. DNS can take care of what Anycast is doing, and in fact Akamai is not using Anycast but hierarchical DNS. In a word, Anycast is solved on the Application layer, not the Network layer. Ethernet can also be thrown away, as its functionality is already taken care of by IP and so we can remove one redundant layer. The network will be much simplified, there will be less circular dependency, less failure, more stability. But hey, we need to keep the Standard Groups going, so more compexity pls.

    That's why when this kind of structurally-unsound architecture which is the Internet, keeps getting bigger and bigger, the effect of the faults will be amplified aka sensive depdendence on initial conditions, and we see more large-scale breakdowns.

    DNSSEC, from Geoff Huston's description, looks to be in a pretty bad spot thanks to security fear mongering leading to the use of excessively long crypto key. Looks how much complexity it adds on top of the protocol, which otherwise is structurally sound, because some security idiots worry about some hypothetical scenarios that happen once in a bluemoon. But the last part of his article is what made me laugh. What? Fear of Quantum Computing breaking Crypto? Sorry to burst the bubble, but until Quantum Error Correction is feasible -- big hint: it gets exponentially harder with the number of Qubits, and they can't do it for any small number of Qubit now because QM is not an understood theory -- there'll be no QC except for the toys put out now and then as a grant-winning trick for Physics schools.

    Worry about fixing the real thing first, before worrying about the invasion of science-fiction types of gadget. This lack of basic knowledge drives the fear-mongering, misleads the uninitiated (in this case the typical IT professionals who generally lack engineering and scientific knowledge) who misplace their trusts in authorities, and generates a lot of unneeded complexity, benefiting almost no one but the vendors who always have more boxes to sell or more services to offer.

  3. I think it's worth mentioning the continued centralization of services on the internet. Outages of smaller, distributed and non-inter-dependent services tend to go unnoticed. When so many websites have all of their dependencies located in AWS/Facebook/Google/other, everyone everywhere notices when the lights go out for a few hours. As you said in this very article, why would you run your own e-mail service these days (unless you are an e-mail SME)? The answer is that we trade complexity for convenience. Yes, you can get your service to market faster than ever, but you are at the mercy of your dependencies when things go awry.

Add comment
Sidebar