BGP- and Car Safety

The Facts and Fiction: BGP Is a Hot Mess blog post generated tons of responses, including a thoughtful tweet from Laura Alonso:

Is your argument that the technology works as designed and any issues with it are a people problem?

A polite question like that deserves more than 280-character reply, but I tried to do my best:

BGP definitely works even better than designed. Is that good enough? Probably, and we could politely argue about that… but the root cause of most of the problems we see today (and people love to yammer about) is not the protocol or how it was designed but how sloppily it’s used.

Laura somewhat disagreed with my way of handling the issue:

I def agree with that take - how it’s used is the problem. I disagree with the reaction that “hot mess” comment has gotten tho. If a product is causing outages, the customer often doesn’t care a developer pushed bad code, but that it caused them an issue.

… and that’s where I had to disagree. If we accept blatant claims that “X is a hot mess” combined with vague clickbait-style spraying of guilt, we’ll never get anywhere. We have to do better, figure out whether the problem we’re experiencing is caused by (A) technology, (B) particular implementation of said technology, (C) how technology is being used or (D) how incompetent the users are allowed to be.

I wrote about these aspects in my Some Internet Service Providers Should Really Know Better rant (and I’m still amazed at how some fellow networking engineers tried to defend blatant errors of a Tier-1 ISP), and addressed the “let’s blame some random technology” behavior in Stretched VLANs and Failing Firewall Clusters… but being in a Twitter conversation the best I could do was…

It’s time we stop blaming technology for user stupidity. It’s like blaming cars or roads because an incompetent idiot without a driver’s license crashed into your house.

… to which Laura replied:

Good one. Sticking with the car analogy, are we trying to solve accidents by training the drivers more or are we trying to automate cars and roads? The tech might not be at fault but finding ways to minimize user error will probably be the most efficient solution.

… and that’s the point that had me thinking for at least a week. A quick Google search resulted in infografic claiming road fatalities in Europe decreased by 57% in 16 years. While car technology did improve drastically in that period, we had major safety features like seatbelts and airbags way before 2001… but some of the safety features were not mandatory or were not enforced as rigorously as they are today. Also, public opinion made car safety a high-priority item to consider when buying a new car.

How about BGP? We had tons of safety features in BGP for ages (AS-path filter, maximum prefixes…) but even though it took us years to document them in a BCP they are still not used. The Verizon SNAFU could have been stopped by rigorous application of security measures that were built into BGP when I was still teaching BGP courses in Cisco TAC in Brussels (hint: in late 1990s).

Then there’s the totally incomprehensible lack of common sense. Default EBGP Route Propagation Behavior RFC was published in 2017, 23 years after the first BGP-4 RFC and at least 20 years after I kept repeating “as a customer you have to take precautions not to become a transit AS” in my BGP course. We have no idea how many fat-finger SNAFUs could have been stopped if only we had this simple idea implemented decades ago.

Oh, and there’s one last minor detail: road traffic is somewhat regulated and the rules are occasionally enforced. Also, in most countries you are not expected to drive without a driver’s license, and professional driver’s licenses (= major ISPs) have more stringent requirements. Renesys wrote about reckless driving on the Internet in 2009 (almost exactly a decade ago) but of course nothing changed.

Which brings me to the end of my chat with Laura. She concluded with…

Just to make sure I don’t go off on a tangent my thought is that BGP has a lot of bolts that can help avoid many of the outages we see today and many orgs just don’t use them, so I can’t act outraged when someone complains about it with a simplified view on a podcast.

… and while I can relate to that, I still think (continuing the car analogy) we should stop saying that “cars kill pedestrians”. They don’t, it’s the drivers.. and if someone wants to engage in a public blame-and-shame rant, they should get the basic facts right (like CloudFlare did a while ago)

2 comments:

  1. Not sure why I had this kind of memory:
    I played with ATM LANE & MPOA a lot something around 2000. It was great to be good at complex expensive & stuff. And sadly this technology disappeared (?) and cheaper
    technology (which requires less skilled = cheaper) staff won...

    I do not want to say that BGP is bad - it is great because skilled man can do a lot with this 'Swiss army' knife.

    Cars do not kill people BUT nobody is skilled enough to prevents accidents. Coordinated actions - I believe that autonomous cars won't be autonomous in the sense of road traffic coordination in the future...
  2. Well, I do think Laura has a good point with the car analogy. The huge improvement in car fatalities stats is also caused by big changes in the way how we plan roads. We build safe highways in more places (and this is important, because only around 8% of fatalities happen on highways, as per EU stats). We design city roads very differently, physically forcing drivers to slow down, introducing more car-free zones etc. A lot of things change in the way we organize road infra. The cars also get more of "anti stupid driver" features, but infra changes are more obvious. So that pillar is definetely inportant.
    The other pillar is regulation - as you mentioned. What is my loss if I misconfigure BGP vs if I drive on red?
    And I would add one more pillar, which I think might be more relevant today than ever. When we talk about road safety, we accept the fact that people will anyway behave like idiots. Not just drivers (although they too). But e.g. kids who jump on the road. We admit that this behavior can happen, and so we try hard to protect road users also in this case. With BGP, I think, the convention has been that we would deal with (at least somewhat) professional people. It is supposed to be part of the profession not to shoot yourself in the leg badly. We basically say: "If the guy is SO bad, then he shouldn't be allowed to touch BGP". Whether this level of expectation was justified or not, I think that it will have to be questioned more than ever NOW, when the leading marketing concept in the industry is: "Everyone can operate the network, because it is self-driven, software-defined, self-healing and super-automated". Vendors try hard to push the idea that networks will babysit themselves, so a cheap inexperienced network operator will be enough to put the right checkboxes. And with this attitude, shall we expect the number of BGP incidents to go down really?
Add comment
Sidebar