Why Do We Need IBGP Full Mesh?
Here’s another question from the excellent list posted by Daniel Dib on Twitter:
BGP Split Horizon rule says “Don’t advertise IBGP-learned routes to another IBGP peer.” The purpose is to avoid loops because it’s assumed that all of IBGP peers will be on full mesh connectivity. What is the reason the BGP protocol designers made this assumption?
Time for another history lesson. BGP was designed in late 1980s (RFC 1105 was published in 1989) as a replacement for the original Exterior Gateway Protocol (EGP). In those days, the original hub-and-spoke Internet topology with NSFNET core was gradually replaced with a mesh of interconnections, and EGP couldn’t cope with that.
At that time, Yakov Rekhter and Kirk Lougheed designed the famous three-napkins protocol that became BGP version 1 (RFC 1105). Its goal was to replace EGP and all it needed to get that job done was to advertise networks with AS paths. There were no other attributes in BGP version 1 (and thus absolutely no way to detect intra-AS routing loops).
BGP was designed in days when networking engineers still focused on solving point challenges instead of trying to boil the ocean, and it was clearly understood that while routing within an autonomous system was needed, it was not what BGP was trying to do. From RFC 1105:
If a particular AS has more than one BGP gateway, then all these gateways should have a consistent view of routing. A consistent view of the interior routes of the autonomous system is provided by the intra-AS routing protocol. A consistent view of the routes exterior to the AS may be provided in a variety of ways.
One of the ways to provide a consistent view of exterior routes was to redistribute exterior routes into IGP. Early BGP implementations required BGP prefixes to be present in an IGP, or they wouldn’t advertise them to other autonomous systems – the famous bgp synchronization
nerd knob that persisted in Cisco IOS (and CCIE lab exams) for decades.
Even more, when redistributing a BGP route into OSPF, the redistributing router copies the first AS in the AS-path into the OSPF tag together with a few extra bits that help the egress router recreate a short AS-path from OSPF tag when redistributing OSPF back into BGP1 – in those days, if you were a regional ISP, you didn’t need IBGP at all.
The designers of BGP version 1 understood that some networks might prefer to exchange BGP information directly between AS-edge routers while still redistributing BGP into IGP to get external routes into forwarding tables of intermediate devices. They could have made the protocol more complex (and harder to implement) but decided to go down the long-forgotten path of “we’ll keep protocols simple and assume that the engineers using them know what they’re doing.” Quoting RFC 1105 again:
One way is to use the BGP protocol to exchange routing information between the BGP gateways within a single AS. In this case, in order to maintain consist routing information, these gateways MUST have direct BGP sessions with each other (the BGP sessions should form a complete graph).
BGP didn’t remain a simple protocol solving a simple problem for long:
- Version 2 (RFC 1163, June 1990) added (well-known and optional) path attributes and defined ORIGIN, AS_PATH, NEXT_HOP, and INTER-AS (now known as MED) attributes.
- Version 3 (RFC 1267, October 1991) added hold timers, notifications, and formal finite-state machine
- Version 4 (RFC 1771, March 1995) added local preference, CIDR support, and route aggregation.
At approximately the same time, IBGP full mesh became a scalability bottleneck in large service provider networks, resulting in route reflectors (RFC 1966, June 1996) and confederations (RFC 1965, June 1996). I still remember getting a new Cisco software release, looking at BGP route reflectors, saying “now, that’s a cool new thing,” and rushing to burn the software into EPROMs to test it2.
-
See RFC 1403 and OSPF-to-BGP redistribution for details. ↩︎
-
That last bit is a “we were living in a shoebox” fairy tale. BGP route reflectors were implemented in software release 11.1, and most Cisco routers had programmable Flash EPROMs at that time. You could also boot new software images via TFTP if you had enough memory in your router – IIRC, the early Cisco 2500-series routers shipped with 2MB of RAM. ↩︎
Great post, thank you! But, do you know that because you lifed through it or did you look that up somewhere?
I lived through BGPv3 ➜ BGPv4 migration, fortunately never had to touch EGP, and figured out (again?) why the '?' at the end of an AS-path is called 'incomplete' when writing this blog post.
It was easy to figure out the thinking behind BGPv1 though. In those days RFCs explained why things were done a certain way.
> In those days RFCs explained why things were done a certain way.
An old wizard once told me: "We are not in the business of educating our competition".
> An old wizard once told me: "We are not in the business of educating our competition".
IETF is supposed to be a community and RFCs are supposed to be community documents, but of course that never stopped anyone...
It's not that simple.
Example. It seems you praise the first BGP RFC. As you wrote, that RFC was written by Yakov. I remember reading the BGP-4 RFC when I started working at a router vendor. Also written by Yakov. The BGP-4 RFC was mostly unreadable, imho. Years later, I worked with Yakov. I asked him "why is the BGP-4 RFC written in that style? I found it almost unreadable. If you know how BGP-4 works, the text might make sense. But if you are new, it doesn't teach (or explain) you anything".
That is when Yakov said: "We are not in the business of educating our competition".
That answer made me look completely different towards the art of writing RFCs. Earlier on, I believed that RFCs should be texts that explain how and why protocols do what they do. See RFC5302. It is an RFC about a single bit. But I tried to include a lot of useful info in it. Because I wanted to clarify the preference of different types of routes.
But later I changed my mind. I think an RFC should just describe packet-formats. And maybe a little about behaviour, state-machines, etc. But only the minimum to allow people to write interoperable implementations. That should be enough. It also might allow people to improve on a protocol later, without breaking backwards compatibility.
Some companies put a lot of effort into moving technology forward. You might think they do that for evil purposes. But there are other companies that do not wish to invest any time or money into improving standards. They just take the work (RFCs, protocols, ideas) from others, implement it cheaply, and try to sell a product that is cheaper than the competition's. Personally I do not wish to make their work easier. The next time I get the opportunity to write a draft of RFC, the text will be as concise as possible.
@Henk Smit: and that's why we can't have any good stuff :((
Every time something good emerges, the bottom feeders destroy it by their greed.
Speaking of true bottom feeders, seems like reposting other people's blog posts to get ad revenue doesn't pay off as much as it did in the past. I'm glad I persisted through those waves of hyena attacks.
;-)
Great explanation. I wonder why it was necessary to get external routes into forwarding tables of intermediate devices back then?
Keep in mind that IP networks perform hop-by-hop destination-only forwarding. How would you do that on intermediate devices without having external routes in their forwarding tables?