IBGP Migrations Can Generate Forwarding Loops

A group of researches presented an “interesting” result @ IETF 87: migrating from IBGP full mesh to IBGP reflectors can introduce temporary forwarding loops. OMG, really?

Don’t panic, the world is not about to become a Vogon hyperspace bypass. Let’s put their results in perspective.

Disclaimer: IBGP loops weren’t the main focus of the IETF 87 paper (do go through the whole slide deck, it’s interesting), but I hate the big fuss some people make out of corner cases.

Can it really happen? Sure it can. You can always find a pathological case where following best practices (assuming they deserve the name) can lead to into a quagmire. Route reflectors are no exception.

Is the migration from full mesh to route reflectors a relevant use case? You tell me – I always tell my clients to use BGP route reflectors whenever they have more than four BGP routers in an AS ... but I’m also positive there are still some neglected networks out there running IBGP full mesh (more probably partial mesh because they forgot to configure a few sessions) on tens of boxes.

Are best practices broken? No. They are just that – a procedure that will cause the least harm (as compared to random ramblings and cut-and-paste of Google-delivered configuration snippets) when executed by people who don’t know what they’re doing.

Or, as John Sonmez put it more politely in his Principles are Timeless, Best Practices are Fads blog post:

If you were to blindly follow any best practice and not apply that best practice in a way that brings out the underlying principle, you would be very unlikely to actually receive any benefit.

Does that make BGP a bad protocol? Contrary to some vocal beliefs, it doesn’t. Every tool (including BGP) can be misused, and a properly focused researcher can generate an NP-hard problem out of every real-life situation. Is screwdriver a bad tool because I have to spend so much energy when hammering nails with it? Maybe not.

Is there a way around the problem? Sure. Deploy MPLS-based forwarding in your network (aka: MPLS is the answer … what was the question?)

Lacking any better idea, use a network simulation tool like Cariden to see what will happen with your network prior to reconfiguring it. More about better ideas in follow-up blog post ... and if you have one, share it in the comments.

9 comments:

  1. Full paper is available at http://inl.info.ucl.ac.be/publications/improving-network-agility-seamless-bgp-reconfigurations
  2. I've never heard from anyone in the SP industry that this is supposed to be a big problem. Most configuration changes should be done outside business hours anyway. Would be interesting if someone could comment on this.
  3. You're probably right in small networks. However,

    - In critical networks (e.g., financial institutions, SWIFT), **any** traffic disruption is bad.

    - In large networks---spanning multiple countries---and composed of hundreds (if not more) BGP routers, non-trivial BGP reconfiguration can take a long time. As comparison, it took one week for AOL to migrate from OSPF to ISIS: http://meetings.ripe.net/ripe-47/presentations/ripe47-eof-ospf.pdf It is therefore crucial to know that the network will stay consistent in all the intermediate states (you don't want a loop in your network lasting for days).

    Also, some *non-neglected* networks still run an iBGP full-mesh, just to get better paths diversity in order to load-balance BGP traffic or to feed fast-convergence mechanisms like BGP PIC with backup paths without using fancy BGP extensions like ADD-PATH.
  4. A few other comments related to Ivan's article.

    It is not only in pathological cases. In the presentation, I give results on an *actual* Tier1 topology, not on crazy academic ones---using best current practices---and we found that numerous forwarding loops can be created. I assume pervasive BGP though, not MPLS.

    You're right that MPLS does guarantee forwarding correctness. It does not however guarantee signaling correctness. Your BGP network can still oscillate during the reconfiguration. This is annoying as you might send your eBGP traffic to different eBGP next-hops, potentially connected to different ASes. Your customers will wonder why they see perpetual changes in their paths' performance and where do these strange traceroute outputs come from...

    To finish on a funny note, yes, most researchers confronted with a non-trivial BGP problem can probably show that it is NP-hard (-complete). Actually, we went one step further and proved that some BGP problems are Turing-complete by building AND/OR/NOT logic gates, as well as memory and clock circuits using *only* BGP configurations. Check out http://vanbever.eu/pdfs/vanbever_turing_icnp_2013.pdf if you want more details ;-)
  5. I don't think the presentation really presents anything most SP engineers haven't known for a long time. Transient loops will occur when making changes to a hierarchy of BGP peers causing micro-loops or sub-optimal routing. When people start making liberal use of the "next-hop self" command it causes even more problems.

    There are large providers today who just simply use a full BGP mesh, even with 50+ routers. The reality is even though RP CPUs aren't the beefiest in the computing world, handling 50 BGP sessions and the related updates is nothing these days. In a lot of instances route reflection just complicates matters.

    There are solutions as well like using dual-plane topologies where you can effectively shut off a plane for maintenance, make your changes, and then turn it back up.
    Replies
    1. You're right that network engineers definitely know that things can go wrong when they modify anything in their configuration. The presentation was not about shedding light on a new problem, but rather solve an old one ;-)

      "Micro-loops" can turn into "Mega-loops" when they run over multiple intermediate reconfiguration steps. For instance, I reconfigure (manually, sic) router A and, doing so, creates a loop for a destination D because A starts to send traffic to B which is not reconfigured yet. That loop will stay until I reconfigure (at least) router B. But if you don't know that in advance, you may reconfigure router B at the very end, causing a loop for a large part of the reconfiguration process. Of course, as Ivan correctly pointed out, MPLS removes that problem.

      The BGP reconfiguration framework we describe at the end of the presentation leverages a similar idea of running two BGP control-planes, although we do have scalability in mind (i.e., avoiding to duplicate all BGP routes and the associated churn).
  6. I have never been a friend of BGP reflectors. I understand that they are probably beneficial in very large networks. That is where I see a potential use case for them. For example in a POP that has a few backbone routers, while supplying lots of customer edge routers with full BGP feeds. Each POP needs a redundant setup of route reflectors then. This has to be carefully designed then, because it adds a degree of complexity and potential sources of failure.
    A backbone network consisting of less than maybe a few dozen iBGP peers however, I would normally configure using an iBGP full mesh.
    I believe that the manageability problem is overstated, because most networks do not change the topology of their BGP backbone each day.
    I have wondered how far the iBGP full mesh would scale. I guess nobody knows, because nobody ever really tried. People follow conventional wisdom, which is that you should use route reflectors.
    It does not appear to me that there is a lot of burden for a router in an iBGP session, and I believe it would scale to hundreds or thousands of peers.
    So if the managemability problem for iBGP sessions could be mitigated or solved, then I believe that even fairly large networks could do without route reflectors.
    Replies
    1. I mean "provider edge routers" in the POP instead of "customer edge", of course...
    2. I have simulated networks using hardware test appliances along with other standard routing gear and been able to do hundreds of sessions in a full mesh. Scale limits on most modern RPs from the main vendors is 2-4K sessions. There are some topologies with things like Seamless MPLS or even BGP in the datacenter where you could potentially run quite a few BGP sessions on those edge aggregation nodes, not in a full mesh but a session is a session and the RR scenario is more taxing.

Add comment
Sidebar