Build the Next-Generation Data Center
6 week online course starting in spring 2017

Running BGP between Virtual Machine and ToR Switch

One of my readers left this question on the blog post resurfacing the idea of running BGP between servers and ToR switches:

When using BGP on a VM for mobility, what is the best way to establish a peer relationship with a new TOR switch after a live migration? The VM won't inherently know the peer address or the ASN.

As always, the correct answer is it depends.

Supporting live VM mobility

If you want to support live (hot) VM mobility across ToR switches, don’t run BGP with the ToR switch. Regardless of how well you fine-tune the setup, it will take at least a few seconds before the BGP session with the new ToR switch is established, making your service inaccessible in the meantime.

As I explained in another blog post (yes, it’s almost exactly three years old), you SHOULD run a BGP session with a route server somewhere in your network, preferably using IBGP to make things simpler.

To add redundancy to the design, peer the VM with two route servers.

Supporting physical servers

If your servers don’t move, but you still don’t want to deal with neighbor IP addresses or AS numbers, use one or more of these tricks:

  • Configure the same loopback address on all ToR switches (I wouldn’t advertise it into the network, and you definitely don’t want it to become the ToR switch router ID);
  • Establish BGP session between the physical servers and the loopback address using either IBGP (so everyone is in the same AS number) or using local-as on the ToR switch to present the same AS number to all servers.

Deploying Cumulus-enhanced Quagga on the servers is obviously a better option. For more details, watch the Leaf-and-Spine Fabric Designs webinar or the video in which Dinesh Dutt explains the enhancements they made to Quagga.

Supporting disaster recovery

Running BGP between the virtual machines and the network simplifies disaster recovery scenarios (and alleviates the need for crazy kludges like stretched VLANs). If this is your use case:

  • Run a set of route servers in each data center to support live VM mobility within each data center;
  • Use the same IP addresses and AS numbers across route servers in all data centers to enable to VM to connect to the route server in the local data center;
  • Don’t advertise the shared IP addresses between data centers (you don’t want the VMs to connect to a route server in another data center due to a crazy routing glitch).

Need even more details?

I’m sure we’ll discuss them in the Building Next-Generation Data Center course. The September 2016 session is sold out, but you’ll get the recordings even if you register for the April 2017 one.

You can also use ExpertExpress to discuss the details of your design with me.

One ExpertExpress session is bundled with the Professional subscription so you don’t have to ask for the budget approval twice.

7 comments:

  1. I was thinking about something similar just recently, and a weird idea came to mind - why not run RIP between the VM and the TOR?

    I know, I know, and before you laugh and spurt coffee out your nose at the thought of such an old crusty IGP in a DC maybe think about it for a minute.

    RIPv2 supports easy summarization and route filtering for the TOR. With reduced timers (and maybe BFD) convergence could come down to a second or three. ECMP works too. Oh, and it has very few nerd-knobs for the server guys to play with.

    It just seems like a much simpler solution than BGP into the hypervisors & VMs.

    ReplyDelete
    Replies
    1. My coffee cup survived ;)

      I could see two drawbacks of using RIPv2:

      * It has to be configured on every ToR switch (you can't use a route server);
      * Per-neighbor route filtering (if you want to do that) could get interestingly complex.

      If all you want to do is to collect whatever VMs are telling you, then obviously RIPv2 is the tool for the job.

      Delete
  2. To me, peering with the ToR switch is confusing... The route servers (reflectors) make good sense, but I'd have thought peering with a daemon on the hypervisor and that reflecting it up (in the case of iBGP) would be a better solution? You can have the same IP as a loopback on every hypervisor and not worry.

    ReplyDelete
    Replies
    1. Or even better, don't run BGP on the VM and just redistribute connected interfaces from the hypervisor.

      Delete
    2. That would be Project Calico (not exactly, they don't run BGP with the hypervisors, but it's an open-source project, so hey...). However, that idea doesn't work with VM mobility.

      Delete
    3. "redistribute connected interfaces" - doesn't work for clustering services where a VM takes over an IP address from another failed VM.

      Delete
    4. you can inject the same VIP with different AS_PATH lengths and/or other attributes to facilitate active/standby model, though.

      Robust primary node election, on the other hand, is trickier problem, unless there is a global locking service available :)

      Delete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.