Running BGP between Virtual Machine and ToR Switch

One of my readers left this question on the blog post resurfacing the idea of running BGP between servers and ToR switches:

When using BGP on a VM for mobility, what is the best way to establish a peer relationship with a new TOR switch after a live migration? The VM won't inherently know the peer address or the ASN.

As always, the correct answer is it depends.

Supporting Live VM Mobility

If you want to support live (hot) VM mobility across ToR switches, don’t run BGP with the ToR switch. Regardless of how well you fine-tune the setup, it will take at least a few seconds before the BGP session with the new ToR switch is established, making your service inaccessible in the meantime.

As I explained in another blog post (yes, it’s almost exactly three years old), you SHOULD run a BGP session with a route server somewhere in your network, preferably using IBGP to make things simpler.

To add redundancy to the design, peer the VM with two route servers.

Supporting Physical Servers

If your servers don’t move, but you still don’t want to deal with neighbor IP addresses or AS numbers, use one or more of these tricks:

  • Configure the same loopback address on all ToR switches (I wouldn’t advertise it into the network, and you definitely don’t want it to become the ToR switch router ID);
  • Establish BGP session between the physical servers and the loopback address using either IBGP (so everyone is in the same AS number) or using local-as on the ToR switch to present the same AS number to all servers.

Deploying FRR on the servers is obviously a better option. For more details, watch the Leaf-and-Spine Fabric Designs webinar.

Supporting Disaster Recovery

Running BGP between the virtual machines and the network simplifies disaster recovery scenarios (and alleviates the need for crazy kludges like stretched VLANs). If this is your use case:

  • Run a set of route servers in each data center to support live VM mobility within each data center;
  • Use the same IP addresses and AS numbers across route servers in all data centers to enable VMs to connect to the route server in the local data center;
  • Don’t advertise the shared IP addresses between data centers (you don’t want the VMs to connect to a route server in another data center due to a crazy routing glitch).

Need even more details?

We discussed them in the Leaf-and-Spine Fabric Designs webinar and in the Building Next-Generation Data Center online course.

Latest blog posts in BGP in Data Center Fabrics series

7 comments:

  1. I was thinking about something similar just recently, and a weird idea came to mind - why not run RIP between the VM and the TOR?

    I know, I know, and before you laugh and spurt coffee out your nose at the thought of such an old crusty IGP in a DC maybe think about it for a minute.

    RIPv2 supports easy summarization and route filtering for the TOR. With reduced timers (and maybe BFD) convergence could come down to a second or three. ECMP works too. Oh, and it has very few nerd-knobs for the server guys to play with.

    It just seems like a much simpler solution than BGP into the hypervisors & VMs.
    Replies
    1. My coffee cup survived ;)

      I could see two drawbacks of using RIPv2:

      * It has to be configured on every ToR switch (you can't use a route server);
      * Per-neighbor route filtering (if you want to do that) could get interestingly complex.

      If all you want to do is to collect whatever VMs are telling you, then obviously RIPv2 is the tool for the job.
  2. To me, peering with the ToR switch is confusing... The route servers (reflectors) make good sense, but I'd have thought peering with a daemon on the hypervisor and that reflecting it up (in the case of iBGP) would be a better solution? You can have the same IP as a loopback on every hypervisor and not worry.
    Replies
    1. Or even better, don't run BGP on the VM and just redistribute connected interfaces from the hypervisor.
    2. That would be Project Calico (not exactly, they don't run BGP with the hypervisors, but it's an open-source project, so hey...). However, that idea doesn't work with VM mobility.
    3. "redistribute connected interfaces" - doesn't work for clustering services where a VM takes over an IP address from another failed VM.
    4. you can inject the same VIP with different AS_PATH lengths and/or other attributes to facilitate active/standby model, though.

      Robust primary node election, on the other hand, is trickier problem, unless there is a global locking service available :)
Add comment
Sidebar