Dynamic Routing with Virtual Appliances

Meeting Brad Hedlund in person was definitely one of the highlights of my Interop 2013 week. We had an awesome conversation and quickly realized how closely aligned our views of VLANs, overlay networks and virtual appliances are.

Not surprisingly, Brad quickly improved my ideas with a radical proposal: running BGP between the virtual and the physical world.

Let’s revisit the application stack I used in the disaster recovery with virtual appliances post. One of the points connecting the virtual application stack with the physical world was the outside IP address of the firewall (or load balancer if you’re using bump-in-the-wire firewall).

Virtual appliance interaction points

Virtual appliance interaction points

Now imagine inserting a router between the firewall and the outside world, allocating a prefix to the application stack (it could be a single /32 IPv4 prefix, a single /64 IPv6 prefix, or something larger), and advertising that prefix from the virtual router to the physical world via BGP.

Virtual appliance running BGP with the adjacent switch

Virtual appliance running BGP with the adjacent switch

Before you start writing a comment complaining how three virtual appliances in sequence reduce performance and introduce unnecessary network traversals: as most virtual appliances these days use Linux, it isn’t that hard to add a few more daemons to the same VM – the approach used by VMware NSX-T. The three boxes in the picture could be a single VM if you prefer performance optimization over flexibility.

You could easily preconfigure the ToR switches (or core switches – depending on your data center design) with BGP peer templates, allowing them to accept BGP connections from a range of directly connected IP addresses, assign outside IP address to the virtual routers via DHCP (potentially running on the same ToR switch), and use MD5 authentication to provide some baseline security.

An even better solution would be a central BGP route server where you could do some serious authentication and route filtering. Also, you could anycast the same IP address in multiple data centers, making it easier for the edge virtual router to find its BGP neighbor even after the whole application stack has been migrated to a different location.

This twist on the original idea makes the virtual application stack totally portable between compatible infrastructures. It doesn’t matter what VLAN the target data center is using, it doesn’t matter what IP subnet is configured on that VLAN, when you move the application stack the client-facing router gets an outside address, establishes a BGP session with someone, and starts advertising the public-facing address range of the application.

More information

I described the basics of overlay networks in Cloud Computing Networking and Overlay Virtual Networking_ webinars.

For vendor-specific information, please watch VMware NSX Technical Deep Dive and Cisco ACI Deep Dive webinars.

Revision History

  • Removed a few obsolete mentions
  • Added links to webinars created after the original publication date
  • Added an NSX-T reference

Latest blog posts in Anycast Resources series


  1. Looks a lot like what Microsoft will propably use in their BGP only DC when their own NVGRE GW /w built-in BGP on the outside comes out later this year.


  2. Maybe I'm dense, but are you saying that it should use BGP because the datacenters that the virtual network might roam through might be in different AS's? Is there any reason that a service provider couldn't use an IGP if the DC's were in the same AS?
    1. Large networks prefer BGP over IGP because it's easier to control/filter.
    2. Thanks! You're awesome.
  3. i think you can use OSPF and stub zones for this kind of setup -
    it's less work todo and we can eliminate extra router and run
    ospf right on the virtual firewall - not all vendors support bgp.
    Unless extensive route filetring required (like corporate datacenter)
    this should work.

    Going to extreme (or simple) even RIPv2 with tweaked timers and distribution
    lists will work fine - this virtual silo needs default route to the core
    and core needs only route from the virtual firewall about the network
    behind it. So block all RIPv2 updates on the core with distribution lists, allocate
    /24 for the firewalled silo and core will receive 1 /24 block per silo. That's it.
    All vendors support RIP, no much CPU power needed in this case.
  4. forgot to say, you can redistribute all this into BGP or whatever if needed to propagate to the WAN...
  5. I wouldn't have expected Brad to suggest such an approach -- a pleasant surprise. But now that you have BGP into the hypervisor, why stop at basic IP connectivity using BGP when you can get network virtualization as well with IP-VPN and E-VPN address families?

    The idea of running BGP to a vrouter/vswitch on the hypervisor has been what many in the decentralized control-plane camp have been pushing for quite some time and some vendors are actively building in one form or other. Proposals to BGP peer with the ToR dynamically on DHCP offer and other ways have been floated in NVO3 mailing list from early on.
    1. This is not exactly what NVO3 is talking about - this is user-mode BGP run between VM at the edge of the app stack (on the border of physical/VLAN and virtual/overlay worlds) and the physical network.

      The other BGP proposed in NVO3 is hypervisor-mode MP-BGP transporting VPNv4 and EVPN prefixes. I thought MPLS in the hypervisors (or something equivalent) was a good idea a long time ago, got persuaded in less than 5 minutes that it's not by people who actually run large-scale cloud data centers, and haven't changed my mind since.
    2. OK -- I thought that's what you meant until I saw the part with vrouter peering with ToR. At that point I was thinking PaaS DC vs IaaS DC. I probably shouldn't have rushed my reading.

      I agree that using MPLS for the transport tunnel is not ideal for the DC. Implementations of EVPN and VPNv4 in the works for the DC will let the NVE advertise the set of encapsulations that it supports. This allows gradual migration from one encap to another as encaps themselves might evolve. This also makes it much more possible for seamless network virtualization across different NVO3 domains so long as both ends support the address family and have a supported encap in common.

      I'm a proponent of MPLS over GRE where an NVE advertises via MP-BGP locally significant MPLS labels per context or even per NLRI. These labels can be used to identify local tenant context or apply whatever special action the egress NVE wishes to apply against an advertised label. With locally significant labels the context ID is not condemned forever to the one role of creating a flat virtual network. Locally significant context ID also enables very flexible topologies, and even takes the number of virtual networks from 24M to something that is a function of the size of BGP route targets and the number of NVE. If route target size increases, so does the number of virtual networks, without any change to the data plane. You could choose to fix use globally significant labels or even split the label space into local and globally significant.

      What did the folk at the large-scale cloud DC tell you that convinced you? I'd like to believe as well, but haven't seen anything that works for everyone versus just for monolithic large cloud DC.

      On a different note, if BGP is not ideal for service location and mobility in the service provider infrastructure, why would we want customers to use it for that purpose in tenant space?
  6. So, I just need an OC12 line card for my virtual server farm...
    1. No, we don't do OC12 any more. It's 10GE or 40GE ;)
    2. To be clear we are on the same page, I was alluding to terminating WAN connectivity in the virtual hypervisor space.

      If you have 10G/40GE WAN links, I am jealous...
    3. We usually have Ethernet handoff to ISPs these days (at least where I live). Interface is usually FE/GE (or sometimes 10GE) with actual delivered speed based on how much you're willing to pay.
  7. Ivan

    This sounds very interesting, but could you expand on the bigger picture? I don't follow the comment about a /32. Was the intent to adv the virtual subnet (subnet behind fw) out to the ISP? How would this work with a /32?
    1. If your application advertises just a single /32, then you can move the whole stack across data centers without changing the outside IP address of the application, but you can't advertise the new location of that IP address to the Internet at large (only within your WAN network).

      If the application has a whole IPv4 /24 (or IPv6 /48) assigned to it, then you can advertise its new location to the Internet at large.

      Finally, you can always use LISP and deploy xTR on the virtual appliance ;)
  8. Ivan

    Could you take a look at what iLand is doing for disaster recovery? They work closely VMware and it seemed very promising to us. I would be interested in knowing where some of your models differ form theirs. Perhaps you could cover this is in your upcoming webinar. thank you
    1. iLand seems to be a "VMware provider" (public cloud running on VMware) using SRM for disaster recovery. You can use them instead of deploying a second data center. Nothing new there (although I sure wish I'd have something similar close to some of my customers).
  9. I believe what Ivan is describing is also very well explained in this draft ( http://tools.ietf.org/html/draft-fang-l3vpn-virtual-pe-02 ). Such design associated with Segment Routing (http://tools.ietf.org/html/draft-previdi-filsfils-isis-segment-routing-02) to simplify the core would be a very nice solution.

    But what are the existing virtual PE? The cisco CSR1000, quagga, vayatta....Junos olive?
    1. I believe vPE is an overkill, but of course the "MPLS is the answer, what was the question?" crowd won't listen.

      We need to make the networks SIMPLER, not more complex. That also means dropping some borderline use cases and focusing on 95% of the problems.

      There are plenty of virtual appliances you can use today. Keep in mind the use case: separating an app stack from the outside world, which implies FW or LB functionality, so don't look at traditional router vendors.
  10. Hi Ivan,

    I don't seem to get the nugget here. If i understand right, client facing IP address could either be /32 or dhcp generated on that DC. This address is advertized to the external world thro' a virtual router running BGP. This ip address is unique to an application stack, so there is one-one mapping between application stack and this ip-address. By having such a model, it is easy to move the application stack from One DC to another DC ?. If /32 is used, it is just a BGP update. If DHCP is used, it is going to be DNS update ?. Is this a fair summarization of this post ?.

    BTW, why do i need an external VLAN as depicted in the picture but not explained in the post ?. Is it for overlay address space to be masked from underlay network (BGP session between Virtual router and BGP router ?)

    1. You have to advertise an "internal" prefix (think loopback interface on a router) or the mobility won't work ... and you need external VLAN to link the virtual world with the physical (where you might find most of the clients).
Add comment