Going All Virtual with Virtual WAN Edge Routers

If you’re building a Greenfield private cloud, you SHOULD consider using virtual network services appliances (firewalls, load balancers, IPS/IDS systems), removing the need for additional hard-to-scale hardware devices. But can we go a step further? Can we replace all networking hardware with x86 servers and virtual appliances?

Of course we can’t. Server-based L2/L3 switching is still way too expensive; pizza-box-sized ToR switches are the way to go in small and medium private clouds (I don’t think you’ll find many private clouds that need more than 2 Tbps of bandwidth that two 10GE ToR switches from almost any vendor give you) … but what about WAN edge routers?

If your data center uses 1Gbps uplinks, and you’re a Cisco shop, I can’t see a good reason not to consider Cloud Services Router (CSR 1000V). You can buy a 1Gbps license with the latest software version and I’m positive you’ll get 1Gbps out of it unless you have heavy encryption needs.

Is that not enough? You might have to wait for the upcoming Vyatta 5600 vRouter that uses Intel DPDK and supposedly squeezes 10Gbps out of a single Xeon core.

Connecting to the outside world

Most servers have a spare 1Gb port or two. Plug Internet uplinks into those ports and connect the uplink NIC straight to the router VM using hypervisor bypass.

I know it’s a psychologically scary idea, but is there a technical reason why this approach wouldn’t be as secure as a dedicated hardware router?

Why Would You Do It?

There are a few reasons to go down the all-virtual path:

  • Reduced sparing/maintenance requirements – you need hardware maintenance for ToR switches and servers, not for dedicated hardware appliances or routers;
  • Increased flexibility – you can deploy the virtual network appliances or routers on any server. It’s also easier to replace a failed server (you probably have a spare server already running, don’t you?) than it is to replace a failed router … and there’s almost no racking-and-stacking if a blade server fails;
  • If you believe in distributed storage solutions (Nutanix or VMware VSAN), you need only two hardware components in your data center: servers with local storage and ToR switches. How cool is that?

I’m positive you’ll find a few other reasons. Share them in the comments.

Need More Information?

Check out my cloud infrastructure resources and register for the Designing Private Cloud Infrastructure webinar.

I can also help you design a similar solution through one or more virtual meetings or an on-site workshop.

16 comments:

  1. For WAN edge, and in fact any routers short of those needing a full table (or to be fair, MPLS, but that's a software problem) why not just use those same L3 switches. In many cases these days the large routers are almost the same silicon, just with external lookup RAM, and sometimes buffer RAM.
    Zero watts is a lot cheaper than any Intel server.
    Replies
    1. I'd imagine there's an assumption that you'd have the servers there already to run/serve applications.
    2. If those same L3 switches support all the WAN edge functionality you need, you're absolutely right.
  2. This would mostly fall under the psychological category, but I prefer to keep my network eggs out of the server administrator's basket. It's not that this can't work technically, but I've seen storage and configuration issues take down virtual clusters more than once.
    So far, it has been better for me to really on my network team to maintain network hardware than to transfer responsibility to another team.
    Replies
    1. I did mention in one of the previous posts (follow the links) that you SHOULD use a dedicated network services cluster.
    2. I'm sure they could say the same about the network. I'd jump at the chance to learn some of their skills and pass on some of mine. Think like that for too much longer and you'll be limiting your career.

      Steven Iveson
    3. Ivan, why use a dedicated cluster? So long the guest has sufficient resources, it shouldn't really matter on which cluster it runs.
    4. You _might_ need a dedicated cluster for layer 8-10 reasons (see above).

      You _should_ consider a dedicated cluster (if the size of the cloud warrants it) because there's a significant difference between network services workloads and traditional server workloads, potentially requiring different virtualization ratio or server configuration.
    5. DevOps has broken down sysadmin silos in at least some organizations and I think the same will happen on the NetOps side. Service agility and automation is going to override the traditional Network/Server/etc. roles especially if the network functions become virtualized and commoditized. It's all about coming up with proper templates and methods defined by the networking guys.

      Now whether it can be done on x86 hardware or not really just comes down to bandwidth needs.

      There are some interesting devices out there now like the Pluribus server/switch thing which kind of blur the lines by integrating a wire speed backplane/switch with actual server hardware and making it open and extensible. Could be a great NFV platform.
  3. I know the answer but anyway, why isn't this what everyone is doing? The benefits are manyfold, the drawbacks are almost non-existant. Throw in a virtualised firewall, load balancer etc and the savings and simplification are enormous.

    Steven Iveson
    Replies
    1. "My stuff!" Unfortunately, too many still hold that their niche or technology is what is important. Rather, the delivery of services in the most streamlined, secure, and available method is the key.

      If you can have an honest discussion (w/o the turf wars), our datacenters would shrink dramatically along with the operational overhead.
  4. Putting all eggs into one basket would be unwise in this case.
    The chances of a large scale impact of a common-mode is too big a risk for many organisations which rely on the network as a critical service.

    Imagine the virtualised edge router, or edge firewall encountering the hypervisor bug which would bring down not just the virtualised servers, but bring the network to its knees.
    I can think of a few such bugs - VMware's e1000 high-load crash bug, VMXNet3 inability to initiate pptp tunnels, Win2k12R2 purple screen of death (doesn't mean it won't happen to other new network OS) , to name just a few.
  5. "If everything is getting virtualized, why do I have to put in more physical hours each day to keep up with the complexity created by it." - Old jungle saying.
    Replies
    1. Because your doing it wrong.
  6. Most of what I'm seeing in the field when I propose this idea is outright resistance; most of the reasons are related to exposing the hypervisor directly to the Internet and receiving a malformed packet. The scenario mentioned in the article, where one uses the PCI-passthrough feature to give a VM direct access to a NIC, should provide more security. The downside is that something like the CSR may not have drivers for the physical NIC.

    One thing I haven't considered is the security consequence of the PCI-passthrough method. Does the hypervisor still have some kind of wedge in there? We should definitely talk about it!
    Replies
    1. I have heard that L3 switches and software/vm-based routers do not have as robust queueing on their interfaces when compared to a dedicated hardware router. FWIW, this information came from a vendor of those dedicated hardware routers, so there may be some sales hype in there. But they said if you have an elaborate QOS policy, some policy-based routing, some intricate ACLs, and other features, you will not be happy with the vm or the software solution. We haven't had an opportunity to test this statement
Add comment
Sidebar