What Did You Do to Get Rid of Manual VLAN Provisioning?

I love(d) listening to the Packet Pushers podcast and came to expect the following rant in every SDN-focused episode: “I’m sick and tired of using CLI to manually provision VLANs”. Sure, we’re all in the same boat, but did you ever do something to get rid of that problem?

After all, you don’t need more than a few tens of VLANs in a typical enterprise data center or private cloud (clouds with thousands of tenants are obviously a totally different story) and most vendors have some sort of VMware-focused automatic edge port VLAN provisioning, from on-switch solutions like VM Tracer (Arista) or Automatic Migration of Port Profiles (Brocade) to network management applications (like Junos Space). Are you using them? If not, why not? What’s stopping you?

But let’s assume you’re unfortunate and use switches that have no hypervisor integration tools. Would it be THAT hard to write an application that would read the LLDP or CDP tables on ToR switches (populated by LLDP or CDP updates from the vSphere hosts), build a connectivity table, and allow server/hypervisor administrators to provision their own VLANs (within limits) on server-facing switch ports? I know that an intern could do it in a week (given reasonably complete functional specs), but we never did it, because doing automatic VLAN provisioning simply wasn’t worth the effort.

Assuming we’re truly sick-and-tired of manual VLAN provisioning in enterprise data centers, there must be other reasons we’re not deploying the vendor-offered features or rolling out our own secret sauce. It might have to do with the critical impact of the networking gear.

Let’s assume you manage to mess up a server configuration with Puppet – you lose a server, and hopefully you’re using a cluster or a scale-out application, so the impact is negligible.

If a vSphere host crashes, you lose all the VMs running on it. That could be 50-100 VMs if you’re using recent high-end server, but if you care about their availability, you have an HA cluster and they get restarted automatically.

Now imagine the vendor-supplied or home-brewed pixie dust badly misconfigures or crashes a ToR switch. Worst case (switch hangs and links to servers are not lost), you lose connectivity to tens of physical servers, which could mean a few thousands VMs; best case those same VMs lose half the bandwidth.

Faced with this reality, it’s understandable we’re scared of software automatically configuring our networking infrastructure. Now please help me understand how that’s going to change with third-party SDN applications.

More Details

I’m describing various VM-aware networking solutions in numerous webinars, including Introduction to Virtual Networking, VMware Networking Technical Deep Dive, Cloud Computing Networking and Data Center Fabric Architectures.

23 comments:

  1. Well written Ivan,

    If we leave too much to automating who is going to control that the configuration is working as expected?

    I know some people that provision all VLANs at once. It's easy to script. Negative side is number of STP processes if you run RPVST+ but if you run MST it's not an issue.
  2. The marketing of SDN and automated DC added a very bad light on networks and network engineers. How many IT guys can say that they understand fully what SDN is all about?

    The image which everybody sees is that with SDN (for example) or other kind of automatic handling of network resources, is that we get rid of those old dinosaurs called network engineers.

    Next step, everybody wants automatic deployment of everything, but when some people encounter a problem (ex. network) they rush back to the dinosaurs for help, pointing again that network is an issue. The fact that everybody wants control over network resources without understanding the technical background, well, that's not an issue.

    I see SDN as an innovative technology, but I don't see it as the magic pill which will replace knowledge and experience.

    I don't want to offend anybody, but I meet people working with VMware products which had no idea how the product works actually.
    It's not VMware's fault, don't get me wrong.

    He was explaining a VM machine like you click here and then click there...ok, ok but what's going on in the background, how is the vSwitch communicate with physical network for example? Silence.

    This is the direction in which we want to go? Click here and there? I understand that we can do now more with less brain usage than 20 years ago, but this is only because there are "dinosaurs" which consider reading, learning and using brain for more than day to day activities.

    Don't worry, if things are going on this path and nobody understand, by everybody uses terms like SDN to hide real problems, in another 20 years we can click here and click there to eliminate the last IT "dinosaurs".
  3. Oh, boy. One more reason for me to write the damned blog post. :) Anyway, short summary: physical network should continue to be configured/managed by the networking team, but instead of VLANs that provide connectivity to VMs, they should provide "transport" connectivity to vSwitches (running in hosts and ToRs) for their virtualised networking overlays.

    Then server/virt team would configure/reconfigure vSwitches via SDN/whatever, while the transport network stays stable and secure. Everybody wins - networking team doesn't have to deal with high volumes of moves/adds/changes; no weird-ass protocols to track VMs; and server guys can do whatever they want without endangering the whole shabang.
    Replies
    1. Very much agree with this. The solution is Network Virtualization.
    2. Not necessarily network virtualization. What you mention is possible today with Q-in-Q. Network team handles the transport layer while the server guys can transport any number of VLANs over it.

      This doesn't (of course) prevent a server guy from borking a vswitch and complaining to the network guys. THAT'S where SDN comes in to play. SDN should allow the network to be provisioned dynamically and automagically at that lower level from the Ethernet transport infrastructure. Ideally, SDN would allow a client to send a tagged frame (with some form of handshake I would presume) and the SDN faeries provision the access ports and ensure any trunk ports which connect to a switch with the same VLAN in use are configured to allow it.

      Of course both of those still rely on some form of STP which is a waste. If we're redefining the DC infrastructure, surely we can "flatten" it out a bit.
  4. For JUNOS, you can use Junoscript/NETCONF. There's even a Java toolkit for this: http://www.juniper.net/techpubs/en_US/junos12.3/information-products/pathway-pages/netconf-java-toolkit/netconf-java-toolkit.html and it's available in Perl *somewhere* as well. You can ofcourse write your own implementation in any programming language of your choice.

    Talking about Puppet: Recently, Juniper also launched Puppet for JUNOS: http://www.juniper.net/techpubs/en_US/junos-puppet0.8/topics/concept/automation-junos-puppet-overview.html . But it requires you to install a UNIX-like daemon on the machine, which comes "as-is and without any warranty" so that basically means nobody sensible will install it (hello memory leaks in there!)...
    Replies
    1. Re: Puppet for Junos. Point of clarification - it does not run as daemon process by default. So each Puppet run is independent on memory usage. Plus version 1.0 was just released that will enable you to tune the memory usage if needed.
  5. User LANs: Use MAB or a full 802.1x solution (ISE works quite well)

    Data Center: If Cisco, Provisioning and unpruning new VLANs going to pruned VMware trunks is quite easy with port-profiles and your configuration management tool of choice. Never touch a port again, just update the VLAN DB, MST Region, and port-profile. Not a hassle at all.
  6. As you say - we only have 10s of Vlans. Maybe up to 200 in a large data center. PuttyCM with credentials makes it easy and a non-issue. When a new Vlan is provisioned - I just populate a container with the required switches to be updated per vm cluster. Takes 2 to 5 minutes to add the vlans to ~30 switches. It would probably take longer using an automated process.
  7. In the future Data Center and maybe even Campus, VLANs likely go away. VXLAN is usually talked about just for increased multi-tenancy, but the bottom line is VXLANs are easier to provision b/c they are already integrated into Cloud Mgmt Platforms. So if your virtual switch (two very popular ones) is supported by the mgmt platform, you're good to go. Can we say this about physical switches? Plus, no need to worry about all the intermediary devices and fat fingering anyway that could bring down the DC.

    For physical devices, in large data centers, puppet seems to make sense for VLAN configs if there are lots of devices always being added/removed. For the Enterprise, I'm not sure yet, but the CLI and single-device mgmt isn't the way forward.
  8. I believe one of the reasons that configuration operations, and particularly VLAN configuration are not automated is because there are no checks and balances that the configuration operation does the right thing for the hosts to talk.

    Its a hard problem for the network alone to solve as it does not typically know which nodes are meant to communicate, so we don't know if changing a VLAN is right or not, so a person does it.

    We need a way to tell us at the application level, which nodes are meant to talk, and then an automated way to deterministically verify that communication is valid before changes are made to device configurations.
  9. If everybody thought like this, we still be using SDLC/SNA - hey ! That was stable ! It worked ! Everyone else was doing it ! Nobody got fired for buying .......

    The problem of course was cost. As soon as the first major bank moved to TCP/IP and IPX Routers (and it worked), all the other banks were at a competitive disadvantage. So in the 90's everybody moved from a stable, high cost network to a less stable lower cost network - because it worked most of the time and costed a hell of a lot less. (Let's not forget that it was mostly networking that broke the back of what was an expensive, proprietary and arrogant vendor).

    Today the network is a bigger problem than it was back then.

    It's highly UNSTABLE (just ask anyone who has to do a IOS or NX-OS bug scrub)
    It has a high capital cost

    And to your point Ivan, it has a very high OPEX. Whilst Network Engineers are the greatest guys in IT (by a mile), you guys are much slower to respond than a computer running a program. You guys sleep, you take lunch, you drive a car and you sleep.

    You can call it SDN or whatever. But what we are talking about is automation. That is the revolution that is coming. And Network Guys can either understand that and embrace it.

    Or as Calin intimated, in a few years the Human Resources Department will be "mouse clicking" you out of the building
  10. Ivan, I am working on a blog post response in parts to this...
    Replies
    1. Hopefully it'll be as funny as this http://workflowsherpas.com/2013/04/01/asshole-hipsters :)
  11. For large scale multi-tenant cloud infrastructures, we simply provision an allocated number of VLANs for customer networks. Your management networks, vmotion networks, etc. will take up a handful of VLANs. Then we allocate a large number of VLANs for customer networks and allow those on the trunks to the hosts where those customers could reside (for example, we may allocate VLANs 2000 through 2999 as tenant networks). At that point, you then have to provision your VLANs on your vNICs but you can write an easy script to take care of that.

    You end up having to manage VLANs and when you consider that you might one day have a need for VLANs to span multiple data centers, you need to reserve a set of VLANs for that purpose. I know how adamant you are against that and frankly, I agree that there are very few needs for it.

    One other thing you have to watch out for in these environments is your VLAN port counts (or STP logical interfaces in Cisco speak). A cloud provider can quickly run up on that number long before they run up on the supported VLAN limit. Every time you add a VLAN to a VNIC, it creates a new STP logical interface and that is a limited resource on the N5Ks, etc...

    Layer 2 sucks.
  12. @John Mc

    You prove the point again. You Network Engineers are extremely smart. But that is most of the problem. To do your job, you HAVE to be really smart ! The incumbent vendor insists that you work all this out for yourself. Your incumbent vendor insists that you write your own scripts. You say it's fairly simple. Maybe it is. But all your CIO sees is OPEX OPEX OPEX!

    * OPEX in that it's manual

    * OPEX in that you have to spend lot's of time working this out, and then selling it to other members of the team.

    * OPEX in that you are really smart - meaning I have to pay you twice as much as a Server Admin that can point and click vCenter - because he DOESN'T HAVE TO UNDERSTAND what is really going on underneath the covers (anymore that a developer needs to understand the x86 instruction set)

    Your CIO can't understand why he can have abstraction in everything else IT - but not the Holy network.

    When some customers start to replace "you never get fired for buying... types", with the same kind of pioneering Engineers that threw out their FEP's, 3270 Terminals and Token Ring for a better/faster/cheaper alternative - guess what? Your CIO will start to as well.
  13. The big challenge with VLAN provisioning for many folks IMO has been the risks of managing VLANs in the core of an STP-based network. Configuring access ports and access trunks is comparatively very low risk.

    In the last few years, we've seen new [non-STP] bridging technologies that don't require VLAN provisioning on core-facing links -- VXLAN, QFabric, FabricPath, SPB, etc. These are all overlay based bridging technologies. With these solutions, the major reason for slow VLAN deployments is done away with. What remains is the lack of a standards-based solutions for the network to autonomically attach access ports to VLANs. VDP as a solution seems to have gone nowhere because of bloat possibly. However, we can expect to see an "MVRP UNI" arrive soon enough that will be coupled with overlay-based (ex: VXLAN) core bridging networks.

    In the MVRP UNI approach, a hypervisor will send VLAN declaration to a TOR when a VM requiring it shows up. The TOR attaches the port/channel to the required VLAN, and the rest is handled by the overlay protocol. The TOR never propogates or declares a VLAN to it's neighbors (including hypervisors). This is an unconventional use of MVRP, but works fine for the purpose of autonomic VLAN configuration and satisfies the needs of the average enterprise. Linux will have MVRP support (http://comments.gmane.org/gmane.linux.network/244153). Now we just need to get Openstack support and the rest will follow.

    This isn't the glorious approach, but for most companies good-enough will do for now, and hopefully some measure of sanity restored. There are a number of benefits for the average enterprise which I'll leave for another day.
    Replies
    1. The "correct" protocol to use between hypervisor and ToR switch is EVB (802.1Qbg), which does exactly what you're describing ... only it's getting nowhere. A few details in these blog posts: http://blog.ioshints.info/search?q=evb&by-date=true
  14. Why stop at VLANs? What about VRFs, ACLs, NAT, Firewall & Load Balancers contexts, etc. There's more to a virtual network than just Layer 2.
    Replies
    1. Yes -- but right now all those things are muddying the water of what most folks need and could have been achieved by now. For most businesses revolutionizing their network isn't a goal.
  15. Ivan is right, it is fear and too often on the networking side I had tools, even vendor supplied that only provided a percentage of success in automating changes. It is about a level of trust and at times to "code" that trust into the tool to guarantee 100% will take too long to do thus you might as well as do it manually and get it done quicker. Who do you trust the chance of the tool messing up or that fat finger?
  16. An automated configuration tool is also a "weapon of mass configuration" pointed at your network. Such a tool can amplify a single typing mistake into a major network outage. Just to make things worse, large network vendors have configuration tools that launch changes without so much as a single confirmation box or "Are you sure?" warning. Almost as though the developers have never heard of the concepts of change management. These "tools" practically guarantee a bad outcome. Fear is the appropriate emotion when faced with these options.
    Replies
    1. "To make error is human. To propagate error to all server in automatic way is #devops."

      Source: https://twitter.com/DEVOPS_BORAT/status/41587168870797312
Add comment
Sidebar