Stupidities of Switch Programming (written in June 2013)
In June 2013 I wrote a rant that got stuck in my Evernote Blog Posts notebook for almost two years. Sadly, not much has changed since I wrote it, so I decided to publish it as-is.
In the meantime, the only vendor that’s working on making generic network deployments simpler seems to be Cumulus Networks (most other vendors went down the path of building proprietary fabrics, be it ACI, DFA, IRF, QFabric, Virtual Chassis or proprietary OpenFlow extensions).
Arista used to be in the same camp (I loved all the nifty little features they were rolling out to make ops simpler), but it seems they lost their mojo after the IPO.
If you have a well-designed network, and manage to push all the complexities onto the network edge (VoIP, iSCSI, virtual overlay networks, virtual appliances ...), all you need in the physical switches is all IP connectivity, in data center environments usually implemented with a Clos fabric.
It would be ideal if you could just plug the new switches in and they would auto-configure themselves and just work - and Brocade was pretty close to meeting that goal when their VCS fabric was a simple L2 solution.
The problem is that the existing switch configuration mechanisms are not well suited for that, and that's not due to lack of protocols or technologies, it's due to inability of networking vendors to be minimally creative and use the existing technologies and protocols that they already have implemented to their full advantage.
For example, when you plug in a new ToR switch, would it be really that hard to put some ports in uplink mode, listen to LACP updates on those ports, and auto-configure port channels when it turns out the other end wants to run port channel? Also, would it be THAT hard to support unnumbered P2P links over Ethernet so we could run OSPF without configuring IP addresses and subnets on every uplink interface (BTW, this will work with IPv6 automagically).
Junos supports unnumbered Ethernet interfaces (including OSPF support) since Junos release 8.2 - thanks to Doug Hanks for pointing that out in the comments!
The list could go on and on - for example, why wouldn't you use LLDP and figure out if there's another switch from the same vendor at the other end of the link. This might not be a ubiquitous solution, but at least I hope people aren't stupid enough to build multi-vendor Clos fabrics in a single pod or availability zone.
It could be really easy to add new ToR switches to an existing network, or to rewire Clos fabric if needed without changing all the IP addresses and OSPF setups, but alas, switch vendors aren't doing any of that, because it's sexier to promote all sorts of crazy stuff like SDN, APIs, Puppet/Chef on the switches, than to build boxes that just work using existing features and protocols.
1), Many of the the switches even from the late 90s, 00s and last 5 yeas have a ton of features built in(beyond R/S) but are never used. Remember CMS on Cisco switches?(if you were a small shop yoiu had a built in secure NMS right there from a GUI to manage your network and even push policy/configs) You had Interface Macros/profiles etc all these "programmable" extras just sitting there for box by box use or centralized use if someone spent the time with an NMS system to utilize these features.
2). Then there was the issue at one point in the last 20 years that most shops didn't want that "plug and play" switch? Why? Our old friend STP or any routing protocol, if all L3. Clients got burned early on with the PnP switch approach causing some convergence related outage so now out of policy and possibly security anything getting connected to the network still must be scrubbed/provisioned not "discovered" prior being added. Maybe just a mindset now but we should be past that these days as you pointed out.
Big Switch Networks does some of that and I know there will be different uses of LLDP for this or any protocol with TLV to add a little extra to the sauce for discovery auto configure purposes.
We - network engineers - are notorious for favoring deliberate complexity; foolishly believing that it gives us job security.
Juniper has supported unnumbered P2P ethernet links with OSPF for about 10 years now.
Have you seen Junos Fusion? It's based on IEEE 802.1BR and has all of the automation and plug-and-play functionality you want.
L3 is not the universal answer for everything, and properly engineered L2 could perform much better in many situations.
Let's take the example of TRILL - internally, it does IP-like routing of packets via shortest paths, so it performs all required L3 functionalities behind the scenes. At the same time, it can autoconfigure nicknames (2-byte "IP addresses"), so addition of a new switch is no problem. And it doesn't need to care about LACP, since it can natively load-balance over multiple equal-cost paths.
As you said - if you push all the complexities onto the network edge, it's totally irrelevant, what protocol the core clos fabric runs - as long as it responsively moves packets over the shortest paths. And the only difference between TRILL and IP is, that TRILL uses 2-byte nicknames, while IP uses 4-byte IPv4 or 16-byte IPv6 addresses. Routing protocol (IS-IS) is the same.
And for the LLDP example that you mentioned, you can easily mimic that behavior yourself with ZTP in EOS (EOS supports loading a python script instead of a configuration file during the ZTP process).
However, why should I do IP address management on intra-fabric links if I don't have to. Also, keep in mind that you're years ahead of everyone else. Not many people consider their network devices lego blocks that they'll use in their DIY solution; many enterprise customers bark at "some assembly required" idea.
Vendors do what customers demand, they have a long list of things to do and they make decisions to prioritize based on rational business reasons, its not a conspiracy.
Programmability and automation additions are all ways to get this done. Take a chhll pill.
Rant is easy.