Why Is Cisco Pushing LISP in Enterprise Campus?
I got several questions along the lines of “why is Cisco pushing LISP instead of using EVPN in VXLAN-based Enterprise campus solutions?”
Honestly, I’m wondering that myself (and maybe I’ll get the answer in a few days @ NFD16). However, let’s start at the very beginning…
What Do You Really Need?
It looks like Cisco (and a few other vendors, each one in its own way) still believes in the dire need for large layer-2 domains. I keep wondering why it seems everyone’s so obsessed with large VLANs stretching all across campus. If you have a good use case, please let me know.
Do keep in mind that traffic separation is not a VLAN use case. It seems easier to solve with VLANs instead of VRFs because you don’t appreciate how brittle or convoluted the behind-the-curtain stuff is. In other words, you’re trading explicit complexity (VRFs + associated routing protocols) for hidden complexity (MLAG or TRILL or SPB or VXLAN with EVPN or LISP or…).
I also stopped believing in IP address mobility being the necessary driving force behind large VLANs. I know people using Mobile IP (and it’s even easier with IPv6) on mobile phones, and most phones today can use mobile data + wireless at the same time anyway. On top of that, wireless access points tend to handle roaming pretty well, and in many cases use their own version of IP tunneling.
Long story short: ask yourself whether you really need large VLANs or whether you need a simpler IP network and smart apps (and as I said, do report your findings in the comments).
Back to VXLAN
It looks like the networking industry is in another lemming rush. Everyone is rolling out VXLAN to solve large VLAN challenges, or even replacing MPLS with VXLAN for L3VPN deployments. Every single vendor is rolling out EVPN as the control plane for VXLAN. The current list includes at least Arista, Brocade (aka Extreme), Cisco, Cumulus, and Juniper.
Yet Cisco decided to use a completely different control plane (LISP) in campus networks. I can’t possibly grasp why they’d do that apart from having a solution that has been searching for a problem to solve for years. If you know a really good technical reason why LISP is better in a campus network than EVPN (potentially with conversational learning in case Cisco yet again has hardware challenges) please share it with me.
I’m not the only baffled engineer out there. Here’s what one of my readers wrote:
Cisco DNA is not fulfilling my needs, this is more complex and looks like a marketing solution. Why would I use LISP to do the same thing given that we are already doing using EVPN [in the data center]?
He asked around whether he could use Nexus switches in campus to get the functionality he needs and (not surprisingly) got the answer along the lines “it might work, but we haven’t tested it”. Or as I told him:
Don't fight the vendor. If your use case is not on their radar, don't try to push it through and make it work (though it makes perfect sense technically) - you'll hit all sorts of bugs because you'll be using untested combinations of features... or you might discover that they don't have features you need in your particular environment.
Fortunately, there's more than one networking vendor out there, and some of them are small enough that they might work with you to get an interesting use case off the ground (I’m looking at you, Cumulus Networks). Just saying ;)
And for wireless there is a capwap tunnel between the AP and the WLC.
The main issues (at least from what I see at my customers) is 1 they are used to work in L2 environment and are afraid from L3 and routing protocols.
The other issue is FW clusters, if you have a FW cluster most of them demand L2
One example would be a user, which moves from "A" to "B" within the campus. "B" is assigned to another distribution layer as "A" -> user gets a new IP address.
Assuming this user or device has unique firewall rules, pointing to the old address, a few applications won't work after the move.
The reactive action is to create a new DHCP reservation for "B" and alter the firewall object. Depending on the IT agility this could take hours or days :)
Proactive solutions could be:
- Change your firewall and/or rule/object design
- Use some identity based FW feature > use names instead of IPs
- Keep the IP through the campus.
- Enhance the moving organizational process to somehow alter the FW object before the actual move
That's just my thoughts about a potential, not completely useless use-case.
Why they use LISP instead of EVPN for the control plane ... no clue...
Let's assume a Catalyst 3850 switch, which is a potential Campus Fabric edge device. This device supports 24,000 IPv4 routes according to the data sheet. The "SD-Access" scale is 8k IPv4 routes and 16k IPv4 host entries. So maybe this is the reason to use LISP to maintain a conversional routing table at the border nodes.
We have long passed the point of diminishing returns in the campus LAN. Speeds and feeds stopped being a reason to upgrade campus LAN switches when we reached gigabit to the desktop and N x 1gb (or N x 10gb) to the aggregation/core. This is terrible for Cisco's (and other networkers) business, since the lifespan of campus gear went from 3 years in the early 2000s to 10-15 years now. Even wireless has hit a point of 'good enough'; if we never got a faster wireless standard than 802.11ac we could make things work indefinitely by creating smaller, lower-powered cells. Sure, we can find corner use cases that require more speed, but for 95+% of users, we have enough.
So how do networking vendors rectify this? It invents 'compelling' reasons to upgrade that are not based on speeds/feeds. Many of these are dubious, and several are downright harmful to business success. I leave it to the reader to name your favorite unneeded campus technology (mine is/was NAC). These solutions are required by a limited audience, but that will not stop networking vendors from attempting to force them on all customers. It is our job as network architects/managers/engineers to continually ask the question "Why does my business need this technology?".
Jeremy
The reliance on a DNS like resolution is not bad. It is a proven solution for the telecommunications industry. Just look at ENUM/DNS used all over the place in SIP based voice routing. With the advent of VoLTE, it will be THE solution. LISP is just applying something similar for packet routing.
However, not everyone likes the dependence on a centralized resolution service.
We will see if it will succeed or not.
Please, be aware that it is not a pure Cisco technology. HPE fully supports it. Huawei also invests a lot into LISP. For example, Huawei is developing the missing LISP pieces into the ONOS network. So probably, they would like to use in telecommunications networks, not just in the enterprise... :-)
https://datatracker.ietf.org/wg/lisp/about/
NFV is nothing else then the re-discovery of the SS7 IN or the SIP/IMS/iFC mechanism. Just for packkets instead of SIP sessions. LISP will be used together with NFV in the future, further extending this long established idea... There is nothing really new under the sun, just new clothes for the same old thing... :-)
However, Cisco for some reason rejected both TRILL and the IEEE IS-IS router Ethernet,
Ethernet addresses do not have a locator component, but the IS-IS layer could add this.
The real problem with Ethernet is the cut-through switching, since it will result in a free frame error propagation.
There are lot of technologies, and you could not decide which one will stay with us based on technical merits. The business case will decide. There is nothing wrong with SDH, actually it is much better for a lot of applications, and it had SDN for a long time already. But if there are very low production volumes, it is becoming so expensive, that it will die out...
Remember VHS winning againts the others? But now we have H.26x videos, or real-time streaming... :-)
I think it comes down to the engineering team that takes on the product development. The issues with "go with what you know" is a big problem in all engineering companies. Architects and Engineering leaders don't put themselves at risk by looking at what is best, they use what the know will work regardless of if it's fit for purpose or passes any sort of Commercial, Strategic or Architectural governance.
A better question to ask is why Cisco's CTO office does not enforce standards or governance in their RnD projects where there's no competitive advantage to be gained.
They continue to pay lip service to open source engagement, why is that?
The impact of mobility events in a LISP network (as you know from past reviews published in your blog) is limited to signaling amongst the network elements involved in active connections between the devices. However, the impact of mobility events in a BGP network is unbound. Even if you have conditional FIB programming, all changes are pushed to all participants. You can try to mitigate this with summarization, but that will have little effect in the case of access networks. This is just one of the many lessons the industry has learnt after years of building overlays with traditional push mechanisms.
I happened to be in the process of posting a document (give me a few hours as it propagates through the system so I can give you a URL) that describes a wealth of other functionality that is possible by the simple principle of the demand based control plane and a discussion on why this is best realized with a demand protocol. One thing to remember is that the overlay problem is one of maintaining a directory of locations. This is not necessarily a routing problem. The use of a directory of locations (and other interesting information) allows us to evolve the services that are provided in these networks. This goes well beyond traditional routing services to include policy driven services. That said, even traditional routing services such as multicast and route leaking are improved. For instance, if you have ever set up (and you probably have) multicast across multiple VRFs in an extranet route-leaking arrangement, you would appreciate a solution that can simplify the machinery involved. There is also the fact that LISP can do this without creating any additional state (vs. copying all routes across all VRFs in the traditional solutions).
Hopefully my comments and the pointer provided are good evidence that this is the product of much thought and indeed an evolution in networking. LISP has come a very long way in the last few years as a product of lessons learnt on numerous successful deployments. I am confident that, upon further review, you will find that the SD-Access implementation of LISP is a much richer solution than what you may have explored before and you’d appreciate how LISP is enabling much needed innovation in this space.
I would also be interested in this document if it's possible for you to share. Do you have any document that discusses reactive control plane vs proactive control plane?
Interested to hear your thoughts on complexity of LISP, cache size, and failure modes etc.
Regarding complexity. It is a different way of looking at things, very similar to DNS from a flow perspective. The configuration and operations are significantly simpler than BGP, but they are perceived as complex because they don't follow the same principles as a routing protocol. So it's a matter of coming to terms with the fact that this isn't a routing problem/task. As for cache size, memory and CPU requirements, we have done benchmarking that shows a footprint that is about 10% of what is required by BGP. The Control Plane is capacity planed following similar guidelines for capacity planning a DNS server. As for failure modes, this is a broad topic, but there is a recursive reliance on the underlay control plane which does use traditional routing protocols with all their functionality, at the borders of the fabric there are mechanisms to maintain visibility into remote network health and circumvent indirect failures (something that most/all overlay mechanisms have failed to address to date).
My slightly wider (and shallower) take is that - as Victor mentions above - the inclusion of LISP is a mobility play not a L2 stretching thing (though Cisco have put that into some of their early education material I notice) The most important part of the SD-Access solution for me is actually not the LISP control plane but the SGT policy stuff. This is the point where you create a separation between host ID (ie IP address) and the security, QoS etc that gets applied to traffic. Based on who you are (or the type of device you are using) you can get access to different stuff, right? Now this isn't new, but have you seen how easy it is to deploy in the DNA Center GUI? That's the big play. Because DNA Center will create the VRFs, the VNIs and all the config under the hood to stitch this stuff together and make it feasible. (Yes a CCIE will still be needed to troubleshoot it when it all goes wrong but hey, we all need jobs right?)
The mobility piece is a nod to legacy as much as anything I think - with SGTs you shouldn't need to care what your IP address is in the access network as you are granted access based on who/what you are. As we know though, sometimes keeping the same IP address is important (especially in legacy apps or legacy networks that don't talk SGT) and so being able to move an address around a network without having to reauthenticate becomes important. And LISP gives you a (nearly) standardised and (relatively) well-worn approach to that without resorting to trying to maintain a distributed database of /32s across an arbitrary topology of switches. We all know a campus network is not like a DC. Traffic flows are very different, volumes are different, connectivity requirements too, and so topologies, control planes and policy enforcement are totally different. I like my DCs being separate from my campus, so am I worried about a different control plane? Not so long as I can translate between them or orchestrate. Centralised identity has been progressively more important in an enterprise, so why not centralised forwarding control plane? Back in the day we were always taught they could be a good thing (off-router route/path calculation etc) and if it's good enough for the DC, why not the campus?
OK, yes, I've drunk the Kool-Aid and I'm sold on the Cisco solution - but I work for a Cisco partner so I would be. I have been waiting for the first viable campus access network overlay solution and it looks like it's arrived from an unexpected source! There will doubtless be others. But Cisco's usual approach of taking a few previously unrelated features and bundling them together to create an "architecture" looks like it might just work this time. Now we can stitch overlays (campus, SD-WAN and DC) together to give a proper end-to-end solution and we can concentrate on giving customers a slicker, easy-to-consume, automated network experience while we continue to do battle with CLI under the hood!!
Hi Jeremy - understood, but SGTs are not just about security at the edge. While we can make it very cool (who wouldn't want to quarantine unpatched machines or remove a PC's access if it starts misbehaving?) the SGTs can also just be used to deploy traditional firewall policy without resorting to IP addressing. They are also about other policy such as QoS and PBR if you want them to be (indeed they are now called Scalable Group Tags for that reason) And as SGTs can be named, you're now expressing policy by its intent (bingo!)
My experience is that many organizations "require" large stretched VLANs simply because they have not adopted DHCP/DDNS in place of statically assigned addresses for printers and other devices. This design oversight introduces a significant amount of operational overhead and the volume of adds/moves/changes become unsustainable without stretched VLANs.
There are good reasons to have that type of tech in the campus...
Thank you for this. I am a bit of a novice Engineer. Only having my CCNA but as I began my CCNP studies as soon as I noticed LISP I thought...too complex. The new adage of work smarter not harder popped into my head. Imo that should only apply to network devices whether they be virtual or physical. Human should work smarter and harder. Put our networks thru such monotony would make sense if it bettered th3 endgame with a result of greater speed decreased latency but it just seems to be a project Cisco worked on and now they decide to force it on us.