A long while ago I decided to write an article explaining how you could run VMware NSX on ESXi servers with redundant connections to two top-of-rack switches on top of a layer-3-only fabric (a fabric with IP subnets and VLANs limited to a single top-of-rack switch). Turns out that’s Mission Impossible, so I put the article on the back burner and slowly forgot about it.
Well, not exactly. Every now and then my subconsciousness would kick it up and I’d figure out yet-another reason why it’s REALLY hard to do it right. After a while, I decided to try again, and completely rewrote the article. The first part is already online, more details coming (hopefully) soon.
I got this question about the use of AS numbers on data center leaf switches participating in an MLAG cluster:
In the Leaf-and-Spine Fabric Architectures you made the recommendation to have the same AS number on all members of an MLAG cluster and run iBGP between them. In the Autonomous Systems and AS Numbers article you discuss the option of having different AS number per leaf. Which one should I use… and do I still need the EBGP peering between the leaf pair?
As always, there’s a bit of a gap between theory and practice ;), but let’s start with a leaf-and-spine fabric diagram illustrating both concepts:
Whenever I was comparing VMware NSX and Cisco ACI a few years ago (in late 2010s in case you’re reading this in a far-away future), someone would inevitably ask “and how would you connect a bare metal server to a VMware NSX environment?”
While NSX-T has that capability since release 2.5 (more about that in a later blog post), let’s start with the big question: why would you need to?
Got mentioned in this tweet a while ago:
Watching @ApstraInc youtube stream regarding BGP in the DC with @doyleassoc and @jtantsura.Maybe BGP is getting bigger and bigger traction from big enterprise data centers but I still see an IGP being used frequently. I am eager to have @ioshints opinion on that hot subject.
Maybe I’ve missed some breaking news, but assuming I haven’t my opinion on that subject hasn’t changed.
One of the attendees of our Building Next-Generation Data Center online course submitted a picture-perfect solution to scalable layer-2 fabric design challenge:
- VXLAN/EVPN based data center fabric;
- IGP within the fabric;
- EBGP with the WAN edge routers because they’re run by a totally different team and they want to have a policy enforcement point between the two;
- EVPN over IBGP within the fabric;
- EVPN over EBGP between the fabric and WAN edge routers.
The only seemingly weird decision he made: he decided to run the EVPN EBGP session between loopback interfaces of core switches (used as BGP route reflectors) and WAN edge routers.
Stumbled upon an article by Tom Limoncelli. He starts with a programming question (skip that) but then goes into an interesting discussion of what’s really important.
Being focused primarily on networking this is the bit I liked most (another case of Latency Matters):
I once observed a situation where a developer was complaining that an operation was very slow. His solution was to demand a faster machine. The sysadmin who investigated the issue found that the code was downloading millions of data points from a database on another continent. The network between the two hosts was very slow. A faster computer would not improve performance.
The solution, however, was not to build a faster network, either. Instead, we moved the calculation to be closer to the data.
Another interesting question I got from an ipSpace.net subscriber:
Assuming we can simplify the physical network when using overlay virtual network solutions like VMware NSX, do we really need datacenter switches (example: Cisco Nexus instead of Catalyst product line) to implement the underlay?
Let’s recap what we really need to run VMware NSX:
TL&DR: It’s 2020, and VXLAN with EVPN is all the rage. Thank you, you can stop reading.
On a more serious note, I got this questions from an Johannes Spanier after he read my do we need complex data center switches for NSX underlay blog post:
Would you agree that for smaller NSX designs (~100 hypervisors) a much simpler Layer2 based access-distribution design with MLAGs is feasible? One would have two distribution switches and redundant access switches MLAGed together.
I would still prefer VXLAN for a number of reasons:
Every now and then someone tries to justify the “wisdom” of migrating VMs from on-premises data center into a public cloud (without renumbering them) with the idea of “scaling out into the public cloud” aka “cloud bursting”. My usual response: this is another vendor marketing myth that works only in PowerPoint.
To be honest, that statement is too harsh. You can easily scale your application into a public cloud assuming that:
While running the Using VXLAN And EVPN To Build Active-Active Data Centers workshop in early December 2019 I got the usual set of questions about using BGP as the underlay routing protocol in EVPN fabrics, and the various convoluted designs like IBGP-over-EBGP or EBGP-between-loopbacks over directly-connected-EBGP that some vendors love so much.
I got a question along the same lines from one of the readers of my latest EPVN rant who described how convoluted it is to implement the design he’d like to use with the gear he has (I won’t name any vendor because hazardous chemical substances get mentioned when I do).
Dinesh Dutt, a pragmatic IP routing guru, the mastermind behind great concepts like simplified BGP configuration, and one of the best ipSpace.net authors, finally decided to start blogging. His first article: describing the impact of having 256 100GE ports in a single ASIC (Tomahawk 4). Hope you’ll enjoy his musings as much as I did ;)
Got this question from one of ipSpace.net subscribers:
Do we really need those intelligent datacenter switches for underlay now that we have NSX in our datacenter? Now that we have taken a lot of the intelligence out of our underlying network, what must the underlying network really provide?
Reading the marketing white papers the answer would be IP connectivity… but keep in mind that building your infrastructure based on information from vendor white papers usually gives you the results your gullibility deserves.
During a recent workshop I made a comment along the lines “be careful with feature X from vendor Y because it took vendor Z two years to fix all the bugs in a very similar feature”, and someone immediately asked “are you saying it doesn’t work?”
My answer: “I never said that, I just drew inferences from other people’s struggles.”
A Step Back
Networking operating systems are probably some of the most complex pieces of software out there. Distributed systems are hard. Real-time distributed systems are even harder. Real-time distributed systems running on top of eventually-consistent distributed databases are extra fun.
Aldrin wrote a well-thought-out comment to my EVPN Dilemma blog post explaining why he thinks it makes sense to use Juniper’s IBGP (EVPN) over EBGP (underlay) design. The only problem I have is that I forcefully disagree with many of his assumptions.
He started with an in-depth explanation of why EBGP over directly-connected interfaces makes little sense:
Enterprise environments usually implement “mission-critical” applications by pushing high-availability requirements down the stack until they hit networking… and then blame the networking team when the whole house of cards collapses.
Most public cloud providers are not willing to play the same stupid blame-shifting game - they live or die by their reputation, and maintaining a stable service is their highest priority. They will do their best to implement a robust and resilient infrastructure, but will not do anything that could impact its stability or scalability… including the snake oil the virtualization and networking vendors love to sell to their gullible customers. When you deploy your application workloads into a public cloud, you become responsible for the resiliency of your own application, and there’s no magic button that could allow you to push the problems down the stack.