(v)Cloud Architects, ever heard of MPLS?
Duncan Epping, the author of fantastic Yellow Bricks virtualization blog tweeted a very valid question following my vCDNI scalability (or lack thereof) post: “What you feel would be suitable alternative to vCD-NI as clearly VLANs will not scale well?”
Let’s start with the very basics: modern data center switches support anywhere between 1K and 4K (the theoretical limit based on 802.1q framing) VLANs. If you need more than 1K VLANs, you’ve either totally messed up your network design or you’re a service provider offering multi-tenant services (recently relabeled by your marketing department as IaaS cloud). Service Providers had to cope with multi-tenant networks for decades ... only they haven’t realized those were multi-tenant networks and called them VPNs. Maybe, just maybe, there’s a technology out there that’s been field-proven, known to scale, and works over switched Ethernet.
Guess what: hundreds of Service Providers worldwide use something called MPLS to scale their VPN offerings, be it on layer 2 (VPLS) or layer 3 (MPLS/VPN). Huge high-speed networks serving tens or hundreds of thousands of customers have been built on MPLS and they work well. While I might understand that the enterprise network designers feel uncomfortable considering MPLS (although MPLS/VPN can solve numerous enterprise problems as described in my Enterprise MPLS/VPN deployment webinar), there is no reason why service provides wouldn’t reuse the same technology they’ve used in their WAN networks for the last decade in their data centers.
Now imagine VMware would actually have done the right thing and implemented a VPLS-capable PE-router in their vSwitch instead of taking a product targeted at building labs, porting it from userland into kernel space and slapping a GUI on top of it.
The VPLS approach would not solve the security issues (MPLS backbone has to be trusted), but would solve the scalability ones (no need to bridge in the data center core), breeze by the VLAN numbering barrier, and enable instant inter-DC connectivity (or private-to-public cloud connections) without external bridging kludges when using mGRE or L2TPv3-based pseudowires. The full effect of the scalability improvements would only become visible after deploying P2MP LSPs in the network core, and I’m positive Juniper would be very happy to tell everyone they already have that built into their gear.
Moving further into dreamland, imagine VMware would do the very right thing and implement MPLS/VPN PE-router in their vSwitch. Zero bridging in the backbone, seamless L3 mobility, no need for bridging kludges, and layer-3 inter-DC connectivity with MPLS/VPN-over-mGRE. A few moronic solutions relying on layer-2 inter-server connectivity would break, but remember that we’re talking about public infrastructure clouds here, not your typical enterprise data center.
Obviously we won’t see either of these options in the near (or far) future. Implementing something that actually works and solves scalability problems using an old technology is never as sexy as inventing a new rounder wheel that just might never work in practice because it’s too brittle, not to mention that developing a PE-router functionality would require serious amount of work.
More information
Enterprise MPLS/VPN Deployment webinar describes the typical MPLS use cases in enterprise networks, the underlying technology and high-level design and implementation guidelines.
In the Data Center Interconnects webinar I’m describing how you can use MPLS/VPN to provide end-to-end path isolation across multiple data centers without deploying stretched VLANs or inter-DC bridging.
802.1ah
> would solve the scalability ones
Check
> breeze by the VLAN numbering barrier
Check
> enable instant inter-DC connectivity
Check, can do over L2TPv3 or VPLS or EoMPLS
> The full effect of the scalability improvements would only become visible after deploying P2MP LSPs in the network core
Don't need that, as multicast is natively supported
> we won’t see either of these options
ALU does that - that's their flavour of "DC fabric". S-VLAN per flexible configurable combination of selectors (L2/L3/L4, if I'm not mistaken), automagically mapped to a hypervisor that needs it, when it needs is, across a single DC or a number of DCs.
Am I missing something? Also, if ALU does all of that (and we know nothing about it), their marketing department deserves a "kind nudge". Could you point me to relevant documents?
Thank you!
Ivan
What is puzzling is the need by vendors to reinvent the wheel for DC.
The vendors I talked to, were saying things like "its too complex for user", "the edge devices will not handle it" (I suppose Ks of Vlans are OK :)).
The more I read about those the less I understand the reasoning behind their introduction.
I thought that even PBB was needed to be brought to life(multicast?).
The more I read about those the less I understand the reasoning behind their introduction.
I thought that even PBB was needed to be brought to life(multicast?).
...which is then handled as if it was normal Ethernet multicast, right?
> their marketing department deserves a "kind nudge"
In my experience, they rely on the existing relationships and trade shows, when you can see the product in action and speak with somebody who actually knows what they are talking about.
Here's one link I found which has some marketing fluff on the subject: http://searchnetworking.techtarget.com/news/2240034476/Alcatel-Lucent-debuts-monster-data-center-switch-fabric-architecture
> ...which is then handled as if it was normal Ethernet multicast, right?
Exactly my point. Normal multicast doesn't scale as it's flooded throughout the broadcast domain ... unless, of course, you use something like MRRP.
On the other hand, VPLS (more so with P2MP LSP) scales better, as the multicasts get flooded only to those PE-routers that actually need them (with P2MP LSP on a source-rooted tree).
I am all ears,
Ok, I should have called things by their proper names - I should have said "SPBM" instead of "802.1ah". My bad! :(
I think what I wanted to say is that SPBM takes care of pruning/replication by building multicast topology based on SPF trees (http://www.nanog.org/meetings/nanog50/presentations/Sunday/IEEE_8021aqShortest_Path.pdf)
HOWEVER - there's OpenFlow :-P We could still be pleasantly surprised.
SPBM uses MAC-in-MAC encapsulation with I-SIDs to provide different forwarding mechanism in the SPBM core. However, the design paradigm is still a single broadcast domain (modulo VLAN pruning). Doesn't scale.
Could you please elaborate where do you see the scaling issue? Multicast traffic inside "huge" I-SIDs?
You won't see that in a typical SP/VPLS/802.1ah network because sensible people use routers to connect to the Carrier Ethernet service.
Wouldn't flood scope be contained by the I-SID to which this VM is mapped? I thought that I-SID is a broadcast container.
Good question for a Packet Pushers Podcast we're doing next week. I don't think it is in SPBM.
Also, in 802.1ah, multi/broadcasts are sent to default BA, which is multicast (unless you configure P2P service), so C-multicast becomes B-multicast.
it'd be interesting to see running mpls-linux with say VPLS & KVM. I guess I need to kill couple week ends to see whether it's possible :-[
Whole premise of Ivan's post is to get rid of physical L2 and do L3 all the way down (because L3 just scales way better than L2, probably why Internet is L3 =-X ). Stopping at the physical network and leaving virtual network as is not really a solution. Or may be I'm missing something.
Building a RFC2547 PE isn't the answer for Cloud networks because the goal is not a transport network, but a broadcast domain. Really what's needed in the Cloud is an emulation of what customers get when they do into a datacenter like Savvis or Equinix - you get a set of racks in a cage, and you uplink your firewall/router up to the provider's redundant set of links that goes to their access layer switching and below your router/firewall, you create a number of routed networks backed by VLANs, etc - simple stuff, right? Well, in the Cloud - folks want the same thing. Perimeter firewalling and access routing between multiple broadcast domains and some level of network hierarchy. One thing you will see this year is the trend of mapping broadcast domain to multicast. This is no longer only a VMware thing - you will be pleasantly surprised this year by other vendors.
So I don't think the whole DC will get flooded - only PEs and links which serve an I-SID with a misbehaving VM in it.
And yes, the behaviour inside I-SID appears to be flooding, but with efficient replication at fork points.
I mean 802.1aq, of course.
SPBM can support 2^24 "VLANs", and has no problem with supporting 4K regular VLAN-based services per port (read: per ESX server). So with SPBM we are moving from "4K VLANs per DC" to "4K VLANs per ESX".
Hope this makes sense.
So I can't quite see how you can have ESXes have independent VLAN ID spaces as long as there are L2 adjacency checks that rely on matching VLAN IDs between ESXes.
There isn't "an L2 domain with bunch of VLANs in it". May I recommend to check out the NANOG presentation I provided a link to above for how SPB solves these problems? Here's the link again: http://www.nanog.org/meetings/nanog50/presentations/Sunday/IEEE_8021aqShortest_Path.pdf - check slides 12, 16 and 17.
Correct. But you also can use this same C-VLAN ID space on a different ESX cluster within the same DC.
How many VMs would you have running on a single ESX cluster?
At about 64-128 (today is max st 32) x 4-socket systems IMHO we are going to hit VLAN ID space constraint. Even then I see nothing wrong with having plain vanilla physical L2 islands for each cluster at these sizes separated by L3.
SPBM/etc really come to play when clusters get massive and at that stage VLAN ID space on hypervisors is going to be the constraint.
IMHO for cloud SP, network edge must be pushed down to hypervisor, whether it's MPLS or SPBM is immaterial. The major theme of Ivan's post to me was that using dumbest possible network feature set in vSwitches is not going to get a cloud SP far enough. For SP like that vSwitches must have more than rudimentary features they have today.
If ALU has a vSwitch appliance (doesn't matter whether it's for Xen/KVM/whatever) that can do SPBM I'd be thrilled.
If I would be a forward-looking provider, I would try to limit myself to L3 connectivity. Easier to design, build & deploy ... plus it can service all greenfield applications and most existing stuff. I may be totally wrong, but looking from the outside it seems this is what Amazon is doing.
If I want to capture 100% of the market, I obviously have to provide virtual L2 broadcast domains. The question I would have to ask myself is: is the incremental revenue I'll get by providing L2 broadcast domain big enough to justify the extra complexity? Based on what I see happening in the industry, I don't believe too many people have seriously considered this question.
Obviously the networking industry is more than happy to support the "virtual L2 domain" approach, as it allows them to sell new technology and more boxes, which begs another question: "who exactly is generating the needs?"
#2 - we're already hitting the VLAN ID space constraint in VDI cloud applications like our FlipIT http://flipit.nil.com/ (large # of VDI VMs per ESX server, small # of VMs in a VLAN)
#3 - You totally got my point (and rephrased it better than I could) - complexity belongs in vSwitch, optimal end-to-end transport into the physical boxes.
And question 2: what about VEPA? Doesn't it "extend" vNICs past vSwitch to an external actual real switch (which could be running SPBM)?
#2 - I don't think VEPA would help you here and it definitely does NOT extend vNICs (that would be 802.1Qbh). More about it in a week or so.
#2 - I don't think VEPA would help you here and it definitely does NOT extend vNICs (that would be 802.1Qbh). More about it in a week or so.
I haven't played much with vDS but I expect vDS to have stronger assumptions if the same vnics are on the same VLAN ID. Not sure if it checks VLANs for disjointedness.
2) VEPA/VN-tag/whatever (as usual there are competing tracks) that tries to move out virtual networking out of hypervisor to physical network box is certainly a valid approach but it requires hypervisor element manager (aka vCenter for ESX) to manage physical network device now to set up/tear down VM related network plumbing when VM moves in/moves off the host. I guess this could be another application for openflow =-X but I guess either way hypervisor element manager will acquire some advanced network control-plane features to manage either virtual or physical switch box.
#2: After reading up on VEPA I see that yes, it won't help, but it won't need to, provided #1 is true.
On the vSwitch 1 we create a port group we call "Network A", and associate it with VLAN 10.
On the vSwitch 2 we create a port group we call "Network B", and also associate it with VLAN 10.
Now, vCenter knows port groups by their network names, *not* by their VLAN IDs, correct? If so, it has no right to make an assumption that VLAN ID 10 on vSwitch 1's uplink is the same broadcast domain as VLAN ID 10 on vSwitch 2's uplink, correct? If so, we're all set! :)
Never rely on overly complex designs, they will break at the most inappropriate moment.
Well, I guess that's a clever way we can prolong the misery a bit by adding more and more pNICs =-X but in the end isn't easier to do the right thing =-X
I thought Amazon started to do just that (i.e. offering dedicated layer 2 domains). :
http://aws.typepad.com/aws/2011/03/new-approach-amazon-ec2-networking.html
> Obviously the networking industry is more than happy to support the "virtual L2 domain" approach, >it allows them to sell new technology and more boxes, which begs another question: "who exactly is >generating the needs?"
We don't want to sell any box at VMware ... yet we want the world to be better ;)
Keep up with your posts... it helps us to stay on track.
Massimo.
Trying to find someone to answer a few basic questions like:
* Would IP multicast work?
* Would subnet-level broadcast work?
* Can I run Microsoft NLB or Microsoft cluster over it?
* Can I run non-IP protocols (like IPv6 or CLNP) between my EC2 instances?
Will keep you updated if I ever get the answers ;)
As for the industry classification - I thought you were the leader in the virtualization industry 8-)
Massimo.
> Can I run Microsoft NLB or Microsoft cluster over it?
if you want to run NLB or MSCS in the cloud.... I think you are not ready for the cloud. Me think.
Massimo.
Let them crash then. Let them learn the hard way.
Massimo.
One of the benefits of keeping the notions that exist today in physical networks like zoning that's based on subnet + underlying bcast domain is that allows for easier migration of apps into the Cloud space. Vision is something like this - p-to-V your workloads, then p-to-V your network/security config (or something close to it), retain the same address space or renumber (if need be) and either via VPN or private connectivity to your provider - upload your apps and netsec config which stands up the services in the Cloud... It sounds a bit futuristic, but I have some REST scripts that I share which stand up multiple bcast domains, sets up access routing between them, sets up perimeter firewall w/ NAT towards the Internet, and some static routes pointing to the provider PE router for the MPLS VPN without NAT and then pushes the VMs in these containers and numbers them...
Can I say that REST is becoming the new CLI for Cloud? I used to prefer Cisco IOS/CatOs until I played with my first Juniper in '99 (Olive x86 box running JunOS - before Juniper had the hardware ready), fell in love with that Gated-style CLI, now it's REST... Maybe I'm a single voice in the wilderness or maybe we'll become a new breed of network engineers?
http://aws.amazon.com/vpc/faqs/#R4
https://forums.aws.amazon.com/thread.jspa?threadID=37972
http://tools.ietf.org/html/draft-ietf-l3vpn-end-system-00
https://datatracker.ietf.org/doc/draft-rfernando-virt-topo-bgp-vpn/?include_text=1
http://tools.ietf.org/html/draft-ietf-l2vpn-evpn-02