(v)Cloud Architects, ever heard of MPLS?

Duncan Epping, the author of fantastic Yellow Bricks virtualization blog tweeted a very valid question following my vCDNI scalability (or lack thereof) post: “What you feel would be suitable alternative to vCD-NI as clearly VLANs will not scale well?

Let’s start with the very basics: modern data center switches support anywhere between 1K and 4K (the theoretical limit based on 802.1q framing) VLANs. If you need more than 1K VLANs, you’ve either totally messed up your network design or you’re a service provider offering multi-tenant services (recently relabeled by your marketing department as IaaS cloud). Service Providers had to cope with multi-tenant networks for decades ... only they haven’t realized those were multi-tenant networks and called them VPNs. Maybe, just maybe, there’s a technology out there that’s been field-proven, known to scale, and works over switched Ethernet.

Guess what: hundreds of Service Providers worldwide use something called MPLS to scale their VPN offerings, be it on layer 2 (VPLS) or layer 3 (MPLS/VPN). Huge high-speed networks serving tens or hundreds of thousands of customers have been built on MPLS and they work well. While I might understand that the enterprise network designers feel uncomfortable considering MPLS (although MPLS/VPN can solve numerous enterprise problems as described in my Enterprise MPLS/VPN deployment webinar), there is no reason why service provides wouldn’t reuse the same technology they’ve used in their WAN networks for the last decade in their data centers.

Now imagine VMware would actually have done the right thing and implemented a VPLS-capable PE-router in their vSwitch instead of taking a product targeted at building labs, porting it from userland into kernel space and slapping a GUI on top of it.

The VPLS approach would not solve the security issues (MPLS backbone has to be trusted), but would solve the scalability ones (no need to bridge in the data center core), breeze by the VLAN numbering barrier, and enable instant inter-DC connectivity (or private-to-public cloud connections) without external bridging kludges when using mGRE or L2TPv3-based pseudowires. The full effect of the scalability improvements would only become visible after deploying P2MP LSPs in the network core, and I’m positive Juniper would be very happy to tell everyone they already have that built into their gear.

Moving further into dreamland, imagine VMware would do the very right thing and implement MPLS/VPN PE-router in their vSwitch. Zero bridging in the backbone, seamless L3 mobility, no need for bridging kludges, and layer-3 inter-DC connectivity with MPLS/VPN-over-mGRE. A few moronic solutions relying on layer-2 inter-server connectivity would break, but remember that we’re talking about public infrastructure clouds here, not your typical enterprise data center.

Obviously we won’t see either of these options in the near (or far) future. Implementing something that actually works and solves scalability problems using an old technology is never as sexy as inventing a new rounder wheel that just might never work in practice because it’s too brittle, not to mention that developing a PE-router functionality would require serious amount of work.

More information

Enterprise MPLS/VPN Deployment webinar describes the typical MPLS use cases in enterprise networks, the underlying technology and high-level design and implementation guidelines.

In the Data Center Interconnects webinar I’m describing how you can use MPLS/VPN to provide end-to-end path isolation across multiple data centers without deploying stretched VLANs or inter-DC bridging.

68 comments:

  1. > Maybe, just maybe, there’s a technology out there that’s been field-proven, known to scale, and works over switched Ethernet.

    802.1ah

    > would solve the scalability ones

    Check

    > breeze by the VLAN numbering barrier

    Check

    > enable instant inter-DC connectivity

    Check, can do over L2TPv3 or VPLS or EoMPLS

    > The full effect of the scalability improvements would only become visible after deploying P2MP LSPs in the network core

    Don't need that, as multicast is natively supported

    > we won’t see either of these options

    ALU does that - that's their flavour of "DC fabric". S-VLAN per flexible configurable combination of selectors (L2/L3/L4, if I'm not mistaken), automagically mapped to a hypervisor that needs it, when it needs is, across a single DC or a number of DCs.
  2. Grr, forgot to "follow" the stream again.
  3. After reading 802.1ah I got the impression that it maps broadcast+multicast in the C-component into multicast in the B-component. The standard does refer to potential use of MRRP (or static filters) but it's not mandatory.

    Am I missing something? Also, if ALU does all of that (and we know nothing about it), their marketing department deserves a "kind nudge". Could you point me to relevant documents?

    Thank you!
    Ivan
  4. PBB and MPLS are both carrier originated technologies.
    What is puzzling is the need by vendors to reinvent the wheel for DC.

    The vendors I talked to, were saying things like "its too complex for user", "the edge devices will not handle it" (I suppose Ks of Vlans are OK :)).
  5. And IS-IS-based TRILL/FabricPath/VCS Fabric/OTV is not complex? The only difference is the complexity is hidden (which will only bite you when you have to do real in-depth troubleshooting).
  6. My point exactly :)
    The more I read about those the less I understand the reasoning behind their introduction.
    I thought that even PBB was needed to be brought to life(multicast?).
  7. My point exactly :)
    The more I read about those the less I understand the reasoning behind their introduction.
    I thought that even PBB was needed to be brought to life(multicast?).
  8. > into multicast in the B-component

    ...which is then handled as if it was normal Ethernet multicast, right?

    > their marketing department deserves a "kind nudge"

    In my experience, they rely on the existing relationships and trade shows, when you can see the product in action and speak with somebody who actually knows what they are talking about.

    Here's one link I found which has some marketing fluff on the subject: http://searchnetworking.techtarget.com/news/2240034476/Alcatel-Lucent-debuts-monster-data-center-switch-fabric-architecture
  9. you are my hero
  10. >> into multicast in the B-component
    > ...which is then handled as if it was normal Ethernet multicast, right?

    Exactly my point. Normal multicast doesn't scale as it's flooded throughout the broadcast domain ... unless, of course, you use something like MRRP.

    On the other hand, VPLS (more so with P2MP LSP) scales better, as the multicasts get flooded only to those PE-routers that actually need them (with P2MP LSP on a source-rooted tree).
  11. You didn't really answer my question though. Yes it could possibly function as an alternative when someone invests the time/money to develop it. Or do you have a way of fully automating this for in case a tenant wants to isolate his vApp on a separate network?

    I am all ears,
  12. > Normal multicast doesn't scale as it's flooded throughout the broadcast domain ... unless, of course, you use something like MRRP.

    Ok, I should have called things by their proper names - I should have said "SPBM" instead of "802.1ah". My bad! :(

    I think what I wanted to say is that SPBM takes care of pruning/replication by building multicast topology based on SPF trees (http://www.nanog.org/meetings/nanog50/presentations/Sunday/IEEE_8021aqShortest_Path.pdf)
  13. Well, if it is not vMware but say KVM hypervisor I'd guess one can use openvswitch which actually implements Ethernet over IP, though admittedly it's not VPLS.
  14. Well, if it is not vMware but say KVM hypervisor I'd guess one can use openvswitch which actually implements Ethernet over IP, though admittedly it's not VPLS.
  15. The way I understood the documentation, openvswitch implements P2P links with EtherIP, which obviously doesn't scale. Also, you would need plenty of orchestration to ensure links are established on-demand as VMs move around (which is what vCDNI does automatically).

    HOWEVER - there's OpenFlow :-P We could still be pleasantly surprised.
  16. Ah, now I'm in a familiar territory 8-)

    SPBM uses MAC-in-MAC encapsulation with I-SIDs to provide different forwarding mechanism in the SPBM core. However, the design paradigm is still a single broadcast domain (modulo VLAN pruning). Doesn't scale.
    Replies
    1. There seems to be some confusion here. SPBM scopes customer broadcast to the I-SID end points. It is not single broadcast domain. I have trouble figuring how n**2 replication of VPLS tunnels to the hypervisor could possibly scale better. Defies laws of physics.
  17. > However, the design paradigm is still a single broadcast domain (modulo VLAN pruning). Doesn't scale.

    Could you please elaborate where do you see the scaling issue? Multicast traffic inside "huge" I-SIDs?
  18. As I wrote in the vCDNI post, if a single VM goes bonkers, the whole DC is flooded. Same with SPBM.

    You won't see that in a typical SP/VPLS/802.1ah network because sensible people use routers to connect to the Carrier Ethernet service.
  19. > the whole DC is flooded

    Wouldn't flood scope be contained by the I-SID to which this VM is mapped? I thought that I-SID is a broadcast container.
  20. > I thought that I-SID is a broadcast container.

    Good question for a Packet Pushers Podcast we're doing next week. I don't think it is in SPBM.

    Also, in 802.1ah, multi/broadcasts are sent to default BA, which is multicast (unless you configure P2P service), so C-multicast becomes B-multicast.
  21. Will look forward to the podcast :)
  22. A little surprised there is no mention of Q-in-Q in the article or comments...
  23. Q-in-Q doesn't help you. It's even worse than vCDNI - single MAC address space, single broadcast domain (unless you use VLAN pruning).
  24. Well there is the mpls-linux project os not all is lost.

    it'd be interesting to see running mpls-linux with say VPLS & KVM. I guess I need to kill couple week ends to see whether it's possible :-[
  25. I don't understand how supporting SPB or whatever on physical device which is ALU is here going to fix lack of PE edge or other scalable Ethernet over IP transport in hypervisor ? Why even bring it into conversation unless ALU offers software implementation that can be instantiated as vswitch on hypervisors ?
  26. And ? How having or not having ALU is going to fix lack of scalable of Ethernet-over-IP transport on hypervisor ?

    Whole premise of Ivan's post is to get rid of physical L2 and do L3 all the way down (because L3 just scales way better than L2, probably why Internet is L3 =-X ). Stopping at the physical network and leaving virtual network as is not really a solution. Or may be I'm missing something.
  27. Doesn't MPLS require manual configuration to bridge networks? How will that work?
  28. mpls-linux seems to be dead (last change ~2 years ago) ... and it only had basic LDP and PE-router labeling functionality.
  29. Well, everything needs configuration (including vCDNI portgroups). The question is how much of the complexity you decide to hide behind a CLI/GUI.
  30. Ivan,

    Building a RFC2547 PE isn't the answer for Cloud networks because the goal is not a transport network, but a broadcast domain. Really what's needed in the Cloud is an emulation of what customers get when they do into a datacenter like Savvis or Equinix - you get a set of racks in a cage, and you uplink your firewall/router up to the provider's redundant set of links that goes to their access layer switching and below your router/firewall, you create a number of routed networks backed by VLANs, etc - simple stuff, right? Well, in the Cloud - folks want the same thing. Perimeter firewalling and access routing between multiple broadcast domains and some level of network hierarchy. One thing you will see this year is the trend of mapping broadcast domain to multicast. This is no longer only a VMware thing - you will be pleasantly surprised this year by other vendors.
  31. I would appreciate if you could explain what is the need for a scalable transport on hypervisor? Ivan pointed our that a VM "gone bonkers" can produce broadcast or multicast flood. Ok, understand. But this can be dealt with - with BMU rate limiters and contained broadcast domains in DC network above hypervisor. What else is there that needs additional intelligence on a VM host?
  32. I think Ivan post goes over that in 2nd paragraph. Assuming vSwitch is still L2 then its hand-off to physical network is still L2 802.1q trunk and you are still constrained by 2^12 VLAN ID space. Because if vSwitch L2, distributed vSwitch naturally would expect same VLAN ID across hypervisors for the same broadcast domain/VM L2 adjacency. How does mapping on ALU is going to extend VLAN ID space on trunk between vSwitch and physical switch. I am all ears.
  33. Ok, went to the source (802.1ak draft 3.6) - the way it reads to me, "client-side" BUM *is* constrained to its I-SID, which includes member SPBM PEs (which have ports with this I-SID) and selected shortest path internodal links for this given I-SID.

    So I don't think the whole DC will get flooded - only PEs and links which serve an I-SID with a misbehaving VM in it.

    And yes, the behaviour inside I-SID appears to be flooding, but with efficient replication at fork points.
  34. > 802.1ak

    I mean 802.1aq, of course.
  35. Second paragraph reads: "If you need more than 1K VLANs, you’ve either totally messed up your network design or you’re a service provider offering multi-tenant services...".

    SPBM can support 2^24 "VLANs", and has no problem with supporting 4K regular VLAN-based services per port (read: per ESX server). So with SPBM we are moving from "4K VLANs per DC" to "4K VLANs per ESX".

    Hope this makes sense.
  36. The more accurate statement would be 2^12 VLAN per virtual L2 domain that can have more than one ESX. Unless you can show me that with SPBM L2 domain can be shrank to single ESX/vSwitch and still provide L2 adjacency/vMotion/etc between different ESX hosts with non-matching VLAN IDs.

    So I can't quite see how you can have ESXes have independent VLAN ID spaces as long as there are L2 adjacency checks that rely on matching VLAN IDs between ESXes.
  37. > 2^12 VLAN per virtual L2 domain that can have more than one ESX

    There isn't "an L2 domain with bunch of VLANs in it". May I recommend to check out the NANOG presentation I provided a link to above for how SPB solves these problems? Here's the link again: http://www.nanog.org/meetings/nanog50/presentations/Sunday/IEEE_8021aqShortest_Path.pdf - check slides 12, 16 and 17.
  38. Thanks for the link, but I don't see how it address my point. it shows that C-VLAN is the same on both hyper visors, if that's not single L2 ID space then what that is ? And given that C-VLAN ID space must be same on all ESXes in the cluster then my statement stands.
  39. > C-VLAN ID space must be same on all ESXes in the cluster

    Correct. But you also can use this same C-VLAN ID space on a different ESX cluster within the same DC.

    How many VMs would you have running on a single ESX cluster?
  40. Well, that's what my statement was about. I guess I worded it a bit vaguely. Virtual L2 domain in my not so clear definitions (that shares VLAN ID space) is the size of ESX cluster with current vSwitch. Obvious trend is for bigger ESX clusters.

    At about 64-128 (today is max st 32) x 4-socket systems IMHO we are going to hit VLAN ID space constraint. Even then I see nothing wrong with having plain vanilla physical L2 islands for each cluster at these sizes separated by L3.

    SPBM/etc really come to play when clusters get massive and at that stage VLAN ID space on hypervisors is going to be the constraint.

    IMHO for cloud SP, network edge must be pushed down to hypervisor, whether it's MPLS or SPBM is immaterial. The major theme of Ivan's post to me was that using dumbest possible network feature set in vSwitches is not going to get a cloud SP far enough. For SP like that vSwitches must have more than rudimentary features they have today.

    If ALU has a vSwitch appliance (doesn't matter whether it's for Xen/KVM/whatever) that can do SPBM I'd be thrilled.
  41. Thanks for the feedback! What you mention is a crucial point: the cloud providers have to decide whether they want to offer L3 connectivity or L2 broadcast domains to their customers.

    If I would be a forward-looking provider, I would try to limit myself to L3 connectivity. Easier to design, build & deploy ... plus it can service all greenfield applications and most existing stuff. I may be totally wrong, but looking from the outside it seems this is what Amazon is doing.

    If I want to capture 100% of the market, I obviously have to provide virtual L2 broadcast domains. The question I would have to ask myself is: is the incremental revenue I'll get by providing L2 broadcast domain big enough to justify the extra complexity? Based on what I see happening in the industry, I don't believe too many people have seriously considered this question.

    Obviously the networking industry is more than happy to support the "virtual L2 domain" approach, as it allows them to sell new technology and more boxes, which begs another question: "who exactly is generating the needs?"
  42. #1 - vDS is per vCenter, not per ESX cluster, so it (and its VLANs) can span 1.000 hosts and 20.000 virtual ports (http://www.vmware.com/pdf/vsphere4/r41/vsp_41_config_max.pdf)

    #2 - we're already hitting the VLAN ID space constraint in VDI cloud applications like our FlipIT http://flipit.nil.com/ (large # of VDI VMs per ESX server, small # of VMs in a VLAN)

    #3 - You totally got my point (and rephrased it better than I could) - complexity belongs in vSwitch, optimal end-to-end transport into the physical boxes.
  43. Understand now, thanks! I'm not too good with VMware, so please excuse me if the next question is stupid: does ESX have single VLAN ID space across all its vSwitch instances? Say an ESX host has two vSwitch instances, each linked to a separate pNIC. If you configure a vNIC on vSwitch 1 with VLAN ID 10, and then another vNIC on vSwitch 2 with the same VLAN ID, what is ESX / vSphere going to make of it? Will it expect that the VLAN 10 is bridged outside the ESX and treat both vNICs as members of a same port group?

    And question 2: what about VEPA? Doesn't it "extend" vNICs past vSwitch to an external actual real switch (which could be running SPBM)?
  44. #1 - in my understanding, the VLAN ID space is tied to physical NICs. Outbound a portgroup does VLAN tagging (assuming it's configured that way), inbound the packets are only delivered to a portgroup if they arrive through the pNIC to which the portgroup is attached.

    #2 - I don't think VEPA would help you here and it definitely does NOT extend vNICs (that would be 802.1Qbh). More about it in a week or so.
  45. #1 - in my understanding, the VLAN ID space is tied to physical NICs. Outbound a portgroup does VLAN tagging (assuming it's configured that way), inbound the packets are only delivered to a portgroup if they arrive through the pNIC to which the portgroup is attached.

    #2 - I don't think VEPA would help you here and it definitely does NOT extend vNICs (that would be 802.1Qbh). More about it in a week or so.
  46. 1) ESX will not care whether VLAN 10 is disjointed unless you put ESX control plane traffic on it (Service Console, VMKernel, FT). But even for user VM VLAN ESX will show in that cute diagram (showing connectivity to datastores & port-groups) in vCenter that these two ESX hosts in the same cluster are on the same VLAN so human operator is likely to assume that this not a disjointed VLAN. So implicitly ESX does make an assumption.

    I haven't played much with vDS but I expect vDS to have stronger assumptions if the same vnics are on the same VLAN ID. Not sure if it checks VLANs for disjointedness.

    2) VEPA/VN-tag/whatever (as usual there are competing tracks) that tries to move out virtual networking out of hypervisor to physical network box is certainly a valid approach but it requires hypervisor element manager (aka vCenter for ESX) to manage physical network device now to set up/tear down VM related network plumbing when VM moves in/moves off the host. I guess this could be another application for openflow =-X but I guess either way hypervisor element manager will acquire some advanced network control-plane features to manage either virtual or physical switch box.
  47. #1: If this is so, aren't we sorted out then? Just create multiple vDS, each with its 4K VLANs, and tie each vSwitch to a corresponding pNIC. Should be especially easy with UCS equipped with VICs. Then plug each pNIC into an SPBM switch, which can map either whole port or port/VLAN to an individual I-SID. Viola: multiple 4K VLAN domains per ESX, no?

    #2: After reading up on VEPA I see that yes, it won't help, but it won't need to, provided #1 is true.
  48. Right, a bit more thinking now: in my scenario with two vSwitch instances on the same ESX, where each vSwitch is attached to its own pNIC uplink connection.

    On the vSwitch 1 we create a port group we call "Network A", and associate it with VLAN 10.
    On the vSwitch 2 we create a port group we call "Network B", and also associate it with VLAN 10.

    Now, vCenter knows port groups by their network names, *not* by their VLAN IDs, correct? If so, it has no right to make an assumption that VLAN ID 10 on vSwitch 1's uplink is the same broadcast domain as VLAN ID 10 on vSwitch 2's uplink, correct? If so, we're all set! :)
  49. ... until a device, link or pNIC breaks in the middle of the night and the late shift operator assigns the port group to different pNICs trying to restore connectivity :-E

    Never rely on overly complex designs, they will break at the most inappropriate moment.
  50. BTW each vSwitch is likely use 2 x pNICs for redundancy so we will be burning switch ports just because we need more VLAN IDs per host though it's probably good for network device vendors =-X

    Well, I guess that's a clever way we can prolong the misery a bit by adding more and more pNICs =-X but in the end isn't easier to do the right thing =-X
  51. Wouldn't there be multiple (teamed) pNICs, just as they are today? And yes, agree with the complexity and breaking...
  52. > If I would be a forward-looking provider, I would try to limit myself to L3 connectivity. Easier to > design, build & deploy ... plus it can service all greenfield applications and most existing stuff. I >may be totally wrong, but looking from the outside it seems this is what Amazon is doing.

    I thought Amazon started to do just that (i.e. offering dedicated layer 2 domains). :

    http://aws.typepad.com/aws/2011/03/new-approach-amazon-ec2-networking.html

    > Obviously the networking industry is more than happy to support the "virtual L2 domain" approach, >it allows them to sell new technology and more boxes, which begs another question: "who exactly is >generating the needs?"

    We don't want to sell any box at VMware ... yet we want the world to be better ;)

    Keep up with your posts... it helps us to stay on track.

    Massimo.
  53. Someone smart (like Amazon) would be able to do everything they publicly promise with smart access lists (like multi-tenant vShield App or Cisco's VSG).

    Trying to find someone to answer a few basic questions like:

    * Would IP multicast work?
    * Would subnet-level broadcast work?
    * Can I run Microsoft NLB or Microsoft cluster over it?
    * Can I run non-IP protocols (like IPv6 or CLNP) between my EC2 instances?

    Will keep you updated if I ever get the answers ;)

    As for the industry classification - I thought you were the leader in the virtualization industry 8-)
  54. I think that yes, you could do that (technically). I believe however that it's easier to sell to two "orgs" that they are going to be hosted on two different broadcast domains (separated by "traditional" firewall architectures - although virtual and not physical) than selling to two "orgs" that they are hosted on a single broadcast domain but are kept separated by a "firewall filter driver running on the hosts". Don't get me wrong, I am not saying it's less secure, I am saying that it's more difficult to sell (in 2011). People out there tend to be conservative (and have political agendas). They will make a step at a time. If you ask me vShield App (or similar technologies) are even more compelling than VCDNI or any other layer 2 virtualization technology we may come out with... I found interesting that you jumped from MPLS straight to APP/VSG.

    Massimo.
  55. BTW:

    > Can I run Microsoft NLB or Microsoft cluster over it?

    if you want to run NLB or MSCS in the cloud.... I think you are not ready for the cloud. Me think.

    Massimo.
  56. Me agree. Wholeheartedly. Probably not everyone does :-E
  57. > Me agree. Wholeheartedly. Probably not everyone does

    Let them crash then. Let them learn the hard way.

    Massimo.
  58. That's actually a great way to look at it - if you're a service provider interested in getting parity with Amazon capabilities (EC2 specifically, as we know that VPC does support bcast), then the approach needs to be equivalent in capabilities where VMs only communicate via L3, today accomplished via filtering (ebtables/iptables in case of Xen Clouds). If you're looking to build enteprise-like Cloud where clustering apps which rely on broadcast/L2multicast or simply want all your legacy apps to work like many of the Windows services which rely on broadcast - then you'd want something either built via VLANs or the next-gen encapsulation alternatives. In the encapsulation approaches, it's key to provide solid debugging and tracing capabilities - from Wireshark plugins to decode the new protocol and tools that can help troubleshoot the topology.

    One of the benefits of keeping the notions that exist today in physical networks like zoning that's based on subnet + underlying bcast domain is that allows for easier migration of apps into the Cloud space. Vision is something like this - p-to-V your workloads, then p-to-V your network/security config (or something close to it), retain the same address space or renumber (if need be) and either via VPN or private connectivity to your provider - upload your apps and netsec config which stands up the services in the Cloud... It sounds a bit futuristic, but I have some REST scripts that I share which stand up multiple bcast domains, sets up access routing between them, sets up perimeter firewall w/ NAT towards the Internet, and some static routes pointing to the provider PE router for the MPLS VPN without NAT and then pushes the VMs in these containers and numbers them...

    Can I say that REST is becoming the new CLI for Cloud? I used to prefer Cisco IOS/CatOs until I played with my first Juniper in '99 (Olive x86 box running JunOS - before Juniper had the hardware ready), fell in love with that Gated-style CLI, now it's REST... Maybe I'm a single voice in the wilderness or maybe we'll become a new breed of network engineers?
  59. Amazon VPC does NOT support multicast or broadcast:

    http://aws.amazon.com/vpc/faqs/#R4
  60. Willing to tell me more about the REST functionality you're talking about?
  61. Thanks for the correction - here is a thread asking Amazon folks to implement broadcast/multicast functionality, a good read:

    https://forums.aws.amazon.com/thread.jspa?threadID=37972
  62. Yep - let me know which areas to expand upon or you can contact me privately here: sergey at vmware dot com.
  63. Agree 100% with this article.
  64. Ivan, Can you please offer your comments on these drafts?

    http://tools.ietf.org/html/draft-ietf-l3vpn-end-system-00
    https://datatracker.ietf.org/doc/draft-rfernando-virt-topo-bgp-vpn/?include_text=1
    http://tools.ietf.org/html/draft-ietf-l2vpn-evpn-02
    Replies
    1. Thank you. I am eagerly waiting for your comments. While you are at it, can you please look at: http://tools.ietf.org/html/draft-smith-lisp-layer2-01 too?
Add comment
Sidebar