Scaling IaaS network infrastructure

I got totally fed up with the currently popular “flat-earth with long-distance bridging” architecture paradigm while developing the Data Center Interconnects webinar. It all started with the layer-2 hypervisor switches and lack of decent L3 network-side solutions; promoting non-scalable cloudy solutions doesn’t help either.

The network infrastructure would scale better if the hypervisors would work as MPLS/VPN PE-routers, but even MPLS would hit scalability limits when the number of servers grows into tens of thousands. The only truly scalable solution is IP-over-IP or MAC-over-IP implemented in the hypervisor switches.

I tried to organize all these thoughts in the “How to build a scalable IaaS cloud network infrastructure” article that was recently published by SearchTelecom ... and just a few days after the article was published, Brad Hedlund pointed me to Infrastructure as a Service Builder’s Guide document, which is saying almost the same thing (and coming to flawed conclusions because they had to promote OpenFlow and NEC).

7 comments:

  1. www.convergence.cx17 May, 2011 11:24

    "but even MPLS would hit scalability limits when the number of servers grows into tens of thousands" , I don't see this is an issue, even with current day RIB/FIB limitations, especially as most implementations are doing single-label-per-vrf on the PE, can you explain?

    ReplyDelete
  2. Ivan Pepelnjak17 May, 2011 12:46

    The way MPLS/VPN is implemented today, you have to run MP-IBGP between loopback interfaces (to ensure end-to-end LSP is set up correctly) _and_ all those loopback interfaces have to appear as host routes in IGP and RIB/FIB. You would end up with tens of thousands of host routes in a single IGP. Not the best design.

    As for label-per-VRF, all implementations I've seen use label-per-CE-prefix (apart from 6500/7600 where they obviously underdimensioned the LFIB), but it doesn't matter as the labels are locally significant to the PE-router. On the other hand, you do need a single LDP label to get to the PE-router, so label space is not the issue.

    ReplyDelete
  3. Well, there sure are multiple solutions to allow for route summarization with MPLS LSPs, so scaling IGP is not a problem even in MPLS networks. The main challenge in the type of the topology used in the large data centers. More and more often, it tends to be fat tree for network that feature full-mesh traffic patterns. In topologies like that, the number of links (and paths) in the topology no longer grows as O(N), where N is the number of nodes. Per what we know today, In topologies with non-linear density growth commonly used routing protocols do not scale, and alternate routing solutions should be used, such as algorithmic or compact routing. As long as the network topology has link density proportional to the number of nodes (in other words it is more like a k-ary tree), theoretical scaling should not be a problem with any routing/switching technology.

    ReplyDelete
  4. www.convergence.cx18 May, 2011 09:21

    and if you are really bothered about using the IGP, why not use BGP to carry the FEC label? (i.e run labelled unicast address family along with vpnv4 from your hypervior+PE to your RR)

    ReplyDelete
  5. Ivan Pepelnjak18 May, 2011 09:25

    Doesn't work if you have L3 switches in the path (and you need them to scale)

    ReplyDelete
  6. This is purely a technical limitation, not affecting theoretical scalability of a given technology. As long as there is no theoretical obstacle to scale a design it is always possible to come up with a technical solution if needed. That could be connection oriented or connectionless technology, it does not really matter - this is just an old argument of connections vs packets. The major factor that limits theoretical scalability is type of complexity growth in the network topology.

    ReplyDelete
  7. www.convergence.cx19 May, 2011 00:53

    I think that is quite a general statement ("you need them to scale"), L3 aggregation switching certainly has its place , if that place requires you to drop useful technology to save cost/space then perhaps it isn't such a great approach. I'm personally not using L3 top-of-rack switching for this reason (instead we have L3 distribution which has full feature parity) so this approach would work for me.

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.