Building a Greenfield Data Center

The following design challenge landed in my Inbox not too long ago:

My organization is the in the process of building a completely new data center from the ground up (new hardware, software, protocols ...). We will currently start with one site but may move to two for DR purposes. What DC technologies should we be looking at implementing to build a stable infrastructure that will scale and support technologies you feel will play a big role in the future?

In an ideal world, my answer would begin with “Start with the applications.”

Application and server recommendations

Whatever you do, make sure you use scale-out application architecture as much as possible. Use products and tools that allow you to scale out every application tier (web servers, application servers and database servers). Web servers are usually easy to scale out unless you insist on weird session management techniques. Database servers are the toughest nut to crack, but even Microsoft’s SQL server has a somewhat redundant architecture.

If you want to make scale-out architecture transparent to the clients, you have to deploy load balancing. Use local load balancing within a data center and DNS-based load balancing between data centers (you might also try out anycast). Select products that have tight integration between local and DNS-based load balancing. Prefer vendors that integrate tightly with your server virtualization platform (example: new VMs should be added to load balancing pools automatically).

Use as much server virtualization as possible. Unless you have huge workloads where a single application needs several high-end physical servers for every tier, virtualization will significantly lower your costs and help you deploy new servers and applications faster.

Use high-end physical servers with as much memory and as many CPU cores as your budget can survive. Bob Plankers wrote a nice blog post explaining why scale-up makes sense for hypervisor hosts.

Networking infrastructure

Support IPv4 and IPv6. Ideally you’d deploy only IPv6 on the inside networks assuming your applications can work over IPv6 (dual-stack deployment increases complexity and support costs) and do 4-to-6 and 6-to-6 load balancing. Some vendors might lack IPv6 support in their data center gear. It’s their problem, don’t make it yours.

Simplify your external routing and try to get a different public prefix for each site. Getting two public IPv4 prefixes might be a tough call; supposedly there’s plenty of IPv6 address space left judging by how we’re throwing it away.

Keep your layer-2 domains small and use layer-3 switching (known inside the ivory towers as routing) as much as possible. Data centers should be islands of layer-2 connectivity in an ocean of layer-3 (thank you, @networkingnerd). Even with emerging earth-flattening technologies like FabricPath, TRILL or SPB, bridging still doesn’t scale (spanning tree protocol is not the only problem bridging has).

If you’ve listened to the previous advices, you don’t need large-scale bridging anyway – scale-out application architectures with load balancers work happily across multiple IP subnets (so do some recent clustering solutions) and don’t need large VLANs.

Furthermore, with proper application architecture and decent load balancing products, there’s no need to move virtual machines around. You can easily shut them down in one location and start them in another where they would get a different IP address; the load balancing tools (integrated with your virtualization platform) should automatically adapt to the changes you’ve made.

Some vendors might not have L3 switching available in products that would fit your needs. Remember that it’s their problem, not yours. Look around, there are alternatives.

Make your layer-2 domains stable. Use multi-chassis link aggregation to increase bandwidth utilization and reduce the impact of link failures. Use spanning tree protection features offered by your gear (BPDU guard, root guard, bridge assurance ...).

Use 10-gigabit Ethernet. It’s easier to maintain than tons of 1GbE links and might actually get decent utilization when used on high-end servers mentioned in the previous paragraph. I would try to use a system that supports virtual Ethernet NICs (Cisco UCS comes to mind). VMware is still unhappy if you don’t have plenty of NICs in your server; the easiest way to keep it happy is to use plenty of virtual NICs spread over two physical uplinks.

Data centers used to have a hierarchy of bandwidths – 100 Mbps to the servers, 1Gb in access layer, 10Gb in the core. With 10GbE server attachments and 40/100GbE products still not widely available (or being too expensive), you’re forced to use high oversubscription ratios. Port channels help, but they’re not perfect. Select the gear that supports DCB standards to cope with high-volume servers overloading the core links. PFC and ETS are mandatory; QCN is not needed if you don’t have large L2 domains.

Multi-site recommendations

Don’t even think about L2 data center interconnect. Yet again, if you did follow my advice and implemented load balancing and scale-out architecture, you don’t need L2 DCI. IP has been proven to work just fine between sites and there’s no reason you should try to reinvent the wheel that has been demonstrated to be broken 20 years ago.

Build your Data Center Interconnects (DCI) with MPLS. Deploying MPLS does require a new set of skills, so it might go against keep it simple recommendation, but it gives you flexibility. You can deploy IP routing, layer-3 VPNs (to keep security zones separated across DCI link) or layer-2 VPNs (either VPLS or upcoming MAC VPN from Juniper) across MPLS infrastructure as needed.

Security, Logging, Monitoring

Consider monitoring and security in the design and build phases. If feasible, use separate cabling for out-of-band management and monitoring, including console access (use terminal servers for remote console access). High-end devices have dedicated management ports; use them! At the very minimum, dedicate a VLAN and a L3 subnet exclusively for network management purposes.

Don't forget to consider physical security and system security. Deployment of IP cameras and recording equipment are worthwhile.

Log everything (and use NTP to synchronize clocks). Make sure you consider a Logging & Compliance Management system that collects logs from everything – including Windows, Unix, Storage, Firewalls and Mainframe – and then analyze them. Learn how to use a Security Event Manager to collate these logs.

Use as much stateless firewalling as possible. Having a stateful firewall that does nothing else but permit TCP sessions to port 80 (HTTP) is a waste.

Your firewalling needs also depend on your applications. If you use applications with clean HTTP-based architecture, you’ll do just fine with packet filters. If the application uses RPC calls using dynamic server port numbers, you’re in troubles.

If you must use firewalls, use products that support multiple logical (virtual) firewalls in a single chassis. As your data center changes, you will be able to create new logical firewall instances instead of buying new hardware devices.

Use Web Application Firewalls. Traditional firewalls and IDS/IPS devices cannot protect you against majority of today’s threats – application-level intrusions like SQL injections. If you want to protect your web applications, you need a device that can reassemble HTTP requests, do a deep inspection and reject anything that looks suspicious.

Storage

Use IP everywhere. Unless you have legacy Fiber Channel gear or huge storage requirements where the FC management tools would make sense, go with iSCSI or NFS. Choose storage devices that support both. Use a separate VLAN for storage traffic; you might want to build a dedicated network to handle it. Use a dedicated 802.1p priority class for iSCSI/NFS traffic and PFC for lossless transport (lossless transport significantly improves performance of high-volume TCP traffic like iSCSI or NFS).

Use Small Storage Arrays (by Greg Ferro). The storage industry has a lot of new technology coming and storage technology is finally changing more quickly. I would advise buying the smallest sized storage arrays you feel will work and then plan for regular hardware updates or new systems. The shift from 3.5" FCAL drives to SAS 2.5" drives at 7200 RPM means less power and better performance, the impact of SSD drives is only just being delivered in new storage products, and software developments for deduplication, block recovery are moving from the high end into standard products.

Physical infrastructure (by Greg Ferro)

Install only the cabling you need. The transition from 1GbE to 10GbE, 40 GbE and 100GbE means big changes for cabling. 10GbE over multimode needs one OM3 fiber pair, 40GbE needs 8 cores, 100GbE needs twenty cores. The MPO connector can only be assembled in factory. Therefore, running twenty or hundred core cabling is a waste of time.

Discussion continues over the use of OM4 or even OM5 multimode for future Ethernet standards, but most likely single mode will be more common. All this uncertainty means you should install only the minimum cabling you really need and plan to use modular cabling systems into the future.

Go Bare Floor. The days of the raised flooring are over. The weight of a server rack with four chassis or two large switches installed usually means reinforced flooring which wastes time and space. Investigate cooling designs (Yahoo, Facebook/OpenCompute) that allow for direct floor use such as hot/cold aisles and air flow containment and overhead cabling trays for power and cabling.

The final advice

Last, but definitely not least, whatever you do – keep it simple. Choose the technologies that your team can support (or make sure they get properly trained). You will want to take a vacation at least once in the next decade and the guy that gets the support call at 1AM has to be able to solve the problem on his own without waking you up.

Acknowledgements

This document has been reviewed and greatly improved by (in alphabetical order) Dan Hughes (@rovingengineer), Greg Ferro (@etherealmind), Jeremy Filliben, Kurt Bales (@networkjanitor), Matthew Norwood (@matthewnorwood) and Tom Hollingsworth (@networkingnerd):

  • Kurt Bales suggested using IPv6 on the inside networks (a strategy also favored by Tore Anderson), MPLS on the DCI infrastructure and WAFs;
  • Matthew Norwood suggested out-of-band monitoring & management;
  • Greg Ferro made extensive remarks on cabling issues (I’ve added a few links to his excellent blog posts) and flooring, recommended virtual firewalls and small storage arrays, and mentioned logging requirements;
  • Jeremy Filliben mentioned 10GbE oversubscription problems (causing me to include DCB as one of the recommendations);
  • Tom Hollingsworth made the fantastic islands-in-the-ocean analogy;
  • I had a great discussion with Dan Hughes regarding external routing. This topic deserves a blog post of its own.

Thank you all for very quick and thoughtful responses!

Even more information

You’ll find big-picture perspective as well as in-depth discussions of various data center technologies in my webinars: Data Center 3.0 for Networking Engineers (recording), Data Center Interconnects (recording) and VMware Networking Deep Dive (recording or live session). All three webinars are also available as part of the yearly subscription.

12 comments:

  1. Anonymous Coward01 August, 2011 13:50

    Hi Ivan,

    I just can't seem to understand your trepidation around layer 2 interconnects. Yes, I agree layer 3 has it's benefits; however with STP mitigation through VPLS, MPLS techniques or trill/fabric path I cannnot accept your arguments that appropriately designed L2 DCI has no place in a well designed data centre. It is very rare that you walk into an enterprise that does not have a multitude of L2 requirements and saying "I like layer 3" does not form a valid business case to migrate away from these legacy requirements.
    I'm not sure what bridging issues you refer to after STP. Yes we need to be cognizant of mac table sizes, but not all switches need to learn macs with well designed layer 2. Broadcast domains are something to keep in mind, but I would say that modern NICs, application behavior and bandwidths that we would be sepeating hosts based on security isolation purposes before we hit any broadcast issues these days. Fault isolation, well after the STP issue we basically fall to broadcast storms from dodgy hosts, but most modern switches offer mechanisms to deal with this.
    Keep in mind that I'm approaching this from an enterprise environment perspective, and that I agree that layer 3 should be the default position. But layer 2 DCI often forms part of the puzzle and with very good results.
    So why so down on layer 2?
    And yes, I have seen L2 meltdowns in my time, but everyone was the result of poor design, which cannot be solved with any interconnect solution. :-P

    ReplyDelete
  2. Working in the telco space, thankfully with an MPLS network. I've seen first hand what extended layer 2 networks can do.

    1) That L2 broadcast storm just took down your sites rather than just one.
    2) Someone digs up your L2 interconnect and you now have devices at each site that expect to see other servers in their lan that have just gone away.

    I could go on...

    ReplyDelete
  3. Anonymous Coward01 August, 2011 17:03

    Enterprise space mainly... But many SPs I'm aware of successfully implement large layer 2's
    1)Broadcast storm control - although I'm not suggesting you completely ignore broadcast domain sizing, we are simply talking about extending them.
    2) path diversity and careful review of split brain failure modes.

    Please go on... That's what I'm interested in.

    ReplyDelete
  4. In terms of iSCSI/NFS, some are still under the impression that FC is a faster protocol than FC. It's not. Even without jumbo frames, iSCSI is competitive with FC.

    The storage array itself makes much more of a difference in what your performance will be (number of spindles, caching, SSD, etc.)

    ReplyDelete
  5. In terms of iSCSI/NFS, some are still under the impression that FC is a faster protocol than FC. It's not. Even without jumbo frames, iSCSI is competitive with FC.

    The storage array itself makes much more of a difference in what your performance will be (number of spindles, caching, SSD, etc.)

    ReplyDelete
  6. In terms of iSCSI/NFS, some are still under the impression that FC is a faster protocol than FC. It's not. Even without jumbo frames, iSCSI is competitive with FC.

    The storage array itself makes much more of a difference in what your performance will be (number of spindles, caching, SSD, etc.)

    ReplyDelete
  7. NotSoAnonymous02 August, 2011 00:23

    "Overhead cable trays", "Raised Flooring"....Are the days for patch panels over!!!??? :'(

    I though patch panels are easiest and cleanest cabling solution! Maybe expensive but its only 1 time..

    ReplyDelete
  8. I'll tackle this from the other end.

    Of course you can do this but the point is you end up with another L3. When you have a system that was designed to do this, is field-proven for 20 years and everyone is familiar with, why do you want to reinvent the wheel again?

    Do you live in the dark ages and still run non-ip protocols? Or a gem like MS NLB? Tell me, what problem do you solve on L2 that you can't on L3?

    I mean, why would anyone sane want to "migrate away from these legacy requirements", given the option of building DC from scratch? I guess you answered your own question.

    ReplyDelete
  9. Anonymous Coward02 August, 2011 14:02

    Still no answer there I'm afraid. Why would I want a layer 2 DCI : for the same reason that layer 2 DCI is so common, and so much work has gone into making them scalabe
    - Stretched Cluster
    - VM Mobility. Yes there are lots of complex ways to make this work, and yes we are talking about geographically close DCs here. But were talking DCI not DR and enterprise. Until LISP picks up pace L2 is going to be the go too tool here for some time.
    - transporting FCoE. yes people have invested lots if money in FC, and no iSCSI and NFS do not offer feature parity, and simple integration the way FCoE do. And before we start flaming FCoE as well, keep in mind that all the major vendors are getting into this game for a reason, because it is making realistic headway as an FC replacment unlike all the other predecessors.

    With regards to the "green fields DC", I have been involved in many a new DC build, and not one of them started with throwing out all the applications and operating practices that have been with an organization since it's inception. Not every app is web app.

    I'm not interested in starting a religious debate here, I implement both L2 and L3 designs, which one depends on the specific requirements at hand - I also own a mac and pc, and use a windows phone yet am typing this on my iPad :)

    Prior to TRILL, I have been bridging the divides with MPLS/VPLS, however now it's here, that a complexity I can avoid in my DCI.

    So the question stands. Why not layer 2?

    Oh, and I know of lots of people that have been sending letters for 20 years and are familiar with it, but e-mail is just sooo much easier. :-P

    ReplyDelete
  10. Ivan Pepelnjak02 August, 2011 14:16

    Stretched cluster is a patently bad idea: http://blog.ioshints.info/2011/06/stretched-clusters-almost-as-good-as.html. You're betting two data centers on the availability of DCI link.

    Inter-DC VM mobility is a bogus requirement. If you have scale-out application, you don't need it, if you have a non-scalable application, you can't achieve high availability anyway.

    FCoE cannot be transported across anything else but dark fiber (due to lack of PFC) and even there it's highly limited. http://blog.ioshints.info/2010/11/fcoe-between-data-centers-forget-it.html

    In my personal opinion, the L2 DCI requirements almost always come from someone who tries to offload his problems and/or bad decisions to the network.

    ReplyDelete
  11. Anonymous Coward02 August, 2011 14:44

    Ok.. So make your DCI available - diverse physical path. And if the DCI does split make sure one side wins. That fixes your stretched cluster concerns. Remember why stretch these clusters - it's easier.

    VM mobility is a bogus requirement????? Are we living in the same world. I'm not suggesting that vmotion will solve my availability requirements, but if my storage is replicated, particularily synchronously, then getting that vm back online quick smart at my second DC is a very positive attribute (and in my view tools like SRM are a pain in the butt in reality). And aide from that, people like t, its flexibke, its easy, it gives you options. Oh, and I'm glad to hear that you have the luxury of re-architecting ever app to match your DCI statergy.

    And as far as FCoE is concerned, if I was replacing my FC SAN which is replicating natively (not via a storage routing solution) then guess what I have, dark fibre, but now I have a single set of switches to manage. Yes there is still a ways to go, and no it won't fix ever problem, but t works.

    And too the last point, that's what networks have been doing for years, providing solutions to problems to enable the deployment of tools that enable the enterprise. We build networks to support apps, not the other way round, and from experience, thats the way hierarchy works in most organizations.

    Still. Why no layer 2? Beyond all the other points that hav been raised, I have a genuine curiosity into the problems that others perceive.

    My view of l2 v l3 is the same as many other features and functions we have in modern kit (FCoe, iSCSI, gre, DMVPN, ospf, bgp, hsrp, nat, etc etc etc), use the tool that makes sense for the problem at hand. That includes consideration of technical, organizational and financial constraints. I think that saying never layer 2 is just as short sighted as saying always l2.

    ReplyDelete
  12. Any thoughts on the pro's and con's of using dedicated backup networks and in-band management (e.g. rdp, ssh) networks that connect into each server? I've seen it done in several data centres but the major down size is that a static routes then need to be created on the server. It feels like a left over from the days of 100Mbit/s networks where bandwidth was an issue but nevertheless it still seems like a popular design choice. Interested in how backup and in-band management should be designed into a greenfield data centre.

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.