Category: design
Building Carrier-Grade Cloud Infrastructure
During one of my SDN workshops, an attendees asked me “How do you build carrier-grade (5 nines) cloud infrastructure with VMware NSX?”
Short answer: You don’t… and it’s a wrong question anyway.
Designing Active-Active and Disaster Recovery Data Centers
A year ago I was a firm believer in the unlimited powers of Software-Defined Data Centers and their ability to simplify workload migrations. After all, if you can use an API to create any data center object, what’s stopping you from moving the workload running in a data center to another location.
As always, there’s a huge difference between theory and reality.
How Complex Is Your Data Center?
Sometimes it seems like the networking vendors try to (A) create solutions in search of problems, (B) boil the ocean, (C) solve the scalability problems of Google or Amazon instead of focusing on real-life scenarios or (D) all of the above.
Bryan Stiekes from HP decided to do a step in the right direction: let’s ask the customers how complex their data centers really are. He created a data center complexity survey and promised to share the results with me (and you), so please do spend a few minutes of your time filling it in. Thank you!
Private and Public Clouds, and the Mistakes You Can Make
A few days ago I had a nice chat with Christoph Jaggi about private and public clouds, and the mistakes you can make when building a private cloud – the topics we’ll be discussing in the Designing Infrastructure for Private Clouds workshop @ Data Center Day in Berne in mid-September.
The German version of our talk has been published on Inside-IT; those of you not fluent in German will find the English version below.
Cumulus Linux Data Center Architectures
After introducing the concepts of Cumulus Linux in the Data Center Fabrics update session, Dinesh Dutt described the typical data center architectures implemented with Cumulus Linux and the lessons everyone should learn from large-scale web properties.
Can You Avoid Networking Software Bugs?
One of my readers sent me an interesting reliability design question. It all started with a catastrophic WAN failure:
Once a particular volume of encrypted traffic was reached the data center WAN edge router crashed, and then the backup router took over, which also crashed. The traffic then failed over to the second DC, and you can guess what happened then...
Obviously they’re now trying to redesign the network to avoid such failures.
Save the Date: Designing Infrastructure for Private Clouds Workshop in Switzerland
Gabi Gerber (the wonderful mastermind behind the Data Center Day event) is helping me bring my Designing Infrastructure for Private Clouds workshop (one of the best Interop 2015 workshops) to Switzerland.
This is the only cloud design workshop I’m running in Europe in 2015. If you’d like to attend it, this is your only chance – register NOW.
So You Need ISSU on Your ToR switch? Really?
During the Cumulus Linux presentation Dinesh Dutt had at Data Center Fabrics webinar, someone asked an unexpected question: “Do you have In-Service Software Upgrade (ISSU) on Cumulus Linux” and we both went like “What? Why?”
Dinesh is an honest engineer and answered: “No, we don’t do it” with absolutely no hesitation, but we both kept wondering, “Why exactly would you want to do that?”
Case Study: Scale-Out Cloud Infrastructure
I helped several customers design scale-out private or public cloud infrastructure. In every case, I tried to start with a reasonably small pod (based on what they’d consider acceptable loss unit – another great term I inherited from Chris Young), connected them to a shared L3 backbone (either within a data center or across multiple data centers), and then tried to address the inevitable desire for stretched layer-2 connectivity.
You’ll find a summary of these designs in my next ExpressExpress case study: Scale-Out Private Cloud Infrastructure, and if you need more details, I’m usually available for online consulting.
How Do I Start My IPv6 Addressing Plan?
One of my readers was reading the Preparing an IPv6 Addressing Plan document on RIPE web site, and found that the document proposes two approaches to IPv6 addressing: encode location in high-order bits and subnet type in low-order bits (the traditional approach) or encode subnet type in high-order bits and location in low-order bits (totally counter intuitive to most networking engineers). His obvious question was: “Is anyone using type-first addressing in production network?”
Terastream project seems to be using service-first format; if you’re doing something similar, please leave a comment!
Design Challenge: Multiple Data Centers Connected with Slow Links
One of my readers sent me this question:
What is best practice to get a copy of the VM image from DC1 to DC2 for DR when you have subrate (155 Mbps in my case) Metro Ethernet services between DC1 and DC2?
The slow link between the data centers effectively rules out any ideas of live VM migration; to figure out what you should be doing, you have to focus on business needs.
Last Chapter of Data Center Design Case Studies Is Published
A few days ago I completed the last chapter in the Data Center Design Case Studies book: building disaster recovery and active-active data centers. It focuses on application behavior and business needs, not on the underlying technologies; the networking technology part tends to be way easier to solve than the oft-ignored application-level challenges.
Case Study: Combine Physical and Virtual Appliances in a Private Cloud
Cloud builders are often using my ExpertExpress service to validate their designs. Tenant onboarding into a multi-tenant (private or public) cloud infrastructure is a common problem, and tenants frequently want to retain the existing network services appliances (firewalls and load balancers).
The Combine Physical and Virtual Appliances in a Private Cloud case study describes a typical solution that combines per-tenant virtual appliances with frontend physical appliances.
Latency: the Killer of Spread-Out Application Stack Ideas
A few months ago I described how bandwidth limitations shatter the dreams of spread-out application stacks with elements residing (or being dynamically migrated) between data centers. Today let’s focus on bandwidth’s ugly cousin: latency.
TL&DR Summary: Spreading the server components of an application across multiple locations (multiple data centers or hybrid cloud deployments) can easily result in dismal performance even when there’s plenty of bandwidth available.
Facebook Next-Generation Fabric
Facebook published their next-generation data center architecture a few weeks ago, resulting in the expected “revolutionary approach to data center fabrics” echoes from the industry press and blogosphere.
In reality, they did a great engineering job using an interesting twist on pretty traditional multi-stage leaf-and-spine (or folded Clos) architecture.