Building network automation solutions

9 module online course

Start now!

Category: high availability

Worth Reading: When Stretching Layer Two, Separate Your Fate

Ethan Banks wrote the best one-line description of the crazy stuff we have to deal with in his When Stretching Layer Two, Separate Your Fate blog post:

No application should be tightly coupled to an IP address. This common issue should really be solved by application architects rebuilding the app properly instead of continuing like it’s 1999 while screaming YOLO.

Not that his (or my) take on indisputable facts would change anything… At least we can still enjoy a good rant ;)

add comment

Worth Reading: Understand Your Single Points of Failure

I’ve been saying the same thing for years, but never as succinctly as Alastair Cooke did in his Understand Your Single Points of Failure (SPOF) blog post:

The problem is that each time we eliminated a SPOF, we at least doubled our cost and complexity. The additional cost and complexity are precisely why we may choose to leave a SPOF; eliminating the SPOF may be more expensive than an outage cost due to the SPOF.

Obviously that assumes that you’re able to follow business objectives and not some artificial measure like uptime. Speaking of artificial measures, you might like the discussion about taxonomy of indecision.

add comment

Impact of Azure Subnets on High Availability Designs

Now that you know all about regions and availability zones (AZ) and the ways AWS and Azure implement subnets, let’s get to the crux of the original question Daniel Dib sent me:

As I understand it, subnets in Azure span availability zones. Do you see any drawback to this? You mentioned that it’s difficult to create application swimlanes that way. But does subnet matter if your VMs are in different AZs?

It’s time I explain the concepts of application swimlanes and how they apply to availability zones in public clouds.

read more add comment

MUST READ: Designing a Simple Disaster Recovery Solution

A few weeks ago Adrian Giacometti described a no-stretched-VLANs disaster recovery design he used for one of his customers.

The blog post and related LinkedIn posts generated tons of comments (and objections from the usual suspects), prompting Adrian to write a sequel describing the design requirements he was facing, tradeoffs he made, and interactions between server and networking team needed to make it happen.

add comment

State Consistency in Distributed SDN Controller Clusters

Why Can't We Have Good Things Like Partition-Resilient SDN Controllers

Every now and then I get a question along the lines of “why can’t we have a distributed SDN controller (because resiliency) that would survive network partitioning?” This time, it’s not the incompetency of solution architects or programmers, but the fundamental limitations of what can be done when you want to have consistent state across a distributed system.

TL&DR: If your first thought was CAP Theorem you’re absolutely right. You can probably stop reading right now. If you have no idea what I’m talking about, maybe it’s time you get fluent in distributed systems concepts after you’re finished with this blog post and all the reference material linked in it. Don’t know where to start? I put together a list of resources I found useful.

read more add comment

MUST READ: Fast and Simple Disaster Recovery Solution

More than a year ago I was enjoying a cool beer with my friend Nicola Modena who started explaining how he solved the “you don’t need IP address renumbering for disaster recovery” conundrum with production and standby VRFs. All it takes to flip the two is a few changes in import/export route targets.

I asked Nicola to write about his design, but he’s too busy doing useful stuff. Fortunately he’s not the only one using common sense approach to disaster recovery designs (as opposed to flat earth vendor marketectures). Adrian Giacometti used a very similar design with one of his customers and documented it in a blog post.

read more see 3 comments

Repost: VMware Fault Tolerance Woes

I always claimed that VMware Fault Tolerance makes no sense. After all, the only thing it does is protect a VM against a server hardware failure… in the world where software crashes are way more common, and fat fingers cause most of the outages.

But wait, it gets worse, the whole thing is incredibly complex – you might like this description Minh Ha left as a comment to my Fifty Shades of High Availability blog post.

read more see 2 comments

Interesting: Differential Availability

Someone pointed me to a high-level overview of Google’s Spanner database which included this gem:

A second refinement is that there are many other sources of outages, some of which take out the users in addition to Spanner (“fate sharing”). We actually care about the differential availability, in which the user is up (and making a request) to notice that Spanner is down. This number is strictly higher (more available) than Spanner’s actual availability — that is, you have to hear the tree fall to count it as a problem.

In other words, it doesn’t matter if your distributed database fails if its user are also gone. Keep this concept in mind every time you’re designing a high availability solution – some corner cases are simply not worth solving.

add comment

Fifty Shades of High Availability

A while ago we had an interesting exchange of ideas around inserting high-availability network appliance into a public cloud environment (TL&DR: it was really hard until AWS introduced Gateway Load Balancing), and someone quickly pointed out we’re solving the wrong challenge because…

Azure Firewall […] is a fully stateful firewall-as-a-service with built-in high-availability.

Somehow he wasn’t too happy when I pointed out that there’s more to high availability than vendor marketing ;)

read more see 6 comments

Are Business Needs Just Excuses for Vendor Shenanigans?

Every now and then I call someone’s baby ugly (or maybe it was their third cousin’s baby and they nonetheless feel offended). In such cases a common resort is to cite business or market needs to prove how ignorant and clueless I am. Here’s a sample LinkedIn comment talking about my ignorance about the need for smart NICs:

The rise of custom silicon by Presando [sic], Mellanox, Amazon, Intel and others confirms there is a real market need.

Now let’s get something straight: while there are good reasons to use tons of different things that might look inappropriate, irrelevant or plain stupid to an outsider, I don’t believe in real market need argument being used to justify anything without supporting technical facts (tell me why you need that stuff and prove to me that using it is the best way of solving a problem).

read more see 2 comments

Disaster Recovery: a Vendor Marketing Tale

Several engineers formerly working for a large virtualization vendor were pretty upset with me when I claimed that the virtualization consultants promotedisaster recovery using stretched VLANs” designs instead of alternatives that would implement proper separation of failure domains.

Guess what… it’s even worse than I thought.

Here’s a sequence of comments I received after reposting one of my “disaster recovery doesn’t need stretched VLANs” blog posts on LinkedIn sometime in late 2019:

read more see 1 comments

Bridging Loops in Disaster Recovery Designs

One of the readers commenting the ideas in my Disaster Recovery and Failure Domains blog post effectively said “In an active/passive DR scenario, having L3 DCI separation doesn’t protect you from STP loop/flood in your active DC, so why do you care?

He’s absolutely right - if you have a cold disaster recovery site, it doesn’t matter if it’s bombarded by a gazillion flooded packets per second… but how often do you have a cold recovery site?

read more add comment

MUST READ: Meaningful Availability

Defining service availability using the famous X nines (and all the hacks like “planned downtime doesn’t count”) is pretty useless in a highly distributed system where the only thing that really matters is the user experience, not ping response times. One should ask what precisely should we be measuring, and how could we make sure we can act on the measurements

More details in a concise analysis of the Meaningful Availability paper by the one-and-only The Morning Paper.

add comment

The Myth of Scaling From On-Premises Data Center into a Public Cloud

Every now and then someone tries to justify the “wisdom” of migrating VMs from on-premises data center into a public cloud (without renumbering them) with the idea of “scaling out into the public cloud” aka “cloud bursting”. My usual response: this is another vendor marketing myth that works only in PowerPoint.

To be honest, that statement is too harsh. You can easily scale your application into a public cloud assuming that:

read more see 7 comments
Sidebar