design « ipSpace.net blog

Monday, February 10, 2020 08:39 +0100

Be Careful When Using New Features

During a recent workshop I made a comment along the lines “be careful with feature X from vendor Y because it took vendor Z two years to fix all the bugs in a very similar feature”, and someone immediately asked “are you saying it doesn’t work?”

My answer: “I never said that, I just drew inferences from other people’s struggles.”

A Step Back

Networking operating systems are probably some of the most complex pieces of software out there. Distributed systems are hard. Real-time distributed systems are even harder. Real-time distributed systems running on top of eventually-consistent distributed databases are extra fun.

read more add comment

design

Wednesday, February 5, 2020 08:37 +0100

The EVPN/EBGP Saga Continues

Aldrin wrote a well-thought-out comment to my EVPN Dilemma blog post explaining why he thinks it makes sense to use Juniper’s IBGP (EVPN) over EBGP (underlay) design. The only problem I have is that I forcefully disagree with many of his assumptions.

He started with an in-depth explanation of why EBGP over directly-connected interfaces makes little sense:

You're Responsible for Resiliency of Your Public Cloud Deployment

Enterprise environments usually implement “mission-critical” applications by pushing high-availability requirements down the stack until they hit networking… and then blame the networking team when the whole house of cards collapses.

Most public cloud providers are not willing to play the same stupid blame-shifting game - they live or die by their reputation, and maintaining a stable service is their highest priority. They will do their best to implement a robust and resilient infrastructure, but will not do anything that could impact its stability or scalability… including the snake oil the virtualization and networking vendors love to sell to their gullible customers. When you deploy your application workloads into a public cloud, you become responsible for the resiliency of your own application, and there’s no magic button that could allow you to push the problems down the stack.

Public Cloud Cannot Change the Laws of Physics

Listening to public cloud evangelists and marketing departments of vendors selling over-the-cloud networking solutions or multi-cloud orchestration systems, you could start to believe that migrating your workload to a public cloud would solve all your problems… and if you’re gullible enough to listen to them, you’ll get the results you deserve.

Unfortunately, nothing can change the fundamental laws of physics, networking, or application architectures:

read more add comment

Wednesday, January 8, 2020 08:00 +0100

Do We Need Math in Networking?

I should have known better, but I couldn’t resist being pulled into a Twitter spat around the question “whether networking engineers need to know something about math” a long while ago.

Before going into the details, let’s start with Wikipedia definition: “Engineering is the use of scientific principles to design and build machines, structures, and other things” including “specific emphasis on particular areas of applied mathematics, applied science, and types of application”.

So feel free to believe that you don’t need any math or other science (because there’s very little science behind what we do in networking) in your job, in which case you might want to stop reading… but then at least please think twice about your job title.

Figure Out What Problem You’re Trying to Solve

A long while ago I got into an hilarious Tweetfest (note to self: don’t… not that I would ever listen) starting with:

Which feature and which Cisco router for layer2 extension over internet 100Mbps with 1500 Bytes MTU

The knee-jerk reaction was obvious: OMG, not again. The ugly ghost of BRouters (or is it RBridges or WAN Extenders?) has awoken. The best reply in this category was definitely:

I cannot fathom the conversation where this was a legitimate design option. May the odds forever be in your favor.

A dozen “this is a dumpster fire” tweets later the problem was rephrased as:

read more add comment

Thursday, December 12, 2019 08:54 +0100

You Don't Need IP Renumbering for Disaster Recovery

This is a common objection I get when trying to persuade network architects they don’t need stretched VLANs (and IP subnets) to implement data center disaster recovery:

Changing IP addresses when activating DR is hard. You’d have to weigh the manageability of stretching L2 and protecting it, with the added complexity of breaking the two sites into separate domains [and subnets]. We all have apps with hardcoded IP’s, outdated IPAM’s, Firewall rules that need updating, etc.

Let’s get one thing straight: when you’re doing disaster recovery there are no live subnets, IP addresses or anything else along those lines. The disaster has struck, and your data center infrastructure is gone.

Disaster Recovery and Failure Domains

One of the responses to my Disaster Recovery Faking blog post focused on failure domains:

What is the difference between supporting L2 stretched between two pods in your DC (which everyone does for seamless vMotion), and having a 30ms link between these two pods because they happen to be in different buildings?

I hope you agree that a single broadcast domain is a single failure domain. If not, let agree to disagree and move on - my life is too short to argue about obvious stuff.

read more add comment

Wednesday, December 4, 2019 08:40 +0100

Tuning BGP Convergence in High-Availability Firewall Cluster Design

Two weeks ago Nicola Modena explained how to design BGP routing to implement resilient high-availability network services architecture. The next step to tackle was obvious: how do you fine-tune convergence times, and how does BGP convergence compare to the more traditional FHRP-based design.

see 2 comments

Saturday, November 30, 2019 08:30 +0100

Worth Reading: Resilience Engineering

Starting with my faking disaster recovery tests blog post Terry Slattery wrote a great article delving into the intricacies of DR testing, types of expected disasters, and resilience engineering. As always, a fantastic read from Terry.

see 2 comments

Tuesday, November 19, 2019 10:51 +0100

Using BGP for Firewall High Availability: Design and Software Upgrades

Remember the “BGP as High Availability Protocol” article Nicola Modena wrote a few months ago? He finally found time to extend it with BGP design considerations and a description of a seamless-and-safe firewall software upgrade procedure.

add comment

Tuesday, November 12, 2019 08:14 +0100

Stretched VLANs and Failing Firewall Clusters

After publishing the Disaster Recovery Faking, Take Two blog post (you might want to read that one before proceeding) I was severely reprimanded by several people with ties to virtualization vendors for blaming virtualization consultants when it was obvious the firewall clusters stretched across two data centers caused the total data center meltdown.

Let’s chase that elephant out of the room first. When you drive too fast on an icy road and crash into a tree who do you blame?

The person who told you it’s perfectly OK to do so;
The tire manufacturer who advertised how safe their tires were?
The tires for failing to ignore the laws of physics;
Yourself for listening to bad advice

For whatever reason some people love to blame the tires ;)

Worth Reading: How Complex Systems Fail

I don’t remember who pointed me to the excellent How Complex Systems Fail document. It’s almost like RFC1925 – I could quote it all day long, and anyone dealing with large mission-critical distributed systems (hint: networks) should read it once a day ;))

Enjoy!

see 2 comments

design

Thursday, October 17, 2019 08:22 +0200

Disaster Recovery Faking, Take Two

An anonymous (for reasons that will be obvious pretty soon) commenter left a gem on my Disaster Recovery Test Faking blog post that is way too valuable to be left hidden and unannotated.

Here’s what he did:

Once I was tasked to do a DR test before handing over the solution to the customer. To simulate the loss of a data center I suggested to physically shutdown all core switches in the active data center.

That’s the right first step: simulate the simplest scenario (total DC failure) in a way that is easy to recover (power on the switches). It worked…

Must read: Shades of Lock-in

Gregor Hohpe published an excellent series on Martin Fowler’s web site focusing on various aspects of lock-in. If nothing else, you SHOULD read the shades of lock-in part, and combine it with my thoughts on lock-in in data center networking.

add comment

design
cloud

Category: design

A Step Back