Blog Posts in January 2020

You're Responsible for Resiliency of Your Public Cloud Deployment

Enterprise environments usually implement “mission-critical” applications by pushing high-availability requirements down the stack until they hit networking… and then blame the networking team when the whole house of cards collapses.

Most public cloud providers are not willing to play the same stupid blame-shifting game - they live or die by their reputation, and maintaining a stable service is their highest priority. They will do their best to implement a robust and resilient infrastructure, but will not do anything that could impact its stability or scalability… including the snake oil the virtualization and networking vendors love to sell to their gullible customers. When you deploy your application workloads into a public cloud, you become responsible for the resiliency of your own application, and there’s no magic button that could allow you to push the problems down the stack.

read more see 1 comments

Master Infrastructure-as-Code and Immutable Infrastructure Principles

Doing the same thing and hoping for a different result is supposedly a definition of insanity… and managing public cloud deployments with an unrepeatable sequence of GUI clicks comes pretty close to it.

Engineers who mastered the art of public cloud deployments realized decades ago that the only way forward is to treat infrastructure in the same way as any other source code:

read more add comment

Fast Failover in SD-WAN Networks

It’s amazing how quickly you get “must have feature Y or it should not be called X” comments coming from vendor engineers the moment you mention something vaguely-defined like SD-WAN.

Here are just two of the claims I got as a response to “BGP with IP-SLA is SD-WAN” trolling I started on LinkedIn based on this blog post:

Key missing features [of your solution]:

  • real time circuit failover (100ms is not real-time)
  • traffic steering (again, 100ms is not real-time)

Let’s get the facts straight: it seems Cisco IOS evaluates route-map statements using track objects in periodic BGP table scan process, so the failover time is on order of 30 seconds plus however long it takes IP SLA to detect the decreased link quality.

read more see 3 comments

Automation Solution: Testing Data Models

If your automation solution relies on a back-end database with strict database schema you can stop reading… but if you (like most others) still live in the land of text files encoded in your favorite presentation format (because it’s hip to hate YAML), you might appreciate the solution Donald Johnson uses to check his data models before committing them into Git repository.

Explore our Network Automation Solutions Showcase to find other solutions created by the attendees of our Building Network Automation Solutions online course.
see 2 comments

Getting More Bang for Your VXLAN Bucks

A little while ago I explained why you can’t use more than 4K VXLAN segments on a ToR switch (at least with most ASICs out there). Does that mean that you’re limited to a total of 4K virtual ethernet segments?

Of course not.

You could implement overlay virtual networks in software (on hypervisors or container hosts), although even there the enterprise products rarely give you more than a few thousand logical switches (to use NSX terminology)… but that’s a product, not technology limitation. Large public cloud providers use the same (or similar) technology to run gazillions of tenant segments.

Want to know more? Watch our NSX, AWS and Azure networking webinars.
read more add comment

Public Cloud Cannot Change the Laws of Physics

Listening to public cloud evangelists and marketing departments of vendors selling over-the-cloud networking solutions or multi-cloud orchestration systems, you could start to believe that migrating your workload to a public cloud would solve all your problems… and if you’re gullible enough to listen to them, you’ll get the results you deserve.

Unfortunately, nothing can change the fundamental laws of physics, networking, or application architectures:

read more add comment

Early Stages of Product Decline

One of the worst things that can happen to anyone selecting equipment for a new network infrastructure is to receive the End-of-Life notice a week after the gear has been deployed in a production network… or maybe it’s even worse to be stuck with a neglected piece of technology full of bugs that the vendor never fixes because they’re chasing other shinier squirrels.

If you’re careful and watch what the vendors are doing, you might be able to save the day and identify the early phases of product decline. Here they are (as seen from the outside) in approximate order:

End of promotion opportunities. In most corporations aggressive hunters fare better than meticulous farmers, and product development is no different. As a friend of mine working for a large corporation once said “The culture here rewards launches instead of steady improvements. Like in academia, publishing a paper is valued more than running ISS”.

read more see 2 comments

MacOS Catalina = Windows Vista

Remember the Windows version that was so security-focused that it broke everything, and needed a gazillion changes/updates/upgrades to get back to where you had a working computer? I think it was Vista, but maybe my memory is failing me. Anyway, Apple got its Vista moment with macOS Catalina.

I was stupid enough to upgrade just before New Year, and I’m still struggling with aftereffects and skeletons falling out of every cupboard I look at. I appreciate Apple trying to make their operating system ever more secure, but breaking stuff every time I upgrade it is borderline ridiculous.

read more see 4 comments

Automation Solution: Data Center Fabric with Tenant Connectivity

I always tell networking engineers attending our Building Network Automation Solutions online course to create minimalistic data models with (preferably) no redundant information. Not surprisingly, that’s a really hard task (see this article for an example) - using a simple automation tool like Ansible you end with either a messy and redundant data model or Jinja2 templates (or Ansible playbooks) full of hard-to-understand and impossible-to-maintain business logic.

Stephen Harding solved this problem the right way: his data center fabric deployment solution uses a dynamic inventory script that translates operator-friendly fabric description (data model) into template-friendly set of device variables.

read more add comment

EVPN Auto-Rd and Duplicate MAC Addresses

Another EVPN reader question, this time focusing on auto-RD functionality and how it works with duplicate MAC addresses:

If set to Auto, the RD generated for the same VNI across the EVPN switches will be different. If the same route (MAC/IP) is present under different leaves of the same L2VNI, there is no best path selection (since the RD is different), and both will be considered. This is a misconfiguration and shouldn’t be allowed. How will the BGP deal with this?

If the above sentence sounded like Latin, go through the short EVPN terminology first (and I would suggest watching the EVPN Technical Deep Dive webinar).
read more see 2 comments

Public Cloud Networking Security is Different

If you’re running a typical (somewhat outdated) enterprise data center, you’re using tons of VLANs and firewalls, use VLANs as security zones, and push inter-VLAN traffic through firewalls for inspection. Security vendors love that approach - when inspecting traffic they can add no value to (like database- or backup sessions), the firewalls quickly become choke points that have to be upgraded.

read more see 4 comments

AWS Rarely Kills a Service. What About Your Vendor?

Here’s an interesting tidbit from “Last Week in AWS” blog:

From a philosophical point of view, AWS fundamentally considers an API to be a promise. Services that aren’t promoted anymore are still available […] Think about that for a second - a service launched 13 years ago is still actively supported to the point where you can use it today.

Compare that to Killed By Google graveyard, and you might understand why I’m a bit reluctant to cover GCP in my webinars.

read more add comment

Must Read: Ironies of Automation

Stumbled upon a 35-year-old article describing the ironies of automation (HT: The Morning Paper). Here’s a teaser…

Unfortunately automatic control can ‘camouflage’ system failure by controlling against the variable changes, so that trends do not become apparent until they are beyond control.

In simpler words: when things fail, they fail really badly because the intermittent failures were kept hidden. Keep that in mind the next time someone tells you how wonderful software-defined AI-assisted networking is going to be.

see 1 comments

Another perspective on "engineering" in IT

Found a nice article about Margaret Hamilton, the lady who coined the term "software engineering".

Engineering—back in 1969 as well as here in 2020—carries a whole set of associated values with it, and one of the most important is the necessity of proofing for disaster before human usage. You don’t “fail fast” when building a bridge: You ensure the bridge works first.

Now be a good "networking engineer" and go and stretch another VLAN around the globe... ;)

see 3 comments

Worth Reading: Apple II Had the Lowest Input Latency Ever

It's amazing how heaping layers of complexity (see also: SDN or intent-based whatever) manages to destroy performance faster than Moore's law delivers it. The computer with lowest keyboard-to-screen latency was (supposedly) Apple II built in 1983, with modern Linux having keyboard-to-screen RTT matching the transatlantic links.

No surprise there: Linux has been getting slower with every kernel release and it takes an enormous effort to fine-tune it (assuming you know what to tune). Keep that in mind the next time someone with a hefty PPT slide deck will tell you to build a "provider cloud" with packet forwarding done on Linux VMs. You can make that work, and smart people made that work, but you might not have the resources to replicate their feat.

see 3 comments

Do We Need Math in Networking?

I should have known better, but I couldn’t resist being pulled into a Twitter spat around the question “whether networking engineers need to know something about math” a long while ago.

Before going into the details, let’s start with Wikipedia definition: “Engineering is the use of scientific principles to design and build machines, structures, and other things” including “specific emphasis on particular areas of applied mathematics, applied science, and types of application”.

So feel free to believe that you don’t need any math or other science (because there’s very little science behind what we do in networking) in your job, in which case you might want to stop reading… but then at least please think twice about your job title.

read more see 5 comments

There Is no Layer-2 in Public Cloud

The amount of layer-2 tricks we use to make enterprise networking work never ceases to amaze me - from shared IP addresses used by various clustering solutions (because it’s too hard to read the manuals and configure DNS) to shared MAC addresses used by first-hop router redundancy protocols (because it would be really hard to send a Gratuitous ARP message on failover) and all sorts of shenanigans we’re forced to engage in to enable running servers to be moved willy-nilly around the Earth.

read more see 3 comments

Review: Webinars in 2019

A year ago I wrote about plans for 2019. While we created over 50 hours of webinar content and over 20 hours of course content in 2019, let’s see how well we delivered on my promises.

Following AWS Networking webinar, we’ll do a similar one on Azure (in early autumn 2019) and GCP (late 2019 or early 2020)

Azure: done. GCP: postponed. Let’s see how serious Google is about that particular project.

read more see 2 comments

Worth Reading: Understanding Scale Computing HC3 Edge Fabric

A long while ago someone told me about a "great" idea of using multi-port server NICs to build ad-hoc (or hypercube or whatever) server-only networks. It's pretty easy to prove that the approach doesn't make sense if you try to build generic any-to-any-connectivity infrastructure... but it makes perfect sense in a small environment.

One can only hope Scale Computing keeps their marketing closer to reality than some major vendors (that I will not name because I'm sick-and-tired of emails from their employees telling me how I'm unjustly picking on the stupidities their marketing is evangelizing) and will not start selling this approach as save-the-world panacea... but we can be pretty sure there will be people out there using it in way-too-large environments, and then blame everything else but their own ignorance or stubbornness when the whole thing explodes into their faces.

see 1 comments