Blog Posts in September 2021
Boris Lazarov sent me an excellent question:
Does it make sense and are there any inherent problems from design perspective to use the underlay not only for transport of overlay packets, but also for some services. For example: VMWare cluster, vMotion, VXLAN traffic, and some basic infrastructure services that are prerequisite for the rest (DNS).
Before answering it, let’s define some terminology which will inevitably lead us to the it’s tunnels all the way down endstate.
Bill Dagy sent me an annoying ISR gotcha. In his own words:
Since you have a large audience I thought I would throw this out here. Maybe it will help someone avoid spending 80 man hours troubleshooting network slowdowns.
Here’s the root cause of that behavior:
Cisco is now shipping routers that have some specified maximum throughput, but you have to buy a “boost license” to run them unthrottled. Maybe everyone already knew this but it sure took us by surprise.
Don’t believe it? Here’s a snapshot from Cisco 4000 Family Integrated Services Router Data Sheet:
In the Non-Stop Forwarding (NSF) article, I mentioned that the routers adjacent to the device using NSF have to play along to make the idea work. That capability is called Graceful Restart. Today we’ll explore its intricate details, be diplomatic, and leave the shortcomings and tradeoffs for the next blog post.
Imagine an access (provider edge) router providing connectivity services to its clients and running a routing protocol with one or more upstream devices.
I think we have a global problem with code quality. Both from a security perspective, and from a less problematic but still annoying bugs-everywhere perspective. I’m not sure if the issue is largely ignored, or we’ve given up on it (see also: Cloud Complexity Lies or Cisco ACI Complexity).
Here’s another masterpiece by Charity Majors: Why I hate the phrase “breaking down silos”. A teaser in case you can’t decide whether to click the link:
When someone says they are “breaking down silos”, whether in an interview, a panel, or casual conversation, it tells me jack shit about what they actually did.
One of my subscribers has to build a small data center fabric that’s just a tad too big for two switch design.
For my datacenter I would need two 48 ports 10GBASE-T switches and two 48 port 10/25G fibber switches. So I was watching the Small Fabrics and Lower-Speed Interfaces part of Physical Fabric Design to make up my mind. There you talk about the possibility to do a leaf and spine with 4 switches and connect servers to the spine.
A picture is worth a thousand words, so here’s the diagram of what I had in mind:
Last week I published an unrolled version of Peter Paluch’s explanation of flooding differences between OSPF and IS-IS. Here’s the second part of the saga: IS-IS flooding details (yet again, reposted in a more traditional format with Peter’s permission).
In IS-IS, DIS1 is best described as a “baseline benchmark” – a reference point that other routers compare themselves to, but it does not sit in the middle of the flow of updates (Link State PDUs, LSPs).
A quick and simplified refresher on packet types in IS-IS: A LSP carries topological information about its originating router – its System ID, its links to other routers and its attached prefixes. It is similar to an OSPF LSU containing one or more LSAs of different types.
Christoph Jaggi sent me a link to an interesting article describing security vulnerabilities pentesters found in Cisco SD-WAN admin/management code.
I’m positive the bugs have been fixed in the meantime, but what riled me most was the root cause: Little Bobby Tables (aka SQL injection) dropped by. Come on, it’s 2021, SD-WAN is supposed to be about building secure replacements for MPLS/VPN networks, and they couldn’t get someone who could write SQL-injection-safe code (the top web application security risk)?
Someone using my netsim-tools sent me an intriguing question: “Would it be possible to get network topology graphs out of the tool?”
I did something similar a long while ago for a simple network automation project (and numerous networking engineers built really interesting stuff while attending the Building Network Automation Solutions course), so it seemed like a no-brainer. As always, things aren’t as easy as they look.
I loved the Time Dilation blog post by Seth Godin. It explains so much, including why I won’t accept a “quick conf call to touch base and hash out ideas” from someone coming out of the blue sky – why should I be interested if they can’t invest the time to organize their thoughts and pour them into an email.
The concept of “creation-to-consumption” ratio is also interesting. Now I understand why I hate unedited opinionated chinwagging (many podcasts sadly fall into this category) or videos where someone blabbers into a camera while visibly trying to organize their thoughts.
Just FYI, these are some of the typical ratios I had to deal in the past:
Peter Paluch loves blogging in microchunks on Twitter ;) This time, he described the differences between OSPF and IS-IS, and gracefully allowed me to repost the explanation in a more traditional format.
My friends, I happen to have a different opinion. It will take a while to explain it and I will have to seemingly go off on a tangent. Please have patience. As a teaser, though: The 2Way state between DRothers does not improve flooding efficiency – in fact, it worsens it.
In early September, I started yet another project that’s been on the back burner for over a year: ipSpace.net Design Clinic (aka Ask Me Anything Reasonable in a more structured format). Instead of collecting questions and answering them in a podcast (example: Deep Questions podcast), I decided to make it more interactive with a live audience and real-time discussions. I also wanted to keep it valuable to anyone interested in watching the recordings, so we won’t discuss obscure failures of broken designs or dirty tricks that should have remained in CCIE lab exams.
Stateful Switchover (SSO) is another seemingly awesome technology that can help you implement high availability when facing a
broken non-redundant network design. Here’s how it’s supposed to work:
- A network device runs two copies of the control plane (primary and backup);
- Primary control plane continuously synchronizes its state with the backup control plane;
- When the primary control plane crashes, the backup control plane already has all the required state and is ready to take over in moments.
Delighted? You might be disappointed once you start digging into the details.
Initial implementation of Noël Boulene’s automated provisioning of NSX-T distributed firewall rules changed NSX-T firewall configuration based on Terraform configuration files. To make the deployment fully automated he went a step further and added a full-blown CI/CD pipeline using GitHub Actions and Terraform Cloud.
Not everyone is as lucky as Noël – developers in his organization already use GitHub and Terraform Cloud, making his choices totally frictionless.
Charity Majors published another must-read article: why every software engineering interview should include ops questions. Just a quick teaser:
The only way to unwind this is to reset expectations, and make it clear that:
- You are still responsible for your code after it’s been deployed to production, and
- Operational excellence is everyone’s job.
Adhering to these simple principles would remove an enormous amount of complexity from typical enterprise IT infrastructure… but I’m afraid it’s not going to happen anytime soon.
Whenever someone is promising a miracle solution, it’s probably due to them working in marketing or having no clue what they’re talking about (or both)… or it might be another case of adding another layer of abstraction and pretending the problems disappeared because you can’t see them anymore.
In December 2020, I got sick-and-tired of handcrafting Vagrantfiles and decided to write a tool that would, given a target networking lab topology in a text file, produce the corresponding Vagrantfile for my favorite environment (libvirt on Ubuntu). Nine months later, that idea turned into a pretty comprehensive tool targeting networking engineers who like to work with CLI and text-based configuration files. If you happen to be of the GUI/mouse persuasion, please stop reading; this tool is not for you.
During those nine months, I slowly addressed most of the challenges I always had creating networking labs. Here’s how I would typically approach testing a novel technology or software feature:
When I started collecting topics for the September 2021 ipSpace.net Design Clinic one of the subscribers sent me an interesting challenge: are there any open-source alternatives to Cisco’s DMVPN?
I had no idea and posted the question on Twitter, resulting in numerous responses pointing to a half-dozen alternatives. Thanks a million to @MarcelWiget, @FlorianHeigl1, @PacketGeekNet, @DubbelDelta, @Tomm3h, @Joy, @RoganDawes, @Yassers_za, @MeNotYouSharp, @Arko95, @DavidThurm, Brian Faulkner, and several others who chimed in with additional information.
Here’s what I learned:
Non-Stop Forwarding (NSF) is one of those ideas that look great in a slide deck and marketing collaterals, but might turn into a giant can of worms once you try to implement them properly (see also: stackable switches or VMware Fault Tolerance).
One of my subscribers is trying to decide whether to buy an -EX or an -FX version of a Cisco Nexus data center switch:
I was comparing Cisco Nexus 93180YC-FX and Nexus 93180YC-EX. They have the same port distribution (48x 10/25G + 6x40/100G), 3.6 Tbps switching capacity, but the -FX version has just 1200 Mpps forwarding rate while EX version goes up to 2600 Mpps. What could be the reason for the difference in forwarding performance?
Both switches are single-ASIC switches. They have the same total switching bandwidth, thus it must take longer for the FX switch to forward a packet, resulting in reduced packet-per-seconds figure. It looks like the ASIC in the -FX switch is configured in more complex way: more functionality results in more complexity which results in either reduced performance or higher cost.
The name of a resource indicates what we seek, an address indicates where it is, and a route tells us how to get there.
You might wonder when that document was written… it’s from January 1978. They got it absolutely right 42 years ago, and we completely messed it up in the meantime with the crazy ideas of making IP addresses resource identifiers.
Noël Boulene decided to automate provisioning of NSX-T distributed firewall rules as part of his Building Network Automation Solutions hands-on work.
What makes his solution even more interesting is the choice of automation tool: instead of using the universal automation hammer (aka Ansible) he used Terraform, a much better choice if you want to automate service provisioning, and you happen to be using vendors that invested time into writing Terraform provisioners.
One of the major challenges of using netsim-tools was the installation process – pull the code from GitHub, install the prerequisites, set up search paths… I knew how to fix it (turn the whole thing into a Python package) but I was always too busy to open that enormous can of worms.
That omission got fixed in summer 2021; netsim-tools is now available on PyPI and installed with pip3 install netsim-tools.