Here’s an interesting blog post (particularly as it’s coming from a well-known cloud evangelist): at the infrastructure level stability matters more than agility or speed-of-deployment. Welcome to real world ;)
After the last US-based ipSpace.net workshop a lot of people asked me about the next one. It took a long time, but here it is: I’m running an on-site automation workshop together with several friends with outstanding hands-on experience in Colorado in late May.
During Cisco Live Europe 2017 (where I got thanks to the Tech Field Day crew kindly inviting me) I had a nice chat with Peter Jones, principal engineer @ Cisco Systems. We started with a totally tangential discussion on why startups fail, and quickly got back to flexible hardware and why one would want to have it in a switch.
I recently finished editing the videos from the Leaf-and-Spine Designs update to the Leaf-and-Spine Fabrics webinar, so it wasn’t hard to select the featured webinar for April 2017. The featured videos now include BGP in the Data Center by Dinesh Dutt, SPB Deep Dive by Roger Lapuh, and VXLAN with EVPN control plane by Lukas Krattiger.
One of my readers considered joining the Building Network Automation Solutions course but wasn’t sure whether it would help him solve the challenges he’s facing in his network.
Fortunately, his challenges aren’t that hard to solve.
Evgeny made an interesting observation while testing the NETCONF client on IOS XE 16.x (see also this comment on my blog):
The most interesting part: for unknown reason IOS-XE gives different answers about capabilities on ports 830 and 22.
In case you’re wondering why we’re stuck with old stuff like TCP, IPv4, OSPF, and a few other bits and pieces that were invented decades ago when we could be using the glitzy controller-based software-defined whatever, read the blog post by Martin Sustrik. He talks about software, but we’re facing the same challenges in networking.
Lukas Krattiger (Cisco Systems) was the guest speaker in Layer-2+3 fabrics part of the Leaf-and-Spine Fabric Design webinar, and he started his presentation with an overview of how we use overlays in data center fabrics.
Most network automation presentations you can find on the Internet focus on configuration management, either to provision new boxes, or to provision new services, so it’s easy to assume that network automation is really a fancy new term for consistent device configuration management.
However, as I explained in the Network Automation 101 webinar, there’s so much more you can do and today I’d like to share a real-life example from Jaakko Rautanen, an alumni of my Building Network Automation Solutions online course.
Johannes Weber built a CCNP practice lab, configured 22 different protocols in it, and took packet captures of all of them happily chatting. To make things more interesting he created 45 challenges that you can solve with Wireshark using the pcap file he published.
One of my readers sent me a link to CCO documentation containing this gem:
Beginning with Cisco NX-OS Release 7.0(3)I2(1), Cisco Nexus 9000 Series switches handle the CLI configuration actions in a different way than before the introduction of NX-API and DME. The NX-API and DME architecture introduces a delay in the communication between Cisco Nexus 9000 Series switches and the end host terminal sessions, for example SSH terminal sessions.
So far so good. We can probably tolerate some delay. However, the next sentence is a killer…
2017-04-05: The wonderful information disappeared from Cisco's documentation within 24 hours with no explanation whatsoever. However, I expected that and took a snapshot of that page before publishing the blog post ;)
My Why Do We Need Session Stickiness in Load Balancing blog post generated numerous interesting comments and questions, so I decided to repost them and provide slightly longer answers to some of the questions.
Warning: long wall of text ahead.
During Cisco Live Europe (huge thanks to Tech Field Day crew for bringing me there) I had a chat with Jeff McLaughlin about NETCONF support on Cisco IOS XE, in particular on the campus switches.
We started with the obvious question “why would someone want to have NETCONF on a campus switch”, continued with “why would you use NETCONF and not REST API”, and diverted into “who loves regular expressions”. Teasing aside, we discussed:
I’m running a hyperconverged infrastructure event with Mitja Robas on April 6th, and so my friend Christoph Jaggi sent me a list of interesting questions, starting with:
What are hyperconverged infrastructures?
The German version of the interview is published on inside-it.ch.
Imagine a Flatworld in which railways are the main means of transportation. They were using horses and pigeons in the past, and experimenting with underwater airplanes, but railways won because they were cheaper than anything else (for whatever reason, price always wins over quality or convenience in that world).
As always, there were multiple railroad tracks and trains manufacturers, and everyone tried to use all sorts of interesting tricks to force the customers to buy tracks and trains from the same vendor. Different track gauges and heptagonal wheels that worked best with grooved rails were the usual tricks.
Niki Vonderwell kindly invited me to Troopers 2017 and I decided to talk about security and reliability aspects of network automation.
The presentation is available on my web site, and I’ll post the link to the video when they upload it. An extended version of the presentation will eventually become part of Network Automation Use Cases webinar.
The last presentation during the Tech Field Day Extra @ Cisco Live Europe event was a Cisco-Apple Partnership presentation, and we expected an hour of corporate marketese.
Can’t tell you how pleasantly surprised we were when Jerome Henry started his very technical presentation explaining the wireless goodies you get when using iOS with IOS.
…we’ve found that VMware’s native virtual switch implementation has become the de facto standard for greater than 99% of vSphere customers today. … Moving forward, VMware will have a single virtual switch strategy that focuses on two sets of native virtual switch offerings – VMware vSphere® Standard Switch and vSphere Distributed Switch™ for VMware vSphere, and the Open virtual switch (OVS).
Ansible network modules (at least in the way they’re implemented in Ansible releases 2.1 and 2.2) were one of the more confusing aspects of my Building Network Automation Solutions online course (and based on what I’m seeing on various chat sites we weren’t the only ones).
I wrote an in-depth explanation of how you’re supposed to be using them a while ago and now updated it with user authentication information.
One of the engineers watching my Data Center 3.0 webinar asked me why we need session stickiness in load balancing, what its impact is on load balancer performance, and whether we could get rid of it. Here’s the whole story from the networking perspective.
Remember the All You Need Are Two Switches saga? Several readers told me they’d like to have in text (article) format, so I found a transcription service, and started editing what they produced and publishing it. The first two installments are already online.
On a related topic: we’ll discuss the viability of this approach in April DIGS event in Zurich, Switzerland.
One of my readers watched my Leaf-and-Spine Fabric Architectures webinar and had a follow-up question:
You mentioned 3-tier architecture was dictated primarily by port count and throughput limits. I can understand that port density was a problem, but can you elaborate why the throughput is also a limitation? Do you mean that core switch like 6500 also not suitable to build a 2-tier network in term of throughput?
As always, the short answer is it depends, in this case on your access port count and bandwidth requirements.
In autumn 2016 I embarked on a quest to figure out how TCP really works and whether big buffers in data center switches make sense. One of the obvious stops on this journey was a chat with Thomas Graf, Linux Core Team member and a founding member of the Cilium project.
When Cisco ACI was launched it promised to do everything you need (plus much more, and in multi-hypervisor environment). It was quickly obvious that you can’t do all that on ToR switches, and need control of the virtual switch (the real network edge) to get the job done.
Yannis sent me an interesting challenge after reading my short “this is how I wasted my time” update:
We are very much committed in automation and use Ansible to create configuration and provision our SP and data center network. One of our principles is that we do rely solely on data available in external resources (databases and REST endpoints), and avoid fetching information/views from the network because that would create a loop.
You can almost feel a however coming in just a few seconds, right?
The featured webinar in March 2017 is the SDN Use Cases webinar describing over a dozen different real-life SDN use cases. The featured videos cover four of them: a data center fabric by Plexxi, microsegmentation (including VMware NSX), SDN-based Internet edge router built by David Barroso, and Fibbing - an OSPF-based traffic engineering developed at University of Louvain.
To view the videos, log into my.ipspace.net, select the webinar from the first page, and watch the videos marked with star.
It’s uncommon to find an organization that succeeds in building a private OpenStack-based cloud. It’s extremely rare to find one that documented and published the whole process like Paddy Power Betfair did with their OpenStack Reference Architecture whitepaper.
One of the challenges of designing a controller-based solution is the transport network used to exchange information between controller and controlled devices. Can you do that in-band or is it better to have an out-of-band network (built with traditional components)? Terry Slattery explained some of the pros and cons in the Monitoring SDN Networks webinar.
During the Tech Field Day Extra event at Cisco Live Europe 2017 Fabrizio Maccioni, Technical Marketing Engineer at Cisco, described enhanced programmability available in Cisco IOS XE release 16.x. What really got my attention was the claim that they made NETCONF on Cisco IOS transactional (and Fabrizio mentioned the candidate config and commit).
Here's my initial reaction:
I often get questions from engineers wondering whether my webinars or courses would be too tough for them. Here’s a question I got from an engineer who wanted to attend my Building Next-Generation Data Center course: “What specific prior experience do you expect for this workshop?”
Eyvonne Sharp wrote a great blog post describing Cisco’s love of complexity and how SD-WAN vendors proved things don’t have to be that complex.
Tags: Data Center
The parts I like most (and they apply equally well to networking):
Last year Cisco launched a new series of Nexus 9000 switches with table sizes that didn’t match any of the known merchant silicon ASICs. It was obvious they had to be using their own silicon – the CloudScale ASIC. Lukas Krattiger was kind enough to describe some of the details last November, resulting in Episode 73 of Software Gone Wild.
For even more details, watch the Cisco Nexus 9000 Architecture Cisco Live presentation.
TL&DR: Don’t use OSPF areas if you can avoid them. Don’t use NSSA areas.
I managed to get another awesome lineup of guest speakers for the Spring 2017 Building Next-Generation Data Center course (starting in less than a month):
Scott Lowe will open the course with a presentation on the impact of open source software in data center environments.
It might be easier to munge the data structure into a more appropriate format first and then use the munged data in subsequent tasks. Wondering how to do it?
One of my readers wondered what the difference between fabric extenders and leaf-and-spine fabrics is:
We are building a new data center for DR and we management is wanting me to put in recommendations to either stick with our current Cisco 7k to 2k ToR FEX solution, or prepare for what seems to be the future of DC in that spine leaf architecture.
Let’s start with “what is leaf-and-spine architecture?”
When Facebook announced 6-pack (their first chassis switch) my reaction was “meh” (as well as “I would love to hear what Brad Hedlund has to say about it”). When Facebook announced Backpack I mostly ignored the announcement. After all, when one of the cloud-scale unicorns starts talking about their infrastructure, what they tell you is usually low on detail and used primarily as talent attracting tool.
Imagine you were asked to migrate some of the workloads running in your data center into a public (or managed) cloud. These workloads still have to access the data residing in your data center – a typical hybrid cloud deployment.
Next thing you know you have to deal with your (C)ISO and his/her usual concerns as well as the variety of articles on tech sites stating that "security is the biggest challenge of cloud adoption".
Tags: Data Center
I got this tweet after publishing the “use Ansible to execute a single command on all routers” blog post (and a few similar comments on the blog post itself)
Or use Python, Netmiko and a simple For loop
I never cease to be amazed by the urge to do undifferentiated heavy lifting in the IT industry.
One of my readers sent me a list of questions after watching some of my videos, starting with a generic one:
While working self within large corporations for a long time, I am asking myself how it will be possible to move from messy infrastructure we grew over the years to a modern architecture.
Usually by building a parallel infrastructure and eventually retiring the old one, otherwise you’ll end up with layers of kludges. Obviously, the old infrastructure will lurk around for years (I know people who use this approach and currently run three generations of infrastructure).
In 2013, large-scale cloud providers and ISPs decided they had enough of the glacial IETF process of generating YANG models used to describe device configuration and started OpenConfig – a customer-only initiative that quickly created data models covering typical use cases of the founding members (aka “What Does Google Need”).
Angelos Vassiliou sent me an interesting lengthy email after I published my OSPF Forwarding Address series (part 1, part 2, part 3, part 4). I asked him whether it’s OK to publish his email together with my responses as a blog post and he gracefully agreed, so here it is.
Cumulus Linux 3.2 shipped with a rudimentary EVPN implementation and everyone got really excited, including smaller ASIC manufacturers that finally got a control plane for their hardware VTEP functionality.
However, while it’s nice to have EVPN support in Cumulus Linux, the claims of its benefits are sometimes greatly exaggerated.
I was using Ansible playbooks to configure Cisco IOS routers running in VIRL and wanted to extract the router configurations before stopping the simulation.
The featured webinar in February 2017 is the Network Automation 101 webinar, and the featured video describes the reasons you should be interested in network automation, its basics, and the difference between automation and orchestration.
Running BGP instead of an IGP in your leaf-and-spine fabric sounds like an interesting idea (particularly if your fabric is large). Configuring a zillion BGP knobs on every box doesn’t.
However, BGP doesn’t have to be complex. In the Simplify BGP Configurations video (part of leaf-and-spine fabric designs webinar) Dinesh Dutt explains how you can make BGP configurations simple and easy-to-understand.
Remember the kludges needed to make OSPF NSSA areas work correctly? We concluded that saga by showing how the rules of RFC 3101 force a poor ASBR to choose an IP address on one of its OSPF-enabled interfaces as a forwarding address to be used in Type-7 LSA.
What could possibly go wrong with such a “simple” concept?
In the next session of Network Automation Use Cases webinar (on Thursday, February 16th) I’ll describe how you could implement automatic deployment of network services, and what you could do to minimize the impact of unintended consequences.
If you attended one of the previous sessions of this webinar, you’re already registered for this one, if not, visit this page and register.
Most networking operating systems include a mechanism to roll back device configuration and/or create configuration snapshots. These mechanisms usually work only for the device configuration, but do not include operating system images or other components (example: crypto keys).
Now imagine using RFC 1925 rule 6a and changing the “configuration rollback” problem into “file system snapshot” problem. That’s exactly what Cumulus Linux does in its newest release. Does it make sense? It depends.
Some of the engineers building Ansible-with-VIRL lab in my Building Network Automation Solutions online course experienced interesting challenges, so I made the how-to instructions more explicit and added a troubleshooting section to the Using Ansible Playbooks with Cisco VIRL document. Hope you’ll find them useful.
When I recorded the first podcast with Thomas Graf we both found it so much fun that we decided to do it again. Thomas had attended the NetDev 1.2 conference so when we met in November 2016 we warmed up with What’s NetDev and then started discussing the hot new networking stuff being added to Linux kernel:
In the previous blog posts I described how OSPF tries to solve some broken designs with Forwarding Address field in Type-5 LSA – a kludge that unnecessarily increases the already too-high complexity of OSPF.
NSSA areas make the whole thing worse: OSPF needs Forwarding Address in Type-5 LSAs generated from Type-7 LSAs to ensure optimal packet forwarding. Here’s why:
In the last few weeks I’ve seen numerous questions along the lines of “how do I manage VLANs on my switch with Ansible”. You can look at this question from two perspectives: the low-level details (which modules do I use, how do I push commands to the box…) or the high-level challenges (how do I make sure actual device state matches desired device state). Obviously I’m interested in the latter.
I’m positive I’ve answered this question a dozen times in various blog posts and webinars, but it keeps coming back:
You always mention that high speed links are always better than parallel low speed links, for example 2 x 40GE is better than 8 x 10GE. What is the rationale behind this?
Here’s the N+1-th answer (hoping I’m being consistent):
In Episode 69 of Software Gone Wild we discussed ways of increasing visibility into VXLAN transport fabric. Another thing we badly need is visibility into the virtual edge behavior, and to help you get there Iwan Rahabok created a set of vRealize dashboards that include the virtual edge networking components. Hope you’ll find them useful.
A while ago I decided it's time to figure out whether it's better to drop or to delay TCP packets, and quickly figured out you get 12 opinions (usually with no real arguments supporting them) if you ask 10 people. Fortunately, I know someone who deals with TCP performance for living, and Juho Snellman was kind enough to agree to record another podcast.
Update 2017-03-31: Added More information section
In my initial OSPF Forwarding Address blog post I described a common Forwarding Address (FA) use case (at least as preached on the Internet): two ASBRs connected to a single external subnet with route redistributing configured only on one of them.
That design is clearly broken from the reliability perspective, but are there other designs where OSPF FA might make sense?
One of the engineers attending my Building Network Automation Solutions online course got the lab up and running, wanted to execute a simple IOS command from an Ansible playbook and failed.
Ansible (or Python+Paramiko/Netmiko) seems to be the tool used in most do-it-yourself network automation presentations and videos. Did you know there’s a scripting/automation alternative that’s hugely popular in parts of sysadmin and virtualization universe that almost nobody talks about in networking (because everyone is focused on huge data center fabrics and unicorns) – PowerShell (now also available on OSX and Linux).
One of the quotes I found in the Mythical Man-Month came from the pre-GPS days: “never go to sea with two chronometers, take one or three”, and it’s amazing the networking industry (and a few others) never got the message.
One would think that we're the only ones struggling with Linux CLI (read: bash). Seems like cyber security professionals might be in the same boat according to the nice summary of dozens of Linux/bash commands collected by Robert Graham.
Running Linux containers on a single host is relatively easy. Building private multi-tenant networks across multiple hosts immediately creates the usual networking mess.
One of my readers sent me an interesting NSSA question (more in a future blog post) that sent me chasing for the reasons behind the OSPF Forwarding Address (FA) field in type-5 and type-7 LSAs.
This is the typical scenario for OSPF FA I was able to find on the Internet:
The next session of the Network Automation Use Cases series will take place on January 24th. Dinesh Dutt will explain describe how you can use Ansible and Jinja2 to automate data center fabric deployments, and I’ll have a few things to say about automating network security.
If you think that what Dinesh will talk about applies only to startups you’re totally wrong. UBS is using the exact same approach to roll out their new data centers; Thomas Wacker will share the details in his guest presentation in the next Building Next-Generation Data Centers online course.
A blog post by Russ White pointed me to an article describing how IPv6 services tend to be less protected than IPv4 services. No surprise there, people like Eric Vyncke and I were telling anyone who was willing to listen that operating two-protocol networks isn’t the same thing as operating a single-protocol one (see also RFC 1925 rule 4).
I was discussing a totally unrelated topic with Terry Slattery when he mentioned a quote from the Mythical Man-Month. It got me curious, I started exploring and found out I can get the book as part of my Safari subscription.
From the moment Cisco and VMware announced VXLAN some networking engineers complained that they'd lose visibility into the end-to-end path. It took a long while, but finally the troubleshooting tools started appearing in VXLAN environment: NVO3 working group defined Fault Managemnet framework for overlay networks and Cisco implemented at least parts of it in recent Nexus OS releases.
Ansible is great at capturing and using JSON-formatted data returned by REST API (or any other script or method it can invoke), but unfortunately some of us still have to deal with network devices that cannot even spell structured data or REST.
The featured webinar in January 2017 is the Introduction to Docker webinar, and in the featured video Matt Oswalt explains the basic Docker tasks. Other videos in this webinar cover Docker images, volumes, networking, and Docker Compose and Swarm.
To view the featured video, log into my.ipspace.net, select the webinar from the first page, and watch the video marked with star.
Elisa mentions that for a given piece of data, there should be “one source of truth”. It gets a bit muddled when you have an IPAM tool and Git source control simultaneously. It is not hard to imagine scenarios where these get out of sync especially if you consider multi-operator scenarios.
Confused? He provided a simple scenario:
With January 6th the Christmas/New Year holidays are over even for most European countries, so it’s time to restart my blog and set some goals for 2017.
2015 was year of SDN, 2016 was year of network automation, and 2017 is shaping up to be the year of the cloud.