Blog Posts in January 2015
When we started planning a VMware NSX-focused podcast episode with Dmitri Kalintsev, I asked my readers what topics they’d like to see covered. Two comments that we really liked were “how do I get started with VMware NSX?” and “how do I troubleshoot this stuff?”
Cloud builders are often using my ExpertExpress service to validate their designs. Tenant onboarding into a multi-tenant (private or public) cloud infrastructure is a common problem, and tenants frequently want to retain the existing network services appliances (firewalls and load balancers).
The Combine Physical and Virtual Appliances in a Private Cloud case study describes a typical solution that combines per-tenant virtual appliances with frontend physical appliances.
Listening to some SDN pundits one gets an impression that SDN brings peace to Earth, solves all networking problems and makes networking engineers obsolete.
Cynical jokes aside, and ignoring inevitable bugs, is controller-based networking really more reliable than what we do today?
Last spring I ran an IPv6 High Availability webinar which started (not surprisingly) with a simple question: “which network components affect availability in IPv6 world, and how is a dual-stack or an IPv6-only environment different from what we had in the IPv4 world?”
In one of the discussions on v6ops mailing list Matthew Petach wrote:
The probability of us figuring out how to scale the routing table to handle 40 billion prefixes is orders of magnitude more likely than solving the headaches associated with dynamic host renumbering. That ship has done gone and sailed, hit the proverbial iceberg, and is gathering barnacles at the bottom of the ocean.
Is it really that bad? Is simple renumbering in IPv6 world just another myth? It depends.
After discussing the load sharing intricacies we briefly dabbled with the concept of entropy labels.
For whatever reason (subliminal messages from vendor marketing departments?), I’m constantly brooding about the vendor lock-in, its inevitability, and the way supposedly disruptive companies try to use the fear of lock-in to persuade na├»ve customers to buy their products.
Brocade VCS fabric has one of the most flexible multichassis link aggregation group (LAG) implementation – you can terminate member links of an individual LAG on any four switches in the VCS fabric. Using that flexibility is not always a good idea.
2015-01-23: Added a few caveats on load distribution
Every time I write about unequal traffic distribution across a link aggregation group (LAG, aka Etherchannel or Port Channel) or ECMP fabric, someone asks a simple question “is there no way to reshuffle the traffic to make it more balanced?”
TL&DR summary: there are ways to do it, and some vendors already implemented them.
The algorithm that spreads the traffic across a group of outbound links (LAG or set of ECMP next hops) has to satisfy a few requirements:
- It has to work reasonably well in typical environments;
- It should not reorder packets of the same flow (here’s why);
- It has to be simple enough to be implementable in reasonably cheap ASICs;
The second and third requirement result in what the chipset manufacturers (and subsequently the hardware vendors) are offering today: hash-based distribution of packets. In case you need a step-by-step overview of this process, here’s how it works:
- Create an array of buckets and assign each outgoing link to one or more buckets. The bucket size is the number you see in marketing papers as “we support N-way ECMP” or “we have N-way LAG”.
- Take N fields from the outgoing packet header. The fields could be MAC addresses (source and/or destination), IP addresses (source and/or destination), IP port numbers, or even some other fixed-position fields in the packet header. Some vendors – for example Arista – allow you to configure which fields you want to use (assuming the platform chipset supports this functionality).
- Hash the fields from the packet header to get an integer between 0 and bucket size – 1. Example: for bucket sizes that are power of two take the low-order N bits of the hash.
- Enqueue the packet into the output queue of the interface that is associated with the bucket selected by the packet hash.
Have you noticed that the algorithm never checks the size of the output queue? If the hashing algorithm decides to send the packet through Interface#1, the switch will send the packet through Interface#1 even though that interface might be dropping packets like crazy due to continuous congestion, and all the other interfaces sit idle.
The reason the load-balancing algorithm never checks the load on the outbound interface is simple: the typical environment mentioned above is usually assumed to be a healthy mix of numerous independent mice flows. Throw a few elephants in the mix and the assumptions start breaking down.
The only vendor that was always able to cope with the elephants in the mix is Brocade due to the fact that their traditional typical environment (storage networks) consists mainly of elephants.
Can We Solve the Problem?
Here’s an intriguingly simple question: Why can’t we change the mix of outgoing interfaces in the N-way ECMP table to reflect the actual interface load? Wouldn’t that allow us to push the mice flows away from elephants crowding some of the interfaces?
In principle, the answer is “Sure, we could do that”, but we have to solve three challenges:
- Coarse-grained reshuffling could make matters worse. If your hardware supports 8-way ECMP and you have four uplinks, you might shift a large proportion of the traffic when you reassign the buckets to less-loaded interfaces, resulting in a nasty oscillation. Modern chipsets support at least 256-way ECMP, so that shouldn’t be a problem.
- The hardware you use has to support per-bucket counters. All hardware supports per-interface counters, but while they help you identify the congested interfaces, the won’t help you reshuffle the traffic – if the control-plane software cannot see how much traffic goes through each bucket, it makes no sense to randomly reshuffle the buckets hoping for the best.
- We shall not reorder the packets (at least within the data center), which means that we cannot reshuffle active buckets, but it’s relatively safe to change the outgoing interface of a currently inactive bucket. You could still reorder packets within a TCP session under an unlikely set of circumstances (figuring out what those circumstances are is left as an exercise for the reader), but we just might have to accept that slight risk of temporary performance degradation if we want to get better link utilization.
Would the reshuffle inactive buckets idea work in practice? Are there inactive buckets in a typical high-volume data center environment? Welcome to the weird world of flowlets.
What Are Flowlets?
It seems the idea of flowlets first appeared in the Harnessing TCP’s Burstiness with Flowlet Switching paper (see also corresponding PPT) – due to the bursty nature of TCP, you might be able to do pretty reliable bucket reshuffling with 256 or more buckets, as some buckets always tend to be empty.
Microsoft started using flowlets in Windows Server 2012 R2, and recently Cisco implemented flowlet-based dynamic load balancing in the ACI leaf-and-spine fabrics. Juniper is doing something similar (adaptive load balancing) on MX routers in Junos 14.1, and did Adaptive Flowlet Splicing within a Virtual Chassis Fabric (a nice rehash of the topic).
Need more information?
Imagine you need a data center WAN edge router with multiple 10GE uplinks. You’d probably go for an ASR or a MX-series router, right? How about using a 2 Tbps ToR switch and an SDN solution to make it work with full Internet routing table?
If you happen to have iTunes on your computer, please spend 10 seconds rating the podcast before you start listening to it. Thank you!
I got a lengthy email from one of my readers a while ago, essentially asking a simple question: assuming I want to go return to my studies and move further than CCIE I currently hold, should I go for CCDE or the new VMware’s VCIX-NV?
Well, it’s almost like “do you believe in scale-up or scale-out?” ;) Both approaches have their merits.
Every major hypervisor and networking vendor has an overlay virtual networking solution. Obviously they’re not identical, and some of them work better than others in large-scale environments – an interesting challenge we tried to address in the Scaling Overlay Virtual Networks webinar. As always, we started by identifying the potential problems.
Olivier Hault sent me an interesting challenge:
I cannot find any simple network-layer solution that would allow me to use total available bandwidth between a Hypervisor with multiple uplinks and a Network Attached Storage (NAS) box.
TL&DR summary: you cannot find it because there’s none.
Dmitri Kalintsev, one of the networking guys from VMware NSX team, has kindly agreed to do an NSX technical deep dive Software Gone Wild episode… and you have the opportunity to tell him what you’d like to hear. It’s as easy as writing a comment, and we’ll pick one of the most popular topics.
Do keep in mind that we plan to do a technical deep dive, and it has to fit within an hour or so or nobody will ever listen to it, so please keep your suggestions focused. “Troubleshooting NSX”, “NSX Design”, or “NSX versus ACI ” is not what we’re looking for ;)
One of the interesting challenges in the Software-Defined Data Center world is the integration of network and security services with the compute infrastructure and network virtualization. Palo Alto claims to have tightly integrated their firewalls with VMware NSX and numerous cloud orchestration platforms - it was time to figure out how that’s done, so we decided to go on a field trip into the scary world of security.
A few months ago I described how bandwidth limitations shatter the dreams of spread-out application stacks with elements residing (or being dynamically migrated) between data centers. Today let’s focus on bandwidth’s ugly cousin: latency.
TL&DR Summary: Spreading the server components of an application across multiple locations (multiple data centers or hybrid cloud deployments) can easily result in dismal performance even when there’s plenty of bandwidth available.
MPLS Traffic Engineer is sometimes promoted as a QoS solution (it seems bandwidth calendaring is a permanent obsession of some networking engineers, and OpenFlow is no more a solution than MPLS-TE was ;), but in reality it’s pretty hard to make the two work together seamlessly (just ask anyone who had to implement auto-bandwidth MPLS-TE in a large network).
One of the main complaints I was continuously getting about my free content is that there’s simply too much of it, and that it’s impossible to find what one is looking for.
New Year holidays gave me enough time to implement a project that has been on my to-do list for almost a year: total redesign of the free content web site. Feedback highly appreciated!
Whenever there’s a weird request to do something totally illogical with BGP, there’s a knob in Cisco IOS to get it done (and increase the heartburn of CCIE candidates). Conditional Route Injection (the ability to insert more specific prefixes into BGP without having them in the IP routing table) is one of them.