Don't Base Your Design on Vendor Marketing

Remember how Arista promoted VXLAN coupled with deep buffer switches as the perfect DCI solution a few years ago? Someone took Arista’s marketing too literally, ran with the idea and combined VXLAN-based DCI with traditional MLAG+STP data center fabric.

While I love that they wrote a blog post documenting their experience (if only more people would do that), it doesn’t change the fact that the design contains the worst of both worlds.

Here are just a few things that went wrong:

I’ve seen tons of STP- or MLAG-induced data center meltdowns. The first thing I would want to do in a new data center design would be to get rid of MLAG as much as possible. Most hypervisors work just fine without MLAG, and bare-metal Linux or Windows servers need MLAG only if you want to fully utilize all server uplinks. WAN edge routers should use routing with the fabric, and in some cases you can use the same trick with network services appliances.

End result: you MIGHT need MLAG to connect network services boxes that use static routing. Connect all of them to a single pair of ToR switches and get rid of MLAG everywhere else.

Even worse, MLAG-based design limits scalability. Most data center switching vendors support at most two switches in an MLAG cluster, limiting a MLAG+STP fabric to two spine switches.

Regardless of how you implement them, large layer-2 fabrics are a disaster waiting to happen. With VXLAN-over-IP fabric you have at least a stable L3-only transport fabric, and keep the crazy bits at the network edge - the way Internet worked for ages.

Interestingly, most networking vendors have seen the light, dropped their proprietary or standard L2 fabrics and replaced them with VXLAN+EVPN. Maybe it’s time to start considering it.

When interconnecting fabrics, you should connect leaf switches not spines. I described the challenge in details in Multi-Pod and Multi-Site Fabrics part of Leaf-and-Spine Fabric Architectures webinar and might write a blog post on the topic; in the meantime the proof is left as an exercise for the reader.

Raw VXLAN is not the best DCI technology. I explained that in 2012 and again in October 2014… obviously with little impact.

Yet again, you can find more details in Lukas Krattiger’s presentation in Leaf-and-Spine Fabric Architectures webinar (this part of the webinar is available with free ipSpace.net subscription).

Deep buffers are not a panacea. When Arista started promoting deep buffer switches (because they were the first vendor deploying Jericho chipset - now you can buy them from Cisco as well) I asked a number of people familiar with real-life data center designs, ASIC internals, and TCP behavior whether you really need deep buffer switches in data centers.

While the absolutely correct answer is always “it depends”, in this particular case we got to “mostly NO”. You need deep buffers when going from low latency/high bandwidth environment to high latency/low bandwidth one (data center WAN edge); in the core of a data center fabric they do more harm than good. Another reason to connect DCI links to fabric edge.

What Should They Have Done?

The blog post I quoted at the beginning of this article is a few years old, and it’s possible that Arista didn’t have VXLAN-capable low-cost ToR switches at that time, but here’s what I would do today:

  • Build two layer-3 leaf-and-spine fabrics;
  • Deploy VXLAN with EVPN or static ingress replication on top of them;
  • Connect DCI link to two deep-buffer leaf switches.

Need more details?

All ipSpace.net webinars are included with standard ipSpace.net subscription. For even more details check out Building Next Generation Data Centers online course available with Expert ipSpace.net Subscription.

Latest blog posts in Multi-Chassis Link Aggregation series

10 comments:

  1. Hello Ivan,
    In this very respect, what I still do not understand is, for instance, the reason why vendors keep pushing and (thus) customers keep purchasing very (cheap) shallow-buffer Leaves (some have a total 16MB for the whole box ...) as part of the Telco Cloud's IP Fabric. To me the interfaces' clock differences on the Leaves as well as a much higher RTTs than those typical of DCs'(as the north-south component is dominant over the east-west in a Telco cloud) and the inability of controlling the TCP Congestion Control Algorithm flavour in a telco cloud environment as opposed to a DC environment dramatically increases the chance of TCP-induced traffic bursting on such shallow-buffer Leaves and thus the chance of poor performance.
    Hope you Ivan/guys can share the same design concern for these days' Telco Clouds' IP Fabrics.
    Cheers,
    Andrea
    Replies
    1. Hi Andrea,

      As I wrote in the blog post, most people who know how TCP and buffers work agree that deep buffers don't make sense in environments like data center fabrics... with a few exceptions.

      Also, never focus on the $vendor-generated buzzwords like "Telco Cloud".

      If the so-called "Telco Cloud" runs TCP-based applications like most other data centers in this world, then it just might work with the same equipment that most other people use.

      OTOH, if all you run in your environment is CBR voice traffic using 64-byte UDP packets then you might need a different solution.

      Long story short: always understand the problem you're trying to solve first, and then try to figure out how other people with more experience solved similar problems.

      Hope this helps,
      Ivan
  2. @Andrea: The problem that you have is called the bandwidth delay product. Here's a not so bad explanation (it's from Cisco but do connive) on buffers and TCP: https://youtu.be/ETpIp6fSw_4
    Replies
    1. Hi Anonymous,
      Thanks - I know I know ! I was once an R&D on the subject of Router QoS and TCP performance in high bandwidth-delay product environments. The video you posted though holds for canonical DCs only and not also for Telco-clouds' IP-FABRIC environments as 100 microseconds is considered by Tom Edsall in the video a large RTT !!!

      Coming back to us, I wouldn't be so sure about the current shallow buffer trend (probably fueled by some declination of the 'Stanford Model') on Fabric Leaves in a telco cloud environment after (re)reading the following Dovrolis' papers:
      https://www.caida.org/~amogh/papers/buffers-CCR06.pdf
      https://www.cc.gatech.edu/~dovrolis/Papers/ravi-final-conext07.pdf

      Ciao
      Andrea
      Andrea
    2. Those 100 microseconds RTT were maybe just an example for having an easy computation. Don't forget that 100 mircoseconds are 0.1 ms which is pretty fast. According to the bandwidth delay product you'll get even smaller buffers (product) when your RTT is low. But in a VoIP environment you're probably more interested in latency. Latency != RTT . UDP behaves differently than TCP (one has to keep the two separate). I can't imagine that big buffers would be beneficial in a 'Telco cloud'.
    3. No, those 100 microseconds are a typical rtt of a canonical DC environment, not of a Telco cloud (e.g. virtualized sp edge), which is much more of a complex and challenging environment for tcp. That guy is selling nexus for the DC market segment afterall - it makes sense to me.
  3. Hello Ivan,
    Of course it helps - it's yours !! :)
    I am actually referring to the 'few exceptions' that you mentioned.
    The thing is, that, as far as I am seeing, the very same IP-Fabric is used in traditional DC environments as well as in Telco-Cloud environments with the latter still being a DC but with the SP's EDGE boxes' (i.e. Business/Mobile PE and Residential NAS among others) data-plane been virtualised as VNF. I reckon these two environments can be very different in terms of RTTs (especially for the Business and Residential traffic as part of the mobile can and is TCP-proxied) and traffic direction (the dominant north-south component of Telco-Cloud exacerbates the IP Fabric Leaves' interface-speed mismatch's impact on the traffic).
    To me, the "other people with more experience" seem to be the usual suspects Google/Facebook/Amazon/..and the way they "solved similar problems" seem to be that of not using CUBIC for instance but well-engineered TCP Congestion Control Algorithms when in presence of shallow-buffer boxes. This is something a SP cannot do.
    On a very similar note, the Google Global Caches solution I am witnessing within my SP for instance sees the deployment of a very-shallow-buffer Cisco box (16MB in total) as its network front-end and my suspect is that they are definitely not using CUBIC as the TCP Congestion Control Algorithm in order to avoid bursting and thus dropping. Will see ...

    Hope it makes sense
    Ciao
    Andrea

    p.s. Having said that, at the very end, should there be any performance issues within the Telco Cloud environment due to the Leaves been shallow-buffered then I guess the $vendor could just swap the much cheaper Leaves with more expensive (as deep-buffered with GB of ASIC) boxes (acting now as Spines only) and still winning !!! :)
  4. Have a look at my answer above it answers most of your questions/obscurities. Additionally in a Telco environment you may suffer from UDP dominance and TCP starvation. You can solve that problem with not mixing UDP and TCP traffic by using QoS (different queues) or separate them physically.
    Replies
    1. Hi Anonymous,

      Thanks - I know I know ! I was once an R&D on the subject of Router QoS and TCP performance in high bandwidth-delay product environments. The video you posted though holds for canonical DCs only and not also for Telco-clouds' IP-FABRIC environments as 100 microseconds is considered by Tom Edsall in the video a large RTT !!!

      Coming back to us, I wouldn't be so sure about the current shallow buffer trend (probably fueled by some declination of the 'Stanford Model') on Fabric Leaves in a telco cloud environment after (re)reading the following Dovrolis' papers:
      https://www.caida.org/~amogh/papers/buffers-CCR06.pdf
      https://www.cc.gatech.edu/~dovrolis/Papers/ravi-final-conext07.pdf
      Ciao
      Andrea
  5. Limit MLAG to within a rack, e.g. servers -> 2*ToR and L3-only Leaf/Spine from there. Unless of course you're running at real scale and can get away with a single ToR :)
Add comment
Sidebar