VXLAN and OTV: The Saga Continues

Randall Greer left a comment on my Revisited: Layer-2 DCI over VXLAN post saying:

Could you please elaborate on how VXLAN is a better option than OTV? As far as I can see, OTV doesn't suffer from the traffic tromboning you get from VXLAN. Sure you have to stretch your VLANs, but you're protected from bridging failures going over your DCI. OTV is also able to have multiple edge devices per site, so there's no single failure domain. It's even integrated with LISP to mitigate any sub-optimal traffic flows.

Before going through the individual points, let’s focus on the big picture: the failure domains.

The Big Picture

Fact: layer-2 network (VLAN) is a single failure domain. You might not believe it… until you experience your first STP-induced loop or a broadcast storm (but then, there are people who believe in bandwidth fairy and flat earth).

Now let’s compare the bridging domains in three scenarios:

  • VXLAN VTEP on the hypervisor
  • VXLAN VTEP on the first ToR switch
  • VXLAN VTEP or OTV at the WAN edge router.

When you place the tunnel encapsulation endpoint on the WAN edge router, you turn at least one data center into a single failure domain. With the VXLAN on the hypervisor, the failure domain is limited to all hypervisors participating in the same VLAN, and if the hypervisor uses packet replication to emulate BUM flooding, a VM-triggered broadcast storm cannot do too much damage. The proof is left as an exercise for the reader.

I will not go into the horror stories of multi-DC meltdowns, you might not believe them; I know people who have experienced them… and then started listening to my advice.

The Details

Now let’s go through Randall’s points:

OTV doesn’t suffer from traffic tromboning. You might solve northbound (DC-to-WAN) traffic flow in OTV scenario with first-hop gateway localization (which is a kludge that shouldn’t have existed). You won’t even start solving southbound traffic flow problems until you introduce LISP, and the solution will be suboptimal until you deploy LISP at remote sites.

You cannot solve east-west tromboning (traffic between VMs in different data centers), and all your glorious ideas kick the dust the moment someone inserts a stateful appliance (firewall or load balancer) in the forwarding path… unless you plan to deploy stretched firewall cluster, in which case I’ll wish you luck and walk away in disgust.

TL&DR: Stop being a MacGyver and get a life. You might also want to watch this video.

You’re protected from bridging failures going over your DCI. This assumes that I’m OK with bridging failures bringing down one of my data centers. I don’t have a dozen data centers, so I don’t want to risk even one of them.

Also, if a VM goes broadcast-crazy and pumps 10Gbps of BUM traffic into your data center core, the OTV solution will either (A) saturate the WAN link or (B) saturate the server links in the other data center.

Before telling me that I don’t know what storm control is, please do read how it actually works. A single BUM-crazy VM will kill almost all BUM traffic on your OTV link, effectively breaking ARP (and a few other things) across stretched VLANs.

OTV is also able to have multiple edge devices per site, so there's no single failure domain. Single point of failure is not the same thing as single failure domain.

OTV is even integrated with LISP… and you get optimal southbound traffic flow only when you deploy LISP on all remote sites, which is still a stretch BHAG for the global Internet.

15 comments:

  1. What I will never understand is the insistence to do l2 dci to begin with. No matter what the solution is it will always be kludgy as its outside of ethernet's design scope. Problems need to be solved at the proper place and app redundancy is not a network one in my opinion. Am I off base?
    Replies
    1. Neither will I ... or a few other people:

      http://blog.ipspace.net/2012/03/stretched-layer-2-subnets-server.html

      It is, however, more lucrative from vendors' or system integrators' perspective to sell more boxes (or more complex boxe) to the customer than telling them that they should fix their apps.
    2. While I agree that stretched layer 2 does not make sense from a networking perspective, your assertion that "they should fix their apps" lives in world of unicorns and fairies. That will never happen. There is not a chance that an Oracle ERP with 1000's of moving parts will ever work they way we wish it would (Ok maybe it is possible but how many millions would it take....). Just my prospective from higher ed.
    3. Yep, and mission-critical apps will never move from mainframes to minicomputers (let alone PCs). Been there, heard that.

      It will just take a while longer in some environments than others.
  2. Nice responses to readers questions.I agree with the responses and additionally , state synchronization between the FWs in the both data centers and state sync between other stateful devices could be necessary if the RTO requirement is 0 or near 0 , in case one of the data center fail.

    But of course just you can do it does not mean you should do it. DCI link failures might cause more issue than what ot would solve.

    Also for the OTV edge devices ,what I would say to this commenter in addition to your excellent and meaningful response :) , although they are not single point of failure , they can not be used in load share scenario due to how it works to avoid forwarding loop. So the convergence time needs to be considered and maybe adjusted and also odd/even Vlan separation needs to be arranged carefully for multihoming ( Management complexity.! ).
  3. "but you're protected from bridging failures going over your DCI"

    You are only protected whilst nothing has failed. It doesn't matter what protocol/technology/features you are using, or what protection mechanisms you have in place. A faulty supervisor module can quite easily identify your single failure domain, I speak from experience ;)
  4. What about the rest? What do you think about H3C´s EVI or Huawei´s EVN?

    http://www.h3c.com/portal/download.do?id=1798081
    http://enterprise.huawei.com/ilink/cnenterprise/download/HW_341904
    Replies
    1. Same set of problems. Some things (like squaring the circle) cannot be solve using rational means.
  5. VXLAN is an overlay network that aims to extend Layer 2 over an IP network; that doesn't mean that this is a DCI solution. Same statement could be made with L2TPv3 or EoMPLS, just to list few, they aim to extend L2 segments, but are not natively DCI solution per se. And that's the reason why OTV or EVPN exists to offer natively a solid DCI solution. It’s not just a feature-to-feature comparison. That requires above all an architectural comparison; As described in your drawing, VXLAN is an end-to-end encapsulation meaning that from a transport point of view, it is a flat architecture. OTV is a DC WAN edge encapsulation, this means that it is an hierarchical architecture, and according to this principal, it better isolates the failure domain inside the DC and it offers better scalability and convergences.

    With OTV and FHRP localization, you can definitively reduce the tromboning effect for E-W traffic (inter-tier Apps workflows) without the need to deploy LISP IP mobility or other southbound optimization services. The Layer 3 gateway of interest is active on each site thus the traffic between application tiers is contained inside each DC without hair-pining workflows across sites. This is currently challenging with VXLAN alone.

    With OTV, the UU are not flooded. And storm-control configuration is simplified, because of failure isolation functionality natively available with OTV (for broadcast and unknown unicast). Thus a VM with a 10GE throughput that becomes "broadcast-crazy" should not impact the secondary site if the right tools are implemented accordingly. This is not completely true with VXLAN unfortunately, and it might be very challenging to rate limit an IP Multicast group. As the result with the current implementation of VXLAN with no control plane, the failure domain is extended throughout the whole VXLAN domain with a huge risk to saturate the DCI links and impacting the remote sites.
    Note that if VXLAN requires an IP multicast network intra-DC which may be accepted by the enterprise, it would also impose IP multicast enabled for inter-DC network, which might be more challenging for the enterprise and/or the SP managing the inter-site IP network.

    OTV in conjunction with LISP IP mobility and FHRP localization offers a real optimization for long distances between DC’s. When LISP IP mobility is deployed with extended subnets, it is complementary to OTV but is not mandatory. It is a smart choice that the network mgr will take based on how the high latency induced by the long distances may impact the enterprise business. Anyhow, for most of enterprises deploying DCI, the WAN access network is an Intranet network where LISP is easy to deploy and rarely a global Internet core per se.

    What is said today may be a bit different in the future. Solutions for distributed virtual DC are evolving in order to improve sturdiness, efficiency and optimization, including FW clustering stretched across multiple locations. It is important for enterprises to pay attention to this fast evolution.

    Finally long story short, it is certainly worth reading the detailed posts I made that discuss about the current VXLAN release and DCI, including LISP mobility and other FW clustering articles.

    26 – Is VxLAN a DCI solution for LAN extension ? 
    http://yves-louis.com/DCI/?p=648
    23 – LISP Mobility in a virtualized environment 
    http://yves-louis.com/DCI/?p=428
    27 – Stateful Firewall devices and DCI challenges 
    http://yves-louis.com/DCI/?p=785
    Replies
    1. Hi Yves!

      Thanks for an extensive reply. We agree on quite a few topics, and will have to agree to disagree on a few others.

      First a nit: most overlay virtual networking implementations no longer require multicast, the only exception being vShield suite (which is a zombie anyway).

      More importantly, you haven't addressed my major concern: OTV requires a DC-wide bridging domain because it sits at the edge of a data center (and I will defer the LISP discussion for some other time ;).

      Finally, while I understand your excitement regarding the long-distance ASA cluster, do tell me what happens when the DCI link fails?

      Kind regards,
      Ivan
    2. This comment has been removed by the author.
    3. /
      Let me split the answers in 3 different replies, otherwise it seems that the whole response is too long
      /
      Thank you Ivan for the fast reply. Let me give you here my thoughts. Please note that I’m replying to the comments made previously only, but several other requirements to build a solid DCI are described in the post I mentioned above.
       
      Ivan>> First a nit: most overlay virtual networking implementations no longer require multicast, the only exception being vShield suite (which is a zombie anyway).
       
      yves >> yes and no :) - But this point is crucial.
       
      From a software-based VXLAN point of view, there are indeed several vendors who already offer unicast-only mode.
      From a ToR-based VXLAN point of view, as far as I’m aware, most of vendors are still enabling VXLAN in Multicast mode (TBC).
      We will certainly comeback in the near future to this last statement, but as I said things are moving quickly however I’m talking about what is available today.
       
      Back to the software-based VXLAN. Leaving aside discussions on scalability and performances, no one wants/can deploy a fully virtual environment, it's always an hybrid network with communication with traditional platforms, hence we need VTEP Gateway (L2) for the VLAN ID <=> VNI mapping. Unfortunately, currently only one single active L2 gateway per VXLAN domain is supported in software-based VTEP. Thus the reason I left aside the software-based solution although it offers unicast-only mode. If that assumption is correct, then the software-based VXLAN cannot be a DCI solution as I explained in these posts below.
       
      In short, if we want to extend a set of VLAN across 2 or more locations using VXLAN as the DCI protocol transport with only one L2 gateway that exists for the whole VXLAN domain, how a remote VLAN to be extended can hit the active L2 gateway. The ToR-based VXLAN can support today distributed active L2 gateways but does not offer currently a CP for the distribution and learning process.
      26 – Is VxLAN a DCI solution for LAN extension?
      http://yves-louis.com/DCI/?p=648 and
      26-bis – VxLAN VTEP GW: Software versus Hardware-based
      http://yves-louis.com/DCI/?p=742
    4. Ivan>> More importantly, you haven’t addressed my major concern: OTV requires a DC-wide bridging domain because it sits at the edge of a data center (and I will defer the LISP discussion for some other time ;).
       
      yves >> OTV is very flexible and actually it can sit at the aggregation PoD's  if we wish, where the L2/L3 boundary exists (maintaining the DC core a pure L3 network). If we look at the evolution of the network fabric, VXLAN intra-DC is one solution. We can contain the failure domain inside the ToR, extend the VLAN intra-DC using a VXLAN overlay from the ToR leaf and terminate the VXLAN tunnel at the DCI layer, and finally map the original VLAN to the OTV overlay for example, hence the bridging domain keeps limits to the access PoD and not widely spread across the whole DC infrastructure.
      VLAN ⇐ToR_VTEP⇒ VXLAN ⇐Core_VTEP⇒ VLAN ⇐DCI_OTV⇒
    5. Ivan>> Finally, while I understand your excitement regarding the long-distance ASA cluster, do tell me what happens when the DCI link fails?
       
      yves>> This is a fair comment. Actually there are two levels of LAN extension when deploying Act/Act FW clustering stretched across multiple locations, one level for the ASA clustering (extending the cluster control link) and one level for the Apps and user data (traditional DCI), obviously all redundant network connections for a resilient solution must exist, but let’s assume the DCI dies.
       
      The key components are that 1st of all, all FW appliances are active and secondly there is a way to control dynamically the direction In & Out to the right DC supporting the Apps of interest.
       
      Hence the best is a combination of ASA clustering stretched across two sites (using a single cluster control plane) with the LAN extension (such OTV offering stateful “hot” live migration), in conjunction with LISP IP mobility (ingress redirection) and FHRP localization (egress redirection).
       
      If the DCI breaks, we will get a split brain like for any other clusters. But the main difference is that with the ASA clustering, all FW appliances (phy or virtual) are active, and thanks to LISP IP mobility and FHRP isolation, the traffic will follow the location of the apps per site and will stay stuck to the site where the apps resides as it will not be possible to address hot live migration anymore. However Disaster Recovery with cold migration should be supported. Both DC becomes two active independent DC’s while the same security rules will remain intact on both sites, but now with two FW control planes.
       
      More to read in this article
      27 – Bis – Path Optimisation with ASA cluster stretched across long distances – Part 2
      http://yves-louis.com/DCI/?p=856
       
  6. Sadly Randall has set aside this time to embarrass himself publicly again. Sorry everyone.
Add comment
Sidebar