TTL in Overlay Virtual Networks

After we get rid of the QoS FUD, the next question I usually get when discussing overlay networks is “how should these networks treat IP TTL?

As (almost) always, the answer is “It depends.”

Layer-2 Virtual Networks

Overlay virtual networking solutions like VXLAN that implement layer-2 segments (effectively Ethernet-over-something) should not modify the VM-generated traffic. These solutions are emulating a transparent bridge and should NOT interact with the user traffic; all they can do is forward, flood or drop.

There’s the “minor” annoyance of CoS or DSCP packet marking, but let’s ignore that detail.

Obviously the transport TTL (TTL generated by hypervisor when encapsulating the VM traffic) shouldn’t reflect the VM-generated TTL. VM-generated TTL could be anything (VM could also generate non-IP traffic), while the transport TTL needs to be high enough to allow the packet to traverse the data center core.

Conclusions:

  • Don’t touch the overlay (VM) TTL;
  • Use whatever TTL makes sense in the transport network.

Layer-3 Virtual Networks

Solutions that implement layer-3 forwarding are usually emulating Ethernet segments (layer-2 segments) connected with routers. In some cases the whole virtual network acts as a single virtual router (VMware NSX Distributed Router, Hyper-V, NEC ProgrammableFlow …), in others the inter-subnet traffic flows through a gateway appliance or a VM (VMware NSX Services Router, default OpenStack networking …).

These solutions SHOULD decrement TTL like any other router (or layer-3 switch) would do. If they wish to stay as close to the emulated Ethernet behavior as possible, they SHOULD decrement TTL if and only if the packet crosses subnet boundaries (or you might get crazy problems with application software that sends packets with TTL = 1).

For example, Hyper-V Network Virtualization SHOULD NOT decrement TTL if the source and destination VM belong to the same subnet (even though the HNV module actually performs L3 lookup to figure out where to send the packet) but SHOULD decrement TTL if the destination VM belongs to another IPv4 or IPv6 subnet.

Like in the layer-2 case, the transport TTL has nothing in common with the VM-generated TTL – hypervisors should use whatever TTL they need to get the encapsulated traffic across the data center fabric.

Conclusions:

  • Decrement TTL like a router would do;
  • Don’t copy overlay TTL into transport TTL or vice versa;
  • Use whatever TTL makes sense in the transport network.
Virtual routers used to implement virtual networks in large public clouds usually don’t decrement TTL when a packet crosses a virtual subnet boundary. It’s impossible to do a traceroute within AWS or Azure environment.

But this is not how MPLS works

Really? Well, this is EXACTLY how L2VPNs (EoMPLS, VPLS, EVPN) work.

MPLS-based L3VPN (the “original” MPLS/VPN) is a totally different story: it’s not supposed to emulate a single virtual router, but a whole WAN. Copying customer TTL into provider TTL (and vice versa) is the most natural thing to do under those circumstances (unless the MPLS provider disables TTL propagation because they want to hide the internal network details).

More information

Watch the Cloud Computing Networking webinar if you need an overview of various virtual network technologies, Overlay Virtual Networking for an overview of what major vendors have to offer, VXLAN deep dive if you’re interested in VXLAN implementation details and VMware NSX Technical Deep DiveArchitecture if you want to know how NSX works.

9 comments:

  1. Agree that ttl should not be copied, what happens if there is sync issue between the controller & hypervisor leading to a loop in network
    Replies
    1. As always:
      * If you have a L3 forwarding loop, overlay TTL will eventual expire.
      * If you have a L2 forwarding loop, you'll get the same fancy effects like in physical L2 networks (the only difference being that the looped packets will hose a few servers, not the whole network).
    2. Consider a case where VM-1 on hyper-visor(H1) is talking to VM-2 on H2. Assuming a programming error where for VM-2, H1 points to H3 and H3 point to H1. If TTL is not copied from one tunnel to another tunnel, will not there be loop ?
  2. AWS however does not decrement TTL
    http://cloudierthanthou.wordpress.com/2013/04/30/the-sdn-behemoth-hiding-in-plain-sight/
    Replies
    1. Thanks for the information & the link - and it's so refreshing to see someone whose view of SDN and pixie dust is so aligned with mine.

      Would you do one more AWS VPC test? Add a third VM in one of the subnets, ping between all three and dump ARP tables on all three VMs.

      Thank you!
      Ivan
  3. If my notes are accurate, then the ARP request from one VM to another never reaches the other VM. Clearly some kind of ARP proxy answers. The arp table shows the entries for the other VM in the same subnet and that of the implied router.
    Also, while you can ping your gateway (10.0.0.33->10.0.0.1), you cannot ping the gateway on the other subnet (10.0.0.33->10.0.1.1).
    Replies
    1. Thank you! I was more interested in the MAC addresses in the ARP table - they should all be the same, regardless of the IP subnet of the destination.
  4. No, each VM has a distinct mac.
    [ec2-user@ip-10-0-0-33 ~]$ sudo arp -n
    Address HWtype HWaddress Flags Mask Iface
    10.0.0.6 ether 02:c5:98:d1:b4:69 C eth0
    10.0.0.16 ether 02:c5:98:d7:5c:43 C eth0
    10.0.0.1 ether 02:c5:98:c0:00:02 C eth0
  5. I know this is an old, but for the benefit of anybody who run into this blog, this video should answer some questions you had:
    https://www.youtube.com/watch?v=3qln2u1Vr2E
Add comment
Sidebar