Asymmetric MPLS MTU problem

Russell Heilling made a highly interesting observation in a comment to my MPLS MTU challenges post: you could get asymmetric MTUs in MPLS networks due to penultimate hop popping.

Imagine our network has the following topology (drawn with the fantastic tools used by the RFC authors):

S---CE---R1===R2---FW---C

The only link using MPLS is between R1 and R2. FW is a misconfigured firewall blocking all ICMP packets. Furthermore, FW uses NAT making the client C appear to be directly connected to R2. Layer-2 payload size (known as MTU on Cisco IOS) on all links is 1500 bytes. Unlabeled IP packets can be up to 1500 byte long; labeled IP packets cannot exceed 1496 bytes (depending on the size of the MPLS label stack).

R1 and R2 advertise labels for all known prefixes to each other using LDP. R1 advertises a “real” label for S (because it’s reachable through a next-hop router); R2 advertises implicit null label for FW/C to R1 to enable penultimate hop popping.

Routers advertise implicit null labels for directly connected prefixes and summary routes pointing to null0. You can change that behavior with the mpls ldp explicit-null global configuration command that also allows you to limit the use of explicit null to specific IP prefixes or LDP peers.

When the server S sends a packet to client C, R1 should send a labeled packet to R2, but due to implicit null advertised by R2, the MPLS label stack is empty; IP MTU from S to C is 1500 bytes.

When C sends a packet to S, R2 inserts a single MPLS label in front of the IP payload (remember: R1 advertised a non-null label for S to R2); IP MTU from C to S is 1496 bytes.

In properly configured networks, asymmetric MTUs wouldn’t matter; combined with misconfigured firewalls they can be fatal and extremely hard to troubleshoot. In our scenario, client would be able to download any content from the server (unless the HTTP request header or its equivalent grows beyond 1456 bytes), but would fail to upload anything longer than ~1400 bytes.

More information

To learn more about MTU path discovery and related problems, read the Never-Ending Story of IP Fragmentation.

You’ll find in-depth description of MPLS/VPN technology and enterprise network deployment hints in our Enterprise MPLS/VPN Deployment. For more VPN webinars, check our VPN webinar roadmap. You get access to all those webinars when you buy the yearly subscription.

12 comments:

  1. I totally agree this shouldn't be a problem if well managed, but I'd like to share an "issue" I had to troubleshoot a couple of years ago.

    Consider the network above.

    The system administrator logs on to C, and tries to ping S. With the DF bit set, 1496 byte pings are successful. 1497+ pings result in a "Fragmentation needed" ICMP response. Groovy.

    Now he logs on to S and tries the same thing towards C. He finds that 1496 bytes is fine as before. 1500+ generates the "Fragmentation needed" as before. However there is a 4 byte "black hole" where there are no responses seen.

    The reason for this is that the echo request packet actually makes it all the way from S to C, but the echo response packet is too big to get back. This leads to an ICMP message getting sent to C, but this never gets shown to the originator of the packet at S. Leading to the apparent problem (this issue is specific to ping though - real application traffic gets through just fine).

    This caused a fair amount of head scratching before we got to the bottom of it.
  2. I had to read it twice to understand it, so let me add a few more details:

    #1 - You don't need the FW (or NAT or anything else) to generate the behavior Russell is describing. Simple IP/MPLS network with asymmetric MTUs (which you always get) is enough.

    #2 - The apparent black hole occurs because the "Fragmentation needed" message is sent to the host sending ICMP reply (not request). That host cannot do anything; there's no "I have a problem" ICMP message it could send to the pinging host.

    #3 - TCP traffic works just fine because the "Fragmentation needed" message is always sent to the host sending oversized TCP segment (which then gets split into smaller segments)
  3. Yeah, it is difficult to describe this stuff without a diagram. I did a presentation with animations for internal consumption at $employer. I really should get around to a more generic explanation I could shove on my blog...

    ICMP within MPLS VPNs adds a whole new layer of interesting here too. When the router generating the error doesn't have a route to the source, so they send the error forwards to the destination...
  4. Why aren't SPs enabling larger MTUs on their backbones?
  5. Well-designed MPLS networks have larger MTUs, but sometimes MPLS is turned on later (and the MTU issue is missed).
  6. Considering LSP ends on R1, would it not be better to advertise route to S with implicit null label (to avoid double lookup)?

    If you did this, the only drawback I see is, if you enable LDP on CE (or have problems with LDP), that change would propagate beyond R1 to its neighbours. Am I missing something else?

    If this is a trivial question answered elsewhere, I'd appreciate a pointer in the right direction.
  7. Most service provider networks I have worked on do enable larger MTUs on the core. Sometimes edge technologies will limit the available MTU at the edge. In the specific example I saw in the past, the "C" site was on the end of an ADSL line with an MTU of 1480 *before* the MPLS was added.

    Yes, we run MPLS over DSL here. It even works as long as you are aware of the caveats ;)
  8. The example in the article is perhaps a little over simplified, and you probably *would* see PHP in both direction in this example. This behaviour does exist in real world examples though.

    S---CE1---P1===P2===P3---CE2---C

    If P1===P2 has an MTU of 2000, but P2===P3 has an MTU of 1500, with PHP/Implicit Null you will be able to get 1500 bytes from S to C, but only 1496 from C to S.

    Asymmetric MTU is not a problem to be solved. It is an issue to be understood. If it is considered in the design there is no reason things will not behave in a predictable manner.
  9. RFC3988 does give you the freedom to signal the imp-null MTU how you want (i.e you can signal the real link MTU or you can signal as if exp-null was present), wonder why cisco doesn't have a knob to do this?
  10. Well, that RFC is (A) experimental and (B) written by Juniper. That might explain lack of support in Cisco IOS.

    Also, the end-to-end LSP MTU signaling is largely irrelevant in most cases; interim LSRs send ICMP error messages with the original label stack, so they eventually arrive to the intended recipient.

    Knowing end-to-end MTU would be quite important, though, if the interim LSRs would have no data-plane IP capability (plus it would allow you to push error processing to the network edges, which is always a good idea).
  11. >RFC is (A) experimental (B) written by Juniper

    Well, there is a Huawei authored standards track draft at the moment for end-to-end RFC3107 signalled LSPs - http://tools.ietf.org/html/draft-zeng-idr-bgp-mtu-extension-00
  12. That clarifies it. Thanks.
Add comment
Sidebar