Asymmetric MPLS MTU problem
Russell Heilling made a highly interesting observation in a comment to my MPLS MTU challenges post: you could get asymmetric MTUs in MPLS networks due to penultimate hop popping.
Imagine our network has the following topology (drawn with the fantastic tools used by the RFC authors):
S---CE---R1===R2---FW---C
The only link using MPLS is between R1 and R2. FW is a misconfigured firewall blocking all ICMP packets. Furthermore, FW uses NAT making the client C appear to be directly connected to R2. Layer-2 payload size (known as MTU on Cisco IOS) on all links is 1500 bytes. Unlabeled IP packets can be up to 1500 byte long; labeled IP packets cannot exceed 1496 bytes (depending on the size of the MPLS label stack).
R1 and R2 advertise labels for all known prefixes to each other using LDP. R1 advertises a “real” label for S (because it’s reachable through a next-hop router); R2 advertises implicit null label for FW/C to R1 to enable penultimate hop popping.
When the server S sends a packet to client C, R1 should send a labeled packet to R2, but due to implicit null advertised by R2, the MPLS label stack is empty; IP MTU from S to C is 1500 bytes.
When C sends a packet to S, R2 inserts a single MPLS label in front of the IP payload (remember: R1 advertised a non-null label for S to R2); IP MTU from C to S is 1496 bytes.
In properly configured networks, asymmetric MTUs wouldn’t matter; combined with misconfigured firewalls they can be fatal and extremely hard to troubleshoot. In our scenario, client would be able to download any content from the server (unless the HTTP request header or its equivalent grows beyond 1456 bytes), but would fail to upload anything longer than ~1400 bytes.
More information
To learn more about MTU path discovery and related problems, read the Never-Ending Story of IP Fragmentation.
You’ll find in-depth description of MPLS/VPN technology and enterprise network deployment hints in our Enterprise MPLS/VPN Deployment. For more VPN webinars, check our VPN webinar roadmap. You get access to all those webinars when you buy the yearly subscription.
Consider the network above.
The system administrator logs on to C, and tries to ping S. With the DF bit set, 1496 byte pings are successful. 1497+ pings result in a "Fragmentation needed" ICMP response. Groovy.
Now he logs on to S and tries the same thing towards C. He finds that 1496 bytes is fine as before. 1500+ generates the "Fragmentation needed" as before. However there is a 4 byte "black hole" where there are no responses seen.
The reason for this is that the echo request packet actually makes it all the way from S to C, but the echo response packet is too big to get back. This leads to an ICMP message getting sent to C, but this never gets shown to the originator of the packet at S. Leading to the apparent problem (this issue is specific to ping though - real application traffic gets through just fine).
This caused a fair amount of head scratching before we got to the bottom of it.
#1 - You don't need the FW (or NAT or anything else) to generate the behavior Russell is describing. Simple IP/MPLS network with asymmetric MTUs (which you always get) is enough.
#2 - The apparent black hole occurs because the "Fragmentation needed" message is sent to the host sending ICMP reply (not request). That host cannot do anything; there's no "I have a problem" ICMP message it could send to the pinging host.
#3 - TCP traffic works just fine because the "Fragmentation needed" message is always sent to the host sending oversized TCP segment (which then gets split into smaller segments)
ICMP within MPLS VPNs adds a whole new layer of interesting here too. When the router generating the error doesn't have a route to the source, so they send the error forwards to the destination...
If you did this, the only drawback I see is, if you enable LDP on CE (or have problems with LDP), that change would propagate beyond R1 to its neighbours. Am I missing something else?
If this is a trivial question answered elsewhere, I'd appreciate a pointer in the right direction.
Yes, we run MPLS over DSL here. It even works as long as you are aware of the caveats ;)
S---CE1---P1===P2===P3---CE2---C
If P1===P2 has an MTU of 2000, but P2===P3 has an MTU of 1500, with PHP/Implicit Null you will be able to get 1500 bytes from S to C, but only 1496 from C to S.
Asymmetric MTU is not a problem to be solved. It is an issue to be understood. If it is considered in the design there is no reason things will not behave in a predictable manner.
Also, the end-to-end LSP MTU signaling is largely irrelevant in most cases; interim LSRs send ICMP error messages with the original label stack, so they eventually arrive to the intended recipient.
Knowing end-to-end MTU would be quite important, though, if the interim LSRs would have no data-plane IP capability (plus it would allow you to push error processing to the network edges, which is always a good idea).
Well, there is a Huawei authored standards track draft at the moment for end-to-end RFC3107 signalled LSPs - http://tools.ietf.org/html/draft-zeng-idr-bgp-mtu-extension-00