As I mentioned in a previous post, I have been studying the materials for the Cisco CCDE. One thing that has come up only a time or two is that of MTU. MTU, or maximum transmission unit, is the maximum size a chunk of data can be for a given interface. In this article, we are speaking specifically of IP MTU and this is an important distinction that I will clarify later. Network design should incorporate a clear understanding of MTU challenges and operators need to understand what to look for when it is not properly built and configured.
A simplistic example of a problematic design is when there is a link with a smaller MTU somewhere between two endpoints capable of creating larger packets (see the image below). While this environment may work fine, understanding the interaction required between the hosts and the network devices is very important to network design.
A few years ago I wrote an article that outlined some of the behavior that can be witnessed when there are MTU discovery issues. Let’s quickly recount what path MTU discovery (PMTU-D) is, how it works, how it fails and some logic around appropriate design.
General Facts Around IP, MTU and MTU Discovery (PMTU-D)
- IPv4 packets have a flag in the header known as (DF)—don’t fragment
- When the DF flag is set (1), routers will not fragment the packet even if it is too large to send out the egress interface
- When a packet is too large to send and has the DF set (1), the current router should notify the originating host via ICMP Code Type 3 Code 4
- The process of an intermediary node (router) notifying an end host via ICMP Type 3 Code 4 is known as Path MTU Discovery (PMTU-D)
- If the DF flag is not set (0), an oversized packet may be fragmented if it comes across an egress interface that is not capable of carrying the current packet size.
- Greater bandwidth interfaces do not necessarily have larger IP MTUs. Likewise, lower bandwidth interfaces may not have smaller MTUs
- Generally speaking, TCP traffic generates IP with the DF set (1), UDP traffic typically has the DF un-set (0), mileage may vary with other IP payload types
- Tunnels and Overlays either reduce the overlay MTU (user/host generated packets), increase the overall packet size or induce additional fragmentation—typically seen when packets are originated by a host at full MTU
- Tunnels may copy the DF field from the inner header to the transit IP header for continuity of expected behavior
- TCP can influence the overall packet size originated at the hosts by leveraging maximum segment size (MSS) in the initial handshake
- IPv6 never fragments (fragmentation is unsupported)
How can this process fail when a packet larger than the MTU is sent with DF set (1)?
- If a router fails to send the ICMP message to the originating host the originating host will continue to forward packets that are too large to be sent across the smaller MTU link. In Cisco IOS this could be because of “no ip unreachables” on the interface facing toward the host
- A firewall between the router and the host could block the ICMP message
- The originating host could ignore or block the received ICMP message instead of presenting it to the IP stack
- NAT or PAT could drop the message because it doesn’t understand the message from its context of state
- A network overlay implementation may not originate the signalling from a dropped packet in the underlay
- Any of these failures can be in either direction of flow
- TCP sessions connect properly (3-way handshake), ICMP may be reachable
- Applications are slow, general poor performance, possible crashes, unuseable
- Some applications and/or hosts may work normally
- TCP Retransmits may be seen in a packet sniffer (on one or both sides of the issue)
Examples of Failure Symptoms
Let me share an example of what often happens. At one point I installed a lot of routers on PPPoE connections. PPPoE reduces the IP MTU by 8 Bytes (typically to 1492). One of my first experiences with this exhibited a behavior that was a little strange. After switching their connection to PPPoE, the customer complained that they could not get to some websites. These websites would only partially load but would consistently be “pingable”.
In these cases, it was very likely that TCP established its three-way handshake with a host capable of sending 1500 Byte IP Packets. The three-way handshake consisted of tiny packets and succeeded normally. Incoming data was being dropped before it left the service provider toward my PPPoE customer. The provider may or may not have been properly responding with ICMP Type 3 Code 4 messages. It is very possible that the firewall protecting the remote website was denying the ICMP in a policy. It is possible that the messages were dropped in a NAT process near the web server. Both hosts exhibited a stuck TCP session and would try again.
TCP MSS as a solution
On a large scale, a PMTU-D issue can have a material effect on the performance of one or both hosts. As design engineers, we should attempt to signal this maximum packet size as quickly as possible. A good solution to the above scenario is to signal in the TCP header that the maximum packet size is 1492. This is done by adjusting the maximum segment size (mss) down to 1452 with the “ip tcp adjust-mss 1452” interface command in IOS.
The reason this needs to be 1452 is that this value is from the perspective of TCP data. TCP maximum segment size does NOT include TCP or IP header information. Those are equal to 20 bytes each. Therefore 1492 – 20 (IP) – 20(TCP) = 1452.
IP MTU vs MTU
Perspective and clarity are also important when discussing MTU. I think most engineers realize that IP MTU is typically 1500 Bytes. In some places, we will see similar values represented as 1514 (typically including ethernet header) or even 1518 (ethernet header and FCS).
When overlays and tunnels are added, we need to design not to exceed the capabilities of the physical network or underlay. It is also prudent to consider how the path of a flow might change based on a network failure. One other area where perspective is important is the PING command itself. I have found that the specified size parameter may be overall packet size or it may be payload size (depending on the utility itself). It is easy to confirm the originating packet size by leveraging Wireshark on the host.
A couple of days ago, I had a customer ask me what I would recommend for their MTU settings. They felt like their IP MTU was too conservative on some of their tunnel interfaces. After digging into, I found that they were using 1200 Bytes on the tunnel interfaces that were part of the public internet facing DMVPN. I also found that they were using 1400 Bytes on tunnel interfaces that traversed their service provider MPLS network. Their DMVPN uses GRE with a tunnel key, and they leveraging esp-aes 256 and esp-sha-hmac in IPSec Transport Mode.
Not fully trusting my interpretation of the calculator, I built this configuration in a lab. What I found was that packets originating at 1414 Bytes would create an overall IP packet of 1480 Bytes on the physical network. Knowing that some of the customer links were PPPoE and that an overall maximum MTU was 1492 in some areas, I set a target to be <=1492. At this point, one would think they can increase the user packet size to 1426 and maintain the objective. What I proved was that even increasing the user packet from 1414 to 1415 Bytes, increased the overall packet size from to 1480 to 1496. ESP actually pads its payload to get to certain block sizes. So this is something to keep in the back of our minds when we design networks.
My response is shown below. Notice I didn’t recommend the most technically optimal solution due to the state the customer was currently in and realizing that this would be a more likely path for the customer to execute on.
So here is what I found. If I use IP MTU of 1414 on the Tunnel interface, it keeps the underlay IP Length at 1480. When I increased that to 1415, it increases the underlay IP total length to 1496. After doing some research, I found that ESP overhead depends on the padding that is applied and that is different depending on the original packet size.
I tested your configuration with the following IP/GRE/ESP/IP—
crypto ipsec transform-set MYTRANSFORM esp-aes 256 esp-sha-hmac
crypto ipsec profile DMVPN-PROF
set transform-set MYTRANSFORM
set isakmp-profile ISAKMP
interface tunnel 0
tunnel protection ipsec profile DMVPN-PROF
I think the optimal configuration is to do the following:
-Do not set an IP MTU on the physical interfaces
-Set the IP MTU at or below 1414 on the tunnel interfaces (ip mtu 1414)
-Set the mss to at least 40 less than ip mtu (ip tcp adjust-mss 1374)
These should match on all of the tunnel interfaces because traffic paths can move based on routing change.
My recommendation would be to simply use the settings (1400 Bytes) found currently on the MPLS facing Tunnel interface throughout the network. Increasing the Internet-facing Tunnel interface IP MTU to that should introduce very little risk. In testing, my goal was to keep the overall total IP length at or below 1492 and those settings will accomplish that.
After building this up in the lab, I went to the calculator and confirmed my findings (now with a clearer understanding).
I hope these examples have helped you better understand these technologies. If you have questions or feedback, please comment below.
Other Articles about MTU
- Using Ping Sweep to Find MTU Ceiling
- Recognizing IP MTU Issues
- Understanding Tunnel Path MTU Discovery
- IP Fragmentation and MTU
Update — In the original article I incorrectly stated the typical TCP MSS used for PPPoE. My mistake, thanks for those who reached out to me to correct it.
Disclaimer: This article includes the independent thoughts, opinions, commentary or technical detail of Paul Stewart. This
may or may does not reflect the position of past, present or future employers.