Nothing seems to be a better teacher than working through a painful issue. I recently had to work through performance issues where there had been many recent changes. This network had new servers, a new cloud based anti-spam solution and a known issue with one of the ISP links. Throughout this process I used my favorite troubleshooting tool, Wireshark. I immediately saw bandwidth utilization issues that likely were accounting for some of the issues, but that wasn’t all.
So to give some specifics on this particular scenario, inbound and outbound email seemed to work fine without attachments. Only very small attachments were being sent or received. There was an issue with one of the Internet connections and that was certainly causing over utilization of the remaining connection. Due to the fact that emails were not successfully being sent, both the server and the users were resending messages (further adding to the bandwidth issues). I could also see inbound messages flow for ten minutes then the max session timer in Exchange 2010 would end the connection.
This is my favorite type of troubleshooting task. I had to work through reproducible performance issues with a common and well understood protocol, SMTP. My first step is to bring up Wireshark and take a quick peek at the traffic. As suspected, there was more traffic than suitable for the connection that the customer had. I quickly resolved the issue with the failed Internet connection. The now repaired connection, was the primary for all web based “user” traffic and the other one was the backup for this type of traffic and the primary path for email. I had naively hoped that this would be the only issue, but email still was not flowing as I expected.
So now, there was much more bandwidth available to SMTP, but it mail was still flowing very slowly. If I looked at inbound streams with large attachments, I only would see a few Kb/s transfer rate and they had 4+ Mb/s available. What could the problem be? Eventually, I started noticing some very large delta’s in the packets in the inbound stream. Very strange.
This is a good point to think about how TCP works, since our SMTP traffic is riding on top of TCP. With TCP each octet is assigned a sequence number. These sequence numbers increase incrementally, so the protocol can determine when data is missing. An IP Packet caries TCP as its data, and each sequence number represents one octet of TCP data (SMTP in our case). This is used for the reliability that is obtained from the TCP protocol. If the receiver see a sequence numbers that have jumped beyond segments that it has received, it can safely assume that some data has been lost. In this case, it will send duplicate ack’s for the first segment’s sequence number that was lost. This is known as a fast retransmit. I was NOT seeing any traffic like this in my example.
The other thing about TCP is that the sender would expect an acknowledgement before the maximum window is reached. Alternatively, the sender does not receive this acknowledgement prior to the TCP retransmission timeout, the data is resent. In this case, performance is slow because we are defendant on timers for retransmission. These timers are an eternity when compared to the typical turnaround times of data on modern networks. Could this be the problem? Could something along the way be dropping the inbound data, or my outbound acknowledgements? Certainly this could be the cause and that would explain my poor performance.
At this point, I decided to jump out to the IOS Firewall that sat in front of these servers. I typically prefer an ASA but there were restrictions that kept this particular network on a router and cost concerns that prevented a router and a firewall appliance. I jumped into the CLI and added the command “ip inspect log drop-pkt” and immediately started seeing drops. Remembering that IOS Firewall didn’t play well with out of order packets well, I fired up my browser and quickly found the 12.4(11)T was the first revision to support out of order TCP. Prior to that, out of order TCP was dropped. After upgrading to a recent 12.4T IOS and rebooting, the traffic looked far different. Utilization and performance were where I expected it to be. Additionally test messages that I had been sending that were flowing for 10 minutes prior to hitting the Exchange 2010 max session timer, were now received in 6 to 10 seconds.
So in conclusion, I think there is great value in understanding TCP traffic and what looks normal (and abnormal) on the wire. Additionally, I think it important to make incremental changes in a network. Two many changes at once cause distractions when troubleshooting. I had to determine if all of the resending of data by users was actually causing the problems or if they were a result of the problem. Was the new, cloud-based, antispam solution part of the problem? When we make too many changes too quickly, we’ll find that we are chasing red herrings and not the root causes of our problems. In the end, I think we uncovered a problem that had existed for a long time. However, the shift in traffic made it more relevant. In any case, performance was much better after fully troubleshooting the problem and I feel more confident in the solution when I actually troubleshoot it to this degree.