Failure Analysis: An Interesting way to Break CAPWAP

I recently stumbled into what I think is a very interesting failure scenario with a Cisco Wireless solution. This was a traditional controller based solution that leveraged a CAPWAP data and control plane. The symptoms were fairly consistent and strange.

Symptoms:

  • When issues are occurring, all uploads reduce to about 1.5Mb/s
  • Installing a new AP seems to solve the issue
  • Issue re-occurs in a few minutes
  • Issues only occur for one specific site
  • Wireless is configured consistently across 5 sites
  • RF is not an issue

Topology:

When I got involved with this, a few people had reviewed the configuration and TAC had been involved for some time. While on-site, I took a look at RF and channel utilization (expecting to find it to be ugly since I knew it was heavily dependent on 2.4Ghz). My first order of business was to spin up a test AP in its own group and advertise a test SSID on a 5Ghz channel. Upon doing so, both iPerf and Speedtest were >50Mb/s. My initial thought was that the density needed to be increased and the radios tweaked to get more clients on 5Ghz. However, a few minutes into my testing–my upload also went to similar speeds (<1.5Mb/s).

My next step was to configure FlexConnect on the test AP and t0 drop the traffic into a local VLAN. This should remove anything to do with CAPWAP as a possible culprit. After doing so, testing showed that there were no issues. Even after an extended period of time, we saw no performance degradation. This reaffirmed that there were no issues with RF and we were likely looking at something impacting CAPWAP throughput.

Having a very busy schedule, I asked the customer to engage the Metro-E service provider and see what they can tell us about the CAPWAP traffic (UDP/5246-5247) for that location. Since there was no additional overlay protocol (DMVPN, MACSEC, etc), I thought it would be interesting to see if the provider was seeing anything abnormal.

About a day later, I received a very interesting email. The service provider had analyzed the CAPWAP traffic from the AP to the WLC and from the WLC to the AP. The traffic from the WLC to AP seemed normal. However, the traffic from the AP to WLC was being sent to a MAC address that was NOT known in the service provider network. I also found that the traffic was being rate-limited to 2Mb/s by a BUM (Broadcast, Unknown, Multicast) policy.

It was at this point that I knew we could solve the problem. With this information in mind, I proceeded to do a packet walk from the AP to the controller. Here is what I found.


For discussion, I will attach psuedo mac addresses to this topology.

So when we apply packet and frame forwarding logic, we have the following:

  • AP uses its local L3 switch as a GW, which routes the packet for 3.3.3.3
  • 3.3.3.3 (WLC) is L3 directly connected VLAN 200 so the the switch connected to the AP routes the packet to VLAN200.
  • VLAN200 ARPs for 3.3.3.3. The response to this is received by the switch on the left. The ARP response populates the MAC address tables from right to left.
  • AP forwards CAPWAP to WLC using the following L3 path (AP->LeftSwitch->WLC)
  • WLC forwards CAPWAP to AP using the following L3 path (WLC->RightSwitch->AP)
  • The only time the Metro-E service sees MAC BA as a source is when the WLC responds to an ARP request
  • From a Metro-E perspective, AP to WLC communication uses the following MAC addresses–SRC:AC, DST:BA
  • From a Metro-E perspective, WLC to AP communication uses the following MAC addresses–SRC:BC, DST:AD (due to the IGP adjacency)

The Workaround

We did a temporary workaround by creating a static route on the LeftSwitch for 3.3.3.3/32. Setting the next hop to 4.4.4.1 forced an outbound destination to a MAC (BC) address that wasn’t being flushed out of the Metro-E providers tables. Ultimately, the goal is to prune VLAN200 out of the remote location and remove the static route.

Analysis and Conclusion

ARP entries often default to four hours. MAC table entries often age out after 5 minutes. Booting up a new AP forced an ARP entry between the LeftSwitch’s VLAN200 and WLC. This entry would have remained for four hours. The ARP response that created this entry would have populated the Metro-E service but timed out after a given period of time. Based on testing, I would guess this was about 5 minutes (which is a common default). Once the entry timed out, the AP to WLC traffic would become unknown and rate limited by the providers BUM policy.

If you have any feedback or comments, please share below.

Disclaimer: This article includes the independent thoughts, opinions, commentary or technical detail of Paul Stewart. This may or may does not reflect the position of past, present or future employers.

About Paul Stewart, CCIE 26009 (Security)

Paul is a Network and Security Engineer, Trainer and Blogger who enjoys understanding how things really work. With over 15 years of experience in the technology industry, Paul has helped many organizations build, maintain and secure their networks and systems.
This entry was posted in Other. Bookmark the permalink.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.