dynamic infrastructure | Chris Miller's Blog

I read this post from Lori MacVittie a long time ago and since I understood and agreed with everything she said, I bookmarked the page so I could always reference it. Unfortunately, I sort of forgot why it was important.

I’m currently the on-call network engineer for our environment so when there’s an issue, I find out fairly quickly. We currently outsource some of our monitoring to a vendor. Essentially, they package Nagios into an appliance on our private network and connect to us through our firewall via some port forwarding magic. This arrangement allows us to leverage a great monitoring system without having to spend tons of time configuring and managing checks. Since this monitoring is so critical, the vendor is always quick to let me know when they’re unable to manage our device as this can indicate a critical network issue.

Last night, they called at about 9:20 PM to let me know they were unable to reach our device. I attempted to VPN into our network but the logon page wasn’t loading. Unfortunately, the VPN device has been having sporadic issues like that so I couldn’t tie that directly to an issue. While constantly attempting to load the page, I began pinging our 4 ISP-facing routers. Three of the routers replied and one didn’t. The one that didn’t was the one through which our vendor managed the probe. That link is also one of the 2 used by our VPN device. While I quickly assumed this ISP/router was having a problem, I realized other devices on that subnet did respond so that likely wasn’t the case.

After finally being able to login to our VPN, I jumped onto our F5 Link Controller (ISP Link Load Balancer) to begin troubleshooting. The motivation for using a Link Load Balancer is that it allows you to multi-home your sites across multiple ISPs. When it detects a problem with a single ISP, it quits replying to queries with addresses from that subnet. Since it was still giving out these addresses, I knew the link wasn’t a problem. After logging onto the Link Controller, I began reviewing recent log messages. The first one that jumped out at me was that the device had went from active to standby and back to active.

Awesome! A fail-over had occurred. That explained the downtime. Unfortunately, since we were still getting alerts of certain sites not responding, there was still a problem. I quickly identified that the Link Controller was indeed active. I then connected to the standby unit and confirmed that it was standby. When reviewing its logs though, I noticed tons of messages similar to the following – “Packet rejected remote IP x.x.x.x port x local IP x.x.x.x port x proto x: Port closed.” After checking the time stamp on the logs, I quickly noticed they were still happening.

Oh crap! After staring blankly at my laptop screen for awhile trying to figure out why my standby load balancer was receiving traffic, I decided to go check the persistence table where session status (mapping of user to pool and ISP) is stored. I quickly noticed that the normally empty table had plenty of data. Since this wasn’t the active unit anymore, I cleared the persistence table and made sure it didn’t become populated again. It didn’t! Unfortunately, the “Port closed” log entries continued to occur. My standby load balancer, which shouldn’t be receiving any traffic at all, was still getting plenty of it.

That’s when it finally hit me! ARP had gotten me again. When a user tries to connect to an IP like 1.1.1.2, it must go through a router like 1.1.1.1. Once the router (1.1.1.1) receives the packet, it tries to locate 1.1.1.2 so it knows where to send the data. It does this by sending an ARP broadcast. If 1.1.1.2 happens to be behind a load balancer, the load balancer will let everyone know that it’s the owner of that IP. If you have an active and standby load balancer though, only one should reply saying it’s the owner. When an active load balancer fails, the standby one should now begin responding to ARP requests. When the active device comes back up, it then begins responding.

Unfortunately, much to the dismay of a dynamic infrastructure, we have a very very evil piece of technology, THE ARP CACHE. Once a network device uses ARP to locate a device, it caches that location so it can save time and not have to ask again. From what google tells me, a Cisco router caches an ARP entry for 5 minutes. Let’s consider why that’s not good for me!

1. A Client makes a request for 1.1.1.2.

2. The cisco router, 1.1.1.1 sends an ARP request for 1.1.1.2.

3. Since my load balancer owns 1.1.1.2, it responds.

4. The Cisco router saves the location of 1.1.1.2 in its ARP table for 300 seconds.

5. My active load balancer fails and the standby load balancer becomes active.

6. The cisco router keeps trying to send data to 1.1.1.2. Since it has its mac address in its ARP table, it attempts to send directly to the formerly active load balancer.

7. Since that load balancer isn’t active anymore, it can’t process the transaction.

As you can see, having an ARP timeout of 300 seconds severely limits my options. In order to fix this problem, I simply jumped onto our routers, issued a “clear arp” command and the problem went away. To keep the problem from happening, I could either disable arp-caching, lower the arp-cache timeout value, or best yet, use MAC Masquerading. See, F5 does a great job of supporting dynamic infrastructures and recognized how critical ARP is! Unfortunately, it’s tough to see the value in performing extra configuration tasks such as enabling mac masquerading if you haven’ t experienced the evil that is ARP Caching!

As the IT industry continues to move toward dynamic infrastructures, it’s critical that we consider technologies such as ARP and how they affect us. Just because you have an active and standby unit, doesn’t mean you’re redundant!

Tag Archive

My Social Networking Links

Tag Archive

Want a Dynamic Infrastructure? Don’t Forget About ARP!

My Social Networking Links