One of F5’s best resources is its DevCentral community. On DevCentral, users can find tutorials, code samples, podcasts, forums, and many additional resources to help them leverage their investment in F5’s technologies. As an active contributor and reader of DevCentral, I was very pleased to see a tutorial on combining F5’s new built-in Geolocation database with Google’s charting API to make heatmaps to illustrate traffic patterns.

One of F5’s DevCentral employees, Colin Walker, first wrote a tutorial for using iRules to show domestic traffic patterns and then added the ability to illustrate world-wide patterns. By using these iRules, users are able to see a visual representation of how often their site is accessed from different areas of the country.

First, there’s a US view:

Then, there’s a world-view.

image

In both cases, the logic is relatively straight-forward. A user hits your site which triggers an iRule that increments a table on your F5 LTM. Based on the source address of the client, the F5 can determine from which state they originated and using the Google Charts API, can overlay that data onto a map, using different colors to represent different hit counts.

While this is great data, we still have to find a tangible use for it. Here are some thoughts I’ve had so far:

1. For companies using Akamai, the client_ip this iRule uses to determine the source of traffic will actually be Akamai’s server. If you want the true source, you need to change [IP::client_addr] to [HTTP::header “True-Client-IP”]. What’s even cooler is doing 1 heatmap with client_addr and 1 heatmap with True-Client-IP. The maps should actually look the same since Akamai has such a distributed computing model. Far more often than not, a user will hit an Akamai resource in their own state. If the maps aren’t the same, you have a problem.

2. Rather than simply using colors to illustrate access, keep a table of HTTP requests per state, look at the amount every 60 seconds, and divide by 60 to get HTTP Reqs/Sec for each state.

3. For E-Commerce sites that use promotions to target certain areas of the country, look at the heatmap before and after a promotion to see whether or not access from that area increased and if so, by how much.

4. If you don’t have any legitimate international customers, using the world view map can help you determine with which frequency your site is being accessed from outside the US. If often enough, it might be worthwhile using the built-in Geolocation services to block access for users outside the US.

5. Rather than looking at every single HTTP request, have the rule only look at certain ones – for instance a checkout page so you can compare conversion rate between states.

6. Same concept as number 5, but if you release a new product page, have your rule look at that page so you can determine where it’s most popular.

7. Watch the heatmap throughout the day to see during which hours different locations most frequently hit your site. In an elastic computing situation, this might allow you to borrow resources from systems that might not get hit until later in the day.

8. If you release a new mobile site, look at mobile browser user-agents as well as client ip address to see if mobile users in certain areas of the country are hitting your site more often than others. If you have bandwidth intensive applications, this might help determine where you’d derive the most benefit with another DC, or using a CDN.

These are just a few thoughts. I’m sure there are many many more opportunities to leverage these great technologies. It’s nice to see that F5 recognizes the value of including a Geolocation database with it’s product, but it’s even more impressive that they’re giving tangible examples of how to use this information to make a site better.

Another challenge is demonstrating these capabilities to the folks who make decisions based on them. In the past, IT has been criticized for finding solutions to problems that didn’t exist yet. New capabilities are being added so frequently that architects really need to look at very solution, determine whether there’s an opportunity, and then send such opportunities to decision-makers.

Some of the most common health checks I see with load balancers include tcp handshakes and tcp half-opens. In a TCP 3-way handshake healthcheck, the load balancer sends a SYN, gets a SYN, ACK from the server, and then sends an ACK back. At this point, it considers the resource up. In a TCP-Half-Open healthcheck, the load balancer sends a SYN, gets a SYN-ACK from the server, and then considers it up. It also sends a RST back to the server so the connection doesn’t stay open, but that’s neither here nor there.

We all know that a much better healthcheck would be something that validates content on the end-systems, like an HTTP GET for a specific page, looking for an HTTP 200 response so we know that the content exists, but that isn’t always necessary. Sometimes, a tcp-half-open or a tcp-handshake might be the best way to go.

If going with either tcp health check method, you’re simply checking whether something is answering at the specified port on your system. If you’re load balancing HTTP traffic to an apache box that runs apache on port 80, doing a tcp healthcheck to port 80 will usually tell you whether Apache is running, but won’t necessarily tell you that your content is valid. Of course, that’s ok if you trust your ability to validate that on your own.  An interesting problem with doing a tcp-check, is that you need to know whose health you’re actually checking!

Let’s assume for a moment that the servers to which you’re load balancing traffic are behind a firewall instead of being local to your load balancer. If the firewall is acting as a full-proxy (like an F5 load balancer does) and you simply send a tcp-half-open or tcp-handshake, all you’re doing is checking the health of the firewall. A full proxy will complete a 3 way handshake with the client (in this case the load balancer) before completing a 3-way handshake with the server. By doing this, the box can, to a certain point, keep the client from starting a SYN-Flood. The only way the server sees the traffic is if the 3-way handshake actually completed.

Here’s the traffic flow for sending a tcp 3-way handshake from the load balancer to a system behind a firewall:

1. The load balancer sends a SYN packet to the server.

2. Since the Firewall is a full-proxy, it actually gets the SYN, and sends a SYN, ACK to the load balancer.

3. The load balancer sends an ACK to what it assumes is the system it’s load balancing, but is actually the firewall.

4. Now that the handshake is complete, the firewall completes a 3-way handshake with the server.

5. Now, if the load balancer were to send an HTTP GET for /index.html, it would send it to the firewall and the firewall would send it to the server.

If we use our above flow for a TCP-Half-Open check, here’s what we get.

1. The load balancer sends a SYN to the destination server.

2. The firewall responds with a SYN, ACK.

3. The load balancer has no idea that the firewall, rather than the server, sent the SYN, ACK and therefore considers the connection up and sends a RST to kill the connection.

Another problem is that the firewall will complete a 3-way handshake with the load balancer even if the server isn’t online. While some devices, F5 load balancers for example, allow you to configure them so they don’t even complete a handshake if the systems behind them are down, this is far from the norm.  So, by doing a tcp-check, we aren’t actually checking the destination server’s health at all.

In short, it’s important to understand what systems are between your load balancer and the systems to which you want to send traffic. If you encounter a proxy on the way, you’ll likely want to use a more intelligent healthcheck than simply seeing whether a service is listening on a certain port. Using HTTP traffic as an example, send an HTTP-GET request for a certain page and look for a specific response code. Doing so will ensure your destination server, and not a firewall, is responding to your health checks.  As cloud computing continues to ramp up, it’ll become more frequent that load balancers are sending traffic to systems in the cloud, thus often encountering firewalls and full-proxies on the way.

One of my great “ah, ha!” moments in Application Delivery came when I was reading a post about compression by F5’s Lori MacVittie. At the time, I was with my previous employer and was considering starting a project to implement compression. When I began discussing it with others, I was told that certain versions of IE had issues with compressed data, even though they sent headers saying they accepted gzip. Since our customers were long term care facilities and could feasibly have older technologies, it wasn’t crazy to think they’d be browsing using pre-IE6 and might have problems. Since our principle rule in IT was to “First Do No Harm,” I didn’t want to cause a negative experience for some users simply to speed things up for others. My mistake at that time was that I made a blanket assumption about all users. I decided that because I shouldn’t compress content for older browsers, I couldn’t compress at all. In Lori’s article, she talks about how compression isn’t always advantageous – especially over a LAN. Prior to reading the article, I really hadn’t considered all the information users were giving me and even better, that I could make decisions based on that information.

When a user visits a website, their browser sends a number of “HTTP-Headers” (An incomplete list). A great example of an HTTP Header, is “User-Agent.” In this header, the web browser informs the site it’s visiting what type of browser it is.

Google Chrome, for example, sends “Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.1 (KHTML, like Gecko) Chrome/6.0.428.0 Safari/534.1″

With this information, a website owner can decide to treat certain browsers differently than others.

An iPhone, for instance, might send something like”HTTP_USER_AGENT=Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1C25 Safari/419.3”

Some sites do an excellent job of leveraging this information to provide a better user experience. If you go to certain sites with an iPhone, you might notice yourself being redirected to m.example.com, or mobile.example.com or another “mobile-specific” site specially designed for a mobile device. Users obviously appreciate this since it keeps them from constantly having to zoom in and out and scrolling just to see the pages. While many companies create iPhone apps for situations like these, that doesn’t help people who have other mobile devices, hence the requirement for a mobile-specific site. One thing you’ll likely noticing when visiting mobile-specific sites is that it’s not simply the same content with a different resolution – it’s typically different menus, buttons, and fewer images. Since most iPhone browsing is being done via a cellular network, it’s good to consider latency as an experience inhibitor.

Using Lori’s example of providing compression only when it improves the user experience, we can apply the same logic to users with the iPhone user-agent header. On an F5 device, for instance, we’d create an iRule to be executed on HTTP_Request events that would look at the user-agent header and if it contained “iPhone,” we’d either send a redirect so the user would go to our mobile site, or compress data more aggressively, or even both. Using my own example of trying to compress data without causing issues for older browsers, I wouldn’t want to compress simply because a browser sent the “accept-encoding gzip” header – I’d really want to make sure I’m only compressing for user-agents I know can handle compressed data so it’d be a combination of both the “user-agent” and “accept-encoding gzip” headers. I often run into sites that, while being smart enough to detect my user-agent and make decisions based on it, provide a negative experience. For example, here’s the text I see when I navigate to a certain site using Google Chrome –

Please Upgrade Your Browser
To provide you with the most efficient experience, (removed)
utilizes advanced browser features of Internet Explorer 5.0 and greater.

Your Internet browser does not meet the minimum criteria.
Please download the latest version of Internet Explorer.

I’m obviously using one of the most capable browsers on the market, and this particular site not only says it won’t support me, but it also says I’d be better off with IE5. The only “saving grace” is that they provide a link through which I can download a browser they do support. Unfortunately, I’m stuck at this page and am not seeing any of their site’s content. Better behavior would be that I make it to their home page, am informed of the features the site doesn’t think I support, and can make a decision on whether I’d like to move forward. This site is obviously looking at the user-agent header, but is unfortunately making a blanket decision that because mine doesn’t contain IE, I’m not compatible. When this logic was written, Chrome didn’t exist. This behavior requires the logic to adapt constantly to new browsers. In this case, the site might be better off looking for headers that determine whether the browser supports the specific features required by IE.

Another interesting thought that popped into my head on the drive home yesterday was the type of inferences you can make about the person behind specific user-agents. If, for instance, I’m using Chrome to visit your site, I’m likely an advanced user who cares about new technology – do you really want to tell me that your site doesn’t support me and that I’d be better off with IE5? How about if I’m visiting your site with an iPhone – what does that say about me?

I’d love to see some of the analytics data comparing something like “conversion rate” for retail sites among different browsers. I imagine very few people purchase from their phone but I expect that quite a few of them are comparing prices – if that’s the case, it might make sense to have pricing readily available on your “mobile-specific” site.

While attempting to test a new load balanced environment, I noticed a very odd behavior. Traffic would hit my F5’s Virtual Server (VIP) but the F5 wouldn’t send any traffic to its pool members. I spent quite a while troubleshooting the issue and thinking it was a VLAN issue but nothing I tried worked. While doing packet captures from the box, I would see traffic hit the F5’s external VLAN but nothing hit its internal VLAN. I therefore assumed the public side was just fine and that there was a communication issue between the load balancer and the devices it was load balancing.

Hoping to get assistance with the issue, I opened a ticket with one of our vendors who after awhile noticed the issue. Traffic was indeed hitting the external VLAN but the TCP 3-way handshake wasn’t completing. The client would send the SYN packet. The F5 would send the SYN, ACK but the final client ACK was never seen. It later turned out that the SYN, ACK wasn’t making it to the client but that’s neither here nor there. The real “learning opportunity” for me was to remember the basics. Just because I saw traffic on the external VLAN, I automatically assumed everything was working fine even though it wasn’t. Since a load balancer is a reverse proxy, it completes a 3-way handshake with the client and then does the same with the server its load balancing. Since the client 3-way handshake was never completing, it wasn’t trying to do its handshake with the pool member. I knew there wasn’t a communication issue between the F5 and server because our health checks were passing but I still assumed the problem to be related to our F5 having to pass data that came in on one VLAN to another VLAN. Had I actually looked closely at my packet capture, I would have noticed the problem right away. Ah well, just another lesson learned.

Sorry for the short post – I’ve been too busy to come up with something good.

I had an interesting discussion with a coworker about on which systems certain application logic should lie. In this case, our dialog revolved around whether an HTTP Redirect  should lie on a Web Server or on an F5 Application Delivery Controller. Naturally, being the ADC guy, I would want it on my system. Even with that said, I think it’s pretty obvious that logic like this should lie on the F5 device.

1. The F5 is closer to the user than the web server. If the F5 handles the redirect, the Web server doesn’t have to see the initial request, just the post-redirect one.

2. Instead of the redirect existing on multiple servers, it only has to exist on 1 (or 2) F5 devices.

Today, when most people discuss serving content on the “edge” or “closer to the customer,” they’re likely saying so because of the performance implications. The initial motivation for companies to utilize CDNs like Akamai  was to reduce dependence on their own infrastructure. By offloading static content to a CDN, companies could reduce their bandwidth costs and potentially even their server footprint. As demand for content-rich applications has increased, the main motivation for utilize a CDN has changed. The price of bandwidth has dropped dramatically while server consolidation technologies like Virtualization and Blades has made server resources cheaper than ever. Now, when a company chooses to utilize a CDN, it’s likely so its content can be even closer to its clients/users. Using technologies like Geographic Delivery, a user requesting a page from California can get sent to a CDN’s resource in California. This helps to deliver the rapid response time users have come to require out of new web applications.

There’s really no disputing that compression, caching, security, and redirects should be done as closer to the user as possible. The only potentially valid argument I see for not utilizing such services is a financial one. In retail, customers demand fast response times. In some environments, that isn’t the case. If users are apathetic to load time, then the optimal cost-effective solution would likely be one that doesn’t require a CDN at all…it’s all about finding out which solution fits best for your environment.

During my first couple weeks at my former job, I happened upon a “Systems Status Dashboard” on one of our intranet sites. It was my first exposure to “IT Monitoring” so I quickly clicked on the link and was shown a few different systems all marked OK with the overall message that all systems were functioning normally. I really liked that anyone in the company could immediately access the health of our systems. That way, if people were experiencing issues, they could simply click the link and be updated. Unfortunately, I quickly came to realize that both the service and systems were pretty much always “ok”, even when they weren’t.

After I gained more experience in IT, especially in Application Delivery, I became somewhat competent at understanding all the parts that go into powering a service. To step back, let’s consider a “service” to be a solution – a website for instance, and a “system” to be a piece of that solution – a server, router, or switch for instance. As I began spending more time working with and troubleshooting Application Delivery Controllers (Load Balancers), I had to learn a lot about server and application issues. Being a traditional network guy, I found myself constantly troubleshooting non layer2/3 issues. As any “smart” IT guy would, I became frustrated that our customers noticed issues before we did and looked into monitoring options.

When I first started as a Network Engineer, I really only considered monitoring to be a method by which I could be alerted when issues happened. I’d configure alerts for bandwidth utilization, device fail-over, etc. It was only after stepping back that I was able to consider the impact of those failures on an overall “service.” Unfortunately, I wasn’t the only person thinking about monitoring from a “system-perspective.” We had different products monitoring network, server, and application pieces and there was very little conversation between the products. This made it very difficult for our Operations teams to gauge the impact of system failures. Taken out of context, IIS failing on 2 servers at the same time doesn’t tell an incident manager much. Of course, if said Incident Manager knows that these 2 servers were the only ones servicing a Web Farm, obviously it’s a serious issue demanding immediate attention.

That’s where the idea of “service monitoring” comes into play. I’ve developed quite a passion for the concept simply because of how much easier it can make life for everyone. Let’s consider the following “Systems” that make up a “Service” which we’ll call WebsiteA.

1. DNS Resolution (2 Servers)

2. ISP Routers (2 routers, each with 4 x T1s)

3. Switching (3 switches to which everything connects)

4. Load Balancers (2 of em, active-standby)

5. Web Servers (20 of em) – we’ll let our Apps run here too

6. Database Servers (5 of them)

7. SAN

Since the architecture of the above systems greatly affects the service’s availability, let’s assume it was done somewhat competently where no single system failure can cause the service to be down for longer than 1 second. In a traditional, legacy environment, different teams manager the 7 systems above (some overlap obviously) and would rely on their own products to monitor them. Imagine though, a dynamic “service map” on which all the systems are shown.

Let’s assume there are 3 possible statuses for our service – green (no impact), yellow (degraded), and red (significant impact). First, someone with a great deal of customer and application understanding, defines the requirements and impacts of all the different failures. A good example would be whether a single server failure would move us to yellow status or if it would take more than just that. Anyways, we still have all those different products monitoring systems but now, they feed their information to our service monitoring dashboard. MOM just detected that IIS quit working on 3 web servers? Our Dashboard should change the service’s (Website’s) status to yellow to reflect the change in capacity/redundancy. A catastrophic issue just took down both of our load balancers? Our dashboard knows that our service depends on at least one of those boxes being up so it knows that the website is down and immediately communicates that information!

So…why aren’t more people doing service monitoring already? Sure, plenty of people use external, transactional-monitoring, but how many are leveraging that as more than just a “the website is up” check. If one of those transactional-monitors fails, are you properly interpreting the results or do you need to spend an hour figuring out which system is broken? Some of the good products tell you at which hop you failed (dns resolution, 404/500 error), etc but not always. Another problem seems to be finding the people who actually understand all of the dependencies for a service. When everyone has their heads down working on their own systems, it’s sometimes hard to remember why the systems are there in the first place. That’s why architects are so handy! Unfortunately, their work often ends when the design is done, rather than when the monitoring and documentation are complete. There’s also that tough part, you know – changing the service monitoring every single time the underlying systems change.

If you read my last post, you noticed me utilizing a very simple F5 iRule to do an http to https redirect. During testing today, we noticed a very frustrating application behavior outline below.

1. Client makes request for http://site

2. F5 sends client redirect for https://site

3. Client requests https://site

4. Client logs in and application runs.

5. Application sends redirect to client saying to hit http://site

6. Client gets a popup message warning them that they’re about to make an unsecure connection.

The problem here isn’t necessarily that the application wants to redirect the client to an http page…rather, it’s that it wants to do so while the client is currently on an https page. Browsers don’t like that behavior because it’s often a trap of some sort. So, in order to keep the traffic flow and behavior relatively the same, the application would need to tell the user to hit https://site after logging in instead of http://site.

Since such a situation would be a bit difficult and require a code change across multiple systems, it’s F5 and iRules to the rescue! Since the application sends the redirect to the F5 which then proxies it to the client, the F5 is able to intercept the application response, rewrite it, and then send it on to the client.

Sure, this is a valuable ability for oddly-behaving applications. Think bigger though! How about if your server is sending 404 or 500 errors? How about instead of sending those to the client, the F5 detects them and redirects the client to your home page or even another server. It’s so nice having visibility at layer 7. The key is having people know your applications well enough that they can recognize and predict these behaviors before they happen!

One of my coworkers is doing a relatively simple infrastructure redesign for one of our sites. Essentially, the site can work over HTTP or HTTPS. As part of a new project, the application team requires that all data be sent over HTTPS so it can be encrypted. Since the current user behavior is to enter “http://site”, the user typically sends information unencrypted.

So, since the user must now send encrypted data, they must use “https://site” instead. That leaves us with the following options.

1. Instruct the user to always type “https://site”

2. Remove the HTTP Virtual Server so any traffic destined for HTTP will be dropped. This will hopefully force the user to enter “https://site”

3. Create an iRule that automatically redirects “http://site” to “https://site”

We obviously chose option 3 as it doesn’t require a change in behavior for the user and also supports them accidentally trying to hit the site over HTTP. Unfortunately, a lot of people don’t understand their infrastructure well enough to realize that such an ability exists within an Application Delivery Controller. In some environments, the project would have taken much longer and been much more troublesome for the end users.

Since you’re reading this blog entry, you more than likely know a bit about me and in turn, know that I’ve recently accepted a new position with Kohl’s Department Stores as a Network Analyst focused on their E-Commerce infrastructure. I plan to write an entry about why I decided to leave Direct Supply, but this won’t be that entry.

It’s now been 2 weeks since I began working at Kohl’s and it’s gone ok. I really miss the people at Direct Supply and I also miss being the “go-to guy” for so many different systems. It took me awhile to become the subject matter expert (SME) for Direct Supply’s network infrastructure and while my goal is to fill that SME role for Kohl’s E-Comm infrastructure, it’s going to take some time. Starting over is never fun!

I really like my new leader so far. He has a great handle on the technologies Kohl’s requires to be successful. That really helps him get us (his team) the resources we need as far as training, mentoring, etc. During my first couple days, he was kind enough to introduce me to at least 50 people. Each time he introduced me, he said, “this is Chris” and then some sort of derivative of  “he’s the F5 guy.” The thing about which I was most surprised is that everyone to which he said that immediately understood why I was there, what I was doing, and how critical it was that I do it. I was being introduced to VPs who already knew how Kohl’s depended on the ability of the devices I supported. What an amazing concept!

When I left Direct Supply, I spent the last couple weeks training my coworker and another person on how easy it was to use F5’s systems to accomplish tasks. I already knew F5s were easy to use…gotta love that GUI. Seriously though, I caught myself thinking that it was my familiarity with F5 equipment that led Kohl’s to take an interest in me. Since I was able to teach someone 95% of the tasks they’ll need to know in just 2 weeks, obviously familiarity with the systems weren’t my value proposition.

Through being constantly referred to as “the F5 guy,” I started to realize what “F5” meant to other people. While to most of us, it simply means an Application Delivery vendor, to others, it means Application Delivery on its own. It’s almost like when I ask someone for a “kleenex” instead of a tissue. Wow, imagine your name being interchangeable with the product you sell. I’m not “valuable” because I know how to get around in an F5 device. Sure, I can create pools, virtual servers, set BIG-IPs up in a high-availability pair, but all of those things can be done but just about anyone. It’s understanding why you’re doing it that’s so powerful! My backfill position suggests it would be advantageous for an applicant to be “knowledgeable” in load balancing technologies and that “F5 would be a plus.” While I have my strong feelings about that whole sentence, it will be addressed in the next entry I right. Back to the point – it’s all about understanding why you’re using a load balancer, and then why a load balancer isn’t enough, and then why going with F5 is key.

My current career goal is to get into more infrastructure architecture and design. Being at Kohl’s has been great because we’re essentially doing a re-design of our E-Commerce infrastructure. It’s always an amazing challenge when you’re given an anticipated amount of Page Views Per Second, told that demand can fluctuate 100-fold depending on the time, and that downtime is absolutely not an option. Oh, and response time is critical! Someone who is “knowledgeable in load balancing” will likely understand that using a load balancer in this scenario would help spread load across servers and even ensure that healthy servers are the only ones receiving traffic. Someone who understands Application Delivery, though, will understand that your F5 device won’t just load balance, but it will also compress data, cache it, and get it to your users faster. That person also understands how to use contextual-awareness to accomplish those goals. So, you want to compress data but only when it’s beneficial? Well, someone who is “knowledgeable in load balancing” is going have a tough time with that. On the other hand, someone with a good handle on how applications, servers, and networks interact should be able to say, “hmmm, let’s figure out which browsers have problems with compression and also, since compression can actually add latency when content is accessed over a LAN, let’s only compress if the RTT is longer than 1 second.  So now, only someone with Internet Explorer 6+ or Firefox 2.5+ that’s accessing our pages over a “slower” link is going to get compressed data. That, to me, is what it’s all about. If being introduced as the “F5 guy” can convey that I’m an application delivery guy, then by all means!

I’ve only recently begun understanding how critical contextual-awareness is to the future of technology. Now that we access sites from phones, laptops, desktops, iPads, video game consoles, and TVs, the industry needs people who can understand not just how those devices are different, but what that means to delivering content. Today, there are people who understand things well enough to say, “if you’re a mobile user, you should enter mobile.***.com so you can access our specially designed mobile site. That’s not going to be good enough guys! Your systems had damn well better be smart enough to view that user’s agent and automatically direct them to your mobile site. Oh, and if they’re not using 3G, you’d better send them to your “light-weight” mobile site.

The above paragraph is a perfect example of why I love application delivery. Did Direct Supply “let” me tackle problems like this? No…because we didn’t necessarily need them solved. If we did, it wasn’t a “Network Engineer’s” job to solve them. Kohl’s definitely understands and needs its network folks to be creative and understand their customers, but whether or not I’m the guy who’s able to accomplish things like this, we’ll have to wait and see. The first part is to find the problem, the second part is to make sure I’m the person to whom they come when they realize the problem needs to be fixed! A big reason I’m pursuing my MBA soon is so I’m the guy who can find the problem, understand why it needs to be fixed, and then understand how to fix it. Whether all those functions exist in my role is another story, but at least I’ll have the tools.

I read this post from Lori MacVittie a long time ago and since I understood and agreed with everything she said, I bookmarked the page so I could always reference it. Unfortunately, I sort of forgot why it was important.

I’m currently the on-call network engineer for our environment so when there’s an issue, I find out fairly quickly. We currently outsource some of our monitoring to a vendor. Essentially, they package Nagios into an appliance on our private network and connect to us through our firewall via some port forwarding magic. This arrangement allows us to leverage a great monitoring system without having to spend tons of time configuring and managing checks. Since this monitoring is so critical, the vendor is always quick to let me know when they’re unable to manage our device as this can indicate a critical network issue.

Last night, they called at about 9:20 PM to let me know they were unable to reach our device. I attempted to VPN into our network but the logon page wasn’t loading. Unfortunately, the VPN device has been having sporadic issues like that so I couldn’t tie that directly to an issue. While constantly attempting to load the page, I began pinging our 4 ISP-facing routers. Three of the routers replied and one didn’t. The one that didn’t was the one through which our vendor managed the probe. That link is also one of the 2 used by our VPN device. While I quickly assumed this ISP/router was having a problem, I realized other devices on that subnet did respond so that likely wasn’t the case.

After finally being able to login to our VPN, I jumped onto our F5 Link Controller (ISP Link Load Balancer) to begin troubleshooting. The motivation for using a Link Load Balancer is that it allows you to multi-home your sites across multiple ISPs. When it detects a problem with a single ISP, it quits replying to queries with addresses from that subnet. Since it was still giving out these addresses, I knew the link wasn’t a problem. After logging onto the Link Controller, I began reviewing recent log messages. The first one that jumped out at me was that the device had went from active to standby and back to active.

Awesome! A fail-over had occurred. That explained the downtime. Unfortunately, since we were still getting alerts of certain sites not responding, there was still a problem. I quickly identified that the Link Controller was indeed active. I then connected to the standby unit and confirmed that it was standby. When reviewing its logs though, I noticed tons of messages similar to the following – “Packet rejected remote IP x.x.x.x port x local IP x.x.x.x port x proto x: Port closed.” After checking the time stamp on the logs, I quickly noticed they were still happening.

Oh crap! After staring blankly at my laptop screen for awhile trying to figure out why my standby load balancer was receiving traffic, I decided to go check the persistence table where session status (mapping of user to pool and ISP) is stored. I quickly noticed that the normally empty table had plenty of data. Since this wasn’t the active unit anymore, I cleared the persistence table and made sure it didn’t become populated again. It didn’t! Unfortunately, the “Port closed” log entries continued to occur. My standby load balancer, which shouldn’t be receiving any traffic at all, was still getting plenty of it.

That’s when it finally hit me! ARP had gotten me again. When a user tries to connect to an IP like 1.1.1.2, it must go through a router like 1.1.1.1. Once the router (1.1.1.1) receives the packet, it tries to locate 1.1.1.2 so it knows where to send the data. It does this by sending an ARP broadcast. If 1.1.1.2 happens to be behind a load balancer, the load balancer will let everyone know that it’s the owner of that IP. If you have an active and standby load balancer though, only one should reply saying it’s the owner. When an active load balancer fails, the standby one should now begin responding to ARP requests. When the active device comes back up, it then begins responding.

Unfortunately, much to the dismay of a dynamic infrastructure, we have a very very evil piece of technology, THE ARP CACHE. Once a network device uses ARP to locate a device, it caches that location so it can save time and not have to ask again. From what google tells me, a Cisco router caches an ARP entry for 5 minutes. Let’s consider why that’s not good for me!

1. A Client makes a request for 1.1.1.2.

2. The cisco router, 1.1.1.1 sends an ARP request for 1.1.1.2.

3. Since my load balancer owns 1.1.1.2, it responds.

4. The Cisco router saves the location of 1.1.1.2 in its ARP table for 300 seconds.

5. My active load balancer fails and the standby load balancer becomes active.

6. The cisco router keeps trying to send data to 1.1.1.2. Since it has its mac address in its ARP table, it attempts to send directly to the formerly active load balancer.

7. Since that load balancer isn’t active anymore, it can’t process the transaction.

As you can see, having an ARP timeout of 300 seconds severely limits my options. In order to fix this problem, I simply jumped onto our routers, issued a “clear arp” command and the problem went away. To keep the problem from happening, I could either disable arp-caching, lower the arp-cache timeout value, or best yet, use MAC Masquerading. See, F5 does a great job of supporting dynamic infrastructures and recognized how critical ARP is! Unfortunately, it’s tough to see the value in performing extra configuration tasks such as enabling mac masquerading if you haven’ t experienced the evil that is ARP Caching!

As the IT industry continues to move toward dynamic infrastructures, it’s critical that we consider technologies such as ARP and how they affect us. Just because you have an active and standby unit, doesn’t mean you’re redundant!