You are currently browsing the tag archive for the ‘monitoring’ tag.

As I discussed in my post about “Strategic Points of Control,” F5 LTMs are in a great position to capture and report on information. I’ve recently encountered several issues where I needed to log the systems sending HTTP 404/500 responses and the URLs for which they were triggered. While this information can be obtained from a packet capture, I find it much easier to simply leverage iRules to log the information.

 

If you don’t know too much about iRules, I’d encourage you to head over to DevCentral and do some reading. One of the first things you’ll learn is that there are several “events” in which an iRule can inspect and react to traffic. Each event has different commands that can be used. While some commands can be used in multiple events, some may not.

 

As an example, HTTP::host and HTTP::uri can be used in the HTTP_REQUEST event, but not in the HTTP_RESPONSE event. Since an HTTP Error Response sent by a server would occur in the HTTP_RESPONSE event (between server and LTM,) we can’t simply log the value of HTTP::host or HTTP::uri as those commands aren’t usable in the HTTP_RESPONSE context. Fortunately, variables can be set in one event and referenced in another which allows us to still access the proper information.

 

Here’s an overview of what we’re trying to accomplish:

 

1. A client makes a request to a Virtual Server on the LTM.

2. The LTM sends this request to a pool member.

3. If the pool member (server) responds with an HTTP Status code of 500, we want to log the Pool Member’s IP, the requested HTTP Host and URI, and the Client’s IP address.

 

We’ll be using the “HTTP::status” command to check for 500s. Since this command needs to be executed within the HTTP_RESPONSE event which doesn’t have access to HTTP::host or HTTP::uri, we’ll need to use variables.

From the HTTP_REQUEST event, we’ll utilize said variables to track the value of HTTP::host, HTTP::uri, and IP::client_addr.

The HTTP_REQUEST event in our iRule will look something like this:

when HTTP_REQUEST {

set hostvar [HTTP::host]

set urivar [HTTP::uri]

set ipvar [IP::client_addr] }

Now, we’ll check the HTTP status code from within the HTTP_RESPONSE event and if it’s a 500, we’ll log the value of the variables above.

when HTTP_RESPONSE {

if { [HTTP::status] eq 500 } {

log local0. “$ipvar requested $hostvar $urivar and received a 500 from [IP::server_addr]” }}

 

Now, whenever a 500 is sent, you can simply check your LTM logs and you’ll see the client who received it, the server that sent it, and the URL that caused it. This is a fairly vanilla implementation. I’ve had several situations in which I needed to also report on the value of a JSESSIONID cookie so our app folks could also check their logs. In a situation like that, you’d simply set and call another variable.

From HTTP_REQUEST:

set appvar [HTTP::cookie JSESSIONID]

From HTTP_RESPONSE:

log local0. “session id was $appvar”

 

This was a good example of how easily iRules can be leveraged to report on issues. Unfortunately though, this isn’t always a scalable option which is why I thought I’d talk about a product I’ve really enjoyed using.

The folks behind Extrahop call it an “Application Delivery Assurance” product. Since both co-founders came from F5, they have a great handle on Application Delivery and the challenges involved. Since I’m typically only concerned with HTTP traffic nowadays, I use Extrahop to track response times, alert on error responses, and also to baseline our environment. As an F5 user, I’m very pleased to see the product’s help section making recommendations on BIG-IP settings to tune if certain issues are seen.

I’d definitely encourage you to go check out some product literature. Since it’s not always fun to arrange a demo and talk to sales folks, they offer free analysis via www.networktimeout.com. Simply upload a packet capture, it’ll be run through an Extrahop unit, and you can see the technology in action.

 

 

Advertisements

As I discussed in this post, sending an HTTP GET for a page on a server to which you load balance traffic is one of the better health checks available. If you use the right page, it can be an extremely light-weight, yet highly reliable check.

In order to properly utilize these health checks, you need to know enough about the application you’re supporting to understand how it behaves when it fails.

In my case, I send traffic to a pool of Apache Servers running mod_weblogic. From there, the traffic is sent to application instances.

 

Using an F5 BIG-IP LTM as an example, there are several configurable parameters when defining a health check.

 

1. Interval (How often the check is sent)

2. Timeout (How long does the resource have to respond)

3. Send String (The request you’re sending the resource)

4. Receive String (What response causes the health check to pass?)

5. Receive Disable String (What response causes the health check to fail)

 

There’s several more, but let’s concentrate on the typical ones.

The default interval is 5 seconds while the timeout is 16. I’ve always been ok with that.

For our send string, let’s do “GET /login.jsp HTTP/1.1\r\nHost: \r\nConnection: Close\r\n\r\n”

So, we’re sending an HTTP GET for /login.jsp using HTTP/1.1 and an empty host header. We’re also closing out the connection so it doesn’t have to sit idle on the server.

For our receive string, let’s do “HTTP/1\.(0|1) (2)”

So, we’re considering a response starting with 2 using HTTP 1.0 or 1.1 as a success. Typically, a server will respond with a 200 when all is well so this is pretty typical.

 

Unfortunately for me, our resource actually sends an HTTP 301 (Permanent Redirect) when a user tries loading the login page. This happens fairly often, especially if you’re sending a health check for “/” and the resource redirects you to a different directory. Since we consider this permanent redirect to be normal behavior, we’ll modify our receive string to “HTTP/1\.(0|1) (2|3)” Now, we’re including all 3** responses as well. Since a failed resource will usually timeout or send a 404/500 when it fails, this should work well.

 

As I mentioned before, my LTM sends traffic to Apache which then sends it to our App instances via mod_weblogic. So, what happens when the app instances are down? I’d expect a 404 or 500 from Apache, right? Sure, as long as your application folks haven’t configured it to send an HTTP 302 (Temporary Redirect) so users go to a custom error page when the App Instances are down.

 

So, here’s what we’ve seen:

 

1. During normal conditions, the resource returns a 301 for its health check.

2. If application instances are down, the resource returns a 302 for its health check.

 

Naturally, we need to modify our Receive String

 

HTTP/1\.(0|1) (2|3)

to

HTTP/1\.(0|1) (2|301)

 

We’re still allowing any 2xx response but are now only allowing 301s.

 

We’ve done what we wanted to. We’ve configured a health check that accurately determines the system’s health. As you’ve noticed though, it required trial and error, and a lot of testing. When determining a health check strategy, it’s critical that either you or an application owner understands their application’s behavior while it’s working, and even more importantly, when it’s not. Also, it’s not always wise to “set and forget” these checks. If, for instance, our application folks changed the “/login.jsp” redirect from a 301 to a 302, the check would fail, and we’d have to come up with a new strategy.

 

 

Someone recently asked me what “application delivery” meant. For those who have read my blogs, you’ll notice many of the topics touch on the subject of application delivery but really don’t offer a simple definition of what it means.  From my perspective, it’s the effort of getting the content from the web server to the client.

It’s such a simple concept and yet there’s so much involved. An infrastructure must be designed that allows for spreading requests across multiple servers, monitors the availability of those servers to ensure they can service requests, offload tasks from the servers if able to do so efficiently, monitor the performance of transactions, and possibly even optimize delivery speed using WAN acceleration, compression, or caching.

The more requests a site handles, the more magnified each of those components becomes. While our infrastructure is small enough that we don’t notice much of an impact by altering our session dispatch method from “round-robin” to “least amount of traffic” or “fewest connections,” someone like Amazon obviously notices. For Amazon, optimizing their dispatch method can result in them requiring hundreds fewer servers and in turn realizing hundreds of thousands of dollars in cost savings. Add the ability to offload SSL and modifying Layer 7 from a server to an Application Delivery Controller and the savings is even greater, not to mention the increased revenue brought about by the client completing their transactions more quickly.

One thing I have definitely touched on before is how  most companies should have a single person manage their “application delivery design.” The manager of these technologies, which we’ve called the Application Delivery Architect, is a position that should pay for itself in reduced infrastructure costs and increased customer spending. Unfortunately, as companies have become complacent and stuck in their designs, an App Delivery Architect is rarely required because a re-design is often impossible. I’ve noticed most places simply re-design one component of their infrastructure at a time. It’s a very “year-to-year” form of thinking. For instance, if customer demand has caused stability issues, a company might simply add more servers to a farm, rather than looking at how scalable the front-end application is or whether offloading tasks to an ADC might be a more effective solution. When there isn’t anyone tasked and empowered with creating an actual vision for application delivery, a company will likely struggle to reach a truly efficient solution.