You are currently browsing the tag archive for the ‘health checks’ tag.

As I discussed in this post, sending an HTTP GET for a page on a server to which you load balance traffic is one of the better health checks available. If you use the right page, it can be an extremely light-weight, yet highly reliable check.

In order to properly utilize these health checks, you need to know enough about the application you’re supporting to understand how it behaves when it fails.

In my case, I send traffic to a pool of Apache Servers running mod_weblogic. From there, the traffic is sent to application instances.

 

Using an F5 BIG-IP LTM as an example, there are several configurable parameters when defining a health check.

 

1. Interval (How often the check is sent)

2. Timeout (How long does the resource have to respond)

3. Send String (The request you’re sending the resource)

4. Receive String (What response causes the health check to pass?)

5. Receive Disable String (What response causes the health check to fail)

 

There’s several more, but let’s concentrate on the typical ones.

The default interval is 5 seconds while the timeout is 16. I’ve always been ok with that.

For our send string, let’s do “GET /login.jsp HTTP/1.1\r\nHost: \r\nConnection: Close\r\n\r\n”

So, we’re sending an HTTP GET for /login.jsp using HTTP/1.1 and an empty host header. We’re also closing out the connection so it doesn’t have to sit idle on the server.

For our receive string, let’s do “HTTP/1\.(0|1) (2)”

So, we’re considering a response starting with 2 using HTTP 1.0 or 1.1 as a success. Typically, a server will respond with a 200 when all is well so this is pretty typical.

 

Unfortunately for me, our resource actually sends an HTTP 301 (Permanent Redirect) when a user tries loading the login page. This happens fairly often, especially if you’re sending a health check for “/” and the resource redirects you to a different directory. Since we consider this permanent redirect to be normal behavior, we’ll modify our receive string to “HTTP/1\.(0|1) (2|3)” Now, we’re including all 3** responses as well. Since a failed resource will usually timeout or send a 404/500 when it fails, this should work well.

 

As I mentioned before, my LTM sends traffic to Apache which then sends it to our App instances via mod_weblogic. So, what happens when the app instances are down? I’d expect a 404 or 500 from Apache, right? Sure, as long as your application folks haven’t configured it to send an HTTP 302 (Temporary Redirect) so users go to a custom error page when the App Instances are down.

 

So, here’s what we’ve seen:

 

1. During normal conditions, the resource returns a 301 for its health check.

2. If application instances are down, the resource returns a 302 for its health check.

 

Naturally, we need to modify our Receive String

 

HTTP/1\.(0|1) (2|3)

to

HTTP/1\.(0|1) (2|301)

 

We’re still allowing any 2xx response but are now only allowing 301s.

 

We’ve done what we wanted to. We’ve configured a health check that accurately determines the system’s health. As you’ve noticed though, it required trial and error, and a lot of testing. When determining a health check strategy, it’s critical that either you or an application owner understands their application’s behavior while it’s working, and even more importantly, when it’s not. Also, it’s not always wise to “set and forget” these checks. If, for instance, our application folks changed the “/login.jsp” redirect from a 301 to a 302, the check would fail, and we’d have to come up with a new strategy.