resiliency | Chris Miller's Blog

If you chose working in IT as a profession, you probably didn’t do so in order to be recognized. If you work on the Infrastructure side of IT, you likely encounter plenty of people who don’t know what you do. My team, which encompasses Data Center, Network, and Telephony, actually has the mission statement “We Work so You Can.” It’s through our laying a technical foundation that our customers are able to use their applications to generate revenue. While it’s unlikely that a customer realizes how my team fits into their experience, they will certainly learn about and notice us if we have an outage. The way I’ve always explained it to people is that my goal isn’t to make money, it’s to prevent our organization from losing it. When I choose to replace an End-of-Life switch, it’s not going to make us any more money. Certainly there might be enhancements that speed up content, but tying that to increased revenue is hard. Instead, I replace the switch because the other one has a higher risk of failure which can cause our procurement engines to be offline.

Take QoS for instance. Since our campus consists of fiber connected buildings and all of our applications are run over our LAN, bandwidth has never really been a concern. Of course, before we could do a full-campus VoIP conversion, we had to do a network readiness assessment. Since I couldn’t find a good wiki link and don’t want to force folks to click on a sponsored link, especially one from which I don’t make money; I’ll explain a bit about the assessment. Since VoIP uses UDP and not TCP, it is easily effected by network latency. This often manifests in dropped words, echoes, static, etc. In order to ensure a network is properly capable of carrying voice calls, a vendor will formulate a test plan and measure the network’s readiness. Since we have about 850 phone users on campus, we had to come up with a maximum number of concurrent calls to simulate. While the number we chose is of no consequence, it’s the exercise that I found very interesting.

We knew going into the assessment that we might peak at around 100 concurrent calls. So, if we’d used 100 calls, we could have certified our readiness for typical load conditions. Of course, we also knew that we were growing at 20% a year and adding roughly 50 employees during the same time period. If we assume our network and phone system need to last 5 years, we can extrapolate that number to about 200 concurrent calls. Since our system had to be designed around and tested with a certain number of maximum calls, someone had to pick the number. Since whoever picked a maximum number would probably be accountable if our real-use ever exceeded that number and a failure happened, I imagine the number would be pretty high. Having QoS enabled was also part of the check. Since our LAN was designed to never be the bottleneck, we didn’t really need it on. That’s the thing about technologies like QoS. In most cases, QoS doesn’t fix a problem. It prevents a problem from happening in the first place! If I have bandwidth issues, I should implement QoS to prioritize traffic but I should also investigate and determine whether I need additional bandwidth. QoS is really just there to minimize the impact of bandwidth constraints to our critical systems. Interestingly enough, we had some spirited debates because we didn’t feel like our not having QoS enabled on our LAN should be a success criteria for our assessment. When our vendor told us that a single user could fully saturate our LAN by FTPing something, I knew he couldn’t prove it. He’d have to find an FTP site that could read/write at 100Mbps just to saturate a single user…good luck having a client machine that can even handle a transfer like that!

That’s when I believe “What if You’re Wrong?” thinking comes into play. Many of us refer to it as being “What If’d” to death. Having a solution challenged is extremely valuable because it forces the designer to consider possible risks. There should be a limit to the validity of those challenges though. Since there’s always a “What If” scenario around which you can’t design, (nuclear war comes to mind) the risk discussion can’t go on forever. The problem system designers often face is knowing most of the built-in redundancies might not ever be used. Imagine only having your work noticed if something goes wrong? I suppose that’s often IT in a nutshell. It’s also during failures that blame begins to be assigned. If your system has an adequate level of redundancy, the outage will be transparent to most users and perhaps others in IT will commend you for your good planning. On the other hand, if you didn’t build enough redundancy or didn’t account for a certain risk, you’ll be noticed by everyone and not in a good way. While we might not believe that QoS is required or that it’s a good use of our time, we’d still be prone to implementing it because there’s so little upside and so much downside. If having QoS enabled prevents a problem 1 in 100,000 times, I’d almost be tempted to implement it because I don’t experience any benefit by not implementing it the other 99,999 times we don’t experience a problem.

That brings us to two options.

1. Design a solution with every risk in mind that might never be used and even if it is, might not get you any commendation outside of “way to do your job.”

2. Design a solution that is most cost-effective but doesn’t necessarily plan for every risk.

Naturally, most of us are prone to option 1. While people are constantly spending time defining the probability of certain risks and assigning a dollar value to them (lost revenue x probability of risk), we always play it safe. One reason might be that system designers don’t experience any of the benefits of having a cost-effective solution. They certainly experience the negatives of not having an extremely resilient one though!

Outside the technical world, we all face the “what if you’re wrong” question all the time. Consider the amount of death bed religious conversions that happen and why. From my perspective, there’s no downside for someone about to die to become religious. From their point of view, if there is a heaven and they’re about to experience death, they should obviously increase their chances by whatever means they can. The reason most of us get insurance isn’t because we experience positive value from it, it’s because our lives aren’t long enough for variance to kick in. If 10 of us pool money together to buy insurance and only 1 of us ever uses it, 9 of us have essentially thrown away money but since none of us could have afforded to have an injury, it was still the right thing to do.

Since I’ve jumped around quite a bit, I should probably re-state my theme. Asking “What if” or “What if you’re wrong” is an extremely valid question in IT. It should, however, be exercised in moderation. A systems designer should have clear requirements about uptime and from there, exercise their own discretion given a fixed budget. If you have $10,000 and want someone to create a hosting solution, you should be prepared to have a dialogue about your requirements. You can probably get 99.99% for that, but if you want that extra 9, you obviously have to change your budget as well. If given a clear budget and a clear uptime requirement, an architect or engineer should analyze risk and build-in proper redundancy. It’s not fair for an IT person to assume what level of uptime someone wants because we’ll always assume 100% uptime and design accordingly. We’re not usually recognized for uptime…we’re noticed for downtime. That naturally leads us to “over-build” solutions.

Tag Archive

IT Planning – What If You’re Wrong?

My Social Networking Links