Tuesday, December 23, 2008

SLAs, availability, and expectations vs. guarantees

I recently wrote a Service Level Agreement-type of document to help set expectations as to what kind of service one can expect from the Eclipse Webmasters, and from the Eclipse servers. You can read it here.  It wasn't really a pleasant experience, and the document is not really groundbreaking.  It's actually quite boring.

I was then introduced to this blog posting, which, as I understand it, essentially states that SLAs are useless when they use a blank metric we call 'availability'.  The author makes an interesting point, and as I prepared a pretty lengthy reply, the blog software decided my comment was too long, so I figured I'd post it here.

Interesting article.  I'd appreciate seeing an example of an effective SLA that you have authored, and that you are being held accountable for.

I mean, let's face it -- I'd love to guarantee that every time you load a web page, it will come in under 20 seconds, and that all your email will be in your Inbox within 15 minutes.  But there are those things that, as you say, are not easy to measure.  When will I get hit by the next DoS attack? When will the most important server decide to crash?  When will an IT guy make a critical human error and take some key system down?  If your systems are connected to the Internet, then you're open to a whole world of unknowns.

So, how can I guarantee, with absolute certainty, that email will not go down for a full day on the last day of a quarter (which, I agree, could be extremely damaging to a business)?  Perhaps we spend millions of dollars on redundant high-end hardware, redundant points-of-presence, more process for staff, and more staff (to help maintain all this hardware). Easy enough.  But now I must increase my prices to afford this wonderful SLA, driving away customers to the cheaper competitors, equally damaging my business.

Then again, sometimes spending massive amounts of money in infrastructure isn't even enough.

So the alternative is to write an effective SLA that has service expectation metrics so forgiving that they don't make much more sense compared to using the availability metric; ones that certainly don't match the expectations of users any more accurately.

Of course, the odds of a catastrophic failure in a properly executed IT infrastructure are very low, making it easy to set reasonable expectations. You *should* expect your email within 15 minutes. But in between expectation and guarantee is this thing those out-of-alignment IT people call 'budget'.


Post a Comment

<< Home