Service level agreements are a calculated risk, so make sure you calculate carefully

26 December 2015 by Steve Blum
, ,

When a power failure at the Hurricane Electric data center took this website down for five hours earlier this year, I looked up the Linode service level agreement (SLA) I accepted, and found that I probably made a good cost/benefit choice, but even so it’s worth revisiting at some point.
Upgrading to a platform with significantly higher reliability could be costly though. It means I have to find a hosting company that uses data centers and other infrastructure with a better SLA than Linode apparently has with Hurricane.
I don’t know the details of that SLA, but Linode promises me three nines – 99.9% – uptime, which means its deal with Hurricane should have been at least four nines.
A data center is only one element in Linode’s reliability chain, which includes its own server hardware and Internet bandwidth, among other things. In order to get three nines of overall reliability, you have to, for example, string together a system comprised of ten or fewer elements with four nines: when probabilities are lined up in a series, the chances of failure add up. So if there are ten elements in a row each with a .01% chance of failure, the overall chance of failure is .1%. A data center promising four nines (which is a pretty pathetic SLA) is, in effect, promising to be down for less than an hour a year.
For the sake of simplicity, I’m ignoring the effect of running elements in parallel – that’s a way to build a high reliability system out of lower reliability elements. Either way you do it, it costs a lot of money to buy yourself an extra nine. At this point, it’s a lot more money than a few extra hours of uptime would be worth to me.
But that’s just for now and just for me. Any business that relies on telecoms and IT infrastructure – in house or outsourced – needs to run the same cost/benefit analysis, make its choice and make sure it can live with the consequences.