Tag Archives: hurricane electric

Service level agreements are a calculated risk, so make sure you calculate carefully

When a power failure at the Hurricane Electric data center took this website down for five hours earlier this year, I looked up the Linode service level agreement (SLA) I accepted, and found that I probably made a good cost/benefit choice, but even so it’s worth revisiting at some point.

Upgrading to a platform with significantly higher reliability could be costly though. It means I have to find a hosting company that uses data centers and other infrastructure with a better SLA than Linode apparently has with Hurricane.

I don’t know the details of that SLA, but Linode promises me three nines – 99.9% – uptime, which means its deal with Hurricane should have been at least four nines.

A data center is only one element in Linode’s reliability chain, which includes its own server hardware and Internet bandwidth, among other things. In order to get three nines of overall reliability, you have to, for example, string together a system comprised of ten or fewer elements with four nines: when probabilities are lined up in a series, the chances of failure add up. So if there are ten elements in a row each with a .01% chance of failure, the overall chance of failure is .1%. A data center promising four nines (which is a pretty pathetic SLA) is, in effect, promising to be down for less than an hour a year.

For the sake of simplicity, I’m ignoring the effect of running elements in parallel – that’s a way to build a high reliability system out of lower reliability elements. Either way you do it, it costs a lot of money to buy yourself an extra nine. At this point, it’s a lot more money than a few extra hours of uptime would be worth to me.

But that’s just for now and just for me. Any business that relies on telecoms and IT infrastructure – in house or outsourced – needs to run the same cost/benefit analysis, make its choice and make sure it can live with the consequences.

How not to run a telecoms company

FacebookTwitterGoogle+PinterestLinkedInRedditEmail

Hey, maybe no one will notice that it broke?

It was no black swan event that brought down a big chunk of Hurricane Electric’s data center #2 in Fremont last weekend. Instead, it was an easily foreseeable malfunction that should have been taken into account when the center was designed. According to a postmortem report posted by Linode – the company primarily affected by the outage – when PG&E’s line went out at the facility, a bad battery kept backup power from kicking in…

Seven of the facility’s eight generators started correctly and provided uninterrupted power. Unfortunately, one generator experienced an electromechanical failure and failed to start. This caused an outage which affected our entire deployment in Fremont….

The maintenance vendor for the generator dispatched a technician to the datacenter and it was determined that a battery used for starting the generator failed under load. The batteries were subsequently replaced by the technician. The generators are tested monthly, and the failed generator passed all of its checks two weeks prior to the outage. It was also tested under load earlier in the month.

For the next four and half hours, the best anyone could do was wait for PG&E to restore service.

Batteries go bad and often at the worst possible time. And that’s not the only reason a generator might fail. Not designing the system with sufficient excess capacity and load balancing/sharing capability is inexcusable. Sitting around helpless when the inevitable happens is pathetic.

Hurricane Electric hasn’t made any public statements about the incident – at least not that I can find. I’d love to hear their side of it. If a telecoms company wants to rank in the top tier, it needs to take ownership of the problem when a major outage hits. That means having the technical resources available to fix it and communicating effectively and openly about it with the public and not just its immediate customers.

You pays yer money, you takes yer chances

FacebookTwitterGoogle+PinterestLinkedInRedditEmail

Hurricane Electric should have been prepared for this.

I’ve been reading, negotiating and occasionally writing service level agreements (SLAs) for many years. It’s an abstract exercise – so many nines for so much money. That’s until until something happens and you have to figure out whether 1. you stupidly hand waved the whole thing hoping that nothing would happen, 2. the calculated risk you took was worth it, or 3. you got screwed by your service provider.

This morning, none of this seems abstract. Yesterday evening, from about 6:30 p.m. to about 11:30 p.m., this website was offline because the Linode server that hosts it was down. Four companies share responsibility: PG&E for the lengthy power outage in Fremont, Hurricane Electric for the backup generator that didn’t work, Linode for picking a data center without adequate power redundancy and Tellus Venture Associates – aka me – for trusting in Linode to worry about the details.

It’s easy – and fun! – to blame PG&E for all the world’s problems, but power outages are a fact of life and this one doesn’t seem to be particularly egregious. Five hours is a long time but common enough, which is why critical facilities are supposed to have adequate back up capability. Hurricane clearly didn’t. It apparently relied on a single generator for the section of the data center where Linode lives, without proper maintenance and/or a Plan B if it failed.

Even with last night’s outage, Linode is still within its annual downtime quota – about nine hours – based on the SLA I accepted, although it failed on a monthly basis, far exceeding the allowable 43 minutes. Which means it owes me a prorated refund for the downtime: as a percentage of the monthly $25 fee I pay, five hours comes out to be 17 cents. Note to Linode: don’t bother.

PG&E performed poorly but as expected. Hurricane performed horribly. Even if the generator failure was a one-in-a-bazillion shot – which I seriously doubt – there’s no excuse for just kicking back and waiting for PG&E to fix things.

Linode performed well, from a cost/benefit perspective. It charges me $20 a month (plus another $5 for back up service) for more bandwidth, processing power and disk space than I need, with the clearly stated caveat it might go down every so often.

So, it comes down to me. Did I make a good choice? I think so. Having this website go dark for a few hours on a Friday night is annoying and might have a lingering effect on traffic, but it’s not going to kill or even dent my business. Typically, though, I evaluate my IT infrastructure once a year, so when I do, I’ll do some comparison shopping to see if an SLA upgrade is worth the money and the time involved.