Cloud Society: Downtime problems start with people

There's an inherent weakness in cloud SLAs: people. And until we get full automation, it will always be a problem

The holiday period has been beset by cloud failures. On the day before Christmas, one part of Amazon's Cloud Service went down, taking its film-on-demand provider Netflix down with it. 

The outage lasted from Christmas Eve through to Christmas Day, and no doubt the disappointment of thousands of film watchers was nothing compared to the poor engineers who had to forego their slap-up dinners and go to work.

And then, on 28 December, Microsoft lost a section of its Azure Storage service for 66 hours, due to "a networking issue", which also affected the management portal and the ability - inevitably - to create new virtual disks. At least in that case the technicians could get home in time for the party to start. 

So should we all be alarmed?Perhaps not, but the outages offer food for thought. Amazon's SLA states that "AWS will use commercially reasonable efforts to make Amazon EC2 available with an Annual Uptime Percentage of at least 99.95 percent during the Service Year." A day's outage equates to about 0.3 percent of the year, which is a considerably bigger number (note the accuracy) to the 0.05 percent target. Azure offers a 99.9 percent SLA on its storage or an 0.1 percent target, which is (ahem) even bigger.

Clearly I'm not a mathematician, but as an ex-screwdriver-wielding techie, I can see a flaw in the way these things are being calculated. The only way that downtime can be limited to 0.05 percent for any particular element of the infrastructure - that is, such that time from fault to resolution can be measured in hours rather than days - is through automation. That is, when an error can be resolved with no manual intervention whatsoever, for example, in the case of re-allocation of resources to resolve a bottleneck. 

As soon as the issue requires people, the time ticks away rapidly. Time to diagnose, time to get the right people involved, time to decide what to do, time to get from one side of the place to the other, time to eat a sandwich and go for a pee. It all adds up and can very quickly exceed the notional minima suggested by the SLA. The massively complex environments now involved (however simple they may look on the outside) exacerbate the issues, and extend the time still further should a problem occur. 

Being realistic, we have to face the fact that today's super-slick data centres are not yet the bastions of perfection we'd like them to be, any more than the engineers we employ are all-seeing and all-knowing. In other words, while Amazon, Microsoft, Google, Oracle and their ilk may profess super-high levels of uptime as targets, extensive cloud users should set their expectations an order of magnitude lower. 

Does this mean ruling out the cloud in forward planning? Of course it doesn't. However, it does mean planning for failure, in the same way as has always been necessary with in-house systems. Should a service cease to be available for one, two, three days, what does that mean to your business or your customer relationships? Are there periods when it would matter less, and periods when you will need some kind of contingency plan? What if you lost a service for a week or two - could you cope then?

As so often with IT, the answer is not to panic but to move forward with with eyes open, in the knowledge that things can go wrong. Moving to the cloud doesn't diminish responsibility for assuring service delivery, whoever is the provider.