Windows Azure outage leaves customers reeling

News Jane McCallion Feb 26, 2013
oops keyboard
oops keyboard

Services based on the Microsoft cloud inaccessible for 12 hours

Microsoft’s cloud computing platform Windows Azure was hit by an outage at the weekend that left customers in the dark for 12 hours.

The failure originated in Windows Azure Storage and was caused by Microsoft forgetting to renew a security certificate. This led to a cascading series of failures in other critical parts of Azure, which eventually brought down Xbox Live as well.

Microsoft says service was restored to 99 per cent of users by 1.00pm PST (9.00pm GMT) on 23 February, with full restoration achieved at 8.00pm (4.00am GMT, 24 February ).

The company has said it will provide credits to affected customers in accordance with its SLA, which will be applied to a future invoice, although it did not say when.

Steve Martin, general manager of Windows Azure Business & Operations, said in a blog: “Our teams are also working hard on a full root cause analysis (RCA), including steps to help any future reoccurrence.”

“We sincerely apologise for the interruption and any issues it has caused,” he added.

However, this is the second time the Redmond giant’s cloud service has fallen foul of a certificate issue.

Almost exactly a year ago, Windows Azure was toppled by a ‘leap day bug’, which caused the service to try and create a transfer certificate with a ‘valid-to’ date of 29 February 2013. The invalid date caused the certificate creation to fail, which led the system to crash. The outage on that occasion lasted eight hours.

Industry insiders said the most recent outage highlights the need to correctly manage certificates.

Paul Ayers, VP EMEA at Vormetric said: “Although the issue has now been fixed [it shows] that managing certificates remains a significant problem for most enterprises today.

"In the same way that Microsoft will be evaluating what happened to Azure and how something slipped through the cracks, enterprises need to re-evaluate and establish a certificate management strategy.”

Microsoft has promised to publish full details of the most recent outage, including what it will do to prevent a recurrence, once it has finished its RCA.

 

  • Article originally published on 25 February 2013 and updated on 26 February 2013.