Leap year bug brought down Windows Azure: Microsoft

Microsoft has claimed that yesterday’s Azure outage, which lasted for hours and took down websites across the world, was cause by a software buy related to the leap year.

Writing on the company’s blog, Bill Laing, Corporate VP Server and Cloud, said: "Yesterday, February 28th, 2012 at 5:45 PM PST Windows Azure operations became aware of an issue impacting the compute service in a number of regions. The issue was quickly triaged and it was determined to be caused by a software bug."

"While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year. Once we discovered the issue we immediately took steps to protect customer services that were already up and running, and began creating a fix for the issue," he added.

The hugely embarrassing outage began during the early hours (UK time) of February 29 and knocked out a number of websites, including the UK government’s recently-launched CloudStore.

Many of the affected services, including the main Windows Azure website and the Service Status Dashboard, were up and down for long periods yesterday. Microsoft now says it is now in control of the situation but that some customers may still be experiencing problems.

"The fix was successfully deployed to most of the Windows Azure sub-regions and we restored Windows Azure service availability to the majority of our customers and services by 2:57AM PST, Feb 29th," the blog said.

"However, some sub-regions and customers are still experiencing issues and as a result of these issues they may be experiencing a loss of application functionality," Laing went on. "We are actively working to address these remaining issues."

More details about the issue should be revealed when a full root cause analysis is completed, Microsoft said. Laing once again apologised to affected customers.

Gartner analyst Kyle Hilgendorf criticised Microsoft for its communication during the outage: "Looking back to 2011 and the AWS and Microsoft outages it became very clear that frequent status updates are paramount during an outage. AWS led the way with 30-45 min outage updates through their painful EBS outage and Ireland issues."

"While updates don’t solve the problem, they do demonstrate customer advocacy and concern. Some customers told me this morning they feel completely in the dark. There is no reason why a cloud provider should not have a dedicated communication team providing at least 30 min updates throughout the entire outage," he added. "Microsoft seems to be in a good cadence late this morning on more frequent updates, but there were large gaps in updates when the outage first occurred."

Hilgendorf said that it is wise to ensure that Service Status dashboards should always be hosted separately from the cloud service itself, so customers can always what is happening with the service.

He also suggested that cloud providers should measure performance and response metrics as well as availability when it comes to monitoring outages and issues.

"Azure’s health dashboard and communication originally communicated that only 3.8% of customers were affected with this outage. There was no context around where the 3.8% came from or how it was measured but I spoke to several customers this morning that suspect they were not included in the 3.8%."

"Because most provider SLAs are based upon uptime and availability, and not performance or response, these outages may not be reported as being affected," he added. "Providers MUST start including performance and response SLAs into their standard service. A degraded service is often as impactful as a down service."

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

Sign up for our regular news round-up!

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

I would also like to subscribe to:

Thank you for subscribing