September 29, 2020updated 05 Oct 2020 8:34am

Microsoft Wobbles Again: Do Azure Staging Procedures Need a Rethink?

"Has anyone started having discussions with their CIO/CEO about moving back to an in-house mail server? I advocate for it"

By CBR Staff Writer

Given the scale of its user base and with a contract worth up to $10 billion in the bag to run the back-end of a superpower’s military, Microsoft might want to start thinking about how it can establish a staging procedure for its Azure cloud that allows it to deploy changes and reliably roll back those changes when things break.

(We know, it is easy to say so from a safe distance…)

Redmond was at it again late Monday, knocking an (apparently substantial) “subset of customers in the Azure Public and Azure Government clouds” offline for three hours with swathes of users globally encountering errors performing authentication operations; multiple services were affected, including Microsoft 365.

The company blamed the issue on a “recent configuration change [that] impacted a backend storage layer, which caused latency to authentication requests.” (Read, users couldn’t login to Teams, Azure and more for hours because of the snafu).

The blockage was felt for users from 22:25 BST on Sep 28 2020 to 01:23 BST.

UPDATED: Azure said in a root cause analysis: “A service update targeting an internal validation test ring was deployed, causing a crash upon startup in the Azure AD backend services. A latent code defect in the Azure AD backend service Safe Deployment Process (SDP) system caused this to deploy directly into our production environment, bypassing our normal validation process.

“Azure AD is designed to be a geo-distributed service deployed in an active-active configuration with multiple partitions across multiple data centers around the world, built with isolation boundaries. Normally, changes initially target a validation ring that contains no customer data, followed by an inner ring that contains Microsoft only users, and lastly our production environment. These changes are deployed in phases across five rings over several days.

Content from our partners

Scan and deliver

GenAI cybersecurity: “A super-human analyst, with a brain the size of a planet.”

Cloud, AI, and cyber security – highlights from DTX Manchester

Microsoft added: “In this case, the SDP system failed to correctly target the validation test ring due to a latent defect that impacted the system’s ability to interpret deployment metadata. Consequently, all rings were targeted concurrently. The incorrect deployment caused service availability to degrade. Within minutes of impact, we took steps to revert the change using automated rollback systems which would normally have limited the duration and severity of impact. However, the latent defect in our SDP system had corrupted the deployment metadata, and we had to resort to manual rollback processes. This significantly extended the time to mitigate the issue.”

The issue comes a fortnight after a protracted outage in Microsoft’s UK South region triggered by a cooling system failure in a data centre. With temperatures rising, automated systems shut down all network, compute, and storage resources “to protect data durability” as engineers rushed to take manual control.

Earlier this month meanwhile Gartner said it “continues to have concerns related to the overall architecture and implementation of Azure, despite resilience-focused engineering efforts and improved service availability metrics during the past year”.

Microsoft Azure CTO Mark Russinovich in July 2019 said that Azure had formed a new Quality Engineering team within his CTO office, working alongside Microsoft’s Site Reliability Engineering (SRE) team to “pioneer new approaches to deliver an even more reliable platform” following customer concern at a string of outages.

He wrote at the time: “Outages and other service incidents are a challenge for all public cloud providers, and we continue to improve our understanding of the complex ways in which factors such as operational processes, architectural designs, hardware issues, software flaws, and human factors can align to cause service incidents.

“Has anyone started having discussions with their CIO/CEO about moving back to an in-house mail server? I advocate for it” one frustrated user noted on a global Outages mailing list meanwhile… If cloud is your compressed audio stream that you’re not sure you own, it may not be long before in-house mail servers become the vintage quality vinyl of the IT world; old, but very much back in demand.

Stranger things have happened.

CBR Staff Writer

CBR Online legacy content

Content from our partners

Scan and deliver

GenAI cybersecurity: “A super-human analyst, with a brain the size of a planet.”

Cloud, AI, and cyber security – highlights from DTX Manchester

CBR Staff Writer

Changelog