Sign up for our newsletter
Technology / AI and automation

Deja Vu All Over Again: Microsoft in Fresh MFA Meltdown

Fresh from providing a post mortem of last week’s multi-factor authentification (MFA) Azure and Office 365 login issues, which plagued users globally for 17 hours, Microsoft today admitted the bugs had re-emerged – and yes, once again rebooting its servers had provided a temporary reprieve.

White papers from our partners

The issue last week was attributed by Microsoft to three root causes, the first two introduced in a roll-out of a code update that began in some data centers on Tuesday, 13 November 2018 and completed on Friday, 16 November 2018.

Read this: Redis Overload to Blame for 17-Hour Azure MFA Login Issue

The issues were found to be activated once a certain traffic threshold was exceeded. Azure was also affected again today, and users were predictably less than happy, with many having recently rolled out MFA to users.

Microsoft blamed a buggy code roll-out, with the issues activated once a certain traffic threshold is reached.

The change had been intended to better manage connections to its caching services.

“Unfortunately, this change introduced more latency and a race-condition in the new connection management code, under heavy load. This caused the MFA service to slow down processing of requests, initially impacting the West EU data centres (which service APAC and EMEA traffic).”

One of the three root causes it identified “causes accumulation of processes on the MFA backend leading to resource exhaustion on the backend at which point it was unable to process any further requests from the MFA frontend while otherwise appearing healthy in our monitoring.”

The company has pledged to “review our update deployment procedures to better identify similar issues during our development and testing cycles and review the monitoring services to identify ways to reduce detection time and quickly restore service” (both by December 2018).

Microsoft also promised “review our containment process to avoid propagating an issue to other data centers (completion by Jan 2019)”.

Meanwhile, rebooting its servers seems to work…
This article is from the CBROnline archive: some formatting and images may not be present.

CBR Staff Writer

CBR Online legacy content.