Deja Vu All Over Again: Microsoft in Fresh MFA Meltdown

Fresh from providing a post mortem of last week’s multi-factor authentification (MFA) Azure and Office 365 login issues, which plagued users globally for 17 hours, Microsoft today admitted the bugs had re-emerged – and yes, once again rebooting its servers had provided a temporary reprieve.

We've performed restarts across the infrastructure responsible for processing MFA requests and have confirmed service restoration. For more details see service incident ID MO165847 on the Service Health Dashboard.

— Microsoft 365 Status (@MSFT365Status) November 27, 2018

The issue last week was attributed by Microsoft to three root causes, the first two introduced in a roll-out of a code update that began in some data centers on Tuesday, 13 November 2018 and completed on Friday, 16 November 2018.

Read this: Redis Overload to Blame for 17-Hour Azure MFA Login Issue

The issues were found to be activated once a certain traffic threshold was exceeded. Azure was also affected again today, and users were predictably less than happy, with many having recently rolled out MFA to users.

It really makes us look bad as partners and service providers when we recommend a service that is now crapping out on us :/

— Nedrick_NA (@Nedrick_NA) November 27, 2018

Microsoft blamed a buggy code roll-out, with the issues activated once a certain traffic threshold is reached.

The change had been intended to better manage connections to its caching services.

“Unfortunately, this change introduced more latency and a race-condition in the new connection management code, under heavy load. This caused the MFA service to slow down processing of requests, initially impacting the West EU data centres (which service APAC and EMEA traffic).”

One of the three root causes it identified “causes accumulation of processes on the MFA backend leading to resource exhaustion on the backend at which point it was unable to process any further requests from the MFA frontend while otherwise appearing healthy in our monitoring.”

The company has pledged to “review our update deployment procedures to better identify similar issues during our development and testing cycles and review the monitoring services to identify ways to reduce detection time and quickly restore service” (both by December 2018).

Microsoft also promised “review our containment process to avoid propagating an issue to other data centers (completion by Jan 2019)”.

Meanwhile, rebooting its servers seems to work…