CrowdStrike claims 97% restoration of Windows sensors following IT outage

The 19 July global IT outage was due to a routine software update to CrowdStrike Falcon sensor. (Credit: reivax/Wikimedia Commons. (Creative Commons))

CrowdStrike, the American cybersecurity firm behind the 19 July global IT outage that resulted in the notorious “blue screen of death” error on millions of Microsoft Windows devices, claims to have restored over 97% of its Windows sensors.

The disruption, which affected approximately 8.5 million devices, was triggered by a routine software update to the Falcon platform sensor, a critical component of CrowdStrike’s cybersecurity infrastructure.

The global IT outage occurred when the update, intended to gather telemetry data on new threat techniques, caused devices running Microsoft Windows to display the blue screen of death. This malfunction affected a wide range of sectors, including airlines, broadcasting, healthcare, and banking, leading to grounded flights, interrupted broadcasts, and disruptions in essential services.

CrowdStrike CEO George Kurtz on Linkedin, wrote: “To our customers still affected, please know we will not rest until we achieve full recovery. At CrowdStrike, our mission is to earn your trust by safeguarding your operations.

“I am deeply sorry for the disruption this outage has caused and personally apologize to everyone impacted. While I can’t promise perfection, I can promise a response that is focused, effective, and with a sense of urgency.”

Post-incident review and findings by CrowdStrike

CrowdStrike’s post-incident review, released on 24 July, detailed the root cause of the outage. The review identified a content configuration update delivered at 04:09 UCT on 19 July as part of the company’s regular operations. This update, categorised under Rapid Response Content, inadvertently caused an out-of-bounds memory read, leading to the blue screen of death on systems running sensor version 7.11.

Update mechanisms and their roles

CrowdStrike deploys updates through two primary mechanisms: Sensor Content and Rapid Response Content. Sensor Content updates are thoroughly tested and part of structured sensor releases, ensuring stability and reliability.

Rapid Response Content, designed for immediate responses to emerging threats, is updated dynamically and may not undergo the same level of pre-release testing as Sensor Content.

The 19 July global IT outage was specifically linked to a Rapid Response Content Template Instance, which included a new IPC Template Type introduced in version 7.11, released on 28 February 2024. Despite rigorous testing and validation, this template type led to unexpected system crashes.

Microsoft’s recovery tool

In response to the incident, Microsoft released a new recovery tool on 20 June to address the issues caused by the CrowdStrike Falcon agent. The tool offers IT administrators two repair options: Recover from WinPE and Recover from Safe Mode.

The Recover from WinPE option creates boot media for quick and direct system recovery without requiring local admin privileges. However, if BitLocker is enabled on the device, users may need to manually enter the BitLocker recovery key. For systems using third-party disk encryption solutions, users are advised to follow their vendor’s guidance to recover the drive and run the remediation script from WinPE.

CrowdStrike implements new measures to prevent future IT outages

CrowdStrike claims to be taking several steps to prevent future outages. The company is enhancing software testing by implementing more comprehensive methods such as stress testing, fuzzing, fault injection, and additional validation checks in the Content Validator.

To improve resilience and recoverability, it is strengthening error handling in the Falcon sensor to manage problematic content more effectively.

The cybersecurity firm is also adopting a staggered deployment strategy, starting with a canary deployment to a small subset of systems before a broader rollout. Enhanced monitoring during this process will help identify and address issues quickly.

Customers will have greater control over Rapid Response Content updates, with options for when and where these updates are deployed, along with notifications about content updates and timing.

Besides, the company stated that it is conducting independent third-party security code reviews and reviewing its end-to-end quality processes from development through deployment.