CrowdStrike has released its first post-incident review (PIR) into a software update that bricked Windows devices in airlines, broadcast news, hospitals and supermarkets around the world. The bug, said the cybersecurity firm, could be traced to a content configuration update for its Windows sensor delivered at 04:09 UCT on 19 July as part of its regular operations. Intended to collect telemetry on emerging threat techniques, the update instead caused Windows systems running version 7.11 of the sensor to crash.
CrowdStrike said that the issue was identified and addressed by 05:27 UTC on the same day. Systems that came online after this time or those that did not receive the update were unaffected, the firm added. Mac and Linux hosts were also not impacted.
The American cybersecurity technology company has announced that a comprehensive root cause analysis of the bug will soon be available. Today’s post-incident review uses general terms for readability, but more specific terminology is used in other documentation.
Crowdstrike update caused global IT meltdown
The Texas-based cybersecurity technology company delivers updates through two of its platforms: Sensor Content and Rapid Response Content. Sensor Content is part of a sensor release and is said to undergo extensive quality assurance processes. Rapid Response Content, meanwhile, is used for real-time responses to emerging threats and is updated dynamically.
According to CrowdStrike, the problem on 19 July involved a Rapid Response Content update that contained an undetected error. This update is not the same as Sensor Content, which is static and part of a sensor release.
The error was traced to a Rapid Response Content Template Instance, which led to a Windows operating system crash (BSOD) due to an out-of-bounds memory read, revealed the company.
CrowdStrike’s sensor version 7.11, which included a new IPC Template Type, was released on 28 February 2024. This template type is said to have been stress-tested and validated before its deployment.
The firm said that subsequent inter-process communication (IPC) template instances were deployed successfully until 19 July, when a validation error allowed problematic content to be deployed.
Cybersecurity firm says future faults will be avoided
CrowdStrike said that it is implementing several measures to prevent future incidents. These include enhanced testing protocols such as local developer testing, rollback testing, stress testing, and improved validation checks for Rapid Response Content.
The company also said it is improving error handling in the Content Interpreter. This component of the Falcon sensor system processes and applies configuration updates from the cloud.
“Everyone is focused on if the source of the defect made it through their testing and QA processes,” Futurum Group’s chief technology advisor Mitch Ashley told Tech Monitor. “The reality is errors happen and can make it into production.”
“This biggest issue here is that CrowdStrike didn’t stage and verify the update and pushed the update out without knowing there was a major outage-causing issue. Staging releases of widely-used software where errant updates can distribute quickly are particularly susceptible to this.”
Earlier this week, in the wake of the 19 July incident, Microsoft introduced a new recovery tool designed to address problems caused by the CrowdStrike Falcon agent on Windows clients and servers. Though most impacted services have now recovered from the outage, some companies are still being affected, including US airline Delta.