Cloudflare has disclosed a significant issue with its logging-as-a-service platform, Cloudflare Logs, which resulted in customer data loss following a problematic software update. The US-based connectivity cloud company acknowledged that approximately 55% of log data generated during a 3.5-hour window on 14 November 2024 was permanently lost. This loss occurred thanks to a series of technical misconfigurations and cascading system failures.
Cloudflare Logs collects event metadata from Cloudflare’s global network for customer use in areas such as debugging, compliance, and analytics. To streamline log delivery and avoid overwhelming recipients, the company employs a system called Logpush, which aggregates and transmits logs in manageable batches. The error stemmed from an update to Logpush, triggering a chain of system failures that ultimately disrupted services and caused data loss.
The incident began with a configuration update aimed at enabling support for an additional dataset within Logpush. A bug in the configuration generation system caused Logfwdr, a component responsible for forwarding logs, to receive an empty configuration. This error indicated to Logfwdr that no logs needed to be transmitted. Cloudflare identified the problem within minutes and reverted the change.
However, reverting the update activated a secondary, pre-existing bug in Logfwdr. This bug, tied to a fail-safe mechanism designed to “fail open” in case of configuration errors, caused Logfwdr to process and attempt to transmit logs for all customers, rather than only those with active configurations.
The unexpected surge in log processing overwhelmed Buftee, Cloudflare’s log buffering system. Buftee is designed to maintain separate buffers for individual customers to ensure data integrity and avoid interference between log jobs. Under normal circumstances, Buftee handles millions of buffers globally. The massive influx of data following the Logfwdr error increased buffer demand by fortyfold, exceeding Buftee’s capacity and rendering the system unresponsive.
Cloudflare stated that resolving the issue required a full system reset and several hours of recovery efforts. During this period, the company was unable to transmit or recover the affected logs, resulting in permanent data loss.
Cloudflare attributed the incident to gaps in its system safeguards and configuration protocols. While mechanisms were in place to manage similar errors, they had not been configured to handle such a large-scale failure. For example, Buftee includes features designed to manage sudden increases in buffer demand, but these features were not activated, leaving the system vulnerable to overload.
The company also highlighted that the fail-open mechanism in Logfwdr, implemented during the early development of the service, had not been updated to reflect the significantly larger customer base and traffic levels. This oversight allowed the system to send logs for all customers, creating a spike in resource usage that exceeded operational limits.
Cloudflare acknowledged that while the original bug in Logfwdr’s configuration system was corrected quickly, the broader system failures highlighted the need for more comprehensive testing and validation of failover mechanisms.
Response and future measures
Cloudflare has apologised for the disruption and announced plans to prevent similar incidents in the future. The company is introducing new alerts to detect configuration errors more effectively, updating its failover mechanisms to handle larger-scale failures, and conducting simulations to test system resilience under overload conditions.
“On a typical day, Cloudflare sends about 4.5 trillion individual event logs to customers,” reads the company’s blogpost about the incident. “Although this represents less than 10% of the over 50 trillion total customer event logs processed, it presents unique challenges of scale when building a reliable and fault-tolerant system.”
Additionally, Cloudflare is enhancing its logging architecture to ensure individual system components can better handle cascading failures. The company stated that while failures in complex systems are inevitable, its focus is on mitigating their impact and ensuring services recover quickly.
Last month, Cloudflare reported successfully mitigating the largest recorded distributed denial-of-service (DDoS) attack, which peaked at 3.8 terabits per second (Tbps). The attack was part of a broader campaign targeting industries such as internet services, financial services, and telecommunications. The campaign consisted of more than 100 hyper-volumetric DDoS attacks, sustained over the course of a month, overwhelming network infrastructure with excessive volumes of data.