Content delivery network Cloudflare has provided details of an outage that caused service disruptions at hundreds of websites this morning. The company said that a change to its network configuration triggered an outage across 19 of its data centres. This impacted 50% of the HTTP requests on its network.
Cloudflare says it has identified a number of areas of improvement that would prevent such outages in future, including refinements to its network architecture and increased automation.
According to DownDetector.com, users reported issues at sites including Google, AWS and Twitter, starting at around 7:20am this morning UK time. At 7:43am, Cloudflare said on its status monitoring site that it was investigating a “critical P0 incident” (P0 refers to the maximum possible priority).
Cloudflare later said it had identified the fault, implemented a fix and is monitoring the results. As of 9:30am, Cloudflare said all of its services were operational.
What caused the Cloudflare outage?
The Cloudflare outage was caused by a change in the way internet traffic is routed between IP addresses through its network, the company explained in a blog post.
Over the last 18 months, Cloudflare has been rolling out a new network architecture that establishes a ‘mesh layer’. “This mesh allows us to easily disable and enable parts of the internal network in a data centre for maintenance or to deal with a problem,” it said.
Earlier today, a change to the network management policy that determines which IP address are reachable on the internet caused many websites to become unavailable. Because this affected the mesh layer of the network, it impacted nearly 50% of the HTTP requests it was handling.
Cloudflare said that the network policy change was implemented at 04:56am UK time, but didn’t reach the mesh layer until 07:27, when the outage began. Cloudflare spotted the outage within 5 minutes, it said, and within half an hour had identified the cause. It then began reverting the problematic network policy update to a previous version, which was complete by 08:45am UK time.
The company identified three areas where improvements could help a similar outage happening again. This include a change to its upgrade procedure, so that network changes are staggered; a change to its architecture; and increased use of automation in managing its network.
What is Cloudflare?
Cloudflare is a content delivery network and edge computing provider that many web services use to improve performance and security. The company operates a network of data centres from which it delivers customers web content, such as images and videos. The proximity of these data centres to users means websites load faster, improving the user experience.
The company also helps protect its clients’ websites from distributed denial of service (DDoS) attacks.
Cloudflare has a 39% share of the content delivery network market, according to figures from enlyft, with Amazon’s CloudFront in second place with 23%. More than a million companies use its services, enlyft says.
What does the Cloudflare outage mean for users?
Today's outage is likely to reignite questions about the centralisation of the web. A year ago, an outage at Fast.ly, another CDN provider, disrupted websites including Reddit and Amazon.
At the time, Gartner analyst Mike Dorosh said the incident may prompt CDNs to invest more in their resilience. “Everything in this space lately has been about performance and making things faster,” he said. “This may raise resiliency again, and you might see the vendors starting to talk about how resilient their tools are.”
But, he added, it was also a wake-up call to companies whose services rely on a single CDN. “Whatever you’re doing in technology, it’s only as good as the single points of failure."
The question of the cloud concentration was also raised in December 2020, after a string of outages at AWS. Back then, cloud computing adviser Ian Moyes said that for many industries, the benefits of cloud services outweigh the risk of outages.
“When there’s an outage everyone thinks it’s the end of the world, but you have to consider if the gain outweighs the risk,” he told Tech Monitor. “For some industries, the price and flexibility [of public cloud] make it worth the risk of an outage, but for others like gaming, even a few minutes of downtime isn’t tolerable for users."
The UK's financial regulators have raised alarm about the potential impact of cloud concentration on the UK's financial system. According to the Bank of England, 65% of UK firms use the same four cloud providers.
To address this risk, HM Treasury recently proposed giving regulators the power to inspect cloud providers' data centres to audit their security and resilience measures.
Cloudflare in Russia
Today's outage is not the only reason Cloudflare has been in the headlines recently. The company is one of only a few Western technology companies that continue to operate in Russia following the country's invasion of Ukraine.
In a blog post in April, Cloudflare CEO Matthew Prince said that the company was helping Russian citizens circumvent internet controls through its WARP virtual private network service, and described the company’s in-country edge servers as a “frontline against cyberattacks” originating from inside Russia.
Tech Monitor is hosting the Tech Leaders Club on 15 September. Find out more on NSMG.live