Amazon, Reddit and the UK government website were among the high-profile victims of an outage which took down a large chunk of the internet, with a failure of a content delivery network (CDN) provided by Fastly pinpointed as the cause of the disruption. CDNs play a crucial role for most websites, and the industry is ready to evolve to help ensure disruption caused by such outages is kept to a minimum.
Tuesday’s incident was identified by Fastly at 10.58 UK time, with users in many territories receiving an “Error 503 service unavailable”. It was resolved an hour later.
We identified a service configuration that triggered disruptions across our POPs globally and have disabled that configuration. Our global network is coming back online. Continued status is available at https://t.co/RIQWX0LWwl
— Fastly (@fastly) June 8, 2021
A blog post shared later on Tuesday by Fastly’s senior vice president of engineering and infrastructure, Nick Rockwell, attributed the incident to a customer inadvertently triggering a bug which was part of a recent software update. “A customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors,” Rockwell explained, apologising to the company’s customers. “Even though there were specific conditions that triggered this outage, we should have anticipated it.”
Nevertheless, the swift resolution of the situation appears to have played in Fastly’s favour, with its share price climbing 4% when the markets in New York opened on Tuesday morning.
What is a content delivery network?
CDNs such as Fastly sit between websites and the end user to help speed up the delivery of content and to ensure sites perform consistently during periods of heavy traffic.
With revenue of $291m in 2020, Fastly is a relatively small player in a market worth $10.7bn, and has just over 2,000 paying customers. Mike Dorosh, senior research director at Gartner, says the reason yesterday’s outage made such a splash is the high-profile nature of some of those customers. “Fastly has a concentration of customers in media, entertainment and retail,” he says.
Dorosh says the incident may cause CDN vendors to consider the importance of building greater resilience into their networks. “Everything in this space lately has been about performance and making things faster,” he says. “This may raise resiliency again, and you might see the vendors starting to talk about how resilient their tools are.” But, he says, “this incident isn’t so much about whether Fastly’s infrastructure is brittle, but more that, whatever you’re doing in technology, it’s only as good as the single points of failure.”
The Fastly outage and the evolution of CDNs
Tech leaders know outages are part of life, Dorosh says, but with most companies relying on a single CDN through which all their web traffic is processed, he believes the industry is set to evolve to help businesses mitigate the risk of downtime. Some organisations have started to implement multiple CDNs, and though this can be more complex logistically Dorosh expects it to become a common strategy in future.
“Multi-CDN has been around for a while, but the maturity is only just beginning to ramp up,” he says. “Customers have been burned by having servers knocked out by things like outages and cyberattacks. With multi-CDN, if one vendor’s servers are down you can use another’s to reach the outside world.”
Dorosh also predicts there will be a rise in CDN brokers, who will sell packages of CDNs from multiple vendors. “These companies will wholesale and resell CDN packages in the same way brokers sell cloud capacity,” he says. “Whenever there’s an incident like this there’s a re-evaluation of risk, and companies will be doing risk versus benefit analysis. If 40% of your revenue comes from digital channels, can you afford to be down for an hour? Or two? Or eight?”
How can tech leaders mitigate risk of a CDN outage?
With multi-CDN still in its infancy, Dorosh says businesses are likely to be hampered by similar outages in future. The use of third-party SaaS tools, which are embedded in many websites, is also an issue. “The problem is that the single points of failure that bite you are often the services you rely on but which aren’t totally in your control,” he explains. “We’ve heard a lot about companies that have been partially affected by this problem, because though their websites were up and running, things like the sales tax calculator they use were down so they couldn’t process any orders.”
The only way for tech leaders to plan for such problems is to have full oversight of the vendors they use so they are prepared for potential issues. “Really all you can do is go through your environment and ask all your vendors what will happen if there’s an outage and what they will do to mitigate it,” Dorosh says.
Home page image by Jarretera/Shutterstock