When Things go Awry in the Cloud: A Closer Look at an AWS Outage

The cloud, of course, is not in the sky, but in somebody else’s data centre. And no matter the myriad benefits of spinning up VMs on shared servers in some hyperscale data centre, or some bare metal in a well-resourced, well-networked and well-secured third-party site, things can – as they can anywhere – go awry in the cloud for many reasons.

Over the past year Computer Business Review recalls Microsoft Azure outages being caused by a lightning strike and an overloaded Redis cache, a Facebook data centre outage being caused by a “server configuration change” and (arguably less “cloudy”, but also as broadly impactful) Cloudflare being briefly crippled by a “regular expression that backtracked enormously” in its web application firewall managed rules.

As we await autopsies from both Google Cloud Platform (for an as-yet unexplained wobble on October 22) and more intriguingly, AWS on a DDoS attack on its infrastructure that caused issues for a subset of users over an eight-hour period on the same date, Computer Business Review took a look AWS’s most recent outage analysis, following a four-outage for users in one of AWS’s Japanese regions.

AWS Outage: A Cooling Failure Cooks Hardware

On August 23, 2019 a “small percentage” of EC2 servers in the Tokyo (AP-NORTHEAST-1) region shut down, causing EC2 instance failures and degraded EBS volume performance for some users in the availability zone. (Some other services like RDS, Redshift, ElastiCache, and Workspaces based on the underlying EC2 instances were also hit.)

AWS also saw a “few isolated cases where customers’ applications running across multiple Availability Zones saw unexpected impact: some customers using Application Load Balancer in combination with AWS Web Application Firewall or sticky sessions, saw a higher than expected percent of requests return an Internal Server Error.”

What Happened?

Racks of servers started overheating, the company explains in a summary of the AWS outage, after control system failure that caused “multiple, redundant cooling systems to fail in parts of the affected Availability Zone”. And, as temperatures soared, the company was forced to cut off power to the affected areas.

Before this happened, temperatures raised to a point at which some hardware effectively got cooked and had to be junked: “A small number of instances and volumes were hosted on hardware which was adversely affected by the loss of power and excessive heat. It took longer to recover these instances and volumes and some needed to be retired as a result of failures to the underlying hardware.”

How did various failovers not kick in?

AWS explained: “This event was caused by a failure of our datacenter control system, which is used to control and optimize the various cooling systems used in our datacenters. The control system runs on multiple hosts for high availability. This control system contains third-party code which allows it to communicate with third-party devices such as fans, chillers, and temperature sensors.

“It communicates either directly or through embedded Programmable Logic Controllers (PLC) which in turn communicate with the actual devices. Just prior to the event, the datacenter control system was in the process of failing away from one of the control hosts. During this kind of failover, the control system has to exchange information with other control systems and the datacenter equipment it controls (e.g., the cooling equipment and temperature sensors throughout the datacenter) to ensure that the new control host has the most up-to-date information about the state of the datacenter.

“A bug in the third-party control system logic…”

“Due to a bug in the third-party control system logic, this exchange resulted in excessive interactions between the control system and the devices in the datacenter which ultimately resulted in the control system becoming unresponsive.

Because the datacenter control system was unavailable, AWS’s operations team on the ground had “minimum visibility into the health and state of the datacenter cooling systems”; they had to manually investigate and reset all equipment; finding unresponsive PLCs controlling some of the air handling units along the way.

AWS, apologising to customers and saying “we are never satisfied with operational performance that is anything less than perfect” said it is “working with our third-party vendors to understand the bug, and subsequent interactions, that caused both the control system and the impacted PLCs to become unresponsive.

“We have disabled the failover mode that triggered this bug on our control systems to ensure we do not have a recurrence of this issue. We have also trained our local operations teams to quickly identify and remediate this situation if it were to recur.”

The simple lesson? In the cloud, as with any other theatre of IT operations, things can escalate fast. With the cloud, at least, fixes also come hard and fast – as one would expect given diverse industry reliance on such services – and most hyperscale providers have a solid track record of sharing the learnings from any such issues.

AWS’s full write-up of the August incident is here.