View all newsletters
Receive our newsletter - data, insights and analysis delivered to you
  1. Technology
  2. Cloud
October 29, 2019updated 30 Oct 2019 9:24am

When Things go Awry in the Cloud: A Closer Look at a Recent AWS Outage

"Some needed to be retired as a result of failures to the underlying hardware"

By CBR Staff Writer

The cloud, of course, is not in the sky, but in somebody else’s data centre. And no matter the myriad benefits of spinning up VMs on shared servers in some hyperscale data centre, or some bare metal in a well-resourced, well-networked and well-secured third-party site, things can – as they can anywhere – go awry in the cloud for many reasons.

Over the past year Computer Business Review recalls Microsoft Azure outages being caused by a lightning strike and an overloaded Redis cache, a Facebook data centre outage being caused by a “server configuration change” and (arguably less “cloudy”, but also as broadly impactful) Cloudflare being briefly crippled by a “regular expression that backtracked enormously” in its web application firewall managed rules.

As we await autopsies from both Google Cloud Platform (for an as-yet unexplained wobble on October 22) and more intriguingly, AWS on a DDoS attack on its infrastructure that caused issues for a subset of users over an eight-hour period on the same date, Computer Business Review took a look AWS’s most recent outage analysis, following a four-outage for users in one of AWS’s Japanese regions.

AWS Outage: A Cooling Failure Cooks Hardware 

On August 23, 2019 a “small percentage” of EC2 servers in the Tokyo (AP-NORTHEAST-1) region shut down, causing EC2 instance failures and degraded EBS volume performance for some users in the availability zone. (Some other services like RDS, Redshift, ElastiCache, and Workspaces based on the underlying EC2 instances were also hit.)

AWS also saw a “few isolated cases where customers’ applications running across multiple Availability Zones saw unexpected impact: some customers using Application Load Balancer in combination with AWS Web Application Firewall or sticky sessions, saw a higher than expected percent of requests return an Internal Server Error.”

What Happened?

Racks of servers started overheating, the company explains in a summary of the AWS outage, after control system failure that caused “multiple, redundant cooling systems to fail in parts of the affected Availability Zone”. And, as temperatures soared, the company was forced to cut off power to the affected areas.

Before this happened, temperatures raised to a point at which some hardware effectively got cooked and had to be junked:  “A small number of instances and volumes were hosted on hardware which was adversely affected by the loss of power and excessive heat. It took longer to recover these instances and volumes and some needed to be retired as a result of failures to the underlying hardware.”

Content from our partners
Scan and deliver
GenAI cybersecurity: "A super-human analyst, with a brain the size of a planet."
Cloud, AI, and cyber security – highlights from DTX Manchester

AWS outageHow did various failovers not kick in?

AWS explained: “This event was caused by a failure of our datacenter control system, which is used to control and optimize the various cooling systems used in our datacenters. The control system runs on multiple hosts for high availability. This control system contains third-party code which allows it to communicate with third-party devices such as fans, chillers, and temperature sensors.

“It communicates either directly or through embedded Programmable Logic Controllers (PLC) which in turn communicate with the actual devices. Just prior to the event, the datacenter control system was in the process of failing away from one of the control hosts. During this kind of failover, the control system has to exchange information with other control systems and the datacenter equipment it controls (e.g., the cooling equipment and temperature sensors throughout the datacenter) to ensure that the new control host has the most up-to-date information about the state of the datacenter.

“A bug in the third-party control system logic…”

“Due to a bug in the third-party control system logic, this exchange resulted in excessive interactions between the control system and the devices in the datacenter which ultimately resulted in the control system becoming unresponsive.

Because the datacenter control system was unavailable, AWS’s operations team on the ground had “minimum visibility into the health and state of the datacenter cooling systems”; they had to manually investigate and reset all equipment; finding unresponsive PLCs controlling some of the air handling units along the way.

AWS, apologising to customers and saying “we are never satisfied with operational performance that is anything less than perfect” said it is “working with our third-party vendors to understand the bug, and subsequent interactions, that caused both the control system and the impacted PLCs to become unresponsive.

“We have disabled the failover mode that triggered this bug on our control systems to ensure we do not have a recurrence of this issue. We have also trained our local operations teams to quickly identify and remediate this situation if it were to recur.”

The simple lesson? In the cloud, as with any other theatre of IT operations, things can escalate fast. With the cloud, at least, fixes also come hard and fast – as one would expect given diverse industry reliance on such services – and most hyperscale providers have a solid track record of sharing the learnings from any such issues.

AWS’s full write-up of the August incident is here

See also: Gartner Gets the Claws Out in IaaS Magic Quadrant

Websites in our network
Select and enter your corporate email address Tech Monitor's research, insight and analysis examines the frontiers of digital transformation to help tech leaders navigate the future. Our Changelog newsletter delivers our best work to your inbox every week.
  • CIO
  • CTO
  • CISO
  • CSO
  • CFO
  • CDO
  • CEO
  • Architect Founder
  • MD
  • Director
  • Manager
  • Other
Visit our privacy policy for more information about our services, how Progressive Media Investments may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.