Google Cloud Down: GCP Suffers "Major" Global Outage

Here for April 2020’s outage? We’re covering that here. Want to understand what caused GCP’s last major borkage? We’ve got that covered here.

Google Cloud Platform (GCP) services down. Issue global in scale. Numerous services affected, including Kubernetes and IoT services like Nest.

Google Cloud Platform (GCP) says it is experiencing a “major issue” with services including Cloud Dataflow, AppEngine, Compute Engine, Cloud Storage, Dataflow, Dataproc, Pub/Sub, BigQuery, Networking all failing today as of 9.14 am BST.

“Multiple products are affected globally” Google Cloud said today.

Engineers are working to mitigate the incident, the company said in a status update. Users of connected home services Nest were among those facing issues.

UPDATED 12.44 BST: “We are investigating an issue with an infrastructure component impacting multiple products. We believe we have identified the cause and are currently rolling out mitigation” GCP said.

UPDATED 22:00 BST. GCP engineers resolved the issue in approximately 2 hours, 15 minutes. The company says says the issue hit “some Google Cloud APIs across us-east1, us-east4 and southamerica-east1, with some APIs impacted globally. This includes the APIs for Compute Engine, Cloud Storage, BigQuery, Dataflow, Dataproc, and Pub/Sub. App Engine applications in those regions [were] also impacted.”

@googlenest it appears Google cloud services are down, according to Google Home, WiFi and Nest Thermostat apps. Update?

— David Strickland (@davstrick) November 11, 2019

Google Cloud Down

The issue comes 21 days after users faced 100 percent packet loss to and from ~20 percent of instances in GCP’s us-west1-b region for two-and-a-half hours.

That outage was blamed on failure in the underlying leader election system” (its “Chubby lock system”) which “resulted in components in the control plane losing and gaining leadership in short succession.”

More Details: Google Cloud’s Little “Chubby” Outage

@googlecloud down? getting timeout on compute engine for 10 minutes now

— Johnny ⚡️ (@johnny_leo) November 11, 2019

The issue follows a string of public cloud outages; a reminder that even the best resourced IaaS companies are not immune to development and infrastructure borkage.

AWS, Azure and GCP have all suffered high profile incidents in the past six months, with AWS services interrupted by a DDoS attack for eight hours on October 22, the same day that GCP suffered its US west coast issue.

Azure has also struggled with a string of well documented outages, with an overloaded Redis cache triggering a 17-hour multi-factor authentification outage in November and Office 365 failing in January; something Microsoft blamed on a “subset of mailbox database infrastructure [that] became degraded, causing impact”.