Sign up for our newsletter
Technology / Cloud

Google Cloud in Major Global Outage: Numerous Services Fail

Here for April 2020’s outage? We’re covering that here. Want to understand what caused GCP’s last major borkage? We’ve got that covered here

  • Google Cloud Platform (GCP) services down. Issue global in scale. Numerous services affected, including Kubernetes and IoT services like Nest.

Google Cloud Platform (GCP) says it is experiencing a “major issue” with services including Cloud Dataflow, AppEngine, Compute Engine, Cloud Storage, Dataflow, Dataproc, Pub/Sub, BigQuery, Networking all failing today as of 9.14 am BST.

“Multiple products are affected globally” Google Cloud said today.

Engineers are working to mitigate the incident, the company said in a status update. Users of connected home services Nest were among those facing issues.

White papers from our partners

UPDATED 12.44 BST: “We are investigating an issue with an infrastructure component impacting multiple products. We believe we have identified the cause and are currently rolling out mitigation” GCP said. 

UPDATED 22:00 BST. GCP engineers resolved the issue in approximately 2 hours, 15 minutes. The company says says the issue hit “some Google Cloud APIs across us-east1, us-east4 and southamerica-east1, with some APIs impacted globally. This includes the APIs for Compute Engine, Cloud Storage, BigQuery, Dataflow, Dataproc, and Pub/Sub. App Engine applications in those regions [were] also impacted.”

Google Cloud Down

The issue comes 21 days after users faced 100 percent packet loss to and from ~20 percent of instances in GCP’s us-west1-b region for two-and-a-half hours.

That outage was blamed on failure in the underlying leader election system” (its “Chubby lock system”) which “resulted in components in the control plane losing and gaining leadership in short succession.”

More Details: Google Cloud’s Little “Chubby” Outage

The issue follows a string of public cloud outages; a reminder that even the best resourced IaaS companies are not immune to development and infrastructure borkage.

AWS, Azure and GCP have all suffered high profile incidents in the past six months, with AWS services interrupted by a DDoS attack for eight hours on October 22, the same day that GCP suffered its US west coast issue.

Azure has also struggled with a string of well documented outages, with an overloaded Redis cache triggering a 17-hour multi-factor authentification outage in November and Office 365 failing in January; something Microsoft blamed on a “subset of mailbox database infrastructure [that] became degraded, causing impact”.

Read this: IaaS Magic Quadrant: Gartner Gets the Claws Out




This article is from the CBROnline archive: some formatting and images may not be present.

CBR Staff Writer

CBR Online legacy content.