View all newsletters
Receive our newsletter - data, insights and analysis delivered to you
  1. Technology
  2. Cloud
November 1, 2019

Google Cloud’s Little “Chubby” Outage

In our latest of installment of when-things-going-awry-in-the-cloud, we look at Google Cloud's October 22 outage

By CBR Staff Writer

A Google Cloud outage on October 22 was rapidly linked on social media to a reported DDoS attack on AWS the same day. That was not the case, the company rapidly confirmed to Computer Business Review.

So what had happened to trigger the issue, which caused 100 percent packet loss to and from ~20 percent of instances in its us-west1-b region for two-and-a-half hours? (It also affected Cloud SQL, Cloud VPN, and other services).

Customers started losing access when the Google Cloud Networking control plane “experienced failures in programming its virtualised networking stack”, Google Cloud explained in an issue summary published today.

“New or migrated instances would have been unable to obtain network addresses and routes, making them unavailable, it notes in the write-up, adding that “existing instances should have seen no impact; however, an additional software bug [was] triggered by the programming failure.”

See also: When Things go Awry in the Cloud: A Closer Look at a Recent AWS Outage

In terms of underlying cause, Google today pointed the finger at a “failure in the underlying leader election system” (its “Chubby lock system”) which “resulted in components in the control plane losing and gaining leadership in short succession.”

These frequent leadership changes halted network programming, preventing VM instances from being created or modified” it said.

What’s “Chubby”?

The Chubby lock system is a way of automatically selecting which servers do what work in a diverse network of roughly similar servers, and handles what is known as the “distributed consensus problem.”

Content from our partners
Scan and deliver
GenAI cybersecurity: "A super-human analyst, with a brain the size of a planet."
Cloud, AI, and cyber security – highlights from DTX Manchester

As Google explains in a detailed paper on the system: “The Google File System uses a Chubby lock to appoint a GFS master server, and Bigtable uses Chubby in several ways: to elect a master, to allow the master to discover the servers it controls, and to permit clients to find the master.

“In addition, both GFS and Bigtable use Chubby as a well-known and available location to store a small amount of meta-data; in effect they use Chubby as the root of their distributed data structures. Some services use locks to partition work (at a coarse grain) between several servers.”

Google Cloud Outage: The Remediation

Existing network routes should continue to work normally when programming fails, Google Cloud noted in a blog today.

On this occasion it didn’t, because “race condition in the code which handles leadership changes caused programming updates to contain invalid configurations, resulting in packet loss for impacted instances.”

The bug has been fixed “and a rollout of this fix was coincidentally in progress at the time of the outage…”

Google engineers were alerted at 16:30 US/Pacific to the bug, started investigating immediately and began  mitigation within 50 minutes. They gained full recovery of the networking control plane by 18:51.

“Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization,” it added.

See also: Redis Overload to Blame for 17-Hour Azure MFA Login Crisis

Websites in our network
Select and enter your corporate email address Tech Monitor's research, insight and analysis examines the frontiers of digital transformation to help tech leaders navigate the future. Our Changelog newsletter delivers our best work to your inbox every week.
  • CIO
  • CTO
  • CISO
  • CSO
  • CFO
  • CDO
  • CEO
  • Architect Founder
  • MD
  • Director
  • Manager
  • Other
Visit our privacy policy for more information about our services, how Progressive Media Investments may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.