View all newsletters
Receive our newsletter – data, insights and analysis delivered to you
  1. Technology
  2. Cloud
November 1, 2019

Google Cloud’s Little “Chubby” Outage

In our latest of installment of when-things-going-awry-in-the-cloud, we look at Google Cloud's October 22 outage

By CBR Staff Writer

A Google Cloud outage on October 22 was rapidly linked on social media to a reported DDoS attack on AWS the same day. That was not the case, the company rapidly confirmed to Computer Business Review.

So what had happened to trigger the issue, which caused 100 percent packet loss to and from ~20 percent of instances in its us-west1-b region for two-and-a-half hours? (It also affected Cloud SQL, Cloud VPN, and other services).

Customers started losing access when the Google Cloud Networking control plane “experienced failures in programming its virtualised networking stack”, Google Cloud explained in an issue summary published today.

“New or migrated instances would have been unable to obtain network addresses and routes, making them unavailable, it notes in the write-up, adding that “existing instances should have seen no impact; however, an additional software bug [was] triggered by the programming failure.”

See also: When Things go Awry in the Cloud: A Closer Look at a Recent AWS Outage

In terms of underlying cause, Google today pointed the finger at a “failure in the underlying leader election system” (its “Chubby lock system”) which “resulted in components in the control plane losing and gaining leadership in short succession.”

These frequent leadership changes halted network programming, preventing VM instances from being created or modified” it said.

What’s “Chubby”?

The Chubby lock system is a way of automatically selecting which servers do what work in a diverse network of roughly similar servers, and handles what is known as the “distributed consensus problem.”

As Google explains in a detailed paper on the system: “The Google File System uses a Chubby lock to appoint a GFS master server, and Bigtable uses Chubby in several ways: to elect a master, to allow the master to discover the servers it controls, and to permit clients to find the master.

“In addition, both GFS and Bigtable use Chubby as a well-known and available location to store a small amount of meta-data; in effect they use Chubby as the root of their distributed data structures. Some services use locks to partition work (at a coarse grain) between several servers.”

Google Cloud Outage: The Remediation

Existing network routes should continue to work normally when programming fails, Google Cloud noted in a blog today.

On this occasion it didn’t, because “race condition in the code which handles leadership changes caused programming updates to contain invalid configurations, resulting in packet loss for impacted instances.”

The bug has been fixed “and a rollout of this fix was coincidentally in progress at the time of the outage…”

Google engineers were alerted at 16:30 US/Pacific to the bug, started investigating immediately and began  mitigation within 50 minutes. They gained full recovery of the networking control plane by 18:51.

“Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization,” it added.

See also: Redis Overload to Blame for 17-Hour Azure MFA Login Crisis

Websites in our network
NEWSLETTER Sign up Tick the boxes of the newsletters you would like to receive. Tech Monitor's research, insight and analysis examines the frontiers of digital transformation to help tech leaders navigate the future. Our Changelog newsletter delivers our best work to your inbox every week.
I consent to New Statesman Media Group collecting my details provided via this form in accordance with the Privacy Policy