A Google Cloud outage on October 22 was rapidly linked on social media to a reported DDoS attack on AWS the same day. That was not the case, the company rapidly confirmed to Computer Business Review.
So what had happened to trigger the issue, which caused 100 percent packet loss to and from ~20 percent of instances in its us-west1-b region for two-and-a-half hours? (It also affected Cloud SQL, Cloud VPN, and other services).
Customers started losing access when the Google Cloud Networking control plane “experienced failures in programming its virtualised networking stack”, Google Cloud explained in an issue summary published today.
“New or migrated instances would have been unable to obtain network addresses and routes, making them unavailable, it notes in the write-up, adding that “existing instances should have seen no impact; however, an additional software bug [was] triggered by the programming failure.”
See also: When Things go Awry in the Cloud: A Closer Look at a Recent AWS Outage
In terms of underlying cause, Google today pointed the finger at a “failure in the underlying leader election system” (its “Chubby lock system”) which “resulted in components in the control plane losing and gaining leadership in short succession.”
These frequent leadership changes halted network programming, preventing VM instances from being created or modified” it said.
What’s “Chubby”?
The Chubby lock system is a way of automatically selecting which servers do what work in a diverse network of roughly similar servers, and handles what is known as the “distributed consensus problem.”
As Google explains in a detailed paper on the system: “The Google File System uses a Chubby lock to appoint a GFS master server, and Bigtable uses Chubby in several ways: to elect a master, to allow the master to discover the servers it controls, and to permit clients to find the master.
“In addition, both GFS and Bigtable use Chubby as a well-known and available location to store a small amount of meta-data; in effect they use Chubby as the root of their distributed data structures. Some services use locks to partition work (at a coarse grain) between several servers.”
Google Cloud Outage: The Remediation
Existing network routes should continue to work normally when programming fails, Google Cloud noted in a blog today.
On this occasion it didn’t, because “race condition in the code which handles leadership changes caused programming updates to contain invalid configurations, resulting in packet loss for impacted instances.”
The bug has been fixed “and a rollout of this fix was coincidentally in progress at the time of the outage…”
Google engineers were alerted at 16:30 US/Pacific to the bug, started investigating immediately and began mitigation within 50 minutes. They gained full recovery of the networking control plane by 18:51.
“Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization,” it added.