Codeanywhere, a San Francisco-based company that provides a cloud-hosted coding collaboration platform, has blamed a Google Cloud Platform (GCP) outage for the disappearance of customers’ projects – with anger among its user base growing this week amid fears that potentially thousands of coding projects are gone for good.
Codeanywhere, founded in 2013, claims a user base of over 1.5 million.
The company provides an Integrated Development Environment (IDE) that lets users work on projects from a web browser – e.g. editing source code, building executables, and debugging – with work based in containers spun up on GCP, via Codeanywhere’s portal.
Users were alerted to emerging issues on October 31, when the company told customers in a social media post: “We have a blackout from GCP and [sic] trying to work with them on it to get it fixed asap. Very very sorry about this.
Apart from a handful of responses to customers trying to get their code back, the company has not returned to Twitter to update affected customers with a formal comment since. Many users say they fear that their projects may be lost for good.
“I have about 20 student accounts with lost work, including several coding portfolio websites that students were planning on using to apply to college with…”
Jeff Gordon is among those affected.
“I use the service for my own projects, but have that work backed up. My students accounts, however, were not backed up prior to the outage, because it’s logistically difficult to setup repositories for them.
“I have about 20 student accounts with lost work, including several high school coding portfolio websites that students were planning on using to apply to college with eventually.”
These have now vanished into thin air.
He added: “There is no status page for Codeanywhere.
“Since they were initially claiming a quick fix, I waited one week to submit my ticket [see above] and did so on November 6th.
“The auto-response indicates that there is a ticketing system, but the link is dead. I’ve pointed this out to them on previous tickets, but have never received a response, and I’m not able to access or locate the ticketing system.”
On November 11th, Codeanywhere closed the ticked as unfixed [see left], with no indication of when a fix – if there is to be one – would be coming.
The company has since charged Gordon his $50 subscription fee.
“I’m afraid to cancel, because I’m scared they’ll just delete my containers,” he told Computer Business Review.
“But I don’t really believe they’ll restore them at this point.”
(He adds: “I was using Cloud 9 until Amazon bought and shuttered the service. Thought they tried to replace it with AWS containers, those didn’t work for my needs, since I need an easy sign-up process that doesn’t require billing information. When Cloud 9 closed, many educators moved to Codeanywhere, which offers relatively similar service. They’re basically the only one in this market space.”)
Codeanywhere Failure: Google Cloud’s Fault?
The desperately poor communication from Codeanywhere, of course, rests on the company’s own shoulders. (It does not provide a phone number. A number listed on Crunchbase for the company does not ring. Its founders, CEO Ivan Burazin and CTO Vedran Jukic did not respond to requests for comment made via LinkedIn. A support ticket raised by Computer Business Review also went unanswered.)
And Google Cloud sources say they have struggled to reach the company themselves to help try to fix the problem.
“Resolving the stuck projects required manual intervention, which was unique to each failed operation type”
The more charitably minded might argue that, whatever the company’s many apparent serious flaws, it was offering a functioning service until GCP suffered a pronounced case of cloud-wobble two weeks ago however.
So, what happened?
The GCP outage that triggered the issue appears to have been “Google Compute Engine Incident #19008” – an incident that stretched on for over three days.
This started when eight GCP services, including Google’s Compute Engine, Kubernetes Engine, App Engine, Cloud Filestore and Cloud Composer abruptly faced “up to a 100% error rate” for over four hours on October 31. GCP blamed a serious software bug that triggered a cascade of problems at the heart of the public cloud provider’s services.
The company provided a summary on November 8 which showcased how much “growing up” even major cloud companies need to do before they are fully trusted*.(While many enterprises have, of course, gone all-in on the cloud, there is no shortage of businesses who remain distinctly wary of it for precisely this kind of reason).
Because this outage got messy.
As the summary details: “A performance regression introduced in a recent release of the [GCP] networking control software caused the service to begin accumulating a backlog of requests. The backlog eventually became significant enough that requests timed out, leaving some projects stuck in a state where further administrative operations could not be applied. The backlog was further exacerbated by the retry policy in the system sending the requests, which increased load still further.
“Manual intervention was required to clear the stuck projects.”
Despite an automatic failover kicking in, an hour later the “overload condition returned and error rates again increased”, GCP said, as its engineers rushed to allot additional resources to the (unspecified) overloaded components.
They were ultimately forced to add a rate limit designed to throttle requests to the network programming distribution service and forced a hard restart that allowed the team to begin focusing on the cleanup of “stuck” projects.
Manual intervention was needed to handle a backlog of customer operations (each of which was, as GCP notes, “was unique to each failed operation type.”) This took until 14:00, November 2 (three days later) to work through.
GCP says it is taking six immediate steps to prevent a recurrence:
“Implementing continuous load testing as part of the deployment pipeline of the component which suffered the performance regression, so that such issues are identified before they reach production in future.”
“Rate-limiting the traffic between the impacted control plane components to avoid the congestion collapse experienced during this incident.”
“Further sharding the global network programming distribution service to allow for graceful horizontal scaling under high traffic.”
“Automating the steps taken to unstick administrative operations, to eliminate the need for manual cleanup after failures such as this one.”
“Adding alerting to the network programming distribution service, to reduce response time in the event of a similar problem in the future.”
“Changing the way the control plane processes requests to allow forward progress even when there is a significant backlog.”
Whether it can track down Codeanywhere’s owners, and help Jeff Gordon and his students get their work back remains an open question.
Customers, meanwhile, are being left dangling.
*As Microsoft Azure CTO Mark Russinovich put it in July, following a trio of Azure failures: “Outages and other service incidents are a challenge for all public cloud providers, and we continue to improve our understanding of the complex ways in which factors such as operational processes, architectural designs, hardware issues, software flaws, and human factors can align to cause service incidents.”