Sign up for our newsletter - Navigating the horizon of business technology​
Technology / Software

GitHub Outage Impacts Millions of Developers: Are Gaps Between Staging and Production Still an Issue?

Updated 09:35. GitHub says the issue was resolved 09:31 BST. Computer Business Review will update with a post-incident review when we have one. 

A major GitHub outage has developers starting the week with gritted teeth, after the software development platform went down this morning.

GitHub says it is working on the outage, which had lasted for over four hours as we published; beginning 04:06 UTC (03:06 BST).

The incident has raised fresh questions about GitHub’s resilience in the wake of three separate outages in April 2020 alone.

White papers from our partners

“We are investigating reports of degraded performance and increased error rates,” the platform said early this morning.

The source of “elevated errors” was spotted 6:53am BST. GitHub added at 8.18am BST that it is working on the recovery of our services.

GitHub outage

GitHub attributed April’s three outages respectively to:

a) misconfiguration of software load balancers disrupted internal routing of traffic between applications that serve GitHub.com and the internal services they depend on;

b) misconfiguration of database connections, related to ongoing data partitioning efforts that “made it unexpectedly to production” and

c) a networking configuration that was “inadvertently applied to our production network” (Yikes).

GitHub admitted in April that it had issues with its staging labs environment.

“This staging environment does not set up the databases and database connections the same way as the production environment. This can lead to limited testability of connection changes specific to the production environment. We will be addressing this issue in the coming months”, the company said.

GitHub runs most of its platform on its own bare metal infrastructure, with  networking infrastructure “built around a Clos network topology with each network device sharing routes via Border Gateway Protocol (BGP).”

GitHub, bought by Microsoft in 2018 for $7.5 billion, is home to over 50 million developers. Given the workloads it supports and widespread reliance on it for high availability, the large-scale outages like this can have a major impact.

Owner Microsoft, like many other large infrastructure providers, has also faced the challenge of rapidly scaling up its data centre infrastructure in the wake of surging workloads driven by a swell in remote working staff in the wake of the pandemic, admitting in April that it had faced some supply chain issues after the outbreak.

The COVID-19 pandemic rocked the server hardware supply chain globally as factories around the world shut down just as large enterprises and hyperscalers needed to overhaul data centres. (Dropbox’s CTO said his company’s data center team “proactively swapped out 30,000 components in eight weeks” to safely reduce on-site staffing).

Chipmaker AMD meanwhile said in a Q1 earnings call that one unnamed cloud provider had added 10,000 servers to their data centres in just 10 days during the crisis, in a frantic bid to scale up their infrastructure as workloads soared.

GitHub’s issues however appear to be related more to issues surrounding gaps between its staging/canary environments and production.

See also: Microsoft Feels the Squeeze: Throttles 365 Services, Migration Bandwidth

 
This article is from the CBROnline archive: some formatting and images may not be present.

CBR Staff Writer

CBR Online legacy content.