Updated 09:35. GitHub says the issue was resolved 09:31 BST. Computer Business Review will update with a post-incident review when we have one.
A major GitHub outage has developers starting the week with gritted teeth, after the software development platform went down this morning.
GitHub says it is working on the outage, which had lasted for over four hours as we published; beginning 04:06 UTC (03:06 BST).
The incident has raised fresh questions about GitHub’s resilience in the wake of three separate outages in April 2020 alone.
Work continues on the recovery of our services. https://t.co/RI6AL3tYM4
— GitHub Status (@githubstatus) July 13, 2020
“We are investigating reports of degraded performance and increased error rates,” the platform said early this morning.
The source of “elevated errors” was spotted 6:53am BST. GitHub added at 8.18am BST that it is working on the recovery of our services.
GitHub attributed April’s three outages respectively to:
a) misconfiguration of software load balancers disrupted internal routing of traffic between applications that serve GitHub.com and the internal services they depend on;
b) misconfiguration of database connections, related to ongoing data partitioning efforts that “made it unexpectedly to production” and
c) a networking configuration that was “inadvertently applied to our production network” (Yikes).
GitHub admitted in April that it had issues with its staging labs environment.
“This staging environment does not set up the databases and database connections the same way as the production environment. This can lead to limited testability of connection changes specific to the production environment. We will be addressing this issue in the coming months”, the company said.
GitHub runs most of its platform on its own bare metal infrastructure, with networking infrastructure “built around a Clos network topology with each network device sharing routes via Border Gateway Protocol (BGP).”
GitHub, bought by Microsoft in 2018 for $7.5 billion, is home to over 50 million developers. Given the workloads it supports and widespread reliance on it for high availability, the large-scale outages like this can have a major impact.
Owner Microsoft, like many other large infrastructure providers, has also faced the challenge of rapidly scaling up its data centre infrastructure in the wake of surging workloads driven by a swell in remote working staff in the wake of the pandemic, admitting in April that it had faced some supply chain issues after the outbreak.
GitHub is temporarily down
I just realized how heavily this affects my programming routine since part of my digital brain extension is missing. (same for Stackoverflow, Google, YouTube, ..)
We are all cyborgs, we just don't realize it untill our digital extensions stop working! pic.twitter.com/IjM5aF7o5K
— Xander Steenbrugge (@xsteenbrugge) July 13, 2020
The COVID-19 pandemic rocked the server hardware supply chain globally as factories around the world shut down just as large enterprises and hyperscalers needed to overhaul data centres. (Dropbox’s CTO said his company’s data center team “proactively swapped out 30,000 components in eight weeks” to safely reduce on-site staffing).
Chipmaker AMD meanwhile said in a Q1 earnings call that one unnamed cloud provider had added 10,000 servers to their data centres in just 10 days during the crisis, in a frantic bid to scale up their infrastructure as workloads soared.
GitHub’s issues however appear to be related more to issues surrounding gaps between its staging/canary environments and production.
See also: Microsoft Feels the Squeeze: Throttles 365 Services, Migration Bandwidth