View all newsletters
Receive our newsletter - data, insights and analysis delivered to you
  1. Technology
  2. Data Centre
April 16, 2020

Cloudflare Admits Outage Came After Technician Unplugged Cables

Oops?

By CBR Staff Writer

A major Cloudflare outage late Wednesday was caued by a technician unplugging a switchboard of cables that provided “all external connectivity to other Cloudflare data centers” —  as they decommissioned hardware in an unused rack.

While many core services like the Cloudflare network and the company’s security services were left running, the error left customers unable to “create or update” remote working tool Cloudflare Workers, log into their dashboard, use the API, or make any configuration changes like changing DNS records for over four hours.

CEO Matthew Prince described the series of errors as “painful” and admitted it should “never have happened”. (The company is well known and generally appreciated for providing sometimes wince-inducingly frank post-mortems of issues).

Cloudflare CTO John Graham-Cumming admitted to fairly substantial design, documentation and process failures, in a report that may worry customers.

Content from our partners
Green for go: Transforming trade in the UK
Manufacturers are switching to personalised customer experience amid fierce competition
How many ends in end-to-end service orchestration?

He wrote: “While the external connectivity used diverse providers and led to diverse data centers, we had all the connections going through only one patch panel, creating a single physical point of failure”, acknowledging that poor cable labelling also played a part in slowing a fix, adding “we should take steps to ensure the various cables and panels are labeled for quick identification by anyone working to remediate the problem. This should expedite our ability to access the needed documentation.”

The wheels come off at Google Cloud

How did it happen to start with? “While sending our technicians instructions to retire hardware, we should call out clearly the cabling that should not be touched…”

Cloudflare is not alone in suffering recent data centre borkage.

Google Cloud recently admitted that “evidence of packet loss, isolated to a single rack of machines” initially seemed to be a mystery, with technicians uncovering “kernel messages in the GFE machines’ base system log” that indicated strange CPU throttling.

A closer physical investigation revealed the answer: the rack was overheating because the casters on the rear, plastic wheels of the rack had failed and the machines were “overheating as a consequence of being tilted”.

Websites in our network
Select and enter your corporate email address Tech Monitor's research, insight and analysis examines the frontiers of digital transformation to help tech leaders navigate the future. Our Changelog newsletter delivers our best work to your inbox every week.
  • CIO
  • CTO
  • CISO
  • CSO
  • CFO
  • CDO
  • CEO
  • Architect Founder
  • MD
  • Director
  • Manager
  • Other
Visit our privacy policy for more information about our services, how New Statesman Media Group may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.
THANK YOU