Sign up for our newsletter
Technology / Data Centre

Leap second causes ‘panic’ for Cloudflare servers

The leap second that was added to the end of 2016 caught out Cloudflare causing some of its servers to fail.

The web firm which says, “we make the Internet work the way it should”, offers CDN, DNS, DDoS protection and security but found that some of its servers failed to handle the added second.

The result was that users received an error message to say that servers could not be reached instead of seeing the page that they wanted to visit.

Cloudflare said that it fixed the problem within 90 minutes and explained the problem by saying: “At midnight UTC on New Year’s Day, deep inside Cloudflare’s custom RRDNS software, a number went negative when it should always have been, at worst, zero.

White papers from our partners

“A little later this negative value caused RRDNS to panic. This panic was caught using the recover feature of the Go language. The net effect was that some DNS resolutions to some Cloudflare managed web properties failed.”

Servers were unable to handle the leap second.
Servers were unable to handle the leap second.

Cloudflare customers use the company’s DNS service to serve the authoritative answers for their domains. Basically the company is a go-between for websites that are aiming to speed up access to a site while also stopping malicious traffic.

The problem is said to have affected about 1% of the requests its servers process during the glitch.

Analysis of the problem revealed that a mismatch between the time-stamps Cloudflare servers were expecting and the ones they got caused the system to ‘panic’.

The trigger for the issue was the addition of the leap second that was added to the end of 2016. This was added in order to compensate for a slowdown in the earth’s rotation and is designed to help co-ordinate time-keeping for those nations that use Greenwich Mean Time (GMT).

Cloudflare said: “This problem was quickly identified. The most affected machines were patched in 90 minutes and the fix was rolled out worldwide by 0645 UTC. We are sorry that our customers were affected, but we thought it was worth writing up the root cause for others to understand.”
This article is from the CBROnline archive: some formatting and images may not be present.