November 24, 2015

7 ways disastrous downtime can hit data centres

List: There are only three parties to blame when a site goes down: humans, machines and nature.

The data centre threat posed to humans has been exposed last week by Ed Ansett, i3 Solutions Group Chairman. He said: "We [data centre industry] still have a long way to go. It is only a matter of time until failure in our industry starts killing people."

CBR lists ten reasons that lead to downtime.

1. Generators failing to start

Ansett gave an example of a serious data centre failure caused by a malfunction in the generators, but without disclosing the name of the affected operator.

He said: "It was hot summer day and there was an utility power outage. The data centre was at full load 7.2MW. The site had four 2.5MW generators installed, N+1 configured. One generator failed to start but the hub was running on three generators. 30 minutes later another generator failed.

"The data centre was then at 5MW capacity supporting a 7.2MW load. The remaining generators overloaded and the cooling plant had no power. IT equipment started to shut down over temperature. The DC data centre run on UPS for another 30 minutes, 2N 15 minutes each side. The total data centre failure came 30 minutes later."

It took six hours to restore the utility and the data centre was only fully brought back online eight hours after.

Generator failures account for an average cost of $463,890, according to Emerson.

Content from our partners

Scan and deliver

GenAI cybersecurity: “A super-human analyst, with a brain the size of a planet.”

Cloud, AI, and cyber security – highlights from DTX Manchester

2. Uncoordinated circuit protection & Switching errors

Inadequately rated or uncoordinated protective equipment can cause extensive damage to systems, lead to prolonged downtime, and even result in personnel injury.

Ansett said that uncoordinated circuit protection is primarily a design and commissioning problem.

Standard operation switching errors are caused by humans and is the third most common occurrence in data centres (51%), according to Emerson.

Loose connections, like the switchgear, can also be responsible for send the whole data centre into darkness.

3. UPS battery failure & Exceeded capacity

An Emerson survey of 450 data centre operators has found that UPS battery failure is the most common reason for an outage (affecting 55% of those surveyed).

UPS supplies provide clean, regulated and continual power to IT equipment, utilising batteries to bridge the gap between any mains failure and the start up of the generator.

Exceeded UPS capacity is the second most common reason for data centres to fail, according to the mentioned Emerson study, which found that 53% of respondents have been affected by this sort of problem.

As IT demand grows, data centres and their infrastructure and services also have to follow and not overload their existing systems.

The average cost of an UPS failure to data centre operators is $687,700, according to Emerson.

4. Water leaks

Water and IT are still a no go area but water in data centres as led to some outages. In Emerson’s study, 35% of those surveyed they said this was the reason behind some of their failures.

Water leaks – and moisture – can be cause by a series of events: weather, broken pipes, computer room air conditioning (CRAC) leaks and so on.

These sorts of issues can be prevented by ensuring crucial parts of the IT system are sealed and by installing monitoring systems that detect water.

Water, heat or CRAC failure cost colos on average $489,100.

5. Maintenance operation errors

Poor maintenance of the data centres, on simple things like batteries or UPS systems, can have serious consequences. A strong programmable logic controller (PLCs), used in many industrial control and/or safety applications can help to improve uptime.

In August 2009, Internap Network Services (INAP) saw its Boston data centre go offline due to poor battery maintenance.

Internap said in a statement at the time that the failure was caused by a utility company power interruption that cause the DC plant to fail over to battery backup.

6. Design errors

According to Schneider Electric, for years, the data centre industry has accepted that human operational error, not poor data centre design or engineering, is the number one cause of data centre downtime.

Not including the operations team in the facility’s design is colos’ first big mistake. A second mistake is to rely too much on data centre design. SE said that providers need to fully qualify the people who will be performing data centre operations from the start. Humans in this case take centre stage.

Other mistakes include failure to correctly address the staffing requirement, failure to train and develop staff, failing to consistently drill and test skills, and failure to overlay operations programs with
documented processes and procedures.

Adding to this there is still failure to implement appropriate processes and procedures in the design space, failure to develop and implement Quality Systems, and failure to use software management tools, like control systems that help to keep things running by intelligently measuring performance on an ongoing basis.

7. Natural disasters

Natural disasters are not of human or machine responsibility, "they are an act of God," Ansett said.

Jumbo colo facilities are usually built in areas where natural disasters – like hurricanes, earthquakes or monsoons – do not happen, at least very often. However, there is a larger number of data centres located in critical areas.

For example, hurricane Sandy in 2012 was powerful enough to shut down several hubs in New York. The ones that did not get flooded, lost their power.

In the wake of the disaster, local electricity supplier Consolidated Edison, had to shut down its power grid in lower Manhattan to avoid further damage to their data centre. Thousands of customers were affected.

Emerson has found that on average, operators are faced with a bill of $395,065 for weather-related events.

What are the costs?

Data centre failures – aside from their potential deadly threat to humans in the future – represent a big cost to operators. Emerson has found that data centre failures in 2013 cost 41% more per minute than in 2010, topping $7,900 and $5,600 respectively. The highest cost for a single organisation was $1.7 million.