Sign up for our newsletter - Navigating the horizon of business technology​
Technology / AI and automation

Q&A: Identifying the Cause of IT Outages Early

When the London Stock Exchange (LSE) suffered its worst outage in seven years last Thursday, it was just the latest in a series of recent IT glitches that have hit a range of sectors.

The software issue, which prevented LSE members from entering orders into the auction system, delayed the opening of trading by just one hour.

Coming just days after hardware issues caused Visa’s European network to suffer a blackout however, and amid TSB’s ongoing core banking system migration issues, it focussed minds anew on the challenges of keeping IT systems stable.

Computer Business Review asked Dave Anderson, a digital performance expert at software intelligence specialists Dynatrace, what lessons can be learned from the recent outages.

White papers from our partners

Finding the problem can be like spotting the needle in a whole lot of disparate haystacks.

What Are Your Thoughts on the LSE Outage?

The glitch that delayed the opening of the London Stock Exchange (LSE) was certainly just the latest example of the immense challenge that every organisation faces in keeping modern digital services up-and-running. In recent months; we’ve seen a whole host of IT outages; from train-ticketing sites going down, to airport systems causing flight delays, online banking problems, and messaging apps going offline.

Whatever the cause, at the root of all these problems is the heavy reliance that today’s businesses have on software delivery.

This is of global concern, and CIOs are struggling to cope with application modernisation, cloud migration and the basics of keeping systems running.

That’s why organisations are increasingly turning to monitoring and intelligence platforms that can provide real-time situational awareness. At the heart of these platforms is AI that can pinpoint and even predict problems before they hit. To take this even further, some organisations are already using that intelligence to enable application self-healing, which removes the need for manual human intervention.

This can’t completely remove the possibility of an IT outage, but real-time software intelligence can help to minimise the frequency and longevity of these incidents.

There Seems to Be a Growing Assumption in Some Circles that Cloud Migration Resolves All These Issues. Do You Agree?

To take full advantage of the benefits of cloud requires more than just “lifting and shifting” everything to the cloud.

For true digital transformation, companies need to break their old, monolithic applications into smaller services, giving them more agility and making it easier to scale. This will be a gradual process involving multiple cloud technologies and providers, as different approaches are more appropriate for different functionalities.

The result is that IT will be required to manage a hybrid, multi-cloud environment for the foreseeable future, bringing with it enormous complexity and constant change. So much so that it is beyond human ability to understand everything and to maintain visibility and control.

That’s why it’s so important to have highly automated, AI-powered monitoring and analytics to ensure performance and availability of these critical services.

You Mentioned Application “Self-Healing” Earlier. Could You Expand on That?

Application self-healing is where IT teams build automated processes into their IT systems to ensure that performance problems are identified in real-time and then instantly resolved, without the need for human intervention.

That process is underpinned by deterministic AI that can identify problems by baselining ‘normal’ behaviour and instantly detecting any deviation from that along with the root cause of the problem. This creates a type of software intelligence that allows systems to learn from previous problems and then automatically resolve recurring issues based on the solution that was identified previously.

That means that potential problems are nipped in the bud before the user feels any impact or the issue escalates into a full-scale outage.

Is There a Typical Culprit Behind an Outage like the LSEs?

To be honest, IT outages can be caused by anything and anyone – there’s no common culprit. People are more reliant on IT than ever before, but the technology and systems we rely on are becoming increasingly complex and also dynamic.

This means that an outage could be caused by any one of a myriad of factors, from developers deploying updated to code to third-party services or even a cyber-attack.

And because these systems are dynamic, the virtual server or host where the problem occurred may not even be running any more by the time you find out there’s a problem. Finding the source of problems used to be like finding a needle in a haystack, but now there are a hundred haystacks and they’re all swirling around inside a tornado.

This reinforces the need for monitoring built on deterministic AI. This technology can help people make sense of complex and dynamic environments to identify and resolve problems before they affect user experience.
This article is from the CBROnline archive: some formatting and images may not be present.

CBR Staff Writer

CBR Online legacy content.