How to prevent your software update from triggering a world-stopping outage

The world is woven ever more tightly together by digital networks. It takes just one bad software update to undo that unity, says Dynatrace’s Bob Wambach. (Image: Shutterstock)

Increased frequency and severity of software outages illustrate the degree to which many aspects of our lives and the stability of commercial organisations are increasingly dependent on the integrity of one or more software platforms. While the cause of each outage may vary, there is an overarching trend behind them. Modern IT systems and cloud environments have become too complex to control and manage using siloed toolsets. In fact, 86% of technology leaders say their technology stack has gone beyond human ability to manage, which makes it easier to make mistakes or overlook issues.

As worrisome as the potential revenue loss are the real-world consequences that outages have for those who depend on digital services. For example, if a payment system at a supermarket is unavailable for any length of time, consumers may be unable to buy essential groceries or fill up their cars. A technical glitch at a hospital could delay patients from receiving life-saving care.

Digital dominoes

Organisations today exist in a world where IT systems behave very differently to the way they once did. As businesses continue to transform, their digital environments have become hyper-connected. A single disruption can trigger a chain reaction, rippling across multiple interconnected systems and services. Organisations can no longer think about the health of their systems in silos, but how systems interact within hybrid and multicloud environments and third-party services.

Take e-gates, for example. In partnership with the government, airports have introduced technology to ease the flow of travellers, reduce the reliance on staff, and have a more accurate view of who is entering and leaving the country. However, even if the e-gates themselves work, there is a whole chain of events that could impact the user experience. A flight may be delayed, disrupting planned passenger flow and leading to long queues and a poor experience for passengers.

If the airport monitors the user experience holistically – analysing the health of the e-gates in concert with other factors, such as flight arrival times and passenger footfall – it can make better decisions to optimize travellers’ journeys.

Beyond the reactive

There is no denying that the digital world is becoming more complex, but organisations need to continue to innovate without compromising service reliability or introducing unforeseen risk to their customers or business.

This can best be achieved with a proactive approach to managing the health of digital services. Observability platforms aim to have an array of monitoring, analytics and automation capabilities that enable teams to reduce the risk of outages and minimise impact when outages do occur. For example, synthetic monitoring can help to detect and resolve potential user experience issues early to avoid an outage and ensure fast action if an incident does occur.

Seeing through complexity

Part of the challenge in large IT environments is that there are so many problems that can lead to IT outages, including hardware failures, software bugs, cyber-attacks, and human error. As we’ve recently seen, even a routine software update can trigger a major point of failure. Organisations need a way to see the smoke before the fire starts to burn and take preventative action.

In this respect, AI-driven approaches to monitoring and observability are essential. Such solutions, if deployed correctly, can give teams real-time insights into systems health and helping them to prioritise the actions they take to minimise disruption during an incident.

In addition to revealing the source and cause of problems, these insights need to illustrate the impact of outages so that IT leaders have the information the C-suite needs to keep shareholders and other key stakeholders informed of their response efforts.

But it’s not enough to know that an application was offline for a given length of time – business leaders need to understand the impact on the outcomes they are measured against, such as the number of customers impacted, or the amount of revenue lost. Real User Monitoring (RUM) capabilities can be invaluable for meeting this need, giving teams a detailed view of user journeys and conversion rates to better understand the financial impact of an incident.

Ideally, then, software providers should prevent the kind of crippling, sector-wide outages the world is increasingly witnessing. Every organisation should consider the types of incidents that could impact their services and identify how they can be ready to respond quickly and minimise disruption when the next major outage strikes. To succeed, they need to change the way they manage and deliver IT services, taking a proactive approach supported by a holistic view of the business. Those organisations that succeed in making this shift will likely be amongst those leading the pack in an increasingly connected digital future.

Bob Wambach is the vice president for product portfolio at Dynatrace