An IBM Cloud outage that hit 80 data centres globally for well over three hours late Tuesday has been blamed by Big Blue on an unnamed “issue introduced by a 3rd party provider” that it says it fixed by “adjusting routing policies”.
The sweeping outage began on June 9 at 11.00pm and was fixed by June 10, 2.39am, IBM said in an update posted at 12.18pm BST.
Rubbing salt in the wound for the customers, the IBM status page is also served on the IBM cloud and was returning an internal service error for concerned users. (This is a surprisingly common issue that, naturally, means when there is an outage, people can’t learn a great deal about it…)
IBM, when pressed for comment by Computer Business Review, merely told us: ““All IBM Cloud services have been restored”.
We’ll eagerly await the autopsy.
(Quite how how a third party provider managed to knock not just one multi-carrier data centre offline, let alone a global network, remains an open question; some observers have suggested that it may have involved a BGP hijacking or routing mistake by a major carrier).
Updated June 11 09.00: IBM says an “external network provider flooded the IBM Cloud network with incorrect routing, resulting in severe congestion of traffic and impacting IBM Cloud services and our data centers. Mitigation steps have been taken to prevent a reoccurrence. Root cause analysis has not identified any data loss or cybersecurity issues.”
IBM Cloud promises “global load balancing to ensure a redundant, highly available platform is available for you to host your workloads”.
Read This! IBM Dumps Facial Recognition Software, Warns Over “Mass Surveillance and Racial Profiling”
The outage forced customers to turn to Twitter and IBM Cloud-hosted services for news. Autopilot was among those that piped up to tell customers that IBM had told it the outage “appears to be a networking issue”.
IBM Cloud, which has limited market share compared to the hyperscalers, is mothballing data centres in Dallas, Houston, Seattle and Melbourne this year as part of a modernisation strategy.
It said in a June 9 status update: “We have made significant investments in rolling out new datacenters and Multizone Regions (MZRs) designed to deliver a more resilient architecture with higher levels of network throughput and redundancy. As part of this modernization strategy, we have determined it is necessary to close select older datacenters unsuitable for upgrading.”
Customers will need to migrate workloads to “one of our new IBM Cloud datacenters to avoid service interruptions”.
They’ll also need to cancel old servers after migration. These otherwise will, IBM notes, “continue to be invoiced until cancelled”.
Fail-overs failing this badly are a rarity, however.
More to follow,
Know more about the outage? Get in touch on claudia dot glover at cbronline dot com