The first of a two-part article from Computer Business Review, a sister publication.
The larger the project, the greater the risk of catastrophic failure. Denise Danks asks why so many mammoth IT projects go badly wrong.
Failure is hard to disguise when it is written large all over the world’s largest sporting event, with the world’s press on hand to record every miserable second. With the attention of the world upon it, IBM flew 2,500 users and their partners to the 1996 Olympic Games at Atlanta, Georgia, to impress them with its technology and system integration skills. To run the Games’ information system, IBM had put in place 750 person years of software support and analysis, four S/390 servers, 16 RS/6000s and 80 AS/400s. PCs numbered 7,000 and 1,000 desktop printers ran on 300 local area networks. A PC ‘factory’ was built on site to commission and decommission the 7,000 PCs, and to carry out hardware integration with 13,000 telephones, 11,500 televisions, 6,000 pagers, 9,500 mobile radios, 80,000 cable installations and 35 accreditation stations. Ten different companies supplied 450 security access points, including Swatch (timing), Xerox (document reproduction), Bell South (networking), Sensormatic (security), AT&T (telecommunications), Motorola (paging), Panasonic (monitors), and Kodak (image processing). The IBM Technology Operations Center overseeing this huge system encompassed 200 systems staff and 5,000 technical volunteers.
IBM crowed about its $80 million ‘in kind’ sponsorship of the Olympics on TV and radio ads throughout the Games. So the responsibility for the systems that insisted that a gymnast was 97 years old and an Angolan basket ball player was three feet tall was never in doubt, despite the ads in USA Today and The Wall Street Journal being pulled. In fact, the only result that came through accurately and fast was that IBM’s performance was an unmitigated public relations disaster. What happened? IBM’s results management system that fed results to 11 major national and international wire news services used slow 9600 baud modems. Data was transmitted serially, with the consequence that information arrived so late on the first day, for example, the US Sunday papers could not publish the results of the first day. There were programming errors on the news feed and the Olympic Word Wide Web system and local information kiosks were also affected. The post mortem revealed that the system had never been tested live because none of the agencies involved wanted to pay to test network communications, and so testing was implemented by diskette. IBM – which is apparently planning to spend more time verifying user requirements for the 1998 Winter Games in Japan – counted the cost in time, money, inconvenience and embarrassment. It was lucky. Sometimes the cost of IT failure is measured in an unsalvagable project and even in lost human lives.
At Kourou, French Guiana, on the morning of 4 June 1996 disaster struck as visibility delayed the maiden flight 501 of the European Space Agency’s Ariane 5 launcher. After a four hour delay, the Vulcain engine and the two solid boosters were ignited. At an altitude of 3,700 meters, 37 seconds after lift off, the launcher veered off its flight path and exploded, taking its valuable scientific payload with it. During the Ariane 5 Development Programme, there had neither been adequate analysis nor testing of the inertial reference system (IRS) nor of the complete flight control system. The repercussion of this lapse was confirmed by the subsequent inquiry which found specification and design errors in the software of the IRS. As a consequence, when the rocket was airborn, the software thought that it was still on the ground and compensated for the tilt by aborting the mission. There had been a complete loss of guidance and attitude information in flight. In other words, the rocket had no idea where it was. Unfortunately, this kind of disaster is not uncommon. At least three US air disasters have been blamed on US government air traffic control system development delays. Two years after the start of the US Federal Aviation Administration’s (FAA’s) $4.3 billion advanced automation system (AAS) air traffic control project in 1990, the General Accounting office warned that delays in implementing a key component of the system could affect the FAA’s ability to safely handle increases in traffic into the next century. The project was delayed by 19 months with the FAA blaming underestimates in the developing and testing time for the software, as well as the difficulty of including changes in requirements. By 1994, the FAA declared AAS ‘out of control’. The cost of the project was forecast at more than $7 billion, with a further delay of 31 months. A more modest AAS program, display system replacement, is set to become operational in 1998, some 17 years after it was first defined as a $2.5 billion program, shortly after President Reagan fired 11,000 striking controllers. It is already running $5.6 billion over budget and until 1998, the public will have to rely on a breakdown-prone air traffic control system based on computer systems that are 30 years old. Unions claim that the FAA has cut the ranks of its own technicians by half and is increasingly turning to private electronic service contractors to maintain the old equipment, potentially creating a danger to security. The FAA, meanwhile insists that the existing system is safe – mid-air collisions have declined – and no accident has been caused by the old equipment. Whether the old system does affect safety is in dispute, however. FAA data shows that flight delays, which are attributable to the system, rose 19% for the first nine months of 1996 compared to the same period in 1995, and are running at a rate of a quarter of a million a year. The question is, would new technology have prevented collisions in 1990 between two Northwest Airlines jets on a Detroit runway, a US Air 737 and a Skywest commuter plane at Los Angeles International in 1991, and a US Air DC9 crash near Charlotte, NC in 1994?
There is one thread common to all these examples: major IT projects incur significant risks. Figures produced by many research studies bear this out with relentless consistency. Projects overrun, are often over budget, they don’t perform as expected, they strain the participating organizations, and are too often cancelled prior to completion after large sums of money have been spent. The Standish Group, a Boston-based market research firm, found that more than 84% of US IT projects fail to meet original expectations and 94% of those started have had to be restarted. The average project cost for large organizations is put at $2.3 million and for more than 50% of projects, almost double the original budget – $80 billion – is spent on canceled projects.
Similarly, Coopers & Lybrand’s Computer Assurance Services risk management group surveyed 80 of the largest enterprises in the UK in its report Managing information and systems risks. The sample included 22 of Britain’s top 100 companies, and a number of large building societies, government agencies and local authorities; 85% reported problems with systems projects that were either late, over-budget or that fail to deliver planned benefits.
Odds against success
The odds against IT success then are so great that users might as well toss a coin – a point of view espoused by Dr Bob Charette, president of the ITABHI Corporation, a high technology risk management company headquartered in Fairfax, Virginia. He is currently chairman of the Risk Advisory Board at the US Software Engineering Institute at Carnegie-Mellon University and describes the situation as deja vu to the nth power. The same things occur over and over. A realistic view of large IT projects is that there is a high probability of failure. You shouldn’t be doing these things unless you understand the risks. Unfortunately, projects are handled like medicine was handled in the 15th century, and as a result there are a tremendous amount of corpses out there, he says, blaming failure on the lack of proper risk management at the outset. He adds that these projects are poor candidates for normal methods of project control. Projects are not problems. Problems have solutions. Projects are dilemmas that you manage your way through. Large scale projects are changing all the time. Instead of accepting change, people try to prevent it. The Coopers & Lybrand study found that a lack of expertise in managing project risks and designing controls were major contributory factors. Only 36% of organizations had conducted formal business risk analysis and mapped it onto internal controls, or had produced a formal risk management and quality plan on systems projects. 80% of organizations had not given any formal training in control concepts and design to their business and IT staff. Other failure factors included insufficient understanding of the business issues to be addressed by the system and a lack of independence in terms of progress and quality reporting. The discomfiture was apparent. Those heads of audit that were polled had little confidence that adequate financial and project control exists over any significant systems projects. Neither did they feel that controls are in place to limit business risks to acceptable levels, nor that senior management are aware of all significant business risks. They are sitting on IT disasters waiting to happen, and they know it, Coopers concluded. The IT factor makes large projects inherently complex. Companies fail to recognize that they are at the leading, or so-called bleeding edge, and the things they are trying to do, they have not done before. No other industry faces the pace of change that the IT industry does. For civil engineers, building a dam today is not that different from 25 years ago. Large IT projects, unfortunately, bear no such comparison. Dale Kutnick, CEO of analysts the Meta Group, describes the problem that besets all large projects as scope creep. Projects are not nailed down as hard as they should be in terms of information requirements and delivery. Once the project has been started, there are compromises, and because of the timescale, they make these compromises and find the nature of the business has changed. He points out that, over time, business cycles have decreased from years to months to weekly cycles and finally, today, to ‘Internet time’. As business cycles are compressed, the time to market required for the projects is cut in half. But as the project starts to take longer, business requirements are added, information requirements are changed due to competition and market conditions in general. This makes large projects much less attractive nowadays. In his book Software Failure: Management failure – Amazing Stories and Cautionary Tales, Stephen Flowers analyzes legendary IT howlers. Again and again, the failures are attributed to: information errors, inadequate testing, complexity, lack of communication, high levels of uncertainty and high stakes. When I was working on the book it was almost like peering into a parallel universe where none of the normal rules of management apply, says Flowers. Indeed. Who would believe, for example, that the Sabre computer reservation system (Confirm CRS) – such a success for American Airlines (AA) that it became more profitable than the airline itself – would transmogrify into a fiasco for the Hilton Hotels, Marriott, Budget Rent-a-Car and AMR Information Services joint venture, Confirm? The development budget for the Confirm CRS system for booking hotel rooms and rental cars was $125 million in 1988. By 1992, however, instead of making 10 times the estimated sales of $3.1 million, the system was written off and the partners sued each other for $810 million. They settled discreetly out of court. The French railway system, SNCF, world famous for its advanced high speed TGV trains and the excellence of its engineering, spent FF1.3 billion ($232 million) on its Sabre-based Socrates CRS system. It was intended to cope with a predicted 100 million increase in reservations from 50 million over the period 1992 to 1995. SNCF also wanted to maximize revenue on ticket sales by applying airline marketing techniques to their unsuspecting customers. The CRS would, in theory, identify peaks of demand and force prices up, thus diverting passengers to half empty trains – a familiar practice in air travel, but not on the railways.