Supercomputer designers must be the most successful – and lucky – engineers of the twentieth century: they have effortlessly improved the performance of their machines year after year, so that now their current machines are literally a million times faster than the original ENIAC was in 1946. But this remarkable achievement seems to have been more the result of accident than design. Every time they started banging their heads against a brick wall, having squeezed the last drop of performance from the current technology, someone obligingly developed a brand new technology and handed it to them on a plate.

Kitchen freezer

The original electromechanical relays were replaced first by vacuum tubes, then transistors, then intergrated circuits, then VLSI… In parallel with these developments came the go-faster acronyms: ECL, CMOS, FET, FPV, ROM, RAM, EPROM, GaAs, CPU, ALV, DSP… all leading to MIPS, KLIPS, and MEGAFLOPS. The engineers have been remarkably successful in shrinking the early machines – some of which were the size of small factories and used as much electrical power as a small town – down to a box the size of a kitchen freezer. And that has been part of the problem: the designers have come up with remarkable engineering solutions to the engineering problems of the early machines, but they are still building the same machine. A designer from 30 or 40 years ago might have difficulty believing that it all fitted in such a tiny box, but he would have no difficulty in understanding how a modern commercial computer worked – it works in exactly the same way that his did. It has a central processor; stored instructions; binary logic; instruction counter; clock; a large, slow back-up memory that is transferred into a fast working memory as needed; – all the parts that filled his room are still there inside the modern machine. Supercomputer designers have been slightly more adventurous, but only slightly. Apart from a few honourable exceptions, even computer science departments and big company research labs have behaved as if the standard von Neumann architecture had been carried down the mountain carved in tablets of stone. Only now, when the computer designers seem to have run out of rabbits to pull out of their hats, are they starting to look seriously at the basic architecture of the computer and trying to rethink how it tackles computational problems. (There seems to have been a general feeling that John von Neumann, Alan Turing and others said everything there was to say on the subject in the 1930s, 1940s and early 1950s). At long last money is starting to flow into basic research in different architectures through initiatives such as the Alvey project, but, as usual in the UK, it is too little and too late. And in the case of the most promising project, the Imperial College based Flagship, the money looks like running out as it successfully enters the final phase. The most fundamental problem the supercomputer designers are banging their heads against is the speed of light. In one nanosecond, an electric signal will travel only about 11 inches along a copper wire. And supercomputers such as the Cray 2 have clock periods of around four nanoseconds, so all the components must be within about 44 inches of the clock if they are to keep in step with the rest of the system. The first generation of supercomputer processors (from Cray Research, the Control Data Cyber 205, the Fujitsu VP-200, the Hitachi S-810 and the NEC SX systems) boosted their performance by pipelining arithmetic functions, using a single instruction to act on each element of a vector and employing processing units: an array of simple arithmetic processors all performing the same calculation on the different elements of a vector in parallel. Each segment of a pipeline executes a one-cycle instruction on the result fed to it from the previous segment – if the pipeline is kept full and fresh data is fed to its each cycle, an eight-segment pipeline will appear to execute eight one-cycle instructions each cycle. But pipelines have been pushed to their

limits: if the length of a pipeline is increased beyond six or eight segments it becomes difficult, if not impossible, to keep the pipeline full. This produces a bubble in the pipeline, one or more cycles when the elements have nothing to do, and nothing comes out of the end. And a pipeline only earns its keep, in terms of its complex hardware and the software needed to control it, if it delivers a result every cycle. A smart compiler can help by shuffling the instructions so that operations that take several cycles to complete, such as LOAD or STORE, are separated from the instructions, to keep the pipeline full. But when the compiler chooses the wrong thread of a conditional branch, as it inevitably must sometimes, the pipeline must be flushed and reloaded with instructions from the right branch, causing a delay equal to the length of the pipeline. Despite these quibbles, these machines can achieve a peak performance around the 1 gigaflops (10 to the power of 9 floating point operations per second) mark, but unfortunately the delivered performance in actual applications is usually only about 10% of the peak performance. Even with faster chips and a more efficient architecture, most observers believe that the days of the uniprocessor are over in supercomputers – the only way forward is to use large numbers of CPUs working in parallel. But there are many problems with parallel processing: how do you decompose programs to spread them across the available CPUs; how are the processors to be connected; how are they to communicate when they are working on different parts of the same problem – if you are not careful they actually slow down when more CPUs are added, spending all their time chattering to one another rather than working on the programs.

Gene Amdahl

There are some doubters of course: Gene Amdahl, who started out designing IBM’s 360 series and then went on to bigger things, recently told an audience at the Institute of Electrical Engineers in London that he thought the maximum number of processors that could usefully be used in parallel was 20 to 25 – more than that and the increase in performance would be minimal. He also demonstrated to his own satisfaction that theoretically vector processors could only achieve double or treble the performance of a standard scalar uniprocessor – a mathematical proof that should provoke a string of broken hearts and bankrupt a fair number of high-tech companies. But exotic designs are pouring out of the universities and research labs as if a dam has broken: some using large numbers of standard chips connected together and to memory in unusual ways, some using proprietary chips and some using semi-standard superchips such as the Inmos Transputer. Most of them run under highly customised versions of Unix, but the only other thing they have in common is the difficulties they have in decomposing programs to run efficiently on dozens, hundreds or thousands of CPUs. For the first time researchers into advanced parallel processing have test beds for their theories. But if they cannot quickly turn their theories into practical programs and methodologies they may find the exotic architectures have blossomed and died, starved of life-giving software.