The odd thing about Intel Corp’s new P6 is not how innovative it is, but rather, how familiar. The company that gave the world the iAPX-86 architecture is following the same path already outlined by Advanced Micro Devices Inc with its K5 and NexGen Microsystems Inc’s Nx586 family. Intel has abandoned the idea of trying to execute the complex iAPX-86 instruction set directly; instead the chip translates them on the fly to RISC instructions dubbed micro-operations or uops, (pronounced you-op). Saying that it followed the two companies is a bit misleading however, since Intel describes the P6 as its first parallel design effort – at the same time that one group of engineers was beavering away on Pentium, another, based in Hillsboro, Oregon, began, in 1990, to work on a bottom-to-top redesign of Intel’s execution model, which lead this year to the P6. Such an eerie degree of convergence between Nexgen, Advanced Micro and Intel will no-doubt be pointed to by RISC proponents as proof of the final victory of RISC over complex: expect a spate of the same this vindicates our whole approach statements that followed the announcement of the Hewlett-Packard Co-Intel love-match early last year.
Tinkering
Of course, Intel won’t care. As long as it can, by hook or by crook, keep cranking up the speed while maintaining compatibility and keeping the size, cost and heat of the chip down, why should it? One lone voice argues that the native Intel architecture is not dead, says it is possible to keep building better and faster chips, despite executing old-style Intel architecture instructions. And that’s Cyrix Corp. The cynical among you are already leaping up and down shouting well they would say that, wouldn’t they? well, yes they would. When Cyrix designed its M1 processor, it designed the kind of chip that Intel should have designed – it used ingenious technology to avoid bottlenecks in the chip’s dual pipelines, it did it’s damndest to overcome the limited set of registers and the instruction dependencies that hamper efforts to get iAPX-86 code running in parallel. The M1 won plaudits from the industry. But, in designing the M1, Cyrix committed itself very firmly to the ‘Campaign for Real iAPX-86 instructions,’ while its competitors, including Intel, have already had several years thinking about and tinkering with ‘pseudo-iAPX-86’ processors – essentially simple RISC cores, running hard-wired iAPX-86 emulation. Either Cyrix has missed the boat, or it knows something that the rest of them don’t. The company’s chief designer is very clear on the matter – he looked at the RISC emulation approach and he found it wanting. Cyrix’s scheme is perfectly scalable, and sucessors to the M1 will stick with the real Intel instructions. So why have all the others jumped ship? Quite simply, he says, because the other companies had a lot of RISC designers around and when they looked at the horrible old Intel instruction set, they shivered and went ‘ble urgh, let’s break this down into a more manageable problem.
By Chris Rose
Certainly, that seems to be the case at Advanced Micro, which states quite openly that the K5 core borrows heavily from its RISC chips. Intel is saying very little about the nature of its uops, but Cyrix insiders say that up to 70% of the P6 team previously worked on Intel’s 80960 RISC. But Cyrix’s director of superscalar products, Mark Bluhm claims that RISC-ising the problem can actually cause extra problems. His contention is that breaking Intel Architecture instructions into uops doesn’t actually solve anything – you haven’t got rid of the problem, you’ve just shuffled it. Moreover the act of translation actually introduces yet more dependencies, he says. It’s what another Cyrix insider colourfully calls turd-squeezing – no matter how much you manipulate the problem, you are still left with the object in question. Intel would certainly rebut these claims, but the problem is that it has, by its own admission, only released the minimum of information about the P6 required for its
International Solid State Circuits Conference presentation. The best overview of the chip’s architecture is given in the conference paper and another document, titled A Tour of the P6 Microarchitecture. Both can be found at Intel’s Web site – http://www.intel.com/, so we won’t bother reproducing the details in full here. There is also a splendid little application at the same site which gives a clear graphical representation of how the chip’s ‘dynamic execution’ technology helps increase the effective number of instructions per clock cycle. Essentially, the P6 has three main functional blocks. The Fetch/Decode unit grabs the iAPX-86 instructions, coverts them into uops and dumps them into a pool to be executed. The Dispatch/Execute unit takes the uops from the pool and sends them to the appropriate execution unit (integer, floating point, jump, load or store). It them returns them to the Pool. Finally the Retire unit scans the uop pool, looking for groups of uops that have finished executing and can be removed. The Retire unit also has to re-impose the original program order, and do all this in the face of interrupts, traps, faults, break-points and mispredictions. There is a lot of clever stuff going on, both in the Retire unit and in the Dispatch/Execute unit, which has to handle dependencies – watching for which uops have all their parameters ready, which are still waiting, and so forth. Dynamic Execution is the name Intel has chosen for its mixture of program flow prediction, speculative execution and such, and it makes much of the chip’s improved instructions per clock-cycle ratio. Unfortunately, when we asked, the company couldn’t put any quantitative measure on how much, better it is, and instead points at the chip’s SPECint figures. Likew ise, there’s not much information on how many uops it takes to translate the average Intel Architecture instruction. The documents available say most IA instructions are converted directly into single uops, some instructions are decoded into one-to-four uops and the complex instructions require microcode…
Bedevilled
Some instructions, called prefix bytes, modify the instruction following it, giving the decoder a lot of work to do. Though we don’t know for sure, an excellent overview of the chip in the Microprocessor Report reckons that the average instruction takes between 1.5 and 2 uops. What is clear, however, is that the P6 will avoid many of the compiler optimising questions that have bedevilled the Pentium. Since Pentium came out, there has been a debate about the extent to which you need to turn on compiler optimisations to get the best from the chip: getting Pentium’s dual pipelines working at full speed requires some fancy foot-work when it comes to instruction ordering. The P6 should do away with this totally, since uops are dumped into a pool and execute out of order anyway. So at least that is one argument less when it comes to the forthcoming which is faster, PowerPC or P6? debate. Chris Rose is editor of PowerPC News