Initial reaction to the PowerPC 604 has been extremely favourable and as always when a new chip is launched, the world made a bee-line for Michael Slater, editorial director of the Microprocessor Report, for his view. The Report’s lead headline last week was PPC 604 Powers Past Pentium – PowerPC Chip Will Open Performance Gap, Possibly Permanently. We guess he thinks the 604 is OK. Not only does the 604 leave announced Pentia in the dust, but IBM Corp and Motorola Inc believe that it will give the forthcoming P6 a run for its money. Additionally, there are plans to make the 604 and the 603 smaller by incorporating a fifth layer of metal, but more of that later. There are now nearly 400 engineers working at the Somerset design facility in Austin, and when their latest meisterwerk was launched last month, it was accompanied by a wealth of technical detail which in some measure goes to explain how the chip achieves its performance improvement.
Intriguing holes
However there are a few intriguing holes, which still have to be filled. For example, the processor apparently plays host to some new graphics operations, which were not included in the 601. One IBM source suggests that these won’t be supported by any general-purpose compiler, but will be used by some graphic-intensive libraries. In general, this is what we do know: the 604 has six functional units: a floating point unit; a branch processing unit; a load-store unit; two standard integer units and one multiple-cycle integer unit, used for the rarer division and multiplication requirements. The chip tries to despatch up to four instructions on each clock cycle. It has a 64-bit external data bus and a 32-bit address bus. In general, the capabilities are closer to the PowerPC 603 than the 601. The Load/Store unit first appeared in the 603 for example, and the 603 and 604 share a register renaming capability not present in the earlier chip. Where the 603 and 604 differ, however is in size: where the 603 is a petit, bijou, chip-ette, measuring 85mm square and integrating 1.6m transistors, the 604 weighs in at 196mm square with 3.6m. This is bigger than the 601 and substantially bigger than Intel Corp’s P54C Pentium. At 8 to 10 Watts, it is also pretty power-hungry. A photo of the chip’s layout reveals that one of the biggest items is the despatch and completion unit, which has to work out how to issue four instructions simultaneously and cope with out-of-order execution.
By Chris Rose
Following the path of an instruction as it wends its way through the processor gives a good feel for how the various parts interact: we start at the 604’s Fetch Unit. Its task is to grab instructions from the 16Kb on-board instruction cache and dump them into the eight-entry instruction queue. Under the simplest circumstances the addresses would simply be fetch sequentially, but the branch prediction unit will often kick in and offer its own suggestion of where to go next. The 604 is the first PowerPC processor to incorporate dynamic branch prediction – its predictions adapt as time goes on and the unit records which jumps were taken previously. When the 604 finds a branch, it predicts the outcome and executes the resultant code, storing the results in a parallel set of rename registers until it is certain whether the prediction was correct. The chip can go two-deep in its predictions. Dynamic branch prediction is based around two structures: the Branch History Table and the Branch Target Address Cache. The latter holds the target addresses for 64 branches that have been taken in the past. The History Table, by comparison is used to predict conditional branches – the 512 entries are each assigned a two-bit value indicating four levels of dynamic prediction – strongly not taken; not taken; taken and strongly-taken. Each time the branch is taken, the value is incremented. Each time it is not taken, it is reduced. The despatch unit is also responsible for allocating the decoded instruction to the appropriate execution unit and allocates a place in the completion unit’s reorder buffer, w
hile checking for dependencies between instructions in the dispatch queue. We haven’t room to go through a full description of all the functional units here, but both the integer and floating point units have had two-entry reservation stations added to their front end which store dispatched instructions that cannot be executed until all the source operands are supplied. These reduce stalls in the chip. The floating point unit is the first single-pass double precision unit to be incorporated into the PowerPC line. This means that both single and double precision operations take place in one clock cycle with a latency of three cycles. Since instructions can finish out of order, the completion unit has to store executed instructions in the reorder buffer until all instructions ahead of it have been completed. Once everything is in order, the unit writes the instruction execution’s results to the appropriate register file and updates any other resources that are affected.
Doesn’t have fabs
Several instructions may complete simultaneously. As we said previously, the 604 is built using the older 0.5 micron technology and it doesn’t incorporate the new transistor design that appeared in the 100MHz PowerPC 601: apparently Motorola doesn’t have fabs that can cope with this new process yet. However, this is one way in which the 604 could get smaller and faster in the future. Certainly the 100MHz clock speed is seen as being in the middle of the 604’s range, so we should expect both slower and faster parts. In addition, Ian Ferguson, Motorola Ltd’s RISC product marketing engineer, suggests that the companies are looking to produce future versions of the 604 and 603 with an additional, fifth layer of metal for chip interconnects. This will make the chip between 5% and 10% smaller, he says. The shorter interconnects also have an impact on latency and therefore achievable clock speed. Compiler technology is critical to getting the best out of the PowerPC, and code optimised for one member of the PowerPC family may well cause performance problems when run on another, but discussion of those issues will have to wait for another day.