POWERPC 620: A LOOK UNDER THE HOOD SHOWS SAME NUMBER OF FUNCTIONAL UNITS, A BIGGER CACHE AND OTHER NIPS AND TUCKS

If there is one thing that the PowerPC 620 proves, it is that there are other ways to increase the speed of a chip than simply cramming in more functional units. After all, it only contains the same number of execution units as the 604, yet it provides substantial speed improvements. Factoring out clock-speed differences shows that a 133MHz 604 would get up to SPECint 212.8 and SPECfp 219.5 performance compared with the PowerPC 620 which is designed to achieve 225 and 300 respectively. SPEC marks are not the only bench test, of course. Time and time again, when the Somerset designers talk about their baby, they emphasise that it is a processor optimised for commercial transaction processing work. They say the 133MHz 620 will deliver twice the transaction processing performance of the 100MHz 604. But how? A lot of this speed improvement is down to the enlarged cache – twice the size of that of the 604 and eight-way semi-associative, as opposed to the 604’s four-way effort. A fractional improvement is also provided through a slightly faster transistor design. The rest comes through a number of relatively small nips and tucks. To begin at the beginning, the 620 adds a pre-decode stage to the instruction pipeline. The stage categorises instructions as they are pulled from the instruction cache in terms of the resources that they will use: operands required, registers used and so on. The data needed by the pre-decode stage is actually held within an additional 7Kb of cache. This, together with 1Kb used for parity information means that the chip’s instruction cache is really 40Kb in size, though only 32Kb is visible to the outside world. The pre-decode is designed to eliminate an entire stage from the instruction pipeline, reducing the performance if branch prediction screws up and the processor takes a wrong turn. That is only an outside chance, however, according to the designers. Motorola Inc’s Brad Beavers says that simulations show the 620 will get it right about 90% of the time. The branch prediction capabilities have been improved over the 604 through the simple act of bumping up the size of the branch history table to 2,048 entries from 512. The branch histoy table predicts the likelihood of any branch being taken from past behaviour. At the same time, the branch target address cache is increased to 256 entries from 64, so that the chip not only knows whether it should branch or not but also to where it should branch. Speculative execution is also improved; the new chip can run past four unresolved branches, where the 604 could manage just two. Processor stalls have also been reduced somewhat by the addition of extra reservation stations in front of the Load/Store Unit and the Branch processing unit. Big data sets But other than, that the main functional units look virtually identical from the raw specifications. The one area where the 620 really differs from its predecessor, other than the 64-bit data extensions, are in its, cache its memory handling and the system bus. Big data sets and multiprocessing should be the 620’s forte. Beavers says they expect to be able to stick six PowerPC 620s into a machine with no additional glue logic. In addition to the bigger on-board L1 cache, the chip also has its L2 cache controller on-board, meaning that an external cache can be added with the minimum of glue logic. The L2 cache can be configured from anywhere between 1Mb and a chunky 128Mb. Address bus capacity is a classic problem with large multiprocessor configurations: each processor needs to keep an eye on the memory that the other processors are modifying, to avoid clashing or trying to access the same piece of data. Usually this is done through a system of ‘snoops’, querying other processors’ caches. The 620 can essentially pipe-line snoop queries and responses so that it can put out new addresses every other cycle without having to wait for responses from the other processors. Bus width has also been extended to 128 bits, though it can also work in a 64-bit mode. The 620’s designers tend to brush aside any queries over

the processor’s performance by saying that SPECints really do not do justice to its having to wait for responses from the other processors. Bus width has also been extended to 128 bits, though it can also work in a 64-bit mode. The 620’s designers tend to brush aside any queries over the processor’s performance by saying that SPECints really do not do justice to its transaction processing-aimed performance. The trouble is, some of the other RISC manufacturers are saying exactly the same. – Chris Rose

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

Sign up for our regular news round-up!

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

I would also like to subscribe to:

Thank you for subscribing