By Timothy Prickett Morgan
The forthcoming Power4 processor, which will be used in AS/400 servers and RS/6000 servers and workstations by the second half of 2001, is shaping up to be a very serious contender as the fastest processor in the Unix market. Jim Kahle, the chief architect of the Power4 chip, gave a presentation at the Microprocessor Forum that shows for once and for all that IBM’s chip designers finally understand that they not only have to offer elegant 64-bit designs, but that they have to be leaders in clock speeds and bandwidth to feed these higher clock speeds. By all indications, the Power4, which is set to go to silicon sometime in the first quarter of 2000 and will have some 15 months or so of test time before it reaches commercial production, will pack lots of punch. And it will do so with a clever design that gives the Power4 chips bandwidth that even supercomputers these days wish they had – and for good reason, since Power4 processors will also be the basis of future IBM supercomputers as well as servers and workstations.
The Power4 chip will actually include two whole processors with a shared on-chip L2 cache memory on a single die, which is more of a microSMP than it is a microprocessor. This microSMP design is similar to Sun Microsystems’ MAJC processor. The dual processors in the Power4 will be heavily based on the existing Power3 core and will include all the PowerPC instructions and commercial workload caching algorithms that were developed for the Apache/Northstar/Pulsar lines. The Power4 will be implemented in IBM’s 0.18 micron CMOS-8S2 copper and silicon-on-insulator processes.
The Power4 will include hardware-assisted multithreading and smart caching technologies that IBM’s chip designers in Rochester, Minnesota have perfected for the AS/400. The Power4 apparently will not include some of the early Power RISC chip instructions, which impede commercial performance. Each of the Power3-based cores on the Power4 microSMP will have three integer units, and the combined six units should be able to process ten instructions per clock cycle (if not more). This would yield a raw integer performance of 11,000 MIPs at the 1.1GHz expected speed of the initial Power4s.
The Power3 chip running at 200MHz had a SPECint95 rating of 13.2, and with compiler and pipelining improvements, the Power4 at 1.1GHz should have a SPECint95 rating of at least 130 and perhaps as high as 150 (this is our estimate, not IBM’s). The 262MHz Northstar chip had a SPECint95 rating of 15, the 125MHz Apache a rating of 8, and the new 450MHz Pulsar will probably come in at around 26. The 600MHz Pentium III has a SPECint95 rating of 24, a 400MHz UltraSparc-II comes in at 17.4, and a 667MHz EV67 Alpha 21264 EV6 comes in at more than double that at about 37. Next year’s EV7 at 1GHz will only rate about 80 on the SPECint95 test, and it won’t be until the EV8s, if they ever see the light of day, when the Alphas will hit 200 SPECint95.
The chip will also include four math units (two per Power3 core) based on the math coprocessor that was developed by the Rochester engineers for the AS/400 Muskie PowerPC processor, which was used in the 1995-1996 line of 530 and 53S servers and which is still the fastest math unit ever developed by IBM. The Muskie math unit was eventually incorporated into the ill-fated PowerPC 630 chip and which eventually saw the light of day as the Power3 chip.
These math units in the Power4 are capable of processing a total of eight floating point operations per second, and at the 1.1GHz that works out to 8.8 gigaflops peak theoretical performance for a single chip. The 200MHz Power3 chip had a SPECfp95 rating of 30.1, which was based on processing about 630 megaflops out of the peak 800 megaflops for the Power3’s math units. Barring any substantial changes in the math units and compilers, that means the Power4 should have a SPECfp95 rating of about 330 (again, this is our estimate). The EV8 is expected to be in the same range, maybe even a little lower. In case you haven’t gotten in yet, these power ratings for the Power4s are big numbers.
But realizing that uniprocessor (or, in this case, microSMP) power is not going to be possible without copious amounts of bandwidth, and IBM has leveraged its supercomputer expertise to build a high-bandwidth memory and I/O infrastructure to keep those Power4 processors fed. The two Power3-based cores in the Power4 will communicate with a shared on-chip L2 cache through a 100GBps L2 bus; this L2 cache will probably be a mere 1.5Mb based on processor transistor counts and relative area sizes in the Power4 diagram. The Power4 will also have L3 caches sitting in from of main memory. The L3 cache is expected to range from 8Mb to 32Mb, but could grow as big as 128Mb for NUMA clusters.
Main memory on the Power4 servers is expected to be 512Gb max. The Power4 processor will communicate with L2 memory through dual 128-bit 11GBps, 333MHz memory buses. Each Power4 chip will include two chip-to-chip interconnect interfaces, each covering one side of the microSMP. This interface, which will run at half the clock speed and have nearly 40GBps of bandwidth, will enable IBM to link four Power4 processors into a single SMP-ready multichip module with eight effective processors. Four of these MCMs will comprise the 32-way Power4 server IBM has been talking about for about six months, and will have 44GBps of aggregate bandwidth going to L3/main memory and another 44GBps going out to peripherals.
Each Power4 will also include a 500MHz wave-pipelined expansion bus for further processor clustering and for communicating with peripherals. The chip is expected to include hardware support for cache coherent NUMA clustering as well, which looks as if it will be the preferred clustering mechanism for servers with more than 32 processors, and it may be more advantageous to use NUMA for Power4 servers with more than eight effective processors as well, depending on how the SMP ratios work out. The Power4 design sounds like wheels within wheels, but IBM’s approach seems to be a good way to create a switch fabric to link all the processors together while at the same time minimizing the overhead on the I/O backplane.
If all this processing power was not a compelling enough reason for IBM, which is keen on selling Java and the MIPS to support it, to keep on the Power roadmap rather than jumping ship to IA- 64, then the relatively low cost of producing the Power4 chip may be. Keith Diefendorff of Microdesign Resources estimates that it will only cost IBM about $2,500 apiece to make the initial Power4s, and that price can come down significantly if IBM re- uses half-broken Power4s in uniprocessor AS/400s and RS/6000s. Merced is a small cache, low-clock speed, low-bandwidth chip compared to Power4, and even if McKinley, also due in 2001, runs at 1.2GHz and delivers twice the performance and three times the bandwidth of Merced as many expect, it will still probably be considerably less powerful than Power4. And no one knows if Merced and McKinley will be any cheaper to procure than building Power4s, so switching to IA-64 doesn’t make a lot of sense unless something radical happens in the server market, like the AS/400 business dries up. á