For years companies have been boasting that their technology allows members of a team working together to be scattered around the world, with all their workstations linked to a common server, with the team members communicating by electronic mail, fax, telephone, telex, and possibly telepathy. But in almost every case their own design teams have been crowded into a single office or building, just as in the good old days when they used slide rules and paper. An honourable exception is National Semiconductor, which has scattered the design team for its third generation 32-bit chip, the 32532, not just around the US but across the continent and ocean from Silicon Vally through to Israel. Ferrari NatSemi was the first company to market a commercial 32-bit microprocessor, back in 1983, but the 32032 and 32332 series were never as popular as Motorola’s 68000 family. The new chip has a completely new internal architecture, but is software-compatible with the earlier designs. Two versions of the 370,000-transistor chip will be available later this year running at 20MHz and 30MHz. The 20MHz version will be built with a 1.5 micron double metal CMOS process and will deliver a peak execution rate of 10 million instructions per second and a sustained rate of 6 to 8 MIPS. NatSemi will shrink the process to 1.25 micron for the 30MHz version, which it claims will deliver a peak of 15 MIPS and sustain an execution rate of 8 to 10 MIPS – way ahead of any of its complex instruction set microprocessor rivals. (Exotic throughbreds such as AMD’s enhanced RISC microprocessor, which claims a sustained throughput of 17 MIPS and a peak execution rate of 25 MIPS, may be faster, but they are no more intended for the mass market than a Ferrari is). The 32532, like its predecessors, has been designed as a high-performance Unix engine, and the company hopes it will break into the market for transaction processing and fault-tolerant systems and improve the 32000 family’s share of the market for military systems, embedded controllers in laser printers, and robotics. NatSemi will also be pushing it hard in the fast-growing multi-processor, parallel processing field. The designers have used all the extra transistors on the 32532 to extend the pipeline to four stages, provide separate on-chip instructions and data caches, move on board a paged memory management unit, and improve the bus interface unit.Cache coherency (the process that ensures that the data held in the caches is always either a true copy of the data held in the memory, or invalidated when the memory location is overwritten) is also implemented in hardware rather than software. This allows it to operate in parallel with the other chip functions and speeds up applications running on the processor, as a software implementation would have to steal CPU cycles with an interrupt every time a read or write operation to the main memory was performed. The cache controller is also connected to a port that allows off-chip monitoring of the internal operations of the data cache: in multiprocessor applications the caches need to monitor each other to ensure global cache coherency. With multiple 32532 systems, the designers will be able to connect the caches directly, freeing valuable cycles in the external data bus. Overloading the data bus is one of the major problems with multiprocessor systems, and maintaining cache consistency over multiple caches is a major source of bus traffic. The 32532 uses a write – through – with – invalidation strategy to ensure cache coherency: every time a write is made to a location in main memory, the memory management unit/cache controller checks to see if the data held in that memory location has been copied into the cache memory. If the data is in the cache, it is flagged as invalid to prevent the processor using stale data. And every time the processor tries to read data from a location in memory, the cache controller must check to see if the data has been copied into the cache and updated, leaving the data in memory stale. If this has happened, the read request m
ust be intercepted and the data from the cache returned to the processor. In a multiprocessor system it becomes even more complicated, as each processor has its own cache, and they all have to be checked each time a read or write request is made – a process that generates a lot of unwanted and often unnecessary bus traffic. Systems designers have developed a number of complicated strategies to try and cut down the unnecessary bus traffic while still maintaining multiple cache coherency, but with variable success. The 32532’s separate port for the cache controllers should enable them to be connected directly via a separate, dedicated bus, allowing all the caches to be checked and coherency maintained without increasing traffic on the data bus. The instruction cache contains 512 bytes of very fast, direct-mapped storage and a 16-byte buffer that can transfer an instruction to the pipeline’s loader each clock cycle. The ceparate data cache can hold 1,024 bytes with a two-way set-associative organisation. Three separate buses connect the cache and memory management unit, while three more connect them to the external bus interface and control logic has its own three-entry buffer to allow it simultaneously to accept and hold requests for memory reads and writes of instruction fetches while it is busy controlling the current bus cycle. The four-stage pipeline – instruction loader, address unit, register file and execution units – can operate on seven instructions simultaneously. By extending the pipeline (while maintaining the 32000 series register structure to ensure software compatiblility with existing applications) and efficiently integrating the memory management unit and twin caches, the designers have cut down the average number of clock cycles needed per instruction. Earlier architectures executed the basic instructions – add, subtract, move, load and store – in 2.4 to 2.8 cycles. The 32532, with the help of three key piece of systems software, achieves a throughput of 2.1 to 2.15 cycles per instruction. Two of the key software refinements operate within the pipeline: branch prediction logic and a hazard detection mechanism. The third is a set of efficient, high-level-language optimising compilers for applications written in C, Fortran, Pascal and Modula-2. These smart compilers generate efficient machine code by optimising the use of the chip’s pipeline structure, branch-prediction logic and dual caches. Hazard detection The branch prediction logic is used to select between sequential and non-sequential instruction streams whenever a branch instruction is decoded. NatSemi will say only that it chooses between the two possible destination addresses calculated by the loader using criteria based on branch condition and direction. However, it claims these criteria correctly predict the next instruction in 80% of all branches, cutting the average delay from four cycles to two cycles.The hazard detection mechanism in the address unit helps avoid data loss due to read and write overlaps in the pipeline. As the separate pipeline stages operate in parallel, at some point, if nothing is done to prevent it, the instruction unit will read ahead for the next instruction at the same moment that a later pipeline stage is performing a write operation to memory that crosses a page boundary. This means that the write operation will have to access two separate pages of virtual memory. But a read operation pre-empts a write operation, which can only remember one of the two pages of virtual memory when the read pushes past it. And in the resulting confusion data is lost or the pipeline locks or the system crashes. None of which improve the performance of the chip. This sorry state of affairs can be avoided by building in an automatic delay, even though it will be rarely needed, but this also degrades the performance. The hazard detection mechanism, however, is able to steal cycles from a pipeline stage that is waiting for the write to complete without affecting the performance. It checks the page tables in the memory management uni
t for non-aligned data crossing page boundaries, and only if it finds any does it delay the read operation. Thus it has to take evasive action that degrades the performance of the chip only when absolutely necessary to avoid data being corrupted. NatSemi is taking a gamble by introducing all these untried innovations at the same time on the same trip, but if they work reliably it may just be the winning design that has eluded it so far.