At the Hot Chips show this week at Stanford University, Newisys showed off a new chipset code-named Horus that will allow it to create the big iron systems that it was hinting it could deliver when it burst onto the scene in January 2003.
Newisys may not be a household name, but Sun Microsystems Inc. is reselling its machines as the Sun Fire V20z and V40z and Verari Systems Inc. (formerly known as RackSaver) is also peddling its designs.
Hewlett-Packard Inc. and IBM Corp. have opted for their own Opteron server designs–so far, at least. The latter is somewhat mysterious given that the founders of the company back in August 2000 were Phil Hester and Clay Cipione were key designers of IBM’s RISC and Intel workstation products and Rich Oehler, the chief technology officer at Newisys, was the lead architect for IBM’s RISC and then PowerPC processors in the 1980s and 1990s.
Oehler was also one of the designers of IBM’s Summit family of server chipsets for Xeon and Itanium processors. Why IBM has not just rebadged Newisys machines (and moved beyond two-way eServer 325 it currently sells) is just plain odd, especially considering that the xSeries line, with the exception of the BladeCenter blade servers and xSeries Summit boxes are actually manufactured by Sanmina-SCI.
You might be thinking that the advent of truly powerful Opteron servers from Newisys that scale from four to 32 processors might encourage IBM and HP to more fully embrace Opteron and just adopt the future Horus designs from Newisys. But with their own Power-Squadron and Itanium-Integrity server lines to protect, IBM and HP will put off adopting and endorsing these Horus machines. (It is truly funny that Horus is the Egyptian god of the rising sun, so maybe that portends that Sun Microsystems will stop trying to build its own big Opteron servers and just keep on using Newisys designs. Maybe Newisys has gallows humor, or Sun had a change in plans once it bought startup Kealia to get founder Andy Bechtolsheim back in the business of creating servers, in this case based on Opteron processors instead of Sparcs.)
According to Oehler, who gave the presentation at Hot Chips this week, Newisys has been working on the ASIC that comprises the Horus chipset for almost three years. He says that while the Opteron design, with its integrated memory controller and HyperTransport interconnect, can gluelessly scale (meaning it doesn’t need a sophisticated chipset like Horus) from two to eight processors in a single system image, the ring architecture of the resulting systems does not scale linearly as processors are added to the machine.
The main problem is keeping cache memories coherent, which means ensuring that any data in cache has not been updated by any of the local processors on the cell board to which it is physically attached or remote cell boards adjacent to it in the system (and which have access to that cache). This cache coherency is what makes many processors look like one virtual processor to the operating system, and as you add processors gluelessly using HyperTransport, the overhead from managing the extra caches stresses out the system and it does not scale as linearly as many server vendors would like. (Which is probably why no major server vendors are making eight-way Opteron machines, why eight-way Pentium and Xeon machines were difficult to sell back in the 1990s, and why it takes a clever architecture like IBM’s Summit or Unisys Corp.’s ES7000 to scale well above four x86 processors.)
Moreover, while HyperTransport can scale to eight processors, Oehler says that HyperTransport was created for only short links, which means vendors have to pack all the main memory (which is dedicated in blocks to each CPU on the cell board) and processors for an eight-way machine in a very tight space, which is then tough to cool.
The Horus chipset takes a different approach. Instead of creating one giant communications ring structure on which all of the processors and their cache memories are linked to each other, Newisys has decided to adopt a four-way cell board architecture based on standard Opteron chips and then use the Horus chipset as an intermediary. Each cell board uses HyperTransport to cluster four Opterons and keep their caches coherent and also uses the Horus chipset to keep track of the state of caches on remote cell boards in the systems.
Conceptually, this is very similar to the architectures IBM is using in its Summit xSeries and various Power machines, and bears some resemblance to the means Unisys uses in its ES7000s and indeed in most modern Unix architectures. Horus is a ring that can support up to 32 sockets, which means up to eight four-socket cell boards. Today, Advanced Micro Devices only sells single-core Opterons, but when AMD jumps to dual-core Opterons in 2005, the Horus servers will scale to 64 cores. This will be as big of a box as any other server vendor can put on the market.
The trick to any NUMA-like architecture, says Oehler, is keeping the latency between the cache memory on the cell boards down. To keep from having to rewrite an operating system and its applications, Oehler says a server design has to have a 3:1 ratio or less between the time it takes for a CPU to reach into the cache of a cell board on the other side of the server compared to the time it takes for that CPU to reach into the local cache memory on its own cell board.
Oehler won’t say how low the Horus designs will go, but he has said that Newisys has added a 64MB L3 cache to the Opteron architecture – Opterons include a main memory controller as well as L1 and L2 caches on chip – that it uses as a remote data cache to keep track of what CPU is using what cache lines. The ring of Horus ASICs are basically reading this cache very quickly and allowing the call boards to work through it to reach the cache in adjacent cell boards.
The Horus ASIC also has a remote directory, which keeps track of what cache lines are being accessed and controlled by cells outside of a given cell board. He also adds that AMD helped Newisys minimize the cache coherency traffic, which further reduced latencies. Since the Horus chipset creates another cache hierarchy above that built into the Opteron chips, Newisys will be calling it the Extended Scale Architecture when it becomes a product next year.
Oehler says that the Horus chipset has taped out, and that Newisys expects to get ASICs back from the foundry in early 2005 and into systems for OEM customers to examine by the middle of 2005. If all goes well, server makers could OEM the product and have it for sale by the end of next year. My first job is to get a server built, says Oehler.
It will be straightforward for us to get to eight, 12, and even 16 processors, but getting to 32 processors will be more of a challenge because of software. The hardware always leads the software in the right direction, he adds, and he knows a thing or two about that trend. This has the potential to change the game bigtime against RISC/Unix systems, says Oehler. That is something that Sanmina-SCI is clearly counting on. It will be interesting to see what IBM, Dell, Sun, and HP do.