From our sister publication Computer Business Review

Vendors spend millions of dollars on benchmarking every year. But what does it prove?Tim Witham, manager of the performance evaluation group at Unix systems vendor Sequent is matter of fact, resigned almost. It is a game we have to play, he says. The game to which he is referring is ‘benchmarketeering’, the production of wildly optimized ‘industry standard’ benchmark figures which are supposed to show the relative performance of different vendors’ systems, covering hardware, databases andapp lication software.Witham, like many others in the business, believes it is a largely pointless and increasingly expensive exercise. The marketing-induced benchmark ratings provide scant indication of how a system will manage real workloads, yet for a vendor to produce a single, publishable result can cost more than $1million, tying up a team of top-flight engineers and valuable equipment for several months. Sequent’s annual benchmarking bill stretches to around $6million. Industry wide, hundreds of millions of dollars worth of virtually meaningless data will be spewed out this year. And, it looks set to get worse. As the computer industry becomes increasingly commoditized, the pressure on Witham and others like him to perfect the art of benchmarketeering and provide better headline figures with which to differentiate systems will grow more intense. But the end result, say consultants and analysts, could be far more destructive than simply serving to distort figures that are already treated with disdain in many quarters. The cost of running benchmarks, especially for some of the smaller software vendors, they say, is now acting as a barrier to competition and in some cases, is diverting funds away from core research and develo pment. Benchmarking, they say, isn’t just worthless, it is positively injurious. Benchmarking has a troubled history. But since 1993, when technology advisory firm The Standish Group slammed database giant Oracle for producing TPC (Transaction Proces sing Council) benchmarks results which, it alleged, were invalid, seriously misleading and abused TPC ethics the benchmark industry has beenon a very public crusade to clean up its image and prove its worth. As a result of the Oracle and other disputes, the major industry consortium-led benchmarking bodies TPC, SPEC (the Systems Performance Evaluation Committee) and BAPCo (Business Applications Performance Corporation) (see box) rewrote the rules governing the submission of benchmarks and tightened up the policing of vendors. Although the Oracle debacle received the most attention,benchmark abuse was rife. Some vendors reported results for systems that were not commercially available; others produced special, one-off, ultra low-cost sys tem components (such as terminals) to artificially inflate price/ performance figures.It became a game to produce the lowest cost terminal, admits Kim Shanley,Chief Operations Officer of TPC. The new versions of the industry benchmarks (such as TP C-C, TPC-Dand Spec95), say the consortia, have eradicated the loopholes insystem testing, restoring equilibrium and providing users with credible performance indicators. The TPC, for example, has abolished terminal pricing and every benchmark is now independently audited to ensure it meets the test conditions. Copies of the test specifications and the modifications carried out by the vendor are freely available to both users and other vendors in a full disclosure report – although digesting and understanding the data is a complex task. SPEC has taken similar measures with the current release of itsbenchmark, Spec95. We asked all the vendors involved to try and break it. If they could, we threw it away, says Kaivalya Dixit, president of Spec. Everything is public. There are no secrets.Users can take the SPEC CD-ROM and run tests themselves to check results. Most observers admit that, as a result of these and other measures, flagrant benchmarking violations have largely been curtailed. Where disagreements do arise over tests, says Shanley,They tend to be as a result of ‘sour grapes’ and misunderstandings between the vendors. Vendors police one another pretty effectively, says CurtisFranklin, lab director at independent benc hmarking specialist Client/Server Labs. But despite the best efforts of the benchmarking bodies, no-one will state categorically that the cheating has been stamped out.If you open a bank, no matter how secure, someone will try and rob it, says Dixit. People who do benchmarks learn the tricks and get better at doing them.I don’t think that anyone is convinced that vendors aren’t still spoofing results, says Franklin. There has been a definite stepforward but it isn’t foolproof.

Self Interest

Even if a larger proportion of the benchmark tests are now ‘legal’, say analysts, it does not necessarily mean that the results are any more useful. Indeed, Herb Edelstein of Two Crows, an independent consulting firm, says the clean-up operation initiated by the likes of Shanley and Dixit may well have been a disservice to the computer community at large. They have kept things from becoming outrageous. Maybe if they hadn’t, the wholething would be discredited by now. The most common complaint about the industry standard benchmarks such as TPC and SPEC is that they do not test anything meaningful and that the results have no comparative value. The reason, say analysts, is that they are devised by committees of vendors. TheTPC would strongly object but we view it as a vendor club, says Brian Richardson, head of the Meta Group’s open computing andserver strategies service. If you look at who sponsors the benchmark, it will give you an idea of the bias, says Richard Finkelstein of Performance Computing, an independent systemstesting specialist. Both Dixit and Shanley admit that when 20 or 30 companies are involved, it is very difficult to get a new benchmark signed off. And, the higher up the system scale the benchmark goes, the more difficult it is to thrash out an agreement on what constitutes a fair test. Before a benchmark can come out, all vendors have to be convinced that it won’t show their systems at a disadvantage,says Franklin. Various parts of TPC-E have been under discussion for a year. With product lifecycles measured in months,benchmarks are often obsolete before they have even been approved. More frustrating for users is that suppliers, and particularly the database companies, engineer the tests so that like with like comparisons are impossible. Benchmarks can be useful as a broad-brush for relative price/performance, but it is misleading to try and extrapolate from the results, says Richardson. Vendors are clever at changing just enoug h components to make comparisons between them and competitors difficult. They try and pitch results for projected products against last year’s product from a rival vendor, he says. According to Witham of Sequent, database vendors do not publish results based on a system which is identical to a competitor as apoint of principle. You can get very close to publishing [a benchmarking figure] and then the relational database company says ‘no’. You either publish something and anger someone or you don’t publish and throw away money. There is lots of backroom politics. Database companies deny that they deliberately avoid directly competitive benchmarks. A further problem: In the case of TPC at least, vendors do not publish bad results. Vendors are selective about the results they allow to be published, admits Shanley. There have been discussions about making people publish – it is a constant source of unhappiness – but no-one has come up with a formula to publish figures which are disadvantageous, he says. The majority of the results that users get to see, and on which they are increasingly being encouraged to base their purchasing decisions, measure systems at the absolute limit of their performance. But do the transaction rati ngs that engineering experts can tease out of supercharged systems, working with virtually unlimited budgets, bear much correlation to how fast Acme Co can process the payroll? Databases cannot handle thousands of transactions a minute. It is hundr eds or dozens of transactions – that is the reality, says Finkelstein. Benchmark systems are like Indy 500 racing cars. Suppliers spend more time and money tuning them than users ever will, agrees Richardson.

Creative Warping

In many cases, it is the marketing department directing the effort. We go to the technical department and say we want that benchmark, says Kevin Joyce, head of technical marketing at Sequent. They come back and say it will take this long and cost that much. Usually the cost is signed off. In the multi-billion dollar Unix systems market, it is too risky not to have a competitive benchmark. Increasingly, hardware and software vendors are teaming up to share benchmarking costs. But the choice of partner is seen as incredibly strategic. Hardware vendors only want to partner with the most performant database suppliers, and database suppliers want to be on the market leading platforms. The decision of who to partner with often goes all the way up to the CEO, say analysts. The practices both sets of vendors use to eke out performance are far removed from everyday computing. Everyone hard codes screens in benchmarks. Nobody in their right mind would do that, says Witham. Despite the new rules, he admits there is still scope for ‘creative warping’. According to Finkelstein, there is an easy, cost free way for users to work out who is the most creative with performance ratings. Just look at the revenue figures and you can extrap olate the benchmark, he says. Oracle has the most money so it can tweak the most. James Johnson, president of The Standish Group is even more damning. All vendors lie. All vendors cheat. Benchmarking is just a marketing ploy and it is impossible to make it fair, he says. He refuses, however, to discuss the outcome of his company’s legal wranglings with Oracle over the database vendor’s alleged abuse and misuse of TPC-A results. This mistrust of industry standard benchmarks has led to increased use of third-party, independent benchmarks such as those offered by AIM and Client/Server Labs (see box). The difference between the independents and the consortia is clear. With lots of the consortia it is the wolves guarding the henhouse, s ays Phil Piserchio of AIM. Our business perception lies with the ultimate customer – the end user. We are less susceptible to vendor demands, says Franklin. But even in the independent space, tests are subject to manipulation. If someone wants to cheat, they are going to find a way to cheat. We have more controls but we are still vulnerable, admits Piserchio. The proliferation of independent tests, coupled with moves by the industry consortia to offer users greater benchmarking choice, may end up damaging the credibility of the benchmarking process further. One of the key reasons why ratings such as TPC-C and Spec95 have a value, is that results are available from virtually all of the mainstream vendors. Although they may not be directly comparable, they do give some idea of ballpark ratings. With the vast increase in the number of available tests, most of which have been tailored to suit the latest marketing vogue, the ability to compare performance between systems is being eroded. There is a danger that benchmarking will get too fragmented. It beats the purpose of it, says Shanley. Nevertheless, TPC now has TPC-C for OLTP work, TPC-D for decision support and is considering new benchmarks for Web servers, client/server systems and other more application-specific workloads. Servers are becoming very specific so benchmarks have to too, says Dixit. SPEC is set to launch a Web server benchmark and is also working on a Java language test. The increasing crossover of Intel and RISC-based architectures in the market is adding to the confusion. Some manufacturers are now doing TPC/C with their high-level PCs, so we are doing an Intel-based high-end database specific benchmark, says Joh

n Sammons, president of BA PCo, the PC benchmark consortium. BAPCo will overlap with TPC in the coming years, says Sammons. From all this confusion, the larger vendors, at least, will benefit. With their sizeable resources, they will be able to produce multiple, highly optimized benchmarks, while others may be forced out of the frame. It is, therefore, in the interest of the more wealthy vendors to encourage users to demand that industry standard benchmarks figures form part of the pre-qualification process of any sal e. Over the past year, admits Shanley, certain suppliers have spent millions of dollars advertising their TPC results. There is much more resource going into benchmarks since Oracle got back in [after its dispute, it temporarily stopped competitive benchmarking]. There are three to four major ad campaigns on TPC results at any one time.It is a necessary evil. Vendors who don’t participate in benchmarks do tend to suffer, says Richardson. If Oracle does it, Sybase has to, says Finkelstein . Lots of smaller vendors want to stop but they feel they have to keep playing the game. All of this raises the question – Who uses benchmarks? At a recent Sybase user conference, Edelstein of Two Crows asked the 3,000 or so assembled users exac tly that. No-one raised their hand. A survey by the Standish Group puts across a similar message. Of the 367 IT executives questioned by the company, the majority said all benchmarks were unimportant. But a survey by the magazine ComputerWorld, foun d completely the opposite. More than 70% of respondents said they found benchmarks to be useful and relevant. The evidence suggests that most users do pay attention to industry standard benchmarks, even if their interest is only cursory. It is a fac t that infuriates Finkelstein. Benchmarks have no predictive value at all. In fact, they can be very dangerous, he says. What should users do instead? Call up other customers and look for people with similar workloads. Go to user groups, join new sgroups and do some in-house benchmarking. Unsurprisingly, Piserchio of AIM also advocates customized, in-house benchmarking. The ultimate is to say ‘test my system live on four platforms’, he says. Edelstein of Two Crows disagrees. For the costs involved, he says, the benefits are often minimal. The performance differential among the major database vendors is not dramatic. Richardson concurs. At the low-end, what is the point of benchmarking for a couple of decimal points? Just buy Inte l’s SHV processor boards and double your power. By paying even scant attention to benchmarks, users are perpetuating the cycle, say analysts, adding millions to companies sales and general administration costs, money that inevitably ends up on syst em price tags. There is another, more worrying aspect. Benchmarking siphons money away from [the development of] new features and functions, says Johnson. In some cases, the quest to score the ultimate benchmark ends up driving overall engineering plans, regardless of whether or not those improvements provide usable benefits. For example, when systems were measured in terms of dhrystone flops, hardware suppliers simply cranked up the clock speed to get better performance, despite the fact th at users did not see a corresponding increase in the computational ability of their systems. When SPEC started, the aim was to build a benchmark that tested more than just clock speed by making sure the test program did not fit in memory. But as the size of the SPEC benchmark grew, so vendors increased the cache on their systems.Most technical performance people are really into producing better performance and making generic improvements but they do get pressure from marketing people, says Sammons of BAPCo. For example, he says, before BAPCo, some vendors were tuning their PCs to synthetic benchmarks and using the benchmark specifications in the design of their machines.Tomorrow’s designers will design around Spec. If we screw up, we screw up the future of computer design, says Dixit. The real purpose should be to make the breed better not just the numbers better, says Witham. We have a hard policy – if it doesn’t help the real world, it doesn’t get done. Most rival vendors s ay much the same thing. But sometimes, this stance gets overruled by commercial reality. In release 7.3 of its database, for example, Oracle added a new ‘concurrency’ option to its database so that it could comply with the conditions for running TPC -C tests. According to Bailey, it is not the default mode and it isn’t used by the most Oracle customers, but was added solely to satisfy the TPC requirements.Until advertising campaigns pitching the benchmark results of one vendor against another s top having an affect and until users stop viewing benchmark results as preliminary selection factors, these practices will undoubtedly continue. Only when benchmarking is no longer seen as a marketing issue can it get back to being used as it was fi rst intended – as one of many tools designed to help internal engineers test development enhancements, say analysts. In the meantime, the circus continues. by Joanna Mancey

TPCTPC-C figures, measuring the online transaction processing performance of large system servers, are the most widely published performance statistics in the industry. They are also the most fiercely contested, arguably the most abused and definite ly the most expensive (often costing more than $1 million). Since the row over a batch of Oracle TPC-A figures (the predecessor to TPC-C), the Transaction Processing Council, a largely vendor driven organisation, has been trying to overhaul its practices. In 1993, it introduced Clause 0, essentially a stricter code of ethics, and made independent auditing of all test results statutory – but disputes over fair use of results continues to rage. The latest addition to the TPC stable is TPC-D, des igned to test systems running complex decision support queries. Only one result has been reported thus far – for a $4 million, 32-node parallel system from IBM. TPC-D allows such a wide range of database sizes, it will be particularly susceptible t o misinterpretation, says Brian Richardson of the Meta Group. SPECWhile TPC figures dominate the high-end server space, lower down the Unix spectrum the talk is of SpecInt (integer) and SpecFP (floating point), the benchmarks backed by the industry -based Systems Performance Evaluation Group. Although SPEC does not publish price/performance figures, sticking only to raw processor and compiler performance, its benchmarks have also been subject to allegations of optimization and abuse. It is a p oint that Kaivalaya Dixit, president of SPEC, is happy to concede. Benchmarks have to be changed every two years or else vendors learn how to manipulate them, he says. Spec92 went on too long. The new Spec95 test, introduced last year, says Dixi t, has resolved the problems. Vendors are forced to publish baseline figures alongside optimized figures, and everyone who undertakes the test has to publish the results, good or bad. But detractors say the methodology is too simple, only really tes ting CPU performance. BAPCoMagazine-based PC benchmarks are strictly component level and synthetic says John Sammons of BAPCo (Business Applications Performance Corporation), the industry-based PC testing consortium. BAPCo, he says, is the only on e to test real life, application level system performance. BAPCo offers a range of test suites – SYSmark95 for emulating Windows 95 performance, SYSmark NT for emulating Windows NT, and SYSmark FS for evaluating networked file servers. The total cos t of running a test, including three months worth of configuration work, is less than $100,000 says Sammons. Now that Intel is starting to move up the value chain with its SHV processor boards, BAPCo is also looking to extend its reach beyond the tr aditional PC space. We will eventually cross over with TPC, says Sammons. AIM is commonly referred to as the vendor independent version of SPEC with added pricing information. According to a study by the Standish Group, it is also the

one that use rs value most. What is unusual about AIM is that the suppliers do not run the benchmark themselves. AIM Technology logs into vendors’ systems remotely, runs the test and then provides a full report and rating. This approach, says Piserchio, helps to produce more accurate figures, although he admits that pricing information can still be manipulated. We are not trying to be the price police but it is better than having no price at all. The test costs around $300,000. If AIM is the independent equivalent to SPEC, then Client/Server Labs (CSL) is the independent equivalent of TPC. Its RPMark benchmark, which includes a subset of TPC-C, is designed to test transaction processing, file serving and decision support performance in a client/ser ver environment, while RPMdbs is designed to emulate high-end enterprise workloads. Overall, the reaction to CSL, which took the IBM RAMP-C (standing for Rochester’s attempt at measuring performance) as its base, has been largely positive. Like SPEC , CSL measures both baseline and optimized performance. Richardson of the Meta Group, however, says its results need to be treated with caution. RPMark can be particularly misleading. It uses a slimmed down version of TPC-C with a very small database which give a phenomenally high rating for file and print services, he says.

Application Benchmarking

Translating inflated ‘transaction per minute’ ratings into real world performance indicators is very unreliable. Instead, many vendors are switching to application specific benchmarks from business software suppliers including SAP, Baan and PeopleSoft, to try to give users figures that they can more readily understand. We’ve found that [application specific benchmarks] a whole lot easier to present to customers and, from customers’ point of view, a lot more understandable, says Mike Saranga, senior vice president of product management and development at Informix.But are these tools reliable? First, the test specifications are not in the public domain and second, many question whether application vendors will avoid showing their hardwar e and database partners in a bad light. I think there would be less bias from application vendors, but apples to apples comparison is very difficult. You cannot extrapolate SAP results for PeopleSoft, says Brian Richardson of the Meta Group. Never theless, he says, At the high-end, benchmarking will become application specific – SAP, PeopleSoft and Oracle Financials. Richard Finkelstein of Performance Computing is more sceptical. PeopleSoft will not embarrass its partners, he says. You a lways have the credibility issue when it is a one vendor benchmark – it is the same with application benchmarks, says John Sammons of BAPCo.