At IBM UK’s DB2 briefing last week, the company made it clear that there were plenty of improvements in the DB2 release 2.2 that ships this month – even if a big improvement in raw transaction throughput was not one of them. So what are they? The answer is query speed and distributed database.Well, as the entire world knows distributed database means many things to many people, and most of them don’t work. However on the query front, IBM seems to have come up trumps, with a 10-fold overall improvement and a fivefold reduction in processor time needed to handle a query. It has done this by using multiple index access paths or multi-index searching.Imagine a query that wants to explore three separate fields, dictating a maximum or minumum value to each and slim down the records just to those that comply. It is the sort of query function that relational databases seemed to be invented for, for instance Find me all the employees that have been here more than 10 years, that have sales experience and that earn less than UKP25,000 a year.

Multiple index

Perhaps the criteria that John Akers himself looks for when fleshing out the IBM sales team these days. Any such query would do. What single index searching does is exactly what the question asks, it finds the data pages with the length of service over 10 years and asks what next. Multiple index searching first looks at the whole question, and sets up three separate jobs, looking for length of service, at all the records of historical job titles and down the salary column as well. Instead of reading in those data pages it just loads the pointers to each set of qualifying records, then compares them. IBM is no longer allergic to the word pointer, when used in conjunction with relational databases, which it was when Codd was at the helm, but it prefers to call them Row Identifiers or RIds. By comparing them, it can reduce the RIds to the final answer before ever looking at a data page. That cuts down a lot of input-output as well as a lot of work. It isn’t the fact that the computer has three eyes looking down the three columns at once that speeds the process up, because it still does it one job at a time. But using an intelligent pre-fetch instruction (an asynchronous fetch that invokes an input-output without tying up a processor waiting for the disk to respond) it at least has the three disk locations being addressed in parallel, even if the final processing can’t be done that way. That’s where the statement of direction on multiprocessors comes in – IBM should be able to speed up those equiries further once it can work one enquiry across more than one processor. IBM got the 10-fold increase by searching a 10m row table looking for two predicates connected by the AND operator, where the result was 9,851 rows that qualified. Another contributor to the improved performance has been the issue of query optimisation, and here IBM was handing out some clues to the future. All of its relational database products have some form of optimiser which looks at data about the files sizes, types and structure (isn’t this part of what we used to call meta-data) and decides upon the most sensible way to answer the SQL query. IBM calls these data distribution statistics and the additional ones it is keeping include the 10 most frequently used multi-index searches and non-unique indexes. These can be adjusted by the database administrator at the site and invoked or not using the DB2 Runstats utility. Will all of this type of meta-data become the province of the Repository Manager? Yes, but only in the fullness of time. These improved optimisation algorithms have created a headache for IBM and others for as long as anyone can remember. Chris Date tells an anecdote about trying to design an optimiser that will work across an entire distributed database. The same anecdote, slightly updated, is still doing the rounds of IBM presentations and like all good jokes it weakens with the number of people who have told it before. IBM’s current joke about this is that you can have two chunks of data,

one on an MVS mainframe, another on an OS/2 machine.The MVS system has 10m rows and the PS/2 has just 100, and an SQL query needs a relational join done on the combined databases. Well the first thing that it tried to do was download all 10m records from the remote mainframe to the OS/2 machine. Some canned laughter. Date’s version was simply that We turned the algorithm off and asked the question with the variables in three different orders. One time it took three seconds, another time it took three hours and they’re still waiting for the third one to finish. You had to be there, but however you cut it, it is still the same joke, and the same problem.The solution however is to have each of the local optimisers in constant touch with each other, and make sure that all SQL queries are fully compiled first and not taken on the fly, first search term first. It is easy to establish the last part, but optimisers that talk reliably and which take account of each other’s statistics are another matter, especially where the decision to feed or not to feed all the available statistics to the optimiser is made at database administrator level. Human error will definitely find a way of creeping through into a seriously complex network. Apart from these improvements in performance what can IBM deliver that’s new this month?

Distributed database

Distributed database means many things to many people. To IBM it means four things, and roughly it plans to deliver two and a half of these this month. The first is to give Systems Application Architecture SQL requests remote access across a network to any database; the second, to provide transaction integrity between one local database and one other remote database, and process SQL requests on either one of them but not both; the third is to pre-compile and bind an SQL request, and have it extract data from both databases; and the fourth, to have fully distributed requests search multiple databases and deliver the answer to one database user transparently.IBM reckons that in DB2 Version 2.2 it gets you somewhere between two and three, by getting data from two databases at once, but at the cost of guaranteed transaction integrity, two phase commit has to be handled in user programming if you’re using more than one database right now, and without the help of synchronised optimisers. For those of us that can’t help thinking of distributed database as item four and item four alone, what does IBM suggest we do with our data at in the meantime? Put it where it makes most sense, said Starkey firmly. It was impossible to believe that IBM meant anything other than Keep it on the mainframe.