25 Jan 98

Nested iteration is still used, particularly for non-equality joins that dont lend themselves to the other methods.


Hash Aggregation is used for SUM, etc. In this technique, the algorithm is

Read a row from the input table

Hash the key value to a bucket; if key is already there, compute aggregate. If not, put it in a bucket.

Loop through input

Output key values and computed aggregates

Note: this produces unordered output. This means that you can no longer depend on GROUP BY to produce ordered output with out an ORDER BY clause. (Ive been trying to tell you all for years that you should, by ANSI standard, use the ORDER BY!) There is a mechanism in place so that when you upgrade a 6.5 database you will get ordered results.

There are changes in the I/O strategies as well. First of all, pages are now 8K and extents are 64KB (see the Storage Architecture section for more details). During table scans, I/Os are also 64KB. Unordered scans can do parallel reads from files. Read-ahead works on heaps as well as the clustered index. The QP drives the read-ahead.

Bulk update processing has been improved. The QP does the index maintenance. On insert, update or delete, changes are sorted in index order and applied in a single pass over each index. Constraint processing is integrated into bulk updating as well. This technique is used in DBCC and bcp. Microsoft reports that a typical "data massaging" application ran 10 times faster.

The query processor understands partitioning solutions. This means that if you were to define tables to hold monthly data (Jan, Feb, Mar, etc.) and combine them with a UNIONING view, then just request one month from that view, it will only search that table. It appears that you need to define a check constraint limiting the month table to a particular month for it to be able to do this. SQL Server does not support partitioning by value, that is, spreading a table across several different storage areas based on key values as is done in DB2, for example.

Query optimization is still cost-based. I/O is still the dominant factor.

There is a lot of support for parallel operations on SMP machines. This is automatic, you dont have to use any configuration or hinting directives. The degree of parallelism will be adjusted down on loaded systems or systems with insufficient memory. For a single query, the optimizer runs on one cpu. However, multiple queries can be optimized simultaneously if there are multiple cpus.

There is support for distributed queries. This support uses OLE DB. It accesses SQL Server directly. With the Jet OLE DB provider it can access Jet files, Excel, FoxPro and dbase. With the OLE Provider for ODBC, it can access Sybase, Informix, DB2, etc. Microsoft has written an OLE DB provider for Oracle which it uses to access Oracle 7 & 8. There is also interaction with Active Directory Services (NT 5) and Index Server. It uses something called a "Linked Server" for these. I am not clear on the difference between a Linked Server and the current Remote Server concept. It is also possible to use something called the OpenRowset function to retrieve data from an OLE DB data source without making it a linked server. [Interaction with Index Server was mentioned a few times at the briefing. I got the impression that there was some integration between Index Server and text data types but this was not clearly stated. Perhaps my total lack of knowledge of Index Server interfered with my comprehension.]


Many thanks to Sharon for providing these notes - drop her a note at sharond@compuserve.com and tell her thanks!