Set of Best Practices (dis)proved by Benchmarking
The market of MPP engines is pretty broad and the cloud big players have their offerings that constantly evolve. So it’s really interesting to have a better understanding of their capacities and how they perform.
Let’s review Azure’s Synapse Dedicated SQL Pool MPP database platform, which is the evolution of Microsoft’s PDW appliance-based data warehouse, and intended to serve the Data Warehouse needs.
The motivation of this benchmark is to find the simple answers to simple questions: how to tweak the physical schema on Azure Dedicated SQL Pool built on top of classic star schema with the realistic enterprise-wide data volumes to answer the most common analytics query.
There are some decent and worth reviewing benchmarks based on the TPC-H schema, for example, the one performed by Gigaom. The latest one is really worth examining, but, as for me, it doesn't provide answers to the following questions:
- How the index tuning and distribution options impact the performance?
- How the cardinality impacts response time?
- How much resources are required to keep the queries within 3–5 sec time response interval for the given data volume size
The current benchmark is intended to give answers to these questions and prove or disprove some of the general recommendations.
Let’s consider a classic star schema with a fact table surrounded by a set of dimension tables
For the simplicity of the test, the fact table contains the references to the same dimension of different sizes.
Queries to Test
Let’s examine the simple analytics queries split into 3 simple categories:
- Full/big range/small range scans as well as random access implemented as proper filtering based on equality and range predicate;
- Join operations between a bigger and smaller table (fact and dimension);
- Big table to big table joins
For the sake of testing different profiles the different size of input tables are examined:
- Smaller dimensions with 100, 10K, 100K records
- Larger dimensions: 1M, 10M, and 100M records
- Fact table size with 10M, 100M, 1B records
The following cluster configurations and preparation work has been performed:
- DW200c, DW1000c, DW3000c, DW6000c capacity/performance levels have been used;
- no concurrent queries, all queries were run subsequently;
- the queries were running under the xlargerc resource class ensuring the 70% of memory resource allocation;
- an index rebuild was called before benchmark queries runs for all the dimension tables based on the replicated data distribution type;
- statistics were updated for all the table participated in queries;
- resultset caching was switched off;
- before each test the following code is run: DBCC DROPCLEANBUFFERS; DBCC FREEPROCCACHE;
- The displayed time elapsed values are calculated as an average value from 5 consecutive runs excluding the minimum and maximum ones.
Dedicated SQL Pool Schema Design Options
The intention of the benchmark is to test different options on the physical schema level with the fixed star-schema-based logical data model.
So, let’s considers the following two main directions: options for indexing and data distribution:
- Default clustered columnstore index, referred as CCI down the text
- Ordered clustered columnstore index (Ordered CCI down the text)
- Clustered index (non columstrore one)
- Secondary B-Tree index
Test #1: Small, Big, and Full Table Scan (no Joins)
In the scope of test #1, the set of simple filter queries were running scanning the different dataset size:
- small — locating couple of records;
- big — 10–30% of the whole table;
- full scan — whole table analyzed.
Each individual test has a certain objective to verify. The list of all the queries run during the benchmark you could find at the very bottom of the article.
Test #1.1: Default CCI Index
The first test is performed on the tables with the default configuration of the index — Clustered Columnstore Index. Below you could find the results on different capacity levels: DW200c, DW1000c, and DW3000c:
- Starting from the DW200c performance level Dedicated SQL Pool provides a response time within 3 seconds for the majority of the queries.
- Exceptions are queries on really large tables such as those containing one billion rows
- Interestingly enough, the timing of a query “Small_Range_Dim_10M_02” is worse on DW1000c in comparison to DW200c. I assume it is related to the fact, from the DW1000c the cluster’s starting to have the 2 compute nodes which makes some of the overhead for internode communication and leads to some performance degradation for such kinds of queries.
Test #1.2: Default CCI vs Ordered CCI
Recently Azure introduced an interesting performance enhancement feature of Dedicated SQL Pool called Ordered Clustered Columnstore Index.
From the first glance, it could make an impression of a game-changer. But in fact, it is not true. I recommend you to read the article from Microsoft and understand better the pretty unique circumstances under which it could bring significant benefits.
In the scope of the current test, it was tested the range scan based on timestamp, which covers either 10% or 30% of all the records. The 3 options for the table structure have been tested:
- Default CCI index
- “Ordered CCI 1” index — ordered CCI index based on order_timestamp field in conjunction of hash distribution based on the same order_timestamp field
- “Ordered CCI 2” index — ordered CCI index based on order_timestamp field with a hash-based distribution which is the same as in Default CCI index based on order_id field
The query pattern tested:
select count(*), sum(total_amount),
where order_timestamp between <range_of_different_size>
- Even though the Ordered CCI 1 could bring significant performance benefits for smaller range scans, you should create ideal circumstances for it, which is almost impossible for real-life cases
- In general, the ordered CCI index doesn’t bring significant benefits
Test #1.3: Clustered (not columnstore) and BTree indexes
The Dedicated SQL Pool has the flexibility to choose between different types of indexes. Besides the default clustered columnstore index, there is a possibility to create the usual clustered one.
So it is interesting to understand when it worth using it.
In scope of the test, the same queries were running on a dimension table with 10 million rows build on either default CCI index or clustered (non columnstore) index in conjunction with B-Tree indexes build on top of it.
Results are reflected on the chart below:
- Clustered index, as well as secondary BTree indexes, give the best response time for random access or small ranges scans;
- Clustered index, as well as secondary BTree indexes, work slightly worse for bigger range and full scans
Test #1.4: Different Number of Columns
It is well known, the default CCI index works best for analytics scenarios that involve a few columns in the query.
The objective of the test is to verify how the number of columns involved in a select statement impacts the timing.
The query pattern:
select <calculation list>
where order_timestamp between '2021-02-01' and '2021-03-01'
1 Column: count(*)
2 Columns: count(*), sum(total_amount)
3 Columns: count(*), sum(total_amount), sum(shipping_amount)
4 Columns: count(*), sum(total_amount), sum(shipping_amount),
The statistics on query timing could be found below:
- Adding more columns to select statements significantly impacts the response time:
- About 40% difference between 2 and 4 columns involved in calculations
Test #1.5: Replicated vs Distributed by Hash
The main objective of test #1.5 is to verify how the data distribution type impacts the performance of the range query of different types.
The same data has been stored in the tables of 3 data distribution types: 1) round-robin; 2) replicated; 3) distributed by hash.
You can see the results of the benchmarking on 2 performance levels DW1000c and DW3000c:
- The query result timing is really close on the DW1000c performance level for all 3 options;
- Starting from the DW3000c the hash distributed type, which I assume allows better parallelization starts to perform significantly better.
Test #2: Large to Small Tables Join (Fact to Dimensions)
As it was stated above, the aim of test #2 is to benchmark different aspects of Join operations. So let’s review the variety of statistics gathered.
Test #2.1: Defaults
The aim of the first test in this section is to benchmark the basic join scenario which involves:
- The fact to dimension table join operation;
- Fact and dimension tables of different sizes;
- Grouping on one of the dimension attributes along with a couple of aggregations
- Larger fact tables are distributed by hash while smaller dimensions are replicated;
- All tables are based on default CCI indexes;
- Benchmarked on the 3 performance levels: DW1000c, DW3000c, and DW6000c
The query template is reflected below:
select top 50
d.category_id, max(d.category), avg(f.tax_amount),
from fact_order_line_<size> f
inner join <dimension_of_different_size> d on <join_criteria>
group by d.category_id
order by sum(f.total_amount) desc
- The majority of the queries are performed within 5 seconds starting from the performance level of DW1000c;
- The exception is 2 latest queries including joins of larger tables: 1B to 100K and 1B to 1M;
- Scaling of resources from DW1000c to DW6000c does not impact tremendously the lighter queries, but for the latest 2 ones, which are significantly heavier, improves the timing linerly
Test #2.2: More Dimension Tables to Join
The objective of test #2.2 is a verification of how the join of an additional dimension table of the same size impacts overall timing.
The comparison has been performed between join of the fact table to 1 dimension table vs join to 2 dimension tables vs join to 3 dimension tables.
The results for performance level DW1000c are reflected below:
- The number of dimensions to join significantly impacts overall timing: the timing of joining the three dimension tables vs one, is 3–4 times longer
Test #2.3: Dimension tables with Clustered non columnstore index
Some database engines, which support both types of storage profiles — row-based and column based, recommend storing the dimension tables in a row-based format.
It was interesting to verify, how the performance of joins are different if to compare the fact table to dimension table join, where dimension table is built on CCI vs dimension using the simple clustered index (non columnstore one)
- As you can see from the chart above the overall performance is almost the same;
- So, the other factor should drive your decision which index approach should you use for the dimension table.
Test #2.4: Hash-Replicated vs Replicated-Replicated Joins
One of the key best practices on MPP storage is to keep the larger fact table to be distributed evenly between all the nodes while storing the smaller dimension table replicated on all the nodes.
On the other hand, Azure recommends replicating smaller tables up to roughly 2GB of space. The question is, if a fact table still fits into this size, is it really worth keeping it replicated as well?
So the test below compares the timing for join queries where the fact table (relatively small), is either replicated or distributed by hash.
- The distributed version of the fact table works better in all the cases.
- For the larger table joins the difference is extremely significant
Test #3: Big table to Big table Join
The last test is dedicated to the big table to big table join.
Test #3.1: Distribution Types
In scope of this test, the large fact table with 1billion of rows is being joined to a large dimension table containing the 100 million rows.
The fact table is distributed by hash on dimension foreign key (customer_sk). The following dimension table configurations are used:
- dimension table distributed by hash on primary key (customer_sk)
- dimension table distributed by alternative key (customer_bk)
- dimension table configured as replicated
The comparison of the timing using DW1000c performance level is reflected below:
- Surprisingly, the distribution based on the joining keys doesn’t bring significant benefits;
- The Join to replicated-based dimension table outperforms other options.
Test #3.2 Distribution vs Replicated and Scaling
The last test intended to verify the same distribution vs replicated options on the bigger scale: DW3000c and DW6000c:
- On a bigger scale, the replicated version of the dimension table shows even more significant benefits in comparison to the distributed version;
- The DW6000c performance level is just enough to handle the 1 billion to 100 million join within 5 seconds interval (for replicated) version.
Despite the fact there is a lot of similarity between different MPP engines, there are a lot of specifics in each of them, including the Azure Dedicate SQL Pool. It requires going deeper and understand well all the possible options of indexing and data distribution configuration to leverage it to the full.
In this article, the different aspects and best practices of indexing and data distribution have been proved by benchmarking.
Some of the highlights to remember:
- Always try to join larger distributed tables with smaller replicated ones;
- Avoid 2 big tables join as much as you can. Otherwise, you would require a lot of expensive resources to have reasonable user experience in querying;
- Watch out a number of tables to join — try to reduce it;
- Reduce the number of columns in select queries you return or consider the cluster index-based dimension for it.
Azure Synapse Dedicated SQL Pool Schema Design Options Benchmark was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.