Tuesday, 23 December 2014

Selecting proper Distribution Key In Netezza for better query performance

Selecting  proper Distribution Key In Netezza for better query performance

Netezza distributes data across the data slices based on the distribution key. This distribution of data impacts the query performance. The distribution key can be specified during table creation (up to 4 columns can be part of distribution key) or it can be specified to be random which mean Netezza will use round robin algorithm to distributed data on the data slices or if not specified Netezza will pick one of the column as distribution key. Hence, Never create table without a distribution key since that does not guarantee optimal performance.

The data has to distributed uniformly across all the data slices so that all the Netezza SPU (snippets) have same amount of work to do. If the data is some data slices are more than others then some SPU will do more work than others in which case the data distribution is skewed which is bad for Netezza query performance.

In the below example, employee_id column is used as distribution key. The distribution keys are defined using the DISTRIBUTE ON clause.

create table employee (
employee_id integer not null ,
employee_name varchar(25),
employee_dept_id integer,
employee_age integer,
employee_address varchar(25)
) distribute on employee_id;

1) When selecting distribution key, choose the columns that have unique values and high cardinality. Hence, primary keys or part of primary key are good choice for distribution key. In the above example, employee ID is unique for all the records hence a good candidate for distribution key.

2) Selecting data ID is not good choice for distribution key. The data might get uniformly distributed across data slices with data ID but if the query requires data from a particular month then only few SPUs might be busy and hence impacting query performance. 

3) If two big tables are joined on certain columns, then better to used those columns as distribution keys.

4) When creating table as (Create table as select * from t1 join t2 on t1.c1=t2.c2) join from two other tables then the distribution keys will be the columns used for joining. If there is a group by clause in the select statement used in  then those group by clause will be the distribution keys. 

5) If there is no good choice for distribution key, then select distribution random instead of not defining any distribution key.

6) Try not to use varchar as a distribution key. Big Int data type or integer data type might be better choice than Varchar. 

We can check distribution of data using performance portal or the netezza administration console, or the explain plan, or nz_skew dba script. 

Explain distribution command gives the data distribution information. For example:

explain distribution select * from employee_table;