Monday, 17 November 2014

Performance tuning in amazon redshift - Simple tricks (table/query design)

Performance tuning in amazon redshift - Simple tricks

The performance tuning of a query in amazon redshift just like any database depends on how much the query is optimised, the design of the table, distribution key and sort key,  the type of cluster (number of nodes, disk space,etc) which is basically the support hardware of redshift, concurrent queries, number of users, etc. However, the good news is it is not that hard compared to database such as oracle to do performance tuning.Amazon Redshift is a peta byte  scale and massively parallel database and very high performance can be achieved with simple configuration steps. In this blog, we have explained in detailed how to achieve high performance in amazon redshift.

To give an idea about the performance issues we were facing, have a look at the cost of the query and the disk usage that were resulting from the queries we were running on redshift. Explain command can be used to get the cost of the query and the execution plan. The execution plan will show those parts of the query that are very costly and needs to be tuned. In the below query, you can see just by looking at the numbers beside the cost the first two outer joins are very expensive and the last inner join is not that costly. It is not only because of the outer join, but because of the amount of data distribution that is happening in the query.

explain select * from table1  a11 outer join table2 a12 on (a11.EMPLOYEE_KEY = a12.EMPLOYEE_KEY) outer join table3 a13 on ( = ) join table3 a14 on (a11.dept_key = a14.dept_key)

.>XN Hash Left Join DS_BCAST_INNER  (cost=683864529053.53..9912121211212999999999999996733616880411.00 rows=8226899613 Width=1388)                                                                                                                                  ,"Hash Cond: ("outer"".employee_key="inner".employee_key)
   ->  XN Hash Left Join DS_DIST_BOTH  (cost=    683646717104.93..999999121211121999999999967336168804116 rows=8226899613 width=1372)
              ,"Hash Cond: ("outer"".name="inner".name)
       ->  XN Hash Left Join DS_DIST_NONE  (cost= 346363.65 rows=822689 width=1372)
                                  ,"Hash Cond: ("outer"".dept_key="inner".dept_key)

Data is distributed among various computing nodes in amazon redshift and the amount of data movement between nodes plays a large part in query performance. In  your query, you see a lot of DS_DIST_NONE in your query then you have least amount of data distribution and if you have other keywords  like DS_DIST_BOTH or DS_BCAST_INNER then there are more data redistribution happening and the performance can be bad.

The meaning of Explain plan attributes as per amazon redshift website is below:

DS_BCAST_INNER- means that broadcast a copy of the entire inner table to all compute nodes.
DS_DIST_ALL_NONE-No redistribution is required because the inner table was distributed to every node using DISTSTYLE ALL
DS_DIST_NONE- No tables are redistributed: collocated joins are possible because corresponding slices are joined without moving data between nodes.
DS_DIST_INNER-    The inner table is redistributed.
DS_DIST_ALL_INNER -    The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL
DS_DIST_BOTH-  Both tables are redistributed.

The disk usage for the above query was looking like below. The disk usage can be seen from the amazon redshift web console in the performance tab. The disk space usage was reaching close to 100% and this is not good news at all since that single query is consuming all the disk space. This is indication of poor query performance. This query was run on a dw2_8xLarge cluster which is a large node type with relatively high disk space.

Also, another way you can check if your query is performing poorly is check if there are lot of writes to the disk using query such as below: 2044 below is the query id of the query.

select query, step, rows, workmem, label, is_diskbased
from svl_query_summary
where query = 2044 order by workmem desc;


We had to make a lot of changes to fix the performance issues. All the factors that influence the query performance and the required changes in discussed below:

1) Table design (Data distribution and Sorting) - In redshift the data is distributed on all the nodes. Whenever it is executing the query it has to bring the data from the different nodes and use it in joins, aggregation, sorting, group by, etc. If the amount of data that has to grabbed from these nodes is high then it results in lot of traffic and poor query performance. To reduce the amount of traffic between nodes, the data has to be distributed properly between the nodes. This can be achieved through the proper table design, data compression,  distribution key, sort keys, and query joins. First lets looks at the table design. Below is the table create statement. The distribution style (diststyle), distribution key (distkeys), sort keys are very important in defining how the data is distributed among the nodes.

TABLE table_name [ ( column_name1 encode ,  ]
| DISTKEY ( distkey_identifier )
| SORTKEY ( sortkey_identifier [, ... ] )

DISTSTYLE ALL will put a copy of the data in all the nodes and make it available for joins without having to move the data between nodes. This is ideal for dimension tables which does not have many records in it. DISTSTYLE KEY will distribute data based on a defined key column and better for tables that are huge and you can use a column that is used in joins to distribute the data. DISTSTYLE EVEN is used mostly to evenly distribute the data and when you cant deside between ALL or KEY.

*Trick*: USE DISTSTYLE ALL for all the dimension tables that are not big.

DISTKEY (column name)- The column name used here should be the one that is used in joins and should be defined for all the large fact tables. The
distribution key (distkey) will help in distributing the data based on the distkey column and during joins only the required data is brought the nodes. There can be only one distkey (distribution key) column defined.

*Trick*: Define Distkey for all the large fact tables and the dist key column should be the one used in the joins.

If the distribution key from the large fact table is joined to the any column of a dimension table that is in diststyle ALL then in your explain plan you will see the attribute DS_DIST_ALL_NONE which mean there is no data redistribution and the cost of the query is very low.  This is the state you want most part of the query to be in for best performance.

SORT KEY (column name1, ..):  There can be multiple sort keys and the sort keys help a lot in storing the data in the nodes in sorted order which helps in group by or order by operation.

*Trick*: If there are columns such as date or timestamp that you use in order by operation then define those as sort keys.

2) Table design (Compression) - In the amazon redshift tables the encoding that can be used for compression can be defined along with the column name. Format shown below. There are various encodings available.

CREATE TABLE table_name (column_name
            data_type ENCODE encoding-type)[, ...]

*Trick*: Easy way to decide on the encoding if you are using the copy command to load data  from s3to redshift is to use COMPUPDATE ON option set in the copy command as shown below. The copy command choosed the best compression to use for the columns that it is loading data to. 

copy <tablename> from 's3://mybucket/files.txt'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
delimiter '|' COMPUPDATE ON

If you want to explicitly define the encoding like when you are inserting data from another table or set of tables, then load some 200K records to the table and use the command ANALYZE COMPRESSION <tablename> to make redshift suggest the best compression for each of the columns. You can use those suggestion while recreating the table.

3) Table design (Constraints) - Defining the primary keys and foreign keys though is not enforced by redshift apparently is used by the query planner to execute the query efficiently. Hence, define those keys in your create table statements for better performance.

4) Query design -
Now we have discussed about table design, the actual query design will all use all those table design features to execute the query. As discussed before the trick in designing good queries is to have proper joins. i.e use distribution key from the large fact table to join to  any column of a dimension table that is in diststyle ALL. If dimension table is not in (distribution style) diststyle all, then use the distkey (
distribution key) from the dimension table in the joins.

5) WLM queue setting:

The WLM queue setting on your cluster configuration determines the number of queries running on your cluster and essential part of performance tuning. Keeping it low means not many queries can run at the same time and provided more memory for each query. Set the concurrency and memory setting for each user group in the cluster for better performance

6) Copy Command Performance tuning:
If you want your copy command to be faster then split the files into multiple files so that they can get loaded in parallel to the redshift database.

All the above discussed steps should help in running the queries more efficiently. If you need more details refer to aws link:

Also check: