Friday, 5 December 2014

How to evaluate a query performance in Amazon Redshift - Explain Plan and Understanding Performance

How to evaluate a query performance in Amazon Redshift - Explain Plan and Understanding Performance?

To understand how your query is performing, use explain plan and check the disk/cpu usage of your query along with lot of other parameter.  Explain command can be used to get the cost of the query and the execution plan. The execution plan will show those parts of the query that are very costly and needs to be tuned. In the below query, you can see just by looking at the numbers beside the cost the first two outer joins are very expensive and the last inner join is not that costly. It is not only because of the outer join, but because of the amount of data distribution that is happening in the query and the number of rows each of those joins are retrieving. The below query is retrieving 8 billion records in each of those joins. Also the data distribution is high.

explain select * from table1  a11 outer join table2 a12 on (a11.EMPLOYEE_KEY = a12.EMPLOYEE_KEY) outer join table3 a13 on ( = ) join table3 a14 on (a11.dept_key = a14.dept_key)

.>XN Hash Left Join DS_BCAST_INNER  (cost=683864529053.53..9912121211212999999999999996733616880411.00 rows=8226899613 Width=1388)                                                                                                                                  ,"Hash Cond: ("outer"".employee_key="inner".employee_key)
   ->  XN Hash Left Join DS_DIST_BOTH  (cost=    683646717104.93..999999121211121999999999967336168804116 rows=8226899613 width=1372)
              ,"Hash Cond: ("outer"".name="inner".name)
       ->  XN Hash Left Join DS_DIST_NONE  (cost= 346363.65 rows=822689 width=1372)
                                  ,"Hash Cond: ("outer"".dept_key="inner".dept_key)

Data is distributed among various computing nodes in amazon redshift and the amount of data movement between nodes plays a large part in query performance. In  your query, you see a lot of DS_DIST_NONE in your query then you have least amount of data distribution and if you have other keywords  like DS_DIST_BOTH or DS_BCAST_INNER then there are more data redistribution happening and the performance can be bad.

The meaning of Explain plan attributes as per amazon redshift website is below:

DS_BCAST_INNER- means that broadcast a copy of the entire inner table to all compute nodes.
DS_DIST_ALL_NONE-No redistribution is required because the inner table was distributed to every node using DISTSTYLE ALL
DS_DIST_NONE- No tables are redistributed: collocated joins are possible because corresponding slices are joined without moving data between nodes.
DS_DIST_INNER-    The inner table is redistributed.
DS_DIST_ALL_INNER -    The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL
DS_DIST_BOTH-  Both tables are redistributed.

The disk usage for the above query was looking like below. The disk usage can be seen from the amazon redshift web console in the performance tab. The disk space usage was reaching close to 100% and this is not good news at all since that single query is consuming all the disk space. This is indication of poor query performance. This query was run on a dw2_8xLarge cluster which is a large node type with relatively high disk space.
Also, you can see the CPU usage from the amazon redshift web console

Also, another way you can check how your query is performing is to check if there are lot of writes to the disk using query such as below: 2044 below is the query id of the query.

select query, step, rows, workmem, label, is_diskbased
from svl_query_summary
where query = 2044 order by workmem desc;