Friday, 2 October 2015

Hadoop: How to dynamically partition table in Hive and insert data into partitioned table for better query performance?

Hadoop: How to dynamically partition table in Hive and insert data into partitioned table for better query performance?

Partitioning in Hive just like in any database allows for better query performance since it allows only sections on data to read instead of the complete table. Hive allows single or  multiple columns to be used for partitioning data. It allows dynamic and both static partitioning of tables. In Hive, since data is stored as files on HDFS, whenever you partition the table it creates sub directories using the partition key. For example, if date and batch key are the partition keys , then it creates directories by date and then batch key sub directory within than date directory. This allows to read data by a certain date and batch key.

In dynamic partitioning of hive, the data is inserted into the respective partition dynamically without you having explicitly create the partitions. Do not use columns with high cardinality as partition keys since that will create more subdirectories which will deteriorate the performance.
Below example shows one of the example:
 
/*Below is the ddl for creating the main table for which you want to insert data. 
It is partitioned by date and batchid. The partitions keys apparently has to be at the end of the table */
  
create table IF NOT EXISTS mytable ( id int, age smallint, name string, joining_date timestamp, location string ) partitioned by (roweffectivedate string, batchid int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS SEQUENCEFILE;
 
/*Below is the ddl for the temp hive table from which you want to insert data to the main table*/
  
create table IF NOT EXISTS temp_mytable ( id int, age smallint, name string, joining_date timestamp, location string, roweffectivedate string, batchid int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS SEQUENCEFILE;

/* Load data from a local file to the temp hive table */ 

load data local inpath '/etc/home/mydirectory/test.txt' overwrite into table temp_table;

/* The below set statment allows all the partitions to be dynamic in Hive */
  
set hive.exec.dynamic.partition.mode=nonstrict;

/* Now insert data from the temp hive table to the main hive table that stores the data by date and batch id */ insert overwrite table mytable partition(roweffectivedate string, batchid int) select * from temp_mytable; 

If you want to manually add the partitions to Hive then use the alter statements such as below:
  
ALTER TABLE test ADD PARTITION (roweffectivedate='2014-03-05',batchid='10') location '/usr/datawarehouse/hive/db/mytable/'