Friday, 2 October 2015

How to update records in Hive for type2 tables?

How to update records in Hive for type2 tables?

It is easy in Hive to append data but it is not easy to update data in any hive table. In fact older versions of hive i.e. prior to 0.12 does not support update operations at all. In those scenarios, how do you update data in Hive?. There is no other way than to rebuild the whole table.  This is important if your table is a type2 table which contains both the historical record and the changed record.

For example: Lets say your main table is tweets_main and all the changed/new  records are in tweets_staging table. How do you update the tweets_main table in hive?

Answer: Take all the unchanged records from the tweets_main table and all the changed records from the tweets_staging table and overwrite the tweets_main table. The below step should work. Lets say ID is the field that is the primary key in both the tables.

insert overwrite table tweets_main as  select * from tweets_main t1 left outer join tweets_staging t2 on t1.ID=t2.ID where t2.ID is null UNION ALL select * from tweets_staging;

So what the above step has done is take all the records from main table that have not changed and then union with all the change records from the change table tweets_staging and reinsert back into the tweets_main table. If you cannot create it in one step then put all the records from the select statement into temp table and then overwrite the main table.

The above steps can also be used for partitioned tables in hive except that is a bit more complicated since you want to overwrite only the impacted partitions in hive. You will have to find out all the impacted partitions, then take the records from the impacted partitions that have not changed, union it will the change records and then overwrite the partitions again in Hive. Check the below article on how to dynamically partitions the table.