Friday, 26 December 2014

How to add only non existing records to Hive table and implement NOT IN operation?

Implementing left outer join and  NOT IN operation in Hadoop Hive

&

How to add only non existing records to Hive table?

a) Now lets start with how to do a left outer join?.


Consider you have two tables table1 and table2. Doing a normal left outer join is simple. The below syntax should do it:

select * from table2 left outer join table1 on table1.dates=table2.dates 


b) How to implement a not in operation in Hive?


Consider you have two tables table1 and table2. To implement a not in operation you have to do a left outer join and then apply a filter.

In normal SQL, if it would look like:

select * from table2 where table2.dates not in ( select dates from table1)

In apache hive SQL, it would look like:

select * from table2 left outer join table1 on table1.dates=table2.dates where table2.dates is null 


 c)  How to add only non existing records to Hive table?


 Now lets say you have a scenario where you have to check if a record exists in your  table and only insert those records that do not exist in your hive table then you have to take a two step approach.

1) Create a landing table which is same as the target table to which you have to insert the records. Insert all your new records to this landing table. You can truncate this landing table and insert the new records to this landing table. Lets call this table as LND_TABLE.

b) Now lets say your target table is TGT_TABLE to which you want to insert only the non existing records. You would use the syntax below to insert non existing records from the LND_TABLE. Here dates is the column that is used to identify existing records.


INSERT INTO TGT_TABLE select * from LND_TABLE left outer join TGT_TABLE on LND_TABLE.dates=TGT_TABLE.dates where TGT_TABLE.dates is null