Tuesday, 25 November 2014

How to a load a local data file into hadoop apache hive ? Example of tweets getting loaded into Hive.

How to a load a local data file into apache hive ? Example of tweets getting loaded into Hive ?

What is Apache hive?
Apache Hive is a Data warehouse system originally built by facebook and runs on  top of the Hadoop/HDFS i.e. actual data is stored as files on HDFS and Hive is used to query and manage the data. Hive stores meta data such as table structure, table definition in its meta store database which is usually derby or mySQL. You can run simple SQL called the Hive SQL to query the hive data warehouse just in other SQL based data warehouse.

In the example below, a sample tweets file is loaded into hive.

Step1: Download the sample tweets file of sachin tendular tweets and save it some local folder on your unix system. Lets say you saved it in /home/desktop folder. The sample tweet file captures only three fields which are the screenname, tweet text and a field indicating if the tweet is retweeted.

The sample tweets file can be found using the link below. Copy and save the file as sampletweets.txt in /home/desktop. If you save it in another folder then change the folder path in step3.


Step2: Create a table on hive. Sample DDL below. This DDL is the three fields that are in the sample file. If you have more fields in your sample file, then add those fields to the table. The sample file is pipe delimited. You and execute the below SQL from AQT or from Hive command interface.

CREATE database twitter;

use twitter;

CREATE TABLE if not exists twitter.tweets
ScreenName VARCHAR(100),
Text VARCHAR(300),
isRetweeted VARCHAR(10),
row format delimited fields terminated by '|'

Step3: Execute the hive command to load the sample file of sachin tendulkar to the table created in step 2.

hive -e "use twitter; load data local inpath '/home/desktop/sampletweets.txt' into table twitter.tweets;"