DWBI-TECH BLOGS (Pradeep Kannadiga): Hive

Showing posts with label Hive. Show all posts

Wednesday, 18 September 2019

Setting for insert/update on Hive table in horton works

Setting for insert/update on Hive table in hortonworks

To create insert/update table, a hive table has to be set up as a transactional table.

The properties you will need is below:

set hive.support.concurrency=true;

set hive.enforce.bucketing=true;

set hive.exec.dynamic.partition.mode=nostrict;

set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

set hive.compactor.initiator.on= true;

set hive.compactor.worker.threads=2;

Hive table needs to be created as following

CREATE TABLE `test.test_zyx1`(
`id` int,
`mm` int,
`data_as_of_dt` string,
`initial_dt` string )

stored as orc tbl_properties("transactional=true")

Tuesday, 10 September 2019

Converting string to map data type in Hive

If you have a string with key value pairs and want to convert to a map data type with key value pairs then use the str_to_map function.

str_to_map(text, delimiter1, delimiter2) - Creates a map by parsing text Split text into key-value pairs using two delimiters. The first delimiter seperates pairs, and the second delimiter sperates key and value. If only one parameter is given, default delimiters are used: ',' as delimiter1 and '=' as delimiter2.

Example:

with t1 as (
select "k1:v1|k2:v2|k3:v3" as c1
)

select str_to_map(c1 ,"\\|",":") as m1, c1 from t1

Monday, 9 September 2019

Difference between posexplode and explode in Hive

In the below example, the numbers column has a list of values.

with t1 as (

select 1 as id, 'deepak' as name, '12345, 56789, 7892334, 6263636, 7181818, 1761761717' as numbers )

select * from t1

Explode is used to create multiple rows from a list value in a column. In the below example, you have broken the list in numbers fields into multiple rows.

with t1 as (

select 1 as id, 'deepak' as name, '12345, 56789, 7892334, 6263636, 7181818, 1761761717' as numbers )

select * from t1

LATERAL VIEW explode(split(numbers, ',')) test

PosExplode is used to create multiple rows from a list value in a column. In the below example, you have broken the list in numbers fields into multiple rows along with the column called pos that contains the position of the value in the list.

with t1 as (

select 1 as id, 'deepak' as name, '12345, 56789, 7892334, 6263636, 7181818, 1761761717' as numbers )

select * from t1

LATERAL VIEW posexplode(split(numbers, ',')) test

Thursday, 15 November 2018

How to make hive queries run faster?

The options available are

1) Use ORC file format which provides better performance than Text.

2) Use Tez execution engine

3) Use Cost based optimization

4) Use Vectorization execution

5) Right sensible queries that avoids joins

Friday, 8 July 2016

Performance tuning of Informatica Big Data Edition Mapping

Below are the list of performance tuning steps that can be done in Informatica Big Data Edition:

1) When using a look up transformation only when the lookup table is small. Lookup data is copied to each node and hence it is slow.

2) Use Joiners instead of lookup for large data sets.

3) Join large data sets before small datasets. Reduce the number of times the large datasets are joined in Informatica BDE.

4) Since Hadoop does not allow updates, you will have to rebuild the target table whenever the record is updated in a target table. Instead of rebuilding the whole table, consider rebuilding only the impacted partitions.

5) Hive slower with any non string data type. It needs to create temp tables to do the conversion to and from the non string data type to string data type. Use non string data type only when required.

6) Use the data type precision close the actual data. Using higher precision slows down the performance of Informatica BDE.

7) Map only the ports that are required in the mapping transformation or loaded to target. Less number of ports means better performance and less data reads.

Saturday, 24 October 2015

Limitation of Hive mode in Informatica BDE (Informatica 9.6.1) version hotfix 1

Limitation of Hive mode in Informatica Big data edition (Informatica 9.6.1) version hotfix 1

Informatica BDE in hive mode has some limitation compared to normal power center mode called the native mode basically because in hive mode the whole mapping logic is converted to Hive SQL. Hive has some limitations such as no support for updates that gets reflected in Informatica BDE in Hive mode. Below is the list of limitation of Informatica BDE in hive mode:

1) There is no support for unconnected lookups, sequence generator or sorters in Hive mode in Informatica hive mode. The reason being hive mode does not support statefulness i.e it cant keep track of values from previous records. Write properties such as insert, update, delete are ignored and it is always a truncated and load.

2) For the same reason above, it does not support stateful variables. i.e you cannot have variable that keep values from previous records.

3) Some of the hive versions does not support timestamp and date. Hence, if you are working with those hive version then you cannot use timestamp or date datatypes.

4) Functions such as abort, error, md5, crc32, max, first, last, aesdecrypt, aesencrypt, etc are not valid in hive environment or has some limitations.

5) PreSQL and post SQL do not work in hive mode.

6) Datatypes such as arrays, maps , unions might be converted to strings in hive data object.

7) Mapping variables and parameters are not supported.

8) The mappings always run in low precision mode.

Even with all the these limitations BDE still is a very useful tool for building mappings on ment and reduces the development time to a great extent. The developers just need to be a bit innovative when they workaround these limitations. One has to be aware that all the ETL logic gets converted to hive SQL so whatever cannot be converted to Hive SQL cannot be supported by this tool.

Friday, 2 October 2015

How to update records in Hive for type2 tables?

It is easy in Hive to append data but it is not easy to update data in any hive table. In fact older versions of hive i.e. prior to 0.12 does not support update operations at all. In those scenarios, how do you update data in Hive?. There is no other way than to rebuild the whole table. This is important if your table is a type2 table which contains both the historical record and the changed record.

For example: Lets say your main table is tweets_main and all the changed/new records are in tweets_staging table. How do you update the tweets_main table in hive?

Answer: Take all the unchanged records from the tweets_main table and all the changed records from the tweets_staging table and overwrite the tweets_main table. The below step should work. Lets say ID is the field that is the primary key in both the tables.

insert overwrite table tweets_main as select * from tweets_main t1 left outer join tweets_staging t2 on t1.ID=t2.ID where t2.ID is null UNION ALL select * from tweets_staging;

So what the above step has done is take all the records from main table that have not changed and then union with all the change records from the change table tweets_staging and reinsert back into the tweets_main table. If you cannot create it in one step then put all the records from the select statement into temp table and then overwrite the main table.

The above steps can also be used for partitioned tables in hive except that is a bit more complicated since you want to overwrite only the impacted partitions in hive. You will have to find out all the impacted partitions, then take the records from the impacted partitions that have not changed, union it will the change records and then overwrite the partitions again in Hive. Check the below article on how to dynamically partitions the table.

Hadoop: How to dynamically partition table in Hive and insert data into partitioned table for better query performance?

Partitioning in Hive just like in any database allows for better query performance since it allows only sections on data to read instead of the complete table. Hive allows single or multiple columns to be used for partitioning data. It allows dynamic and both static partitioning of tables. In Hive, since data is stored as files on HDFS, whenever you partition the table it creates sub directories using the partition key. For example, if date and batch key are the partition keys , then it creates directories by date and then batch key sub directory within than date directory. This allows to read data by a certain date and batch key.

In dynamic partitioning of hive, the data is inserted into the respective partition dynamically without you having explicitly create the partitions. Do not use columns with high cardinality as partition keys since that will create more subdirectories which will deteriorate the performance.

Below example shows one of the example:

/*Below is the ddl for creating the main table for which you want to insert data.

It is partitioned by date and batchid. The partitions keys apparently has to be at the end of the table */

create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string
)
partitioned by (roweffectivedate string, batchid int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'  STORED AS SEQUENCEFILE;

/*Below is the ddl for the temp hive table from which you want to insert data to the main table*/

create table IF NOT EXISTS temp_mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string,
batchid int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'  STORED AS SEQUENCEFILE;

/* Load data from a local file to the temp hive table */

load data local inpath '/etc/home/mydirectory/test.txt' overwrite into table temp_table;

/* The below set statment allows all the partitions to be dynamic in Hive */

set hive.exec.dynamic.partition.mode=nonstrict;

/* Now insert data from the temp hive table to the main hive table that stores the data by date and batch id */

insert overwrite table mytable partition(roweffectivedate string, batchid int) select * from temp_mytable;

If you want to manually add the partitions to Hive then use the alter statements such as below:

ALTER TABLE test
    ADD PARTITION (roweffectivedate='2014-03-05',batchid='10')
    location '/usr/datawarehouse/hive/db/mytable/'

Friday, 18 September 2015

Creating a table in Parquet, Sequence, RCFILE and TextFile format and enabling compression in Hive

Creating a table in Parquet, Sequence, RCFILE and TextFile format in Hive.

My intention was to write an article of different file formats in Hive but happened to notice an article already posted.

We are using parquet these days apparently because of the compression options and performance with large tables in Hive. Parquet is optimized to work with large data sets and provide good performance when doing aggregation functions such as max or sum.

If you prefer to see the HDFS file in clear text format then you need to store the file in textfile format. This apparently takes more space than the binary formats supported by sequence and rcfile.

Below are examples for creating a table in Parquet ,Sequence, RCfile, TextFile format: Location is optional.

create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS parquet location 'hdfspathname';

create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS sequencefile location 'hdfspathname';

create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS rcfile location 'hdfspathname';

create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS textfile location 'hdfspathname';

The default storage format is apparently is a comma-delimited TEXTFILE.

Enabling compression in Hive or Impala table:

Apart from the storage format supported in Hive, the data can also be compressed using codecs such as LZO, Snappy, etc. To enable compression in Hive, we use set statements such as below. The below example is for Lzo compression. After the set statements you can use insert statements to insert data to a table and the data will be compressed as per the set statement.

set mapreduce.output.fileoutputformat.compress=true;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;

insert overwrite table tablename partition(dateid) select * from table2;

Thursday, 6 August 2015

What is Informatica big data edition (BDE) ?

What is Informatica big data edition BDE ?

Informatica big data edition BDE is a product from Informatica Corp that can be used like an ETL tool for working in hadoop enviroment along with traditional RDBMS tools. Now there are lot of ETL products in the market that makes it easier to integrate with hadoop. To name a few talend, pentaho, etc. Informatica is one of the leading ETL tool vendor and Informatica power center tool is very famous as an ETL tool and has been for many years. Traditionally this tool was used to extract transform and load data to traditional databases such as oracle, sql server, netezza to name a few. With advent of hadoop for storing peta byte volumes of data, building ETL tools that can work with hadoop became more important. It requires a lot of handcoding and knowledge to work directly with hadoop and build map reduce jobs. Hadoop tools such as hive made it easier to write SQL queries on top of Hive database and convert it to map reduce jobs. Hence, lot of companies started using Hive as a data warehouse tool and storing data in hadoop just like traditional databases and writing queries on Hive. How do we now extract , transform, and load the data in Hadoop? Thats where Informatica BDE comes into picture. It is a tool that you can use for ETL or ELT on hadoop infrastructure. Informatica BDE can run in two modes. They are native mode and hive mode.

In the native mode, it runs as a normal powercenter but in hive mode you can push down the whole mapping logic to hive and make it run on the hadoop cluster there by using the parallelism provided by hadoop. However there are some limitation when running in hive mode but that is more because of the limitations from hive itself. For example, hive does not allow updates in older versions.

With Informatica BDE you can do the following at a very high level :

a) Just like any ETL tool you can do extract , transform, load between tranditional rdbms or hive/hdfs source and targets.
b) Push the whole ETL logic to hadoop cluster and make use of the map reduce framework. Basically it makes building hadoop jobs easier.
c) Makes it easy to create connection to all the different sources and integrate data from those sources.
d) It makes it easier to ingest complex files such as JSON, XML, Cobol, AVRO, Parquest , etc.

Informatica BDE uses the Informatica developer interface to build the mappings , deploy and create applications. Anyone who has used IDQ before might be familiar with the Informatica developer interface. Informatica BDE can be found in Informatica 9.6 versions onwards.

Friday, 26 December 2014

Big Data - Good Books for Hadoop, Hive, Pig, Impala, Hbase.

Big Data - Good Books for Hadoop, Hive, Pig, Impala, Hbase.

1) Hadoop: The Definitive Guide

Hadoop: The Definitive Guide: Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

You’ll find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This third edition covers recent changes to Hadoop, including material on the new MapReduce API, as well as MapReduce 2 and its more flexible execution model (YARN).

Store large datasets with the Hadoop Distributed File System (HDFS)

Run distributed computations with MapReduce

Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence

Discover common pitfalls and advanced features for writing real-world MapReduce programs

Design, build, and administer a dedicated Hadoop cluster—or run Hadoop in the cloud

Load data from relational databases into HDFS, using Sqoop

Perform large-scale data processing with the Pig query language

Analyze datasets with Hive, Hadoop’s data warehousing system

Take advantage of HBase for structured and semi-structured data, and ZooKeeper for building distributed systems

2) Programming Hive

Programming Hive: Need to move a relational database application to Hadoop? This comprehensive guide introduces you to Apache Hive, Hadoop’s data warehouse infrastructure. You’ll quickly learn how to use Hive’s SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoop’s distributed filesystem.
This example-driven guide shows you how to set up and configure Hive in your environment, provides a detailed overview of Hadoop and MapReduce, and demonstrates how Hive works within the Hadoop ecosystem. You’ll also find real-world case studies that describe how companies have used Hive to solve unique problems involving petabytes of data.

Use Hive to create, alter, and drop databases, tables, views, functions, and indexes
Customize data formats and storage options, from files to external databases
Load and extract data from tables—and use queries, grouping, filtering, joining, and other conventional query methods
Gain best practices for creating user defined functions (UDFs)
Learn Hive patterns you should use and anti-patterns you should avoid
Integrate Hive with other data processing programs
Use storage handlers for NoSQL databases and other datastores
Learn the pros and cons of running Hive on Amazon’s Elastic MapReduce

3) Programming Pig

Programming Pig: This guide is an ideal learning tool and reference for Apache Pig, the open source engine for executing parallel data flows on Hadoop. With Pig, you can batch-process data without having to create a full-fledged application—making it easy for you to experiment with new datasets.

Programming Pig introduces new users to Pig, and provides experienced users with comprehensive coverage on key features such as the Pig Latin scripting language, the Grunt shell, and User Defined Functions (UDFs) for extending Pig. If you need to analyze terabytes of data, this book shows you how to do it efficiently with Pig.

Delve into Pig’s data model, including scalar and complex data types
Write Pig Latin scripts to sort, group, join, project, and filter your data
Use Grunt to work with the Hadoop Distributed File System (HDFS)
Build complex data processing pipelines with Pig’s macros and modularity features
Embed Pig Latin in Python for iterative processing and other advanced tasks
Create your own load and store functions to handle data formats and storage mechanisms
Get performance tips for running scripts on Hadoop clusters in less time

4) HBase: The Definitive Guide

HBase: The Definitive Guide: If you're looking for a scalable storage solution to accommodate a virtually endless amount of data, this book shows you how Apache HBase can fulfill your needs. As the open source implementation of Google's BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. Many IT executives are asking pointed questions about HBase. This book provides meaningful answers, whether you’re evaluating this non-relational database or planning to put it into practice right away.

Discover how tight integration with Hadoop makes scalability with HBase easier

Distribute large datasets across an inexpensive cluster of commodity servers

Access HBase with native Java clients, or with gateway servers providing REST, Avro, or Thrift APIs

Get details on HBase’s architecture, including the storage format, write-ahead log, background processes, and more

Integrate HBase with Hadoop's MapReduce framework for massively parallelized data processing jobs

Learn how to tune clusters, design schemas, copy tables, import bulk data, decommission nodes, and many other tasks

5) Getting Started with Impala: Interactive SQL for Apache Hadoop

Getting Started with Impala: Interactive SQL for Apache Hadoop:Learn how to write, tune, and port SQL queries and other statements for a Big Data environment, using Impala—the massively parallel processing SQL query engine for Apache Hadoop. The best practices in this practical guide help you design database schemas that not only interoperate with other Hadoop components, and are convenient for administers to manage and monitor, but also accommodate future expansion in data size and evolution of software capabilities.

Ideal for database developers and business analysts, Getting Started with Impala includes advice from Cloudera’s development team, as well as insights from its consulting engagements with customers.

Learn how Impala integrates with a wide range of Hadoop components
Attain high performance and scalability for huge data sets on production clusters
Explore common developer tasks, such as porting code to Impala and optimizing performance
Use tutorials for working with billion-row tables, date- and time-based values, and other techniques
Learn how to transition from rigid schemas to a flexible model that evolves as needs change
Take a deep dive into joins and the roles of statistics

How to add only non existing records to Hive table and implement NOT IN operation?

Implementing left outer join and NOT IN operation in Hadoop Hive

&

How to add only non existing records to Hive table?

a) Now lets start with how to do a left outer join?.

Consider you have two tables table1 and table2. Doing a normal left outer join is simple. The below syntax should do it:

select * from table2 left outer join table1 on table1.dates=table2.dates

b) How to implement a not in operation in Hive?

Consider you have two tables table1 and table2. To implement a not in operation you have to do a left outer join and then apply a filter.

In normal SQL, if it would look like:

select * from table2 where table2.dates not in ( select dates from table1)

In apache hive SQL, it would look like:

select * from table2 left outer join table1 on table1.dates=table2.dates where table2.dates is null

c) How to add only non existing records to Hive table?

Now lets say you have a scenario where you have to check if a record exists in your table and only insert those records that do not exist in your hive table then you have to take a two step approach.

1) Create a landing table which is same as the target table to which you have to insert the records. Insert all your new records to this landing table. You can truncate this landing table and insert the new records to this landing table. Lets call this table as LND_TABLE.

b) Now lets say your target table is TGT_TABLE to which you want to insert only the non existing records. You would use the syntax below to insert non existing records from the LND_TABLE. Here dates is the column that is used to identify existing records.

INSERT INTO TGT_TABLE select * from LND_TABLE left outer join TGT_TABLE on LND_TABLE.dates=TGT_TABLE.dates where TGT_TABLE.dates is null

Friday, 19 December 2014

How to select between Hadoop vs Netezza vs Redshift- comparison?

Common question that we here is, among Hadoop, Netezza, and Redshift, which one is better and which one is going to replace everything else? All these technologies have their own pros and cons and it is essential to know those before we try find out ideal use cases for them and how to use them in your ecosystem.

Hadoop:

Pros:

As we know hadoop is mainly used for big data purpose i.e purpose where huge volume is involved often peta byte scale or more.
The data can be unstructured, stuctured or semi structured.
The data in hadoop can be stored on commodity hardware and the cost of storing and processing is less considering the volume of data it can handle. Now lets says we are receiving huge number of XML files, flat files from huge number of sources (for example, sensors from air plane, logs fromfirewalls/IDS systems, tweets from twitter, etc) and needs to be loaded at a rapid pace and processed then hadoop is an ideal candidate.
The hadoop clusters are highly scalable and it can be horizontally or vertically scaled by adding more commodity hardware and nodes that are not expensive.
The hadoop clusters can also be made to handle streaming data, collect data from sensors, store the streaming data, playback etc.
Data redundancy is built into the cluster.

Cons:

Point to be noted here are the files if stored on HDFS are going to be in flat file format and though you can use Hive or Impala to process these files as structured data there are some limitation to it. The hive SQL do not offer full functionality that is found in traditional databases.
Updating records will be tough and you will have to use tool like HBASE to get this update functionality.
Response time of hadoop hive queries are not going to fast as Netezza because at the back ground these program run map reduce jobs that are not as fast as the Netezza. Also, they are meant for huge volumes and used in places where response time are not that important i.e for batch jobs.
Administration of the cluster is not simple like administration of netezza.

Conclusion:

If you are doing ELT (Extract Load transform) with huge data loads that can be stored as flat files on HDFS then Hadoop is a good candidate. You can load all the files and then process or transform etc.

Volume and scalability are also the deciding factors and Hadoop is deal for big data purpose

Netezza:

Pros:

It is a traditional data warehouse appliance that supports traditional SQL statements i.e all DML and DDL statements. Inserts, updates are all possible.

Very fast and made for gigabyte scale data warehousing purpose. Depending on the number of SPUs, it can handle gigabytes of data and process it at super fast speed. The data is distributed on multiple SPUs and they all work together to crunch the data.

Administration overhead is very less compared to oracle, DB2 , etc data bases.

Cons:

Mainly meant for structured data warehouse applications.
Probably cannot handle unstructured data of peta byte volumes effectively like Hadoop.

Conclusion:

If you are building traditional data warehouse where you also need to ETL (Extract - Transform- Load) then Netezza is better since you can insert, update, aggregate, etc with ease and write your SQL for reporting and building cubes.

Redshift:

Pros:

Peta byte scale computing and cost model based on On demand basis.
Traditional data warehousing and SQLs supported.
Can handle volume even greater than Netezza and build on the same model as Netezza.
On the cloud and minimal administration required
Easily scalable and nodes (computing power ) and space can be added with some mouse clicks.
Easy to create data snapshots and bring the redshift cluster up and down as and when required.

Cons:

On the cloud and data has to be transported to the cloud redshift cluster.
Reading huge volumes from ODBC/JDBC connection is still a problem.
If there is lot of back and forward movement between the cluster and local network then it creates some latency.
Have to get security clearance to get data out of the network

Conclusion:

Good for ELT purpose where data is loaded to redshift cluster once and then processing is done on the cluster and aggregate/summarized data is read from the cluster or if the other downstream applications are sourcing the data directly from the redshift cluster.
Good for traditional data warehousing purpose.