DWBI-TECH BLOGS (Pradeep Kannadiga): October 2015

Saturday 24 October 2015

Limitation of Hive mode in Informatica BDE (Informatica 9.6.1) version hotfix 1

Limitation of Hive mode in Informatica Big data edition (Informatica 9.6.1) version hotfix 1

Informatica BDE in hive mode has some limitation compared to normal power center mode called the native mode basically because in hive mode the whole mapping logic is converted to Hive SQL. Hive has some limitations such as no support for updates that gets reflected in Informatica BDE in Hive mode. Below is the list of limitation of Informatica BDE in hive mode:

1) There is no support for unconnected lookups, sequence generator or sorters in Hive mode in Informatica hive mode. The reason being hive mode does not support statefulness i.e it cant keep track of values from previous records. Write properties such as insert, update, delete are ignored and it is always a truncated and load.

2) For the same reason above, it does not support stateful variables. i.e you cannot have variable that keep values from previous records.

3) Some of the hive versions does not support timestamp and date. Hence, if you are working with those hive version then you cannot use timestamp or date datatypes.

4) Functions such as abort, error, md5, crc32, max, first, last, aesdecrypt, aesencrypt, etc are not valid in hive environment or has some limitations.

5) PreSQL and post SQL do not work in hive mode.

6) Datatypes such as arrays, maps , unions might be converted to strings in hive data object.

7) Mapping variables and parameters are not supported.

8) The mappings always run in low precision mode.

Even with all the these limitations BDE still is a very useful tool for building mappings on ment and reduces the development time to a great extent. The developers just need to be a bit innovative when they workaround these limitations. One has to be aware that all the ETL logic gets converted to hive SQL so whatever cannot be converted to Hive SQL cannot be supported by this tool.

Friday 2 October 2015

How to use SQLPlus to connect to oracle and use it in unix scripts? Connect strings

How to use SQLPlus to connect to oracle and use it in unix scripts? SQL Plus Connect strings and commands :

Sqlplus is a command line tool that comes with any oracle client that can be used to connect to oracle database. It can also be used in scripts to send commands to oracle database.

The common sqlplus connection string is:

sqlplus username/password@servername:port/databasename

Go to the directory where sqlplus is location and enter the above command with proper username, password supplied to the command as shown in the connection string.

To find list of tables in the database use the command in sqlplus:

select table_name from user_tables;

Rest of the DML and DDL statements from oracle work in SQL plus.

If you want to use sqlplus in an unix script then the below sample unix script should help:

#/usr/bin/ksh
sqlplus username/password@servername:port/databasename << EOF
WHENEVER SQLERROR EXIT 2;
insert into mytable select * from mytemptable;
EOF
RTN=$?
if [ $RTH -ne 0] ; then
echo failed

Informatica Java transformation to parse comma separated columns and generate new rows - normalizer

Informatica Java transformation to parse comma separated columns and generate new rows - normalizer limitations

Informatica normalizer transformation is normally used to convert a column containing multiple values into separate rows. i.e something like ID, 1,2,3,4 can be converted to something like below.
ID , 1
ID, 2
ID, 3
ID, 4

The problem with normalizer however is that it has to know the number of occurences. If the number of occurences are very high then it is hard to create the output ports for that many number of ports.

The easy way in these cases is to use Java transformation to parse the column containing multiple values and generate a separate row for each of them:

The java code is shown below:

String str_var= INPUT_PORT_NAME
String[] arr;
String delimiter = ",";
arr=str_var.split(delimiter);
for (int i=0; i<arr.length;i++) {
INPUT_PORT_NAME=arr[i];
generateRow();
}

How to update records in Hive for type2 tables?

It is easy in Hive to append data but it is not easy to update data in any hive table. In fact older versions of hive i.e. prior to 0.12 does not support update operations at all. In those scenarios, how do you update data in Hive?. There is no other way than to rebuild the whole table. This is important if your table is a type2 table which contains both the historical record and the changed record.

For example: Lets say your main table is tweets_main and all the changed/new records are in tweets_staging table. How do you update the tweets_main table in hive?

Answer: Take all the unchanged records from the tweets_main table and all the changed records from the tweets_staging table and overwrite the tweets_main table. The below step should work. Lets say ID is the field that is the primary key in both the tables.

insert overwrite table tweets_main as select * from tweets_main t1 left outer join tweets_staging t2 on t1.ID=t2.ID where t2.ID is null UNION ALL select * from tweets_staging;

So what the above step has done is take all the records from main table that have not changed and then union with all the change records from the change table tweets_staging and reinsert back into the tweets_main table. If you cannot create it in one step then put all the records from the select statement into temp table and then overwrite the main table.

The above steps can also be used for partitioned tables in hive except that is a bit more complicated since you want to overwrite only the impacted partitions in hive. You will have to find out all the impacted partitions, then take the records from the impacted partitions that have not changed, union it will the change records and then overwrite the partitions again in Hive. Check the below article on how to dynamically partitions the table.

Hadoop: How to dynamically partition table in Hive and insert data into partitioned table for better query performance?

Partitioning in Hive just like in any database allows for better query performance since it allows only sections on data to read instead of the complete table. Hive allows single or multiple columns to be used for partitioning data. It allows dynamic and both static partitioning of tables. In Hive, since data is stored as files on HDFS, whenever you partition the table it creates sub directories using the partition key. For example, if date and batch key are the partition keys , then it creates directories by date and then batch key sub directory within than date directory. This allows to read data by a certain date and batch key.

In dynamic partitioning of hive, the data is inserted into the respective partition dynamically without you having explicitly create the partitions. Do not use columns with high cardinality as partition keys since that will create more subdirectories which will deteriorate the performance.

Below example shows one of the example:

/*Below is the ddl for creating the main table for which you want to insert data.

It is partitioned by date and batchid. The partitions keys apparently has to be at the end of the table */

create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string
)
partitioned by (roweffectivedate string, batchid int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'  STORED AS SEQUENCEFILE;

/*Below is the ddl for the temp hive table from which you want to insert data to the main table*/

create table IF NOT EXISTS temp_mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string,
batchid int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'  STORED AS SEQUENCEFILE;

/* Load data from a local file to the temp hive table */

load data local inpath '/etc/home/mydirectory/test.txt' overwrite into table temp_table;

/* The below set statment allows all the partitions to be dynamic in Hive */

set hive.exec.dynamic.partition.mode=nonstrict;

/* Now insert data from the temp hive table to the main hive table that stores the data by date and batch id */

insert overwrite table mytable partition(roweffectivedate string, batchid int) select * from temp_mytable;

If you want to manually add the partitions to Hive then use the alter statements such as below:

ALTER TABLE test
    ADD PARTITION (roweffectivedate='2014-03-05',batchid='10')
    location '/usr/datawarehouse/hive/db/mytable/'