Friday 11 December 2015

SQL override in Source qualifier and look up overide in Informatica

SQL override in Source qualifier and look ups in Informatica



SQL override is commonly used by ETL developers in Source qualifiers and Lookup transformation. To override SQL in source qualifier you can generate the SQL from the properties tab of the source qualifier and do all the necessary modifications to the SQL. All you have to keep in mind is the columns in the select statement match the order of the ports in the source qualifier. The ports have to be mapped to the transformation next to the source qualifier. The diagram below shows a simple SQL override. Similarly SQLs can be override in the lookups too.



The advantages of SQL override in Informatica:

a) You can put a lot of logic in the SQL override. The SQL will be executed in the source database and the results will be pulled in by the Informatica powercenter. Instead of reading all the records from the table, by writing SQL override you can filter, aggregate, etc on the source database and pull in the result set.

b) Similarly in the Lookups instead of reading all the columns from a table, you can read only the required columns from the table by doing SQL override and removing all the unnecessary ports. Also you can aggregate , filter, etc on the records that needs to be read.

c) If you are comfortable with SQL you would prefer this path instead of implementing a lot of stuff that can be achieved with SQLs.



Thursday 12 November 2015

Indirect file processing in Informatica Cloud - Workaround

Indirect file processing ( Filelist) in Informatica Cloud (ICS)

I have a scenario where I need to read list of files (filelist) in Informatica cloud. My research pointed out to couple of articles on Informatica community. Apparently there is a template that we can work with or we need to use filelist connector. I could not find both and so had to come up with a different startegy.

The strategy was:
1) Use powercenter to create the mapping and a workflow that reads from the file list and writes to a oracle target
2) Export it from repository manager as XML file
3) Import it from the informatica cloud as a powercenter task.Map the connection to ICS connections and run it from Informatica cloud.
 
In Informatica power center we read filelist by setting the property in the session source file to indirect as shown in the screenshot below: Using the same strategy I created my mapping and wrote the records to oracle target. Once I tested the workflow I exported it to Informatica cloud and it worked. 

When you import in Informatica cloud you can map the connection of the oracle target to the connection used in the cloud. Make sure your mapping has only the supported cloud transformations. 

Also the source filelist and the source files have to be in location that is accessible to Informatica cloud (ICS).



Tuesday 10 November 2015

User management- Creating groups , users , owners in Amazon redshift

User management- Creating groups , users , owners in Amazon redshift 

 Below are some sample commands to create user groups, adding users, managing permission on table in Amazon redshift.

 1) To create a user use the below command in amazon redshift:

create user my_user password 'anaconda123'; 

In the above example, user with name my_user is created with password anaconda123.

You can later change the password using the command:

alter user my_user password 'mynewpassword123'

 2) To create a user_group use the below command in amazon redshift:


create group my_group with user my_user;

In the above command, a new group with name my_group is created and the user my_user is added to that group.

 3) To create a database schema use the below command in Amazon redshift:

 create schema myschema authorization my_user;

In the above example, schema myschema is created with my_user being provided all the permission on the schema.

 4) To assign user group to a schema use the below command in  Amazon redshift:

 grant all on schema myschema to group my_group;



 5) To add or drop users to a user group use the below command in Amazon redshift:

ALTER GROUP group_name ADD USER username;

ALTER GROUP group_name DROP USER username;

ALTER GROUP group_name  RENAME TO new_name;


 6) To check permissions on the tables in amazon redshift:

select * from pg_tables /* provides all table owner info */
select * from pg_views  /* provided table owner info */
select * from pg_shadow  /* provides user info */
select * from pg_group  /* provided user group info */



 7) To add permissions on the tables in amazon redshift:

Use sample below to add permissions on a table in redshift

Create table MYTABLE (
    COLUMN1          CHAR(1)                       NOT NULL,
    COLUMN2     CHARACTER VARYING(100)        NOT NULL)
distkey (COLUMN1);

GRANT  ALL ON TABLE MYTABLE TO mstr_user;

ALTER TABLE  MYTABLE owner TO my_user;





Monday 9 November 2015

SQL transformation in Informatica for executing SQL scripts and dynamic SQLs

SQL transformation in Informatica for executing SQL scripts and dynamic SQLs

If you have some SQLs in a file and need to execute it on a database using Informatica or if you want to use some transformation to create dynamic SQLs i,e SQLs that take parameter and give you the output based on the input parameter, then SQL transformation is a way to go.

The SQL transfomation operates in Script and Query mode. In query mode you can write a query in the SQL transformatica and specify a parameter in the query which is the input port and depending on the value of the input port the result could change. For example: if NAME is a input port in the SQL transformation, your query in SQL transformation could be " SELECT DEPARTMENT FROM DEPARTMENT_TABLE WHERE FIRST_NAME=?NAME?". This will return department name for the every NAME that goes through the SQL transformation.

If you want to execute bunch of SQLs that are in a file then create the SQL transformation in Script mode. Read the SQL from a flat file and pass the field corresponding to the SQL to the Script name field of the SQL transformation. The SQL transformation executes every SQL that it reads from the flat file and executes it. 

For example: If the input file has :

CREATE TABLE MYTABLE as SELECT * from MYSTAGING_TABLE;

InSERT into MYTABLE values ('xxx', 1 ,2 3).

The SQL transformation will execute both the statements one after another and return the script result and any ERROR outputs. Sample mapping is shown in the screenshot below:





Friday 6 November 2015

How to identify last record in Informatica cloud or powercenter

How to identify last record in Informatica cloud or powercenter?

I had a scenario where I had to add comma (,) at the end of each line that I'm generating and loading to the target but however for the last line I should not add the comma. Now that is not as easy as it sounds. Informatica powercenter does not have any indicator on the last line that is read from the source. How do we accomplish this? Well if you have access to unix scripts then most likely you are write a script to remove that comma from the target file after it is generated. I work in windows environment these days and don't have the option of writing complex scripts. This is how I solved the problem using aggregator with LAST function and a joiner to join the last record indicator with rest of the records.


I created two source definition reading from the same file. The first source reads all the records but after the second source I have added a aggregator to pass the last line using the LAST function.


The last function in the aggregator sent the last record. Basically I have used a field called LAST_RECORD that has the value LAST(ID1 || ID2) in the aggregator. ID1 and ID2 are the two primary key fields of every record that is read from the source and the last function sends the last value of ID1 concatenated to ID2. There is one record out of this aggregator which is the last record read from the source.


For all the records read from the first source, I'm creating a field called FIELD_CONCAT that holds the value of ID1 || ID2 which is the concatenation of two value.


Now all the records coming from the two sources are joined on the FIELD_CONCAT field from the first source and LAST_RECORD field from the second source. This join is a full outer join, All the records coming out from this full outer join will have the LAST_RECORD field value as NULL except for the one record which is equal to the LAST(ID1|ID2) value. I hope you got the point. Now you can check which of those records have the LAST_RECORD value as not null, and remove the comma from the last record which was my original requirement to remove the comma from the last record.

Now one problem was the joiner was sending the records in the same order as it was reading from the first source. Hence I have to add sequence number field to each record and then sort using that field so that we can write the output in the same order as the input. Sample mapping for identifying the last line read is shown below. It is an Informatica cloud mapping. 



Monday 2 November 2015

Estimating database and table size in Netezza

Estimating database and table size and skew in Netezza


nz_db_size utility present in /nz/support/contrib/3.1/bin/ can be used to estimate the total number of bytes on disk that a given database and the tables belonging to the database is occupying:


Usage is as follows:

nz_db_size [ -s ] [ database [ table ] ]

E.g.:

[nz@nps12345 bin]$ ./nz_db_size MYTESTDB MYTABLE


Object | Name | Bytes | KB | MB | GB | TB
-----------+----------------------------------+---------------------+-----------------+-------------+-----------+--------
Appliance | nps1234 | 134,934,002 | 136,083 | 132,894 | 192.8 | .1
Database | MYTESDB | 123,245,56 | 12
,345,67 | 1000 | 1230 | .0
Table | MYTABLE | 1,206,720 | 1,178 | 1 | .0 | .0


To get summary stats for database use the below command:

[nz@mpc-nps]$ nz_db_size -summary -s

  Object   |               Name               |        Bytes         |        KB        |      MB      |     GB     |   TB
-----------+----------------------------------+----------------------+------------------+--------------+------------+--------
 Appliance | mpc-nps                   |    1,34343,2434,24 |   1,232,32,23232,34 |    2,167,756 |    4,2267 |    2.1
 Database  | MYTESTDB                          |      123,2323,3232,33 |      123,345,343|      456,334|      686.0 |     .7


nz_stats is another command that can be used be get the full statistics:

For example: nzstats show -type table


Using SQLCMD to execute SQL scripts from a file on SQL server machine in Informatica

Using SQLCMD to execute SQL scripts from a file on SQL server machine


I was wondering how to execute a SQL script  using Informatica cloud on SQL server. The obvious approach that came to mind was using the SQL transformation in script mode. That approach works but cannot be implemented in Informatica cloud since SQL transformation is not supported in Informatica. The next approach that came to mind was to use some tool that can execute SQL script on SQL server. Happened to find out about this tool called the SQLCMD that can do the task for you.
You can put all the commands you want to execute in a .sql file or .txt file and then execute the script using the SQLCMD utility. In the below example, all the SQL  commands are in C;\test_sql.txt

You will most likely find the tool in the below location on the machine that has microsoft managment studio installed.

cd C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\110\Tools\Bin

If you want to execute the SQLCMD command in window authentication mode use the below command:

SQLCMD -S myservermachine -i C:\test_sql.txt -o C:\out.txt

If you want to execute the SQLCMD command in window authentication mode as a different user then use  the below command:

RUNAS /user:Mydomain\myusername "SQLCMD -S myservermachine -i C:\test_sql.txt -o C:\out.txt"

If you want to execute the SQLCMD command in sql server  authentication mode use the below command:

SQLCMD -U myusername -P mypass -S myservermachine -i C:\test_sql.txt -o C:\out.txt

Using NZSQL to execute SQL scripts from a file on Netezza machine

Similary nzsql tool can be used SQL scripts from a file on Netezza machine. The command is:

nzsql -f <sql_file_name>


Saturday 24 October 2015

Limitation of Hive mode in Informatica BDE (Informatica 9.6.1) version hotfix 1

Limitation of Hive mode in Informatica Big data edition (Informatica 9.6.1) version hotfix 1

 Informatica BDE in hive mode has some limitation compared to normal power center mode called the native mode basically because in hive mode the whole mapping logic is converted to Hive SQL. Hive has some limitations such as no support for updates that gets reflected in Informatica BDE in Hive mode. Below is the list of limitation of Informatica BDE in hive mode:

1) There is no support for unconnected lookups, sequence generator or sorters in Hive mode in Informatica hive mode. The reason being hive mode does not support statefulness i.e it cant keep track of values from previous records. Write properties such as insert, update, delete are ignored and it is always a truncated and load.

2) For the same reason above, it does not support stateful variables. i.e you cannot have variable that keep values from previous records.

3) Some of the hive versions does not support timestamp and date. Hence, if you are working with those hive version then you cannot use timestamp or date datatypes.

4) Functions such as abort, error, md5, crc32, max, first, last, aesdecrypt, aesencrypt, etc are not valid in hive environment or has some limitations.

5) PreSQL and post SQL do not work in hive mode. 

6) Datatypes such as arrays, maps , unions might be converted to strings in hive data object. 

7) Mapping variables and parameters are not supported.

8) The mappings always run in low precision mode.

 

Even with all the these limitations BDE still is a very useful tool for building mappings on ment and reduces the development time to a great extent. The developers just need to be a bit innovative when they workaround these limitations.  One has to be aware that all the ETL logic gets converted to hive SQL so whatever cannot be converted to Hive SQL cannot be supported by this tool. 

Friday 2 October 2015

How to use SQLPlus to connect to oracle and use it in unix scripts? Connect strings

How to use SQLPlus to connect to oracle and use it in unix scripts? SQL Plus Connect strings and commands :

Sqlplus is a command line tool that comes with any oracle client that can be used to connect to oracle database. It can also be used in scripts to send commands to oracle database.

The common sqlplus connection string is:

sqlplus username/password@servername:port/databasename 

Go to the directory where sqlplus is location and enter the above command with proper username, password supplied to the command as shown in the connection string.

To find list of tables in the database use the command in sqlplus:

select table_name from user_tables;
 
Rest of the DML and DDL statements from oracle work in SQL plus.

If you want to use sqlplus in an unix script then the below sample unix script should help:

#/usr/bin/ksh
sqlplus username/password@servername:port/databasename << EOF
WHENEVER SQLERROR EXIT 2;
insert into mytable select * from mytemptable;
EOF
RTN=$?
if [ $RTH -ne 0] ; then
echo failed




Informatica Java transformation to parse comma separated columns and generate new rows - normalizer

Informatica Java transformation to parse comma separated columns and generate new rows - normalizer limitations

Informatica normalizer transformation is normally used to convert a column containing multiple values into separate rows. i.e something like ID, 1,2,3,4 can be converted to something like below.
ID , 1
ID, 2
ID, 3
ID, 4

The problem with normalizer however is that it has to know the number of occurences. If the number of occurences are very high then it is hard to create the output ports for that many number of ports.

The easy way in these cases is to use Java transformation to parse the column containing multiple values and generate a separate row for each of them:

The java code is shown below:

String str_var= INPUT_PORT_NAME
String[] arr;
String delimiter = ",";
arr=str_var.split(delimiter);
for (int i=0; i<arr.length;i++) {
INPUT_PORT_NAME=arr[i];
generateRow();
}


How to update records in Hive for type2 tables?

How to update records in Hive for type2 tables?

It is easy in Hive to append data but it is not easy to update data in any hive table. In fact older versions of hive i.e. prior to 0.12 does not support update operations at all. In those scenarios, how do you update data in Hive?. There is no other way than to rebuild the whole table.  This is important if your table is a type2 table which contains both the historical record and the changed record.

For example: Lets say your main table is tweets_main and all the changed/new  records are in tweets_staging table. How do you update the tweets_main table in hive?

Answer: Take all the unchanged records from the tweets_main table and all the changed records from the tweets_staging table and overwrite the tweets_main table. The below step should work. Lets say ID is the field that is the primary key in both the tables.


insert overwrite table tweets_main as  select * from tweets_main t1 left outer join tweets_staging t2 on t1.ID=t2.ID where t2.ID is null UNION ALL select * from tweets_staging;

So what the above step has done is take all the records from main table that have not changed and then union with all the change records from the change table tweets_staging and reinsert back into the tweets_main table. If you cannot create it in one step then put all the records from the select statement into temp table and then overwrite the main table.

The above steps can also be used for partitioned tables in hive except that is a bit more complicated since you want to overwrite only the impacted partitions in hive. You will have to find out all the impacted partitions, then take the records from the impacted partitions that have not changed, union it will the change records and then overwrite the partitions again in Hive. Check the below article on how to dynamically partitions the table.



Hadoop: How to dynamically partition table in Hive and insert data into partitioned table for better query performance?

Hadoop: How to dynamically partition table in Hive and insert data into partitioned table for better query performance?

Partitioning in Hive just like in any database allows for better query performance since it allows only sections on data to read instead of the complete table. Hive allows single or  multiple columns to be used for partitioning data. It allows dynamic and both static partitioning of tables. In Hive, since data is stored as files on HDFS, whenever you partition the table it creates sub directories using the partition key. For example, if date and batch key are the partition keys , then it creates directories by date and then batch key sub directory within than date directory. This allows to read data by a certain date and batch key.

In dynamic partitioning of hive, the data is inserted into the respective partition dynamically without you having explicitly create the partitions. Do not use columns with high cardinality as partition keys since that will create more subdirectories which will deteriorate the performance.
Below example shows one of the example:
 
/*Below is the ddl for creating the main table for which you want to insert data. 
It is partitioned by date and batchid. The partitions keys apparently has to be at the end of the table */
  
create table IF NOT EXISTS mytable ( id int, age smallint, name string, joining_date timestamp, location string ) partitioned by (roweffectivedate string, batchid int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS SEQUENCEFILE;
 
/*Below is the ddl for the temp hive table from which you want to insert data to the main table*/
  
create table IF NOT EXISTS temp_mytable ( id int, age smallint, name string, joining_date timestamp, location string, roweffectivedate string, batchid int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS SEQUENCEFILE;

/* Load data from a local file to the temp hive table */ 

load data local inpath '/etc/home/mydirectory/test.txt' overwrite into table temp_table;

/* The below set statment allows all the partitions to be dynamic in Hive */
  
set hive.exec.dynamic.partition.mode=nonstrict;

/* Now insert data from the temp hive table to the main hive table that stores the data by date and batch id */ insert overwrite table mytable partition(roweffectivedate string, batchid int) select * from temp_mytable; 

If you want to manually add the partitions to Hive then use the alter statements such as below:
  
ALTER TABLE test ADD PARTITION (roweffectivedate='2014-03-05',batchid='10') location '/usr/datawarehouse/hive/db/mytable/'

Friday 18 September 2015

Creating a table in Parquet, Sequence, RCFILE and TextFile format and enabling compression in Hive

Creating a table in Parquet, Sequence, RCFILE and TextFile format in Hive.


My intention was to write an article of different file formats in Hive but happened to notice an article already posted. 
 
We are using parquet these days apparently because of the compression options and performance with large tables in Hive. Parquet is optimized to work with large data sets and provide good performance when doing aggregation functions such as max or sum.

If you prefer to see the HDFS file in clear text format then you need to store the file in textfile format. This apparently takes more space than the binary formats supported by sequence and rcfile.


Below are examples for creating a table in Parquet ,Sequence, RCfile, TextFile format: Location is optional.

create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS parquet location 'hdfspathname';


create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS sequencefile location 'hdfspathname';


create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS rcfile location 'hdfspathname';


create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS textfile location 'hdfspathname';


The default storage format is apparently is a comma-delimited TEXTFILE.

Enabling compression in Hive or Impala table:


Apart from the storage format supported in Hive, the data can also be compressed using codecs such as LZO, Snappy, etc. To enable compression in Hive, we use set statements such as below. The below example is for Lzo compression. After the set statements you can use insert statements to insert data to a table and the data will be compressed as per the set statement.



set mapreduce.output.fileoutputformat.compress=true;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;

insert overwrite table tablename partition(dateid) select * from table2;




Monday 14 September 2015

Why a career in Data warehousing ? Best jobs and salary in Data warehousing.

Why a career in Data warehousing?


Firstly, what is data warehousing? Data Warehousing (DW) is a term used in Information technology to refer to data repositories created to store historical data that is mostly used for reporting and analytical purposes. This could be an enterprise data warehouse that has data integrated from various sources, or it could be subject specific data repositories called the data marts.  This is just a small definition of data warehousing. 


These days data warehousing is gaining more popularity and attention thanks to Big data and other high profile web service technologies. Big data is a technology that allows huge volume of data to be processed at lower cost for analytical purpose. Big companies like Facebook, google, twitter, and yahoo have invested in these technologies and using them for processing the huge amount of data that they gather on daily basis. These days cloud services have become popular and are allowing huge data warehouses to be hosted on the cloud. Amazon web services, Microsoft azure to make a few are gaining popularity. With these kind of technologies making news and gaining popularity, data warehousing is becoming more exciting and a place where lot of IT jobs are getting created and bringing in talented IT professionals. 

Having worked in Data warehousing from last 10 years, I have seen increased interest in companies investing in data warehousing and in general increase in interest among ITfolks in data warehousing related technologies. I'm writing this article to introduce people to data warehousing and let them know about some of the jobs in data warehousing. 

Best jobs and salary in Data warehousing.

Enterprise Data Architect/ ETL Architect/ Solution Architect / Hadoop Architect - 
As the name implies these guys are responsible for architecture. This could be enterprise architecture which means they decide how all the pieces of the data warehouse work together, which products need to be used, interactions between different systems, etc. ETL architect could be more involved with data integration and creation of data warehouses. Hadoop architects are more concerned about the big data architecture in an organization. Architects also work with upper management and business teams to provide optimum data warehousing solutions. 

Estimated Salary: 80,000 to 140,000+

Team/Project/Account Managers -
As with any team , there will be managers in data warehousing teams too. These guys usually are expected to have some experience with data warehousing. Team managers are more involved in resourcing and budgets etc. Project managers are more involved in managing a projects till the end. Account managers are involved if the company is a vendor providing services to another organization.

Estimated Salary: 80,000 to 120,000+ 

Big data / Hadoop developers and Admins-
The big data developers are involved in developing Hadoop jobs which could be writing map reduce jobs, developing hive , pig , oozie scripts or working on complex apache spark code. The Hadoop admins are involved in managing the Hadoop infrastructure, fine tuning it for performance, user administrator, security, etc.

Estimated Salary: 60,000 to 100,000+ 

Data modellers -
In some of the organization the data modelling can be done by data architects or even ETL developers. Some organizations also have different people for doing data modelling. Data modelling refers to developing all the table structure and defining the relationships between them.

Estimated Salary: 60,000 to 100,000+ 

ETL/Data warehouse Developer -
ETL stands for Extract, transform, and load and ETL professionals create jobs to integrate data from multiple sources, transform and load it to relational or flat file targets. These days ETL developers also work ETL tools created for big data platforms. There are many big vendors in this area and Informatica, Pentaho are some of the big names.

Estimated Salary: 70,000 to 100,000+ 

Report Developer -
Report developers build various reports and dashboards for the business teams using tools developed by vendors such as Cognos, Business Objects, Microstrategy, etc. Report developers need to know SQL and proficiency with any of the report development tools.


Estimated Salary: 70,000 to 100,000+ 


Tester/ Quality Assurance -
Well every field in software development needs a tester to make sure the software is working as per requirement. Testers do all the testing, write test cases, etc. Data warehouse testing might need testers who know a bit of ETL tools, SQL, and some data warehousing concepts.

Estimate Salary - 60,000 - 100,000+

Database Administrators -
Database administrators are responsible for managing and administering the databases and servers that are used to store the data in data warehousing. There are many jobs or oracle , sql server, netezza, etc kind of databases.

Estimated Salary: 70,000 to 100,000+ 

Production and Infrastructure Support - 
Production support folks are responsible for managing and supporting all the jobs in production. Only these guys have access to production servers and make sure all the production jobs are running smoothly. They have to carry pagers and sometimes have to work overnight shifts.

Estimated Salary: 50,000 to 80,000+ 










Monday 17 August 2015

Difference between native and hive mode in Informatica big data edition (BDE)

Difference between native and hive modes in Informatica big data edition BDE : 

 

a) Native mode- In native mode BDE works like a normal power center. This can be used to read /wrtie to traditional RDBMS databases. It can also be used to write to HDFS and Hive. It works like a power center because the execution of the mapping logic happens on the power center server. i.e. that source data is read to informatica server , transformations applied and then data is loaded to the target.

This mode is stateful i.e you can keep track of data from previous records, use sequence generators, sorters , etc just like in normal power center.


b) Hive mode- In Hive mode , like in native you can have similar source and targets however the whole mapping logic is pushed down to hive i.e. the hadoop cluster. The Informatica BDE in this mode coverts the mapping logic into hive SQL queries and executes it directly on the hadoop cluster as Hive queries there by converting them all into map reduce jobs.

This mode is not stateful i.e., you cannot keep track of dataa in the previous records using stateful variables. Your transformations like sorters, sequence generators wont work fully or properly.

Your update strategy transformation will not work in hive mode just because hive does not allow updates. You can only insert records to Hive database. 

In this mode the data gets read from source to temporary hive tables , transformed , and the target also gets loaded to temp hive tables before being inserted to final target which can be RDBMS database like oracle or Hive database. Hence the limition of hive also follows on to Hive mode in Informatica BDE. 

However. if your volume of data is huge and you want to push all the processing to hive then Hive mode is a better option. There are workarounds to  do type 2 kind of updates in Hive mode.

Thursday 6 August 2015

What is Informatica big data edition (BDE) ?

What is Informatica big data edition BDE ?


Informatica big data edition BDE is a product from Informatica Corp that can be used like an ETL tool for working in hadoop enviroment along with traditional RDBMS tools.  Now there are lot of ETL products in the market that makes it easier to integrate with hadoop. To name a few talend, pentaho, etc. Informatica is one of the leading ETL tool vendor and Informatica power center tool is very famous as an ETL tool and has been for many years. Traditionally this tool was used to extract transform and load data to traditional databases such as oracle, sql server, netezza to name a few. With advent of hadoop for storing peta byte volumes of data, building ETL tools that can work with hadoop became more important. It requires a lot of handcoding and knowledge to work directly with hadoop and build map reduce jobs. Hadoop tools such as hive made it easier to write SQL queries on top of Hive database and convert it to map reduce jobs. Hence, lot of companies started using Hive as a data warehouse tool and storing data in hadoop just like traditional databases and writing queries on Hive. How do we now extract , transform, and load the data in Hadoop? Thats where Informatica BDE comes into picture. It is a tool that you can use for ETL or ELT on hadoop infrastructure.  Informatica BDE can run in two modes. They are native mode and hive mode. 

In the native mode, it runs as a normal powercenter but in hive mode you can push down the whole mapping logic to hive and make it run on the hadoop cluster there by using the parallelism provided by hadoop. However there are some limitation when running in hive mode but that is more because of the limitations from hive itself. For example, hive does not allow updates in older versions.  

With Informatica BDE you can do the following at a very high level :

a) Just like any ETL tool you can do extract , transform, load between tranditional rdbms or hive/hdfs source and targets.
b) Push the whole ETL logic to hadoop cluster and make use of the map reduce framework. Basically it makes building hadoop jobs easier.
c) Makes it easy to create connection to all the different sources and integrate data from those sources. 
d) It makes it easier to ingest complex files such as JSON, XML, Cobol, AVRO, Parquest , etc.


Informatica BDE uses the Informatica developer interface to build the mappings , deploy and create applications. Anyone who has used IDQ before might be familiar with the Informatica developer interface. Informatica BDE can be found in Informatica  9.6 versions onwards.


Tuesday 14 July 2015

How to monitor Netezza performance?

 How to monitor Netezza performance?

Performance of Netezza depends on various factors such as distribution keys on the table, query performance, hardware factors such as number of spus,  data skew i.e. how the data is distributed across the spus, etc

There are different ways to check how your netezza box is performing. First you can use the netezza administration tool to see the number of queries running , queries getting queued, the performance of individual queries. If there are too many queries running then obviously it is going to slow down your server performance. If there are any bad queries pulling lot of data then obviously it will one of the culprit. Put a threshold time out for queries that users are running so that it does not run for ever.You can also check out the netezza performance portal. Provides some extra information on performance.

Data skew on netezza refers to how well the data is distributed on different spus of the netezza server. If some of the spus have more data then the others then it is going to impact the performance. Make sure the data is properly distributed by using appropriate distribution keys.

You can also use certain queries to see the performance of the netezza machine. Some of them are listed below.

To check all the queries running on the system use the below query

select * from SYSTEM..NZ_QUERY_HISTORY

To check the list of successful and all historical queries also try out the below queries:

SELECT * FROM  HIST_admin."$v_hist_successful_queries"

SELECT * FROM  HIST_admin."$v_hist_queries"

select * from   HISTDB.."$v_hist_table_access_stats"

select * from _v_qryhist

select * from _v_qrystat;




We can check distribution of data using performance portal or the netezza administration console, or the explain plan, or nz_skew dba script. 

Sunday 14 June 2015

How to connect to amazon redshift using Informatica power center?

How to connect to  amazon redshift using Informatica power center?


You can connect to amazon redshift from the drivers available on Informatica cloud. However, if you want to connect Informatica power center 9.1 and prior version  to redshift, using Postgres odbc drivers are an option. You can use DataDirect 6.1 PostgreSQL Wire Protocol and later versions for this purpose.

Step1: Install the postgres driver and create an entry in .odbc.ini


Some of the parameters you might have to set in your odbc (.odbc.ini) connection is below: Check you driver for the complete settings:

[PostgreSQL]
Driver=/informatica/server91/ODBC6.1/lib/DWpsql25.so
Description=DataDirect 6.1 PostgreSQL Wire Protocol
Database=mydatabasename
0HostName=myreshifthost.cqa0adasda-east-1.redshift.amazonaws.comLogonID=myuser_name
Password=mypassword
PortNumber=yourportnumber
TrustStore=/export/redshift-ssl-ca-cert.pem
PacketSize=32


You can contact Informatica or read the driver documentation for the full settings.

2) Create a relational odbc connection from Informatica workflow manager as shown below: 



Wednesday 10 June 2015

How to enforce primary key constraint in Informatica when loading to netezza?

Does netezza enforce primary key constraint?


Neteza does not enforce primary key constraint. Application that loads data to netezza has to enforce it.

How to enforce primary key constraint in Informatica when loading to netezza?


If Informatica is loading data to netezza you can enforce primary key constrain by setting the primary key constraint in the target definition of Informatica mappings. This allows it to enforce primary key constrain in the current load but does not work if there is data already in the netezza table. For example, if record1 and record2 are duplicate in the current load then Informatica detects it when loading to netezza table. However, if records2 already exists in netezza table, and you are now loading only record1 then Informatica cannot find out about record2.

In this circumtance you have to put additional logic in informatica such as a lookup or filter condition to check for existing records in netezza target table. Hope this helps!!




Monday 8 June 2015

How to install UDF functions on Netezza?


How to install an user defined function (UDF) in netezza?


Lets says you have got hold of the UDF function from netezza and want to install it then the below steps would help. In the example below, reverse function is installed on netezza server.


How to install UDF functions on Netezza?


 In the folder where you have placed the tgz file execute the below instruction.

[netezzahost]$ gunzip reverse.tgz
[netezzahost]$ tar -xvf reverse.tar

[netezzahost]$ cd reverse

[netezzahost]$ ./install <databasename>

CREATE FUNCTION
Created udf
Done


After this step you might have to execute grant command to provide permission on the function to the users.

For example:

grant execute on reverse to <mygroup>;


Login to database and test the function:

select reverse('mydog');
 REVERSE
---------
 godmy
(1 row)

How to connect to microsoft sqlserver from Informatica using odbc driver ?

How to connect to sqlserver from Informatica using odbc driver?

Step1: set up the .odbc. ini file with the below entry. You will have to enter the proper path for your driver and information for your database.

[MYSQLSERVER]
Driver=/data/informatica/ODBC6.1/lib/DWmsss25.so
Description=DataDirect SQL Server Wire Protocol
Database=mydatabase
LogonId=mysuerid
Password=mydatabase
Address=machinename,1433
QuotedId=No
AnsiNPW=No

Step 2: set up the relational odbc connection from your workflow manager as shown in screenshot below. 

Tuesday 12 May 2015

Connect to DB2 database using Informatica

Setting up DB2 ODBC or Native DB2 database Connection in Informatica

 There are two ways to create connection to DB2 in Informatica. First way is to use the DB2 in native mode using DB2 powerconnect module. Second way is to use odbc drivers.

To connect to DB2 in native mode, you need DB2 PowerConnect module installed on your informatica server machine.

The steps to create DB2 connection using DB2 powerconnect module is below:

1)    Create a remote DB2 database connection entry  using command:
db2 CATALOG TCPIP NODE <nodename> REMOTE <hostname_or_address> SERVER <port number>
2)    Create Catalog entries for the database in unix using command:
 db2 CATALOG DATABASE <dbname> as <dbalias> at NODE <nodename>
3)    Commands to check the database entries are below:
db2 list database directory
db2 list node directory
4)    Verify the connection to the DB2 database using command:
CONNECT TO <dbalias> USER <username> USING <password>
5)    Create a relational connection (type:DB2) with connect string as “nodename” in Informatic workflow manager (Sample below):





To connect to DB2 using odbc, you need odbc driver for db2 installed on your informatica server machine.

The steps to create DB2 connection using odbc is below:

1)    Create entry in .odbc.ini file on your Informatica server machine.

[mydb2database]
Driver=/informatica/ODBC6.1/lib/DWdb225.so
Description=DataDirect 5.2 DB2 Wire Protocol
ApplicationUsingThreads=1
ConnectionRetryCount=0
ConnectionRetryDelay=3
#Database applies to DB2 UDB only
Database=MYDB2DB
DynamicSections=200
GrantAuthid=PUBLIC
GrantExecute=1
IpAddress=101.101.701.101
LoadBalancing=0
#Location applies to OS/390 and AS/400 only
Location=
LogonID=<yourid>
 Password=<yourpasswd>
PackageOwner=
ReportCodePageConversionErrors=0
SecurityMechanism=0
TcpPort=4456
UseCurrentSchema=1
WithHold=1

2)    Create relational database connection of type odbc using the connection string created in step 1.

Thursday 23 April 2015

Getting view names and definition of view in Netezza

To get list of all the views in Netezza use the below statement:


SELECT viewname FROM _v_view WHERE viewname NOT LIKE '\_%' AND UPPER(objtype) = 'VIEW' ORDER BY viewname;


To get the definition of a view in Netezza  use the below statement. 


SELECT definition FROM _v_view WHERE viewname = <VIEWNAME> AND UPPER(objtype) = 'VIEW';

To get the unique object id associate with the view in Netezza use the below statement:


SELECT objid FROM _v_view WHERE viewname = <VIEWNAME> AND UPPER(objtype) = 'VIEW';


General syntax to create views in Netezza is :

create view vendor_view as select vendor_id, vendor_name from vendor order by vendor_id;

To read about mat views check:


http://dwbitechguru.blogspot.ca/2014/12/what-is-materialized-mat-view-in.html 


Wednesday 1 April 2015

Simple data conversion , string handling, testing conditions in Informatica expression transformation

Simple data conversion , string handling, conditions in Informatica expression transformation

Testing conditions in Informatica expression transformation use:

1) If condition in Informatica:

IIF(condition, true, false)

2) To test for null

ISNULL(<portname>)

When used along with IIF statement, it would look like:

IIF(ISNULL(NAME), '', NAME)

NAME is the port that you are testing to check if it is null. If null you are passing '' blank character otherwise you are sending the NAME as it is .

3) To test if the incoming port is a decimal or integer you can use the below functions:

IS_DECIMAL() or IS_INTEGER() or IsNumber()

For example:

IIF( Not IS_INTEGER(i_PRICE) OR ISNULL(i_PRICE),'Invalid Price', '')

4) To check length of a port which is of string data type :

IIF(LENGTH(i_NAME) > 0, i_NAME, '')


String Handling in Informatica expression transformation use:

1) To search for a string in a port and get the position of the string you can use INSTR Function. SUBSTR function can be used to used to
extract part of string>

substr(NAME,1,INSTR(NAME,'-',1)-1)

In the above example, everything upto - is extracted from the incoming port NAME.

2) To concat strings from multiple ports use || funcation. For example:

i_port1 || i_port2

3) REPLACESTR function can be replace characters in a string; For example, char(10), chr(13) are replaced with '' blanks in the port NAME.

IIF(isnull(NAME),' ', REPLACESTR (1, NAME,CHR(10),CHR(13), ''))

In the below example, - are removed from date ID and then converted to integer:

TO_INTEGER(REPLACECHR(FALSE, v_DATE_ID, '-', ''))

4) To remove blanks from a port you can use LTRIM or RTRIM functions: In example below, blank spaces are removed before checking if the value  of port NAME is equal to '?'.

IIF(LTRIM(RTRIM(NAME)) = '?', 'N', 'Y')


Date manipulation in Informatica expression transformation:

In the below example, port TIMESTAMP is converted to dates in YYYY-MM-DD and YYYY-MM-DD HH24:MI:SS format respectively.

TO_CHAR(trunc(TIMESTAMP), 'YYYY-MM-DD')

TO_DATE(TIMESTAMP, 'YYYY-MM-DD HH24:MI:SS')

Netezza datatype conversion function

Netezza datatype conversion function


To convert decimal to integer use cast(value as int) in netezza:


select cast(12345.30 as int)

In the above example in place of 12345.30 you can use any decimal or numeric column.

Wednesday 25 March 2015

Difference between Lookup and Joiner Transformation in Informatica

Difference between Lookup and joiner transformation in Informatica


Look up transformation :
a) Look up transformation can be used on a single table or a query to search for records that match incoming records. Matching condition can be specified in the lookup transformation. The result returned can be multiple columns or single column.
b) Lookup transformation can be unconnected or connected transformation. Unconnected transformation can return only single value.
c) Lookup transformation can be static or active. Dynamic lookups are active transformation.
d) Lookup transformation be used with more than one relational operator such as > , =, etc.

 

Joiner transformation :

a) Joiner transformation is used to usually to join data coming from two separate tables or source qualifiers.
b) The join can be left outer join, inner join, right outer join, etc.
c) The joiner returns all the results that match the join condition.
d) The master records in the joiner transformation is cached. The detail records are not cached. Hence, joiner transformation is active transformation.


When do you use joiner or lookup transformation?


a) If  the table size is not too large then preferable to use lookup.
b) If result from a single matching records needs to be returned then use a lookup. If there is a query that needs to be used in a lookup to find the result for lookup then good to use a lookup.
c) If you are doing lookup on a table that is updated in the session then use a lookup. Joiners are active so not preferable.
d) If look up table data does not change then the table can be made persistent and used in the lookup which gives even better performance.
e) If data from two different source qualifiers need to be joined then use a joiner.
f) If data from two different databases are read and need to be joined with a outer or inner join then use joiner transformation.

Thursday 19 March 2015

Creating DATABASE, GROUPS, USERS , GRANTING PERMISSIONS in Netezza

Creating DATABASE, GROUPS, USERS , GRANTING PERMISSIONS in Netezza

Some of the sample netezza commands to create database, user groups , users, add user to a group, grant permissions to netezza objects is below:


Creating a netezza database, group, and adding user group permissions to that database


\c SYSTEM
CREATE DATABASE <databasename>;

CREATE GROUP <groupname> WITH USER <username1>, <username2>;
GRANT LIST ON
<databasename> TO <groupname> WITH GRANT OPTION;GRANT  Create External Table, Create Sequence, Create Synonym, Create Table, Create Temp Table, Create View TO <groupname> ;

GRANT backup, create sequence, create synonym, create table, create external table, create temp table, create view, restore  TO <groupname> ;

To create a user group in netezza database:


Syntax: CREATE GROUP <groupname>

Example: CREATE GROUP  MYTESTGROUP1;

To alter the settings of the group:


ALTER GROUP "<groupname>" WITH ROWSETLIMIT 0 SESSIONTIMEOUT 0 QUERYTIMEOUT 50 DEFPRIORITY High MAXPRIORITY High;

To create a user and assign to a netezza database user group:


CREATE USER  <username>  WITH PASSWORD '??' IN GROUP "<groupname>";

If user already exists or after you have added a user then you can use the command below to add that user to a group:


ALTER GROUP "<groupname>" ADD USER "<username>";

To provide permission on netezza objects such as tables, views, database etc to a user group or a user use the syntax below:


GRANT List ON "<databasename>" TO "groupname";

GRANT List, Select, ON "<tablename>" TO "groupname";


GRANT delete,genstats,groom,insert,list,select,truncate,update ON <tablename> TO <groupname>;
 

To copy permissions from one table to the other in Netezza:


ALTER TABLE <target_table_name> set privileges to  <target table name>;

To change ownership of a netezza table:

alter table <tablename> owner to <newownername>

The above command change the schema name associated with the table