DWBI-TECH BLOGS (Pradeep Kannadiga): Informatica

Showing posts with label Informatica. Show all posts

Friday, 3 February 2017

Sample parameter file and using parameter file in Informatica

Below is a sample of a simple parameter file. All the global variables used by all the sessions are defined under [Global] and the session specific variable overrides are defined under session names.

-----------------------------------------------------

[Global]
$DBConnection_VSTP=VSTPDW_HIST_US

[s_MYSQL_EXT_Load5]
$$BATCH_KEY=204
$$YEAR=2016
$InputFile_Name=$$PMRootDir\SrcFiles\SRC1.EXT.G2016.txt

[s_SRC2_EXT_Load]
$$BATCH_KEY=204
$$YEAR=2016
$InputFile_Name=$$PMRootDir\SrcFiles\SRC2.EXT.G2016.txt

[s_SRC3_EXT_Load]
$$BATCH_KEY=204
$$YEAR=2016
$InputFile_Name=$$PMRootDir\SrcFiles\SRC2.EXT.G2016.txt

[s_SRC4_EXT_Load]
$$BATCH_KEY=204
$$YEAR=2016
$InputFile_Name=$$PMRootDir\SrcFiles\SRC14.EXT.G2016.txt
-------------------------------------------------------------------------

The parameter file can be set at session level or at workflow level. The worklevel parameter applies to all the session unless it is over riden by another parameter file at session level.

Wednesday, 26 October 2016

Static vs Dynamic Lookup in Informatica

Difference between Static vs Dynamic Lookup In Informatica

Static Lookups:

These are are used when you want to lookup a value on a static table or a SQL query based on certain matching conditions. It caches the values when the static lookup in created at the begining of the execution of the session. If the values in the lookup tables are changing and you want the changed values to be reflected during execution of the session then you need to use dynamic cache.
When you create a lookup transformation it by defaults creates a static cache. Static lookups can be connected or unconnected depending on weather you want multiple or single value to be returned from the lookup.

Dynamic Lookups:

These kind of lookups are required when you want the changes happening in the lookup table to be reflected during the execution of the mapping/session. You can create dynamic lookups by setting Dynamic lookup Cache property in the properties section of the lookup transformation. When you make the lookup dynamic, you will observe in ports tab that a new field called NewLookupRow is created automatically. This NewlookupRow port will indicate if a row has been inserted or updated in the cache. In the dynamic lookup the records gets inserted or updated in the cache depending on weather the rows existed or not existed in the cached.

Use of NewLookupRow in Dynamic Cache:

NewLookupRow port can have the following values:-

0 - Indicates that powercenter did not update or insert the row in the cache.
1 - Indicates that powercenter inserted the row into the cache because it is a new row
2- Indicates that powercenter updated the row in the cache because the lookup value has changed.

Using the above values you can decide using update or filter transformation what you want to do with those rows coming of the source. Like for example if you want to insert it into the target table or update it? If NewLookupRow=1 then insert it to the target table or if it equal to 2 then update the record in the target table.

Thursday, 25 August 2016

Improving performance of Inserts, updates and deletes in SQL server

How to make Inserts , updates and deletes faster in SQL server - Microsoft Azure?

I tried to delete some records from a table with 100 millions records and it took forever. One of the reason being there was no index on the column I used in the where clause of the delete statement and other reason being the table was so big. So deleting a big chunk of the data from such big tables can take forever since the data gets written to transaction logs and the table has to scan for all the records that match the where clause condition.

How to delete records quickly from huge tables in sql server on Azure?

1) Use only the column that has index defined in the where clause condition. May be this will help find the records to be deleted quickly.

2) Delete records in small chunks using the loop below. This prevents transaction logs from growing rapidly. @@rowcount always returns the number of records affected from delete.

DELETE TOP (100000) from dbo.mytable WHERE name='dwbi'
WHILE @@rowcount > 0
BEGIN
 
  DELETE TOP (100000) from dbo.mytable
    WHERE name='dwbi'
 
END

3) Insert data into temp table with no index on it using the columns that need not be deleted , delete the original table, rename the temp table back to the original table and add back the index on the newly renamed temp table.

select * into tmp_mytable from mytable where name!='dwbi'

drop mytable

sp_rename 'tmp_mytable', 'mytable'

How to insert records quickly from huge tables in sql server on Azure?

My records to insert records into a huge table improved a lot after I disabled the non clustered indexes and rebuild it after the insert. The steps are below:

ALTER INDEX [IX_MYTABLE_User_SID] ON dbo.[MYTABLE] DISABLE

Insert into [dbo].[mytable] select * from [dbo].[stg_table]

ALTER INDEX [IX_MYTABLE_UserSID] ON dbo.[MYTABLE] REBUILD

How to update records quickly from huge tables in sql server on Azure?

My usual way of updating records used to be something like this:

update dbo.mytablename set name= (
select dbo.TMP_table_name.name
from TMP_table_name

where dbo.mytablename.party_sid=TMP_table_name.party_sid)

The above update statement takes a lot of time because it updates all the records even those that do not match. The easier and faster way to update is using the statement below: This will update only the matching records and it is pretty quicker.

update dbo.mytablename set
dbo.mytablename.name=dbo.TMP_table_name.name
from TMP_table_name

where dbo.mytablename.party_sid=TMP_table_name.party_sid

Thursday, 28 April 2016

How to run stored procedure in Informatica cloud

How to run stored procedure in Informatica cloud?

There are multiple ways to run stored procedure in Informatica cloud.

If you want to call a stored procedure by passing an input value and reading the output from the stored procedure and process it in a mapping then it is better to use SQL transformation in Informatica cloud mapping designer as shown in picture below:

If you want to read the output of a stored procedure as a source in a mapping then you can call the stored procedure a query in the source as shown in screenshot below. The exec sp_help stored procedure is called in the source query. The stored procedure should return some data for this method to work. You might have to define the output variables in the stored procedure.

If you want to just execute a stored procedure once before or after executing a mapping then you can call the stored procedure a pre or post processing command in the mapping configuration as shown in the screenshot below. SQLCMD of sql server is used to execute the stored procedure. The output of the stored procedure is written to an output file. -b flag is used to return error code and make the preprocessing command fail when the stored procedure fails. Use session property Preprocessing command failure and set it to stop if you want failure to stop the processing of other jobs.

Tuesday, 29 March 2016

Importing Cobol copy book definition , and reading and troubleshooting cobol file issues in Informatica

Importing Cobol copy book definition in Informatica and troubleshooting cobol file issues

To read a cobol file, first you need to create the source definition and then read the file the data from the cobol input file. Source definitions are created from cobol copy book definitions. Cobol copy book has definition of all the fields present in the data file along with the level and occurence information.

How to import a cobol copy book in Informatica?

Create source definition using Source-> Importing From Cobol file option to create the source definition from cobol copy book.

Insert the source definition into mapping. Instead of source qualifier it is going to create normalizer transformation.Read from the normalizer and write to any target.

Details on how to load data from cobol copy book is in this video that I found on youtube:

https://www.youtube.com/watch?v=WeACh8AWQv8

Troubleshooting copy book import in informatica:

Reading from cobol file is not always straightforward. You will have to sometimes edit the copy book to get it working. Some of the messages I had while creating source definitions are below:

Error messages seen:

05 Error at line 176 : parse error

Error: FD FILE-ONE ignored - invalid or incomplete.

05 Error: Record not starting with 01 level at line 7

Resolution for copy book import issues In Informatica:

1) Add the headers and footers shown in blue below even if they are blank

        identification division.
        program-id. mead.
        environment division.
        select file-one assign to "fname".
       data division.
       file section.
       fd FILE-ONE.
        01 HOSPITAL-REC.
           02 DEPT-ID                PIC 9(15).
           02 DEPT-NM                PIC X(25).
           02 DEPT-REC                OCCURS 2 TIMES.
               03 AREA-ID             PIC 9(5).
               03 AREA-NM             PIC X(25).
               03 FLOOR-REC       OCCURS 5 TIMES.
                   04 FLOOR-ID       PIC 9(15).
                   04 FLOOR-NM       PIC X(25).
      working-storage section.
       procedure division.
           stop run.

2) All the fields should start from column 8. Create 8 spaces before each field.
3) Make sure the field names do not start with *. It has to have a level number at the beginning.
4) Level 01 has to be defined as the first record. If level 01 not there then it would not import the cbl file.
5) Special levels such as 88 , 66 might not get imported. This apparently is limitation of powercenter. I could not find a way around it.

Resolution for cobol file issues In Informatica:

I had such a hard time reading a cobol file that was in EBCDIC format with comp-3 fields in it. I thought the issues are mostly in importing the cobol copy book but even more issues when reading the cobol file. It is supposed to be straight forward in Informatica powercenter. The list of issues I faced.

1) When reading from cobol mainfram file the comp-3 fields were not read properly.

2) The alignments of the fields were not correct. The values were moving to next columns and had a cascading effect.

To resolve the above two issues I asked the mainframe file to be sent in binary format. Apparently files ftp'ed in ascii format from mainframe causes all the above issue. The comp-3 fields are not transferred properly in Ascii mode of ftp.

Now with the binary file and below properties shown in screen it was supposed to work correctly. It was not. It was all garbage because of a weird issue. The properties I set for code page in source properties in source designer was for some reason not getting reflected in the session properties for that source. I had to manually change in the session- mapping -> source- file properties -advanced tab to code page IBM EBCDIC US English IBM037 and Number of Bytes to skip between records to 0 and it finally worked.

To summarize when reading cobol file:
a) Ask for the file to be ftp'ed in binary format from the mainframe.
2) Set the code page and number of bytes to skip appropriately as shown in the screenshot in the session properties.

This should solve most of your problems with comp-3 fields and alignments ,etc.

If not try unchecking IBM comp-3 option in the source properties from source designer. Try changing the code page and number of bytes to skip, etc and hopefully it works.

If you are seeing dots at the end of some comp-3 fields then most likely you are writing numeric data into varchar field. If you write it to numeric fields in the database those dots should disappear.

Another issue with Informatica power center is that it can read only fixed width files. if you have occurs section with depending on condition then most likely you have a variable length file and Informatica power center does not work correctly with variable length files. Ask the mainframe guys to convert it to fixed width files.

Friday, 4 March 2016

Mapping variables and parameters in Informatica Mapping and session

Mapping variables and parameters in Informatica Mapping

1) Informatica Mapping Parameters:

You use mapping parameters to supply parameter values to informatica mappings. It could be values such as Batch_Key, Batch_Name etc.The value of the mapping parameters do not change for the entire execution time of a session. The mapping parameters can be defined in the mapping as shown in the screenshot below in the mapping designer.The same mapping parameters needs to be defined in the parameter file too.

For example, the parameter file would have:

$$BATCH_KEY=301

In the mapping designer, you would define the $$BATCH_KEY in the mapping and parameters window. During execution the session will associate the value 301 where it finds $$BATCH_KEY in the mapping. It might also be necessary to define the same parameter in the worflow under the variables tab.

2) Informatica Mapping Variables:

Mapping variable represents a value that can change through the session. Mapping variables can also be defined under the mapping and parameters window of the mapping designer just like parameters (shown in screenshot above). It can also be initialized using the parameter file. However the values of a variable can be changed during execution. The value of the mapping variables can be changed in an expression transformation using setvariable functions. For example:

SETVARIABLE($$BATCH_KEY, 302) would change the value of $$BATCH_KEY to 302 during execution.

SETVARIABLE function can be used in expression, filter, router, and update strategy.Integration Service saves the value of a mapping variable to the repository at the end of each successful session run and uses that value the next time you run the session.

Postsession and presession variable assignment in session can be used to pass the value of the mapping variable to the session/workflow variables.This allows using the values of the variable at workflow level as shown in screenshot below.

3)Informatica Session Parameters:

Session parameters represent values that can change between session runs, such as database connections or source and target files. E.g.: $DBConnectionName, $InputFileName, etc. These paramters needs to defined in the parameter file.

4)Informatica Workflow variables:

All the parameters, variables that have to be used in the workflow needs to be defined in the workflow in the window below. This can be all the mapping parameters, session parameters, etc.

Monday, 22 February 2016

Pushdown Optimization in Informatica Power Center and Big data edition

Pushdown optimization is a feature in Informatica which enables the Informatica mapping to be converted into a SQL code and executed on the database. It is useful when you are dealing with huge volumes of data from the source and instead of reading all the data to the Informatica server it might be better to do some of the transformation logic such as filtering, aggregation, joins on the database itself. Also, useful in cases when you have powerful data warehouse appliance such as Netezza or Teradata.

What is informatica cloud services?

What is Informatica cloud services ?

Informatica cloud service is an Informatica product that is web based and on the cloud solution for data integration. The mappings and any data integration tasks can be created on the web based tool and executed/monitored from the web based tool. A secure agent is installed on the local environment that allows reading/writing data to/from the cloud. Connections can be created to database such as amazon redshift, sql server, oracle, netezza etc and network folders. This product allows ETL tasks to be created on the web without having a power center server location on your premise. Hence, providing a cheaper option for creating Informatica tasks.

The different components of the Informatica cloud service (ICS) tool sets are listed below:
.

Data synchronization and Data replication:

As the name implies it allows database synchronization. Data is copied from the source to the targets along with application of any data filters. One table or multiple tables can be synchronized. ICS provides inbuilt connectors using which connections to the source and target tables can be created.

Mapping configuration:

Allows a task to be created from a mappings. The parameters files, variables, post processing commands, sessions settings etc can be defined for the task. This is similar to the session in the informatica power center.

Power Center:

This component allows Informatica power center workflows to be imported and executed from the Informatica cloud service. The source and target connections from the power center workflow can be mapped to the connections available on Informatica cloud service. When importing informatica power center workflows make sure the workflow is exported from the repository manager of power center. Otherwise it shows some error while importing the task.

Mappings

This component allows to create informatica ETL mappings similar to Informatica power center. Not all the features of Informatica power center is available in Informatica cloud service. Features like SQL transformation, union, etc are not provided in ICS. If you need additional features then you might have to create a mapplet in power center and import that task as a mapplet from powercenter and add it to the ICS mapping. ICS provides transformations such as expression, joiner, filter, lookup, sorter, aggregator, mapplet, and normalizer.

Task Flows

Task flows allows to create a sequence of jobs to be executed. All the mapping configuration tasks created can be made to execute in a sequence. This is similar to the workflow in informatica power center.

Integration templates

This component allows Informatica mapping templates created using Microsoft visio to be imported and applied to mappings in ICS.

Activity Log/Monitor

This component allows to monitor all the executing tasks and as well see the completed tasks. It provides information on the number of source and target rows, session logs, etc.

Mapplets

ICS mapplets work similar to Informatica power center mapplets. Mostly this allows powercenter mapplets to be imported to ICS.

Connections

This components allows connections to be created to flat files folders, databases, etc. Wide range of connectors are available.

Friday, 11 December 2015

SQL override in Source qualifier and look up overide in Informatica

SQL override in Source qualifier and look ups in Informatica

SQL override is commonly used by ETL developers in Source qualifiers and Lookup transformation. To override SQL in source qualifier you can generate the SQL from the properties tab of the source qualifier and do all the necessary modifications to the SQL. All you have to keep in mind is the columns in the select statement match the order of the ports in the source qualifier. The ports have to be mapped to the transformation next to the source qualifier. The diagram below shows a simple SQL override. Similarly SQLs can be override in the lookups too.

The advantages of SQL override in Informatica:

a) You can put a lot of logic in the SQL override. The SQL will be executed in the source database and the results will be pulled in by the Informatica powercenter. Instead of reading all the records from the table, by writing SQL override you can filter, aggregate, etc on the source database and pull in the result set.

b) Similarly in the Lookups instead of reading all the columns from a table, you can read only the required columns from the table by doing SQL override and removing all the unnecessary ports. Also you can aggregate , filter, etc on the records that needs to be read.

c) If you are comfortable with SQL you would prefer this path instead of implementing a lot of stuff that can be achieved with SQLs.

Thursday, 12 November 2015

Indirect file processing in Informatica Cloud - Workaround

Indirect file processing ( Filelist) in Informatica Cloud (ICS)

I have a scenario where I need to read list of files (filelist) in Informatica cloud. My research pointed out to couple of articles on Informatica community. Apparently there is a template that we can work with or we need to use filelist connector. I could not find both and so had to come up with a different startegy.

The strategy was:
1) Use powercenter to create the mapping and a workflow that reads from the file list and writes to a oracle target
2) Export it from repository manager as XML file
3) Import it from the informatica cloud as a powercenter task.Map the connection to ICS connections and run it from Informatica cloud.

In Informatica power center we read filelist by setting the property in the session source file to indirect as shown in the screenshot below: Using the same strategy I created my mapping and wrote the records to oracle target. Once I tested the workflow I exported it to Informatica cloud and it worked.

When you import in Informatica cloud you can map the connection of the oracle target to the connection used in the cloud. Make sure your mapping has only the supported cloud transformations.

Also the source filelist and the source files have to be in location that is accessible to Informatica cloud (ICS).

Tuesday, 10 November 2015

User management- Creating groups , users , owners in Amazon redshift

Below are some sample commands to create user groups, adding users, managing permission on table in Amazon redshift.

1) To create a user use the below command in amazon redshift:

create user my_user password 'anaconda123';

In the above example, user with name my_user is created with password anaconda123.

You can later change the password using the command:

alter user my_user password 'mynewpassword123'

2) To create a user_group use the below command in amazon redshift:

create group my_group with user my_user;

In the above command, a new group with name my_group is created and the user my_user is added to that group.

3) To create a database schema use the below command in Amazon redshift:

create schema myschema authorization my_user;

In the above example, schema myschema is created with my_user being provided all the permission on the schema.

4) To assign user group to a schema use the below command in Amazon redshift:

grant all on schema myschema to group my_group;

5) To add or drop users to a user group use the below command in Amazon redshift:

ALTER GROUP group_name ADD USER username;

ALTER GROUP group_name DROP USER username;

ALTER GROUP group_name RENAME TO new_name;

6) To check permissions on the tables in amazon redshift:

select * from pg_tables /* provides all table owner info */
select * from pg_views /* provided table owner info */
select * from pg_shadow /* provides user info */
select * from pg_group /* provided user group info */

7) To add permissions on the tables in amazon redshift:

Use sample below to add permissions on a table in redshift

Create table MYTABLE (
COLUMN1 CHAR(1) NOT NULL,
COLUMN2 CHARACTER VARYING(100) NOT NULL)
distkey (COLUMN1);

GRANT ALL ON TABLE MYTABLE TO mstr_user;

ALTER TABLE MYTABLE owner TO my_user;

Monday, 9 November 2015

SQL transformation in Informatica for executing SQL scripts and dynamic SQLs

If you have some SQLs in a file and need to execute it on a database using Informatica or if you want to use some transformation to create dynamic SQLs i,e SQLs that take parameter and give you the output based on the input parameter, then SQL transformation is a way to go.

The SQL transfomation operates in Script and Query mode. In query mode you can write a query in the SQL transformatica and specify a parameter in the query which is the input port and depending on the value of the input port the result could change. For example: if NAME is a input port in the SQL transformation, your query in SQL transformation could be " SELECT DEPARTMENT FROM DEPARTMENT_TABLE WHERE FIRST_NAME=?NAME?". This will return department name for the every NAME that goes through the SQL transformation.

If you want to execute bunch of SQLs that are in a file then create the SQL transformation in Script mode. Read the SQL from a flat file and pass the field corresponding to the SQL to the Script name field of the SQL transformation. The SQL transformation executes every SQL that it reads from the flat file and executes it.

For example: If the input file has :

CREATE TABLE MYTABLE as SELECT * from MYSTAGING_TABLE;

InSERT into MYTABLE values ('xxx', 1 ,2 3).

The SQL transformation will execute both the statements one after another and return the script result and any ERROR outputs. Sample mapping is shown in the screenshot below:

Friday, 6 November 2015

How to identify last record in Informatica cloud or powercenter

How to identify last record in Informatica cloud or powercenter?

I had a scenario where I had to add comma (,) at the end of each line that I'm generating and loading to the target but however for the last line I should not add the comma. Now that is not as easy as it sounds. Informatica powercenter does not have any indicator on the last line that is read from the source. How do we accomplish this? Well if you have access to unix scripts then most likely you are write a script to remove that comma from the target file after it is generated. I work in windows environment these days and don't have the option of writing complex scripts. This is how I solved the problem using aggregator with LAST function and a joiner to join the last record indicator with rest of the records.

I created two source definition reading from the same file. The first source reads all the records but after the second source I have added a aggregator to pass the last line using the LAST function.

The last function in the aggregator sent the last record. Basically I have used a field called LAST_RECORD that has the value LAST(ID1 || ID2) in the aggregator. ID1 and ID2 are the two primary key fields of every record that is read from the source and the last function sends the last value of ID1 concatenated to ID2. There is one record out of this aggregator which is the last record read from the source.

For all the records read from the first source, I'm creating a field called FIELD_CONCAT that holds the value of ID1 || ID2 which is the concatenation of two value.

Now all the records coming from the two sources are joined on the FIELD_CONCAT field from the first source and LAST_RECORD field from the second source. This join is a full outer join, All the records coming out from this full outer join will have the LAST_RECORD field value as NULL except for the one record which is equal to the LAST(ID1|ID2) value. I hope you got the point. Now you can check which of those records have the LAST_RECORD value as not null, and remove the comma from the last record which was my original requirement to remove the comma from the last record.

Now one problem was the joiner was sending the records in the same order as it was reading from the first source. Hence I have to add sequence number field to each record and then sort using that field so that we can write the output in the same order as the input. Sample mapping for identifying the last line read is shown below. It is an Informatica cloud mapping.

Monday, 2 November 2015

Using SQLCMD to execute SQL scripts from a file on SQL server machine in Informatica

Using SQLCMD to execute SQL scripts from a file on SQL server machine

I was wondering how to execute a SQL script using Informatica cloud on SQL server. The obvious approach that came to mind was using the SQL transformation in script mode. That approach works but cannot be implemented in Informatica cloud since SQL transformation is not supported in Informatica. The next approach that came to mind was to use some tool that can execute SQL script on SQL server. Happened to find out about this tool called the SQLCMD that can do the task for you.
You can put all the commands you want to execute in a .sql file or .txt file and then execute the script using the SQLCMD utility. In the below example, all the SQL commands are in C;\test_sql.txt

You will most likely find the tool in the below location on the machine that has microsoft managment studio installed.

cd C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\110\Tools\Bin

If you want to execute the SQLCMD command in window authentication mode use the below command:

SQLCMD -S myservermachine -i C:\test_sql.txt -o C:\out.txt

If you want to execute the SQLCMD command in window authentication mode as a different user then use the below command:

RUNAS /user:Mydomain\myusername "SQLCMD -S myservermachine -i C:\test_sql.txt -o C:\out.txt"

If you want to execute the SQLCMD command in sql server authentication mode use the below command:

SQLCMD -U myusername -P mypass -S myservermachine -i C:\test_sql.txt -o C:\out.txt

Using NZSQL to execute SQL scripts from a file on Netezza machine

Similary nzsql tool can be used SQL scripts from a file on Netezza machine. The command is:

nzsql -f <sql_file_name>

Saturday, 24 October 2015

Limitation of Hive mode in Informatica BDE (Informatica 9.6.1) version hotfix 1

Limitation of Hive mode in Informatica Big data edition (Informatica 9.6.1) version hotfix 1

Informatica BDE in hive mode has some limitation compared to normal power center mode called the native mode basically because in hive mode the whole mapping logic is converted to Hive SQL. Hive has some limitations such as no support for updates that gets reflected in Informatica BDE in Hive mode. Below is the list of limitation of Informatica BDE in hive mode:

1) There is no support for unconnected lookups, sequence generator or sorters in Hive mode in Informatica hive mode. The reason being hive mode does not support statefulness i.e it cant keep track of values from previous records. Write properties such as insert, update, delete are ignored and it is always a truncated and load.

2) For the same reason above, it does not support stateful variables. i.e you cannot have variable that keep values from previous records.

3) Some of the hive versions does not support timestamp and date. Hence, if you are working with those hive version then you cannot use timestamp or date datatypes.

4) Functions such as abort, error, md5, crc32, max, first, last, aesdecrypt, aesencrypt, etc are not valid in hive environment or has some limitations.

5) PreSQL and post SQL do not work in hive mode.

6) Datatypes such as arrays, maps , unions might be converted to strings in hive data object.

7) Mapping variables and parameters are not supported.

8) The mappings always run in low precision mode.

Even with all the these limitations BDE still is a very useful tool for building mappings on ment and reduces the development time to a great extent. The developers just need to be a bit innovative when they workaround these limitations. One has to be aware that all the ETL logic gets converted to hive SQL so whatever cannot be converted to Hive SQL cannot be supported by this tool.

Friday, 2 October 2015

Informatica Java transformation to parse comma separated columns and generate new rows - normalizer

Informatica Java transformation to parse comma separated columns and generate new rows - normalizer limitations

Informatica normalizer transformation is normally used to convert a column containing multiple values into separate rows. i.e something like ID, 1,2,3,4 can be converted to something like below.
ID , 1
ID, 2
ID, 3
ID, 4

The problem with normalizer however is that it has to know the number of occurences. If the number of occurences are very high then it is hard to create the output ports for that many number of ports.

The easy way in these cases is to use Java transformation to parse the column containing multiple values and generate a separate row for each of them:

The java code is shown below:

String str_var= INPUT_PORT_NAME
String[] arr;
String delimiter = ",";
arr=str_var.split(delimiter);
for (int i=0; i<arr.length;i++) {
INPUT_PORT_NAME=arr[i];
generateRow();
}

Monday, 17 August 2015

Difference between native and hive mode in Informatica big data edition (BDE)

Difference between native and hive modes in Informatica big data edition BDE :

a) Native mode- In native mode BDE works like a normal power center. This can be used to read /wrtie to traditional RDBMS databases. It can also be used to write to HDFS and Hive. It works like a power center because the execution of the mapping logic happens on the power center server. i.e. that source data is read to informatica server , transformations applied and then data is loaded to the target.

This mode is stateful i.e you can keep track of data from previous records, use sequence generators, sorters , etc just like in normal power center.

b) Hive mode- In Hive mode , like in native you can have similar source and targets however the whole mapping logic is pushed down to hive i.e. the hadoop cluster. The Informatica BDE in this mode coverts the mapping logic into hive SQL queries and executes it directly on the hadoop cluster as Hive queries there by converting them all into map reduce jobs.

This mode is not stateful i.e., you cannot keep track of dataa in the previous records using stateful variables. Your transformations like sorters, sequence generators wont work fully or properly.

Your update strategy transformation will not work in hive mode just because hive does not allow updates. You can only insert records to Hive database.

In this mode the data gets read from source to temporary hive tables , transformed , and the target also gets loaded to temp hive tables before being inserted to final target which can be RDBMS database like oracle or Hive database. Hence the limition of hive also follows on to Hive mode in Informatica BDE.

However. if your volume of data is huge and you want to push all the processing to hive then Hive mode is a better option. There are workarounds to do type 2 kind of updates in Hive mode.