DWBI-TECH BLOGS (Pradeep Kannadiga): Informatica BDE

Showing posts with label Informatica BDE. Show all posts

Friday, 8 July 2016

Performance tuning of Informatica Big Data Edition Mapping

Below are the list of performance tuning steps that can be done in Informatica Big Data Edition:

1) When using a look up transformation only when the lookup table is small. Lookup data is copied to each node and hence it is slow.

2) Use Joiners instead of lookup for large data sets.

3) Join large data sets before small datasets. Reduce the number of times the large datasets are joined in Informatica BDE.

4) Since Hadoop does not allow updates, you will have to rebuild the target table whenever the record is updated in a target table. Instead of rebuilding the whole table, consider rebuilding only the impacted partitions.

5) Hive slower with any non string data type. It needs to create temp tables to do the conversion to and from the non string data type to string data type. Use non string data type only when required.

6) Use the data type precision close the actual data. Using higher precision slows down the performance of Informatica BDE.

7) Map only the ports that are required in the mapping transformation or loaded to target. Less number of ports means better performance and less data reads.

Workarounds for Mapping variables and parameters and sequence generators and sorters in BDE

Mapping variables and parameters and sequence generators and sorters in BDE

Since there are no mapping variables/parameters and sequence generators in BDE, you can use the following workarounds:

For mapping variables and parameters, you can use a control table or a files instead and read the control tables or files in the mapping and use the content of the table/files in your mapping. Create a look up on the control table to get values for all the parameters defined in the control table. You can update the control table if the parameter needs to be updated at the end of the run.

For Sequence generator, you can use UUID (Unique Identified) functions instead. These UUID functions are alphanumeric and if you need numeric only then use Java functions.

HSQL does sorting by default. i.e Hadoop does the sorting and so you do not need sorter unless you are using it with a aggregator that has a sorted input. In this case you need to add a sorter to validate the mapping.

Friday, 4 March 2016

Mapping variables and parameters in Informatica Mapping and session

Mapping variables and parameters in Informatica Mapping

1) Informatica Mapping Parameters:

You use mapping parameters to supply parameter values to informatica mappings. It could be values such as Batch_Key, Batch_Name etc.The value of the mapping parameters do not change for the entire execution time of a session. The mapping parameters can be defined in the mapping as shown in the screenshot below in the mapping designer.The same mapping parameters needs to be defined in the parameter file too.

For example, the parameter file would have:

$$BATCH_KEY=301

In the mapping designer, you would define the $$BATCH_KEY in the mapping and parameters window. During execution the session will associate the value 301 where it finds $$BATCH_KEY in the mapping. It might also be necessary to define the same parameter in the worflow under the variables tab.

2) Informatica Mapping Variables:

Mapping variable represents a value that can change through the session. Mapping variables can also be defined under the mapping and parameters window of the mapping designer just like parameters (shown in screenshot above). It can also be initialized using the parameter file. However the values of a variable can be changed during execution. The value of the mapping variables can be changed in an expression transformation using setvariable functions. For example:

SETVARIABLE($$BATCH_KEY, 302) would change the value of $$BATCH_KEY to 302 during execution.

SETVARIABLE function can be used in expression, filter, router, and update strategy.Integration Service saves the value of a mapping variable to the repository at the end of each successful session run and uses that value the next time you run the session.

Postsession and presession variable assignment in session can be used to pass the value of the mapping variable to the session/workflow variables.This allows using the values of the variable at workflow level as shown in screenshot below.

3)Informatica Session Parameters:

Session parameters represent values that can change between session runs, such as database connections or source and target files. E.g.: $DBConnectionName, $InputFileName, etc. These paramters needs to defined in the parameter file.

4)Informatica Workflow variables:

All the parameters, variables that have to be used in the workflow needs to be defined in the workflow in the window below. This can be all the mapping parameters, session parameters, etc.

Monday, 22 February 2016

Pushdown Optimization in Informatica Power Center and Big data edition

Pushdown optimization is a feature in Informatica which enables the Informatica mapping to be converted into a SQL code and executed on the database. It is useful when you are dealing with huge volumes of data from the source and instead of reading all the data to the Informatica server it might be better to do some of the transformation logic such as filtering, aggregation, joins on the database itself. Also, useful in cases when you have powerful data warehouse appliance such as Netezza or Teradata.

What is informatica cloud services?

What is Informatica cloud services ?

Informatica cloud service is an Informatica product that is web based and on the cloud solution for data integration. The mappings and any data integration tasks can be created on the web based tool and executed/monitored from the web based tool. A secure agent is installed on the local environment that allows reading/writing data to/from the cloud. Connections can be created to database such as amazon redshift, sql server, oracle, netezza etc and network folders. This product allows ETL tasks to be created on the web without having a power center server location on your premise. Hence, providing a cheaper option for creating Informatica tasks.

The different components of the Informatica cloud service (ICS) tool sets are listed below:
.

Data synchronization and Data replication:

As the name implies it allows database synchronization. Data is copied from the source to the targets along with application of any data filters. One table or multiple tables can be synchronized. ICS provides inbuilt connectors using which connections to the source and target tables can be created.

Mapping configuration:

Allows a task to be created from a mappings. The parameters files, variables, post processing commands, sessions settings etc can be defined for the task. This is similar to the session in the informatica power center.

Power Center:

This component allows Informatica power center workflows to be imported and executed from the Informatica cloud service. The source and target connections from the power center workflow can be mapped to the connections available on Informatica cloud service. When importing informatica power center workflows make sure the workflow is exported from the repository manager of power center. Otherwise it shows some error while importing the task.

Mappings

This component allows to create informatica ETL mappings similar to Informatica power center. Not all the features of Informatica power center is available in Informatica cloud service. Features like SQL transformation, union, etc are not provided in ICS. If you need additional features then you might have to create a mapplet in power center and import that task as a mapplet from powercenter and add it to the ICS mapping. ICS provides transformations such as expression, joiner, filter, lookup, sorter, aggregator, mapplet, and normalizer.

Task Flows

Task flows allows to create a sequence of jobs to be executed. All the mapping configuration tasks created can be made to execute in a sequence. This is similar to the workflow in informatica power center.

Integration templates

This component allows Informatica mapping templates created using Microsoft visio to be imported and applied to mappings in ICS.

Activity Log/Monitor

This component allows to monitor all the executing tasks and as well see the completed tasks. It provides information on the number of source and target rows, session logs, etc.

Mapplets

ICS mapplets work similar to Informatica power center mapplets. Mostly this allows powercenter mapplets to be imported to ICS.

Connections

This components allows connections to be created to flat files folders, databases, etc. Wide range of connectors are available.

Friday, 11 December 2015

SQL override in Source qualifier and look up overide in Informatica

SQL override in Source qualifier and look ups in Informatica

SQL override is commonly used by ETL developers in Source qualifiers and Lookup transformation. To override SQL in source qualifier you can generate the SQL from the properties tab of the source qualifier and do all the necessary modifications to the SQL. All you have to keep in mind is the columns in the select statement match the order of the ports in the source qualifier. The ports have to be mapped to the transformation next to the source qualifier. The diagram below shows a simple SQL override. Similarly SQLs can be override in the lookups too.

The advantages of SQL override in Informatica:

a) You can put a lot of logic in the SQL override. The SQL will be executed in the source database and the results will be pulled in by the Informatica powercenter. Instead of reading all the records from the table, by writing SQL override you can filter, aggregate, etc on the source database and pull in the result set.

b) Similarly in the Lookups instead of reading all the columns from a table, you can read only the required columns from the table by doing SQL override and removing all the unnecessary ports. Also you can aggregate , filter, etc on the records that needs to be read.

c) If you are comfortable with SQL you would prefer this path instead of implementing a lot of stuff that can be achieved with SQLs.

Thursday, 12 November 2015

Indirect file processing in Informatica Cloud - Workaround

Indirect file processing ( Filelist) in Informatica Cloud (ICS)

I have a scenario where I need to read list of files (filelist) in Informatica cloud. My research pointed out to couple of articles on Informatica community. Apparently there is a template that we can work with or we need to use filelist connector. I could not find both and so had to come up with a different startegy.

The strategy was:
1) Use powercenter to create the mapping and a workflow that reads from the file list and writes to a oracle target
2) Export it from repository manager as XML file
3) Import it from the informatica cloud as a powercenter task.Map the connection to ICS connections and run it from Informatica cloud.

In Informatica power center we read filelist by setting the property in the session source file to indirect as shown in the screenshot below: Using the same strategy I created my mapping and wrote the records to oracle target. Once I tested the workflow I exported it to Informatica cloud and it worked.

When you import in Informatica cloud you can map the connection of the oracle target to the connection used in the cloud. Make sure your mapping has only the supported cloud transformations.

Also the source filelist and the source files have to be in location that is accessible to Informatica cloud (ICS).

Monday, 9 November 2015

SQL transformation in Informatica for executing SQL scripts and dynamic SQLs

If you have some SQLs in a file and need to execute it on a database using Informatica or if you want to use some transformation to create dynamic SQLs i,e SQLs that take parameter and give you the output based on the input parameter, then SQL transformation is a way to go.

The SQL transfomation operates in Script and Query mode. In query mode you can write a query in the SQL transformatica and specify a parameter in the query which is the input port and depending on the value of the input port the result could change. For example: if NAME is a input port in the SQL transformation, your query in SQL transformation could be " SELECT DEPARTMENT FROM DEPARTMENT_TABLE WHERE FIRST_NAME=?NAME?". This will return department name for the every NAME that goes through the SQL transformation.

If you want to execute bunch of SQLs that are in a file then create the SQL transformation in Script mode. Read the SQL from a flat file and pass the field corresponding to the SQL to the Script name field of the SQL transformation. The SQL transformation executes every SQL that it reads from the flat file and executes it.

For example: If the input file has :

CREATE TABLE MYTABLE as SELECT * from MYSTAGING_TABLE;

InSERT into MYTABLE values ('xxx', 1 ,2 3).

The SQL transformation will execute both the statements one after another and return the script result and any ERROR outputs. Sample mapping is shown in the screenshot below:

Friday, 6 November 2015

How to identify last record in Informatica cloud or powercenter

How to identify last record in Informatica cloud or powercenter?

I had a scenario where I had to add comma (,) at the end of each line that I'm generating and loading to the target but however for the last line I should not add the comma. Now that is not as easy as it sounds. Informatica powercenter does not have any indicator on the last line that is read from the source. How do we accomplish this? Well if you have access to unix scripts then most likely you are write a script to remove that comma from the target file after it is generated. I work in windows environment these days and don't have the option of writing complex scripts. This is how I solved the problem using aggregator with LAST function and a joiner to join the last record indicator with rest of the records.

I created two source definition reading from the same file. The first source reads all the records but after the second source I have added a aggregator to pass the last line using the LAST function.

The last function in the aggregator sent the last record. Basically I have used a field called LAST_RECORD that has the value LAST(ID1 || ID2) in the aggregator. ID1 and ID2 are the two primary key fields of every record that is read from the source and the last function sends the last value of ID1 concatenated to ID2. There is one record out of this aggregator which is the last record read from the source.

For all the records read from the first source, I'm creating a field called FIELD_CONCAT that holds the value of ID1 || ID2 which is the concatenation of two value.

Now all the records coming from the two sources are joined on the FIELD_CONCAT field from the first source and LAST_RECORD field from the second source. This join is a full outer join, All the records coming out from this full outer join will have the LAST_RECORD field value as NULL except for the one record which is equal to the LAST(ID1|ID2) value. I hope you got the point. Now you can check which of those records have the LAST_RECORD value as not null, and remove the comma from the last record which was my original requirement to remove the comma from the last record.

Now one problem was the joiner was sending the records in the same order as it was reading from the first source. Hence I have to add sequence number field to each record and then sort using that field so that we can write the output in the same order as the input. Sample mapping for identifying the last line read is shown below. It is an Informatica cloud mapping.

Monday, 2 November 2015

Estimating database and table size in Netezza

Estimating database and table size and skew in Netezza

nz_db_size utility present in /nz/support/contrib/3.1/bin/ can be used to estimate the total number of bytes on disk that a given database and the tables belonging to the database is occupying:

Usage is as follows:

nz_db_size [ -s ] [ database [ table ] ]

E.g.:

[nz@nps12345 bin]$ ./nz_db_size MYTESTDB MYTABLE

Object | Name | Bytes | KB | MB | GB | TB
-----------+----------------------------------+---------------------+-----------------+-------------+-----------+--------
Appliance | nps1234 | 134,934,002 | 136,083 | 132,894 | 192.8 | .1
Database | MYTESDB | 123,245,56 | 12,345,67 | 1000 | 1230 | .0
Table | MYTABLE | 1,206,720 | 1,178 | 1 | .0 | .0

To get summary stats for database use the below command:

[nz@mpc-nps]$ nz_db_size -summary -s

Object   |               Name               |        Bytes         |        KB        |      MB      |     GB     |   TB
-----------+----------------------------------+----------------------+------------------+--------------+------------+--------
Appliance | mpc-nps                   |    1,34343,2434,24 |   1,232,32,23232,34 |    2,167,756 |    4,2267 |    2.1
Database | MYTESTDB                          |      123,2323,3232,33 |      123,345,343|      456,334|      686.0 |     .7

nz_stats is another command that can be used be get the full statistics:

For example: nzstats show -type table

Saturday, 24 October 2015

Limitation of Hive mode in Informatica BDE (Informatica 9.6.1) version hotfix 1

Limitation of Hive mode in Informatica Big data edition (Informatica 9.6.1) version hotfix 1

Informatica BDE in hive mode has some limitation compared to normal power center mode called the native mode basically because in hive mode the whole mapping logic is converted to Hive SQL. Hive has some limitations such as no support for updates that gets reflected in Informatica BDE in Hive mode. Below is the list of limitation of Informatica BDE in hive mode:

1) There is no support for unconnected lookups, sequence generator or sorters in Hive mode in Informatica hive mode. The reason being hive mode does not support statefulness i.e it cant keep track of values from previous records. Write properties such as insert, update, delete are ignored and it is always a truncated and load.

2) For the same reason above, it does not support stateful variables. i.e you cannot have variable that keep values from previous records.

3) Some of the hive versions does not support timestamp and date. Hence, if you are working with those hive version then you cannot use timestamp or date datatypes.

4) Functions such as abort, error, md5, crc32, max, first, last, aesdecrypt, aesencrypt, etc are not valid in hive environment or has some limitations.

5) PreSQL and post SQL do not work in hive mode.

6) Datatypes such as arrays, maps , unions might be converted to strings in hive data object.

7) Mapping variables and parameters are not supported.

8) The mappings always run in low precision mode.

Even with all the these limitations BDE still is a very useful tool for building mappings on ment and reduces the development time to a great extent. The developers just need to be a bit innovative when they workaround these limitations. One has to be aware that all the ETL logic gets converted to hive SQL so whatever cannot be converted to Hive SQL cannot be supported by this tool.