Showing posts with label ETL. Show all posts
Showing posts with label ETL. Show all posts

Monday, 9 September 2019

Difference between posexplode and explode in Hive



Difference between posexplode and explode in Hive

In the below example, the numbers column has a list of values.


with t1 as (
select 1 as id, 'deepak' as name, '12345, 56789, 7892334, 6263636, 7181818, 1761761717' as numbers )

select * from t1



Explode is used to create multiple rows from a list value in a column. In the below example, you have broken the list in numbers fields into multiple rows.


with t1 as (
select 1 as id, 'deepak' as name, '12345, 56789, 7892334, 6263636, 7181818, 1761761717' as numbers )

select * from t1
LATERAL VIEW explode(split(numbers, ',')) test










PosExplode is used to create multiple rows from a list value in a column. In the below example, you have broken the list in numbers fields into multiple rows along with the column called pos that contains the  position of the value in the list.

with t1 as (
select 1 as id, 'deepak' as name, '12345, 56789, 7892334, 6263636, 7181818, 1761761717' as numbers )

select * from t1
LATERAL VIEW posexplode(split(numbers, ',')) test






Friday, 3 February 2017

Sample parameter file and using parameter file in Informatica

Sample parameter file and using parameter file in Informatica

Below is a sample of a simple parameter file. All the global variables used by all the sessions are defined under [Global] and the session specific variable overrides are defined under session names.

-----------------------------------------------------
[Global]
$DBConnection_VSTP=VSTPDW_HIST_US

[s_MYSQL_EXT_Load5]
$$BATCH_KEY=204
$$YEAR=2016
$InputFile_Name=$$PMRootDir\SrcFiles\SRC1.EXT.G2016.txt

[s_SRC2_EXT_Load]
$$BATCH_KEY=204
$$YEAR=2016
$InputFile_Name=$$PMRootDir\SrcFiles\SRC2.EXT.G2016.txt

[s_SRC3_EXT_Load]
$$BATCH_KEY=204
$$YEAR=2016
$InputFile_Name=$$PMRootDir\SrcFiles\SRC2.EXT.G2016.txt

[s_SRC4_EXT_Load]
$$BATCH_KEY=204
$$YEAR=2016
$InputFile_Name=$$PMRootDir\SrcFiles\SRC14.EXT.G2016.txt
-------------------------------------------------------------------------

The parameter file can be set at session level or at workflow level. The worklevel parameter applies to all the session unless it is over riden by another parameter file at session level.


Thursday, 3 November 2016

Find all columns with only nulls in SQL Server table

Find all column fields with only nulls in SQL Server table


The piece of code below will give list of all columns that have only nulls in a give table. Replace YourTableName with your tablename in the code below. This is variation of code that I found here (
http://stackoverflow.com/questions/63291/sql-select-columns-with-null-values-only) 

CREATE TABLE #NullColumns (ColumnName Varchar(100))
DECLARE @TableName Varchar(100)
SET @TableName='YourTableName'
DECLARE crs CURSOR LOCAL FAST_FORWARD FOR SELECT name FROM syscolumns WHERE id=OBJECT_ID(@TableName)
OPEN crs
DECLARE @name sysname
FETCH NEXT FROM crs INTO @name
WHILE @@FETCH_STATUS = 0
BEGIN
    EXEC('INSERT INTO #NullColumns (ColumnName) SELECT ''' + @name + ''' WHERE NOT EXISTS (SELECT * FROM ' + @TableName + ' WHERE ' + @name + ' IS NOT NULL)')
    FETCH NEXT FROM crs INTO @name
END
CLOSE crs
DEALLOCATE crs
SELECT * FROM #NullColumns
DROP TABLE #NullColumns

Wednesday, 26 October 2016

Static vs Dynamic Lookup in Informatica

Difference between Static vs Dynamic Lookup In Informatica

Static Lookups: 

These are are used when you want to lookup a value on a static table or a SQL query based on certain matching conditions. It caches the values when the static lookup in created at the begining of the execution of the session. If the values in the lookup tables are changing and you want the changed values to be reflected during execution of the session then you need to use dynamic cache. 
When you create a lookup transformation it by defaults creates a static cache. Static lookups can be connected or unconnected depending on weather you want multiple or single value to be returned from the lookup.  

Dynamic Lookups: 

These kind of lookups are required when you want the changes happening in the lookup table to be reflected during the execution of the mapping/session. You can create dynamic lookups by setting Dynamic lookup Cache property in the properties section of the lookup transformation. When you make the lookup dynamic, you will observe in ports tab that a new field called NewLookupRow is created automatically. This NewlookupRow port will indicate if a row has been inserted or updated in the cache. In the dynamic lookup the records gets inserted or updated in the cache depending on weather the rows existed or not existed in the cached. 


Use of NewLookupRow in Dynamic Cache:

NewLookupRow port can have the following values:-
0 - Indicates that powercenter did not update or insert the row in the cache.
1 - Indicates that powercenter inserted the row into the cache because it is a new row
2-  Indicates that powercenter updated the row in the cache because the lookup value has changed.


Using the above values you can decide using update or filter transformation what you want to do with those rows coming of the source. Like for example if you want to insert it into the target table or update it? If NewLookupRow=1 then insert it to the target table or if it equal to 2 then update the record in the target table.




Thursday, 6 October 2016

SQL server and azure troubleshooting - common issues

 SQL server and azure troubleshooting - common issues

1) Msg 245, Level 16, State 1, Line 2

Conversion failed when converting the varchar value '000005EW84' to data type int.


Resolution: this issue happens when there is character in the column that you are converting to integer. Use a case statement like below to remove those records that have a column. In the example below policy_number is the column that has characters.

case when isnumeric(policy_number) = 1 then cast(policy_number as int) else 0 end as policy_num


2) Cannot insert explicit value for identity column in table 'Mytable_Dim' when IDENTITY_INSERT is set to OFF.

Resolution: You're inserting values for OperationId that is an identity column.

You can turn on identity insert on the table like this so that you can specify your own identity values.

SET IDENTITY_INSERT Table1 ON

INSERT INTO Table1
/*Note the column list is REQUIRED here, not optional*/
            (Mydata_SID,
             MyDate_Desc)
VALUES      (20,
             ''My data")

SET IDENTITY_INSERT Table1 OFF

3)Msg 515, Level 16, State 2, Line 45

Cannot insert the value NULL into column 'Party1_SID', table 'dbo.Party_Dim'; column does not allow nulls. INSERT fails.


Resolution: The column Party1_Sid does not accept null values.  Insert a value of -1 to the column whenever the value is null. Isnull() function can be used to test if a column value is null.

4) Msg 120, Level 15, State 1, Line 45 The select list for the INSERT statement contains fewer items than the insert list.


Resolution:The number of SELECT values must match the number of INSERT columns.
Match the number of columns on select and insert statements..

5) Msg 264, Level 16, State 1, Line 45

The column name 'Party1_SID' is specified more than once in the SET clause or column list of an INSERT.



Resolution: Remove any duplication of columns in the query. A column cannot be assigned more than one value in the same clause. Modify the clause to make sure that a column is updated only once. If this statement updates or inserts columns into a view, column aliasing can conceal the duplication in your code.

6) Arithmetic overflow error converting numeric to data type numeric.


Resolution: Increase the numeric precision. If there are datatype such as  numeric(11,2) you might have to make it high so that all the values can fit into the datatype. For example, make it numeric(20,2).


7) Msg 1033, Level 15, State 1, Procedure test_vw, Line 56
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.

 Resolution: Remove order by from the view statement.

 

Thursday, 25 August 2016

Improving performance of Inserts, updates and deletes in SQL server

How to make Inserts , updates and deletes faster in SQL server  - Microsoft Azure?

I tried to delete some records from a table with 100 millions records and it took forever. One of the reason being there was no index on the column I used in the where clause of the delete statement and other reason being the table was so big. So deleting a big chunk of the data from such big tables can take forever since the data gets written to transaction logs and the table has to scan for all the records that match the where clause condition.

How to delete records quickly from huge tables in sql server on Azure? 

1) Use only the column that has index defined in the where clause condition. May be this will help find the records to be deleted quickly.
2) Delete records in small chunks using the loop below. This prevents transaction logs from growing rapidly.  @@rowcount always returns the number of records affected from delete.

DELETE TOP (100000) from dbo.mytable WHERE name='dwbi'
WHILE @@rowcount > 0
BEGIN
 
  DELETE TOP (100000) from dbo.mytable
    WHERE name='dwbi'
 
END
 
3)  Insert data into temp table with no index on it using the columns that need not be deleted , delete the original table, rename the temp table back to the original table and add back the index on the newly renamed temp table.

select * into tmp_mytable from mytable where name!='dwbi'
drop mytable
sp_rename 'tmp_mytable', 'mytable'


How to insert records quickly from huge tables in sql server on Azure? 

My records to insert records into a huge table improved a lot after I disabled the non clustered indexes and rebuild it after the insert. The steps are below:

ALTER INDEX [IX_MYTABLE_User_SID] ON dbo.[MYTABLE] DISABLE
Insert into [dbo].[mytable] select * from [dbo].[stg_table]

ALTER INDEX [IX_MYTABLE_UserSID] ON dbo.[MYTABLE] REBUILD

How to update records quickly from huge tables in sql server on Azure? 

My  usual way of updating records used to be something like this:

update dbo.mytablename set name= (
select dbo.TMP_table_name.name
from TMP_table_name
where dbo.mytablename.party_sid=TMP_table_name.party_sid)

The above update statement takes a lot of time because it updates all the records even those that do not match. The easier and faster way to update is using the statement below:  This will update only the matching records and it is pretty quicker.

update dbo.mytablename set
dbo.mytablename.name=dbo.TMP_table_name.name
from TMP_table_name
where dbo.mytablename.party_sid=TMP_table_name.party_sid


Friday, 8 July 2016

Performance tuning of Informatica Big Data Edition Mapping

Performance tuning of Informatica Big Data Edition Mapping



Below are the list of performance tuning steps that can be done in Informatica Big Data Edition:


1)  When using a look up transformation only when the lookup table is small. Lookup data is copied to each node and hence it is slow.


2) Use Joiners instead of lookup for large data sets.


3) Join large data sets before small datasets. Reduce the number of times the large datasets are joined in Informatica BDE.


4) Since Hadoop does not allow updates, you will have to rebuild the target table whenever the record is updated in a target table. Instead of rebuilding  the whole table, consider rebuilding only the impacted partitions.


5) Hive slower with any non string data type. It needs to create temp tables to do the conversion to and from the non string data type to string data type. Use non string data type only when required.


6) Use the data type precision close the actual data. Using higher precision slows down the performance of Informatica BDE.


7) Map only the ports that are required in the mapping transformation or loaded to target. Less number of ports means better performance and less data reads.









Thursday, 7 July 2016

Useful Queries for troubleshooting amazon redshift

USEFUL QUERIES FOR TROUBLESHOOTING IN AMAZON REDSHIFT 

Here are some of my queries for troubleshooting in amazon redshift. I have collected this from different sources.

TO CHECK LIST OF RUNNING QUERIES AND USERNAMES:

select a.userid, cast(u.usename as varchar(100)), a.query, a.label, a.pid, a.starttime, b.duration,
b.duration/1000000 as duration_sec, b.query as querytext
from stv_inflight a, stv_recents b, pg_user u
where a.pid = b.pid and a.userid = u.usesysid




select pid, trim(user_name), starttime, substring(query,1,20) from stv_recents where status='Running'

TO CANCEL A RUNNING QUERY:

cancel <pid>


You can get pid from one of the queries above used to check running queries.


TO LOOK FOR ALERTS:

select * from STL_ALERT_EVENT_LOG
where query = 1011
order by event_time desc
limit 100;


TO CHECK TABLE SIZE:

select trim(pgdb.datname) as Database, trim(pgn.nspname) as Schema,
trim(a.name) as Table, b.mbytes, a.rows
from ( select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name ) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
order by b.mbytes desc, a.db_id, a.name;


TO CHECK FOR TABLE COMPRESSION:

analyze <tablename>;
analyze compression <tablename>;



TO ANALYZE ENCODING:

select "column", type, encoding
from pg_table_def where tablename = 'biglist';



TO CHECK LIST OF FILES COPIED:

select * from stl_load_errors

select * from stl_load_commits


select query, trim(filename) as file, curtime as updated, *
from stl_load_commits
where query = pg_last_copy_id();


TO CHECK LOAD ERRORS

select d.query, substring(d.filename,14,20),
d.line_number as line,
substring(d.value,1,16) as value,
substring(le.err_reason,1,48) as err_reason
from stl_loaderror_detail d, stl_load_errors le
where d.query = le.query
and d.query = pg_last_copy_id();


TO CHECK FOR DISKSPACE USED IN REDSHIFT:

select owner as node, diskno, used, capacity
from stv_partitions
order by 1, 2, 3, 4;
select query, trim(querytxt) as sqlquery
from stl_query
order by query desc limit 5;


SOME IMPORTANT AWS COMMANDS:

To resize the redshift cluster (node type and number of nodes always required):

aws  redshift modify-cluster --cluster-identifier <cluster name> --node-type dw2.8xlarge --number-of-nodes 3

To get filelist on S3:

aws s3 ls $BUCKET/  > ./filecount.out

To get status of cluster and other information of cluster in text format:

aws redshift describe-clusters --output text   


Tuesday, 29 March 2016

Importing Cobol copy book definition , and reading and troubleshooting cobol file issues in Informatica

Importing Cobol copy book definition in Informatica and troubleshooting cobol file issues

 To read a cobol file, first you need to create the source definition and then read the file the data from the cobol input file. Source definitions are created from cobol copy book definitions.  Cobol copy book has definition of all the fields present in the data file along with the level and occurence information.

How to import a cobol copy book in Informatica?


Create source definition using Source-> Importing From Cobol  file option to create the source definition from cobol copy book.

Insert the source definition into mapping. Instead of source qualifier it is going to create normalizer transformation.Read from the normalizer and write to any target.


Details on how to load data from cobol copy book is in this video that I found on youtube:

https://www.youtube.com/watch?v=WeACh8AWQv8 
 

Troubleshooting copy book import in informatica:

Reading from cobol file is not always straightforward. You will have to sometimes edit the copy book to get it working. Some of the messages I had while creating source definitions are below:

Error messages seen:

05 Error at line 176 : parse error

Error: FD FILE-ONE ignored - invalid or incomplete.

05 Error: Record not starting with 01 level at line 7


Resolution for copy book import issues In Informatica:

1) Add the headers and footers shown in blue below even if they are blank

        identification division.
        program-id. mead.
        environment division.
        select file-one assign to "fname".
       data division.
       file section.
       fd  FILE-ONE.

        01  HOSPITAL-REC.
           02  DEPT-ID                PIC 9(15).
           02  DEPT-NM                PIC X(25).
           02  DEPT-REC                OCCURS 2 TIMES.
               03  AREA-ID             PIC 9(5).
               03  AREA-NM             PIC X(25).
               03  FLOOR-REC       OCCURS 5 TIMES.
                   04  FLOOR-ID       PIC 9(15).                                
                   04  FLOOR-NM       PIC X(25).
      working-storage section.
       procedure division.
           stop run.


2)  All the fields should start from column 8. Create 8 spaces before each field.
3) Make sure the field names do not start with *. It has to have a level number at the beginning.
4) Level 01 has to be defined as the first record. If level 01 not there then it would not import the cbl file.
5) Special levels such as 88 , 66 might not get imported. This apparently is limitation of powercenter. I could not find a way around it. 


Resolution for cobol file issues In Informatica:

I had such a hard time reading a cobol file that was in EBCDIC format with comp-3 fields in it. I thought the issues are mostly in importing the cobol copy book but even more issues when reading the cobol file. It is supposed to be straight forward in Informatica powercenter. The list of issues I faced.

1) When reading from cobol mainfram file the comp-3 fields were not read properly.
2) The alignments of the fields were not correct. The values were moving to next columns and had a cascading effect.

To resolve the above two issues I asked the mainframe file to be sent in binary format. Apparently files ftp'ed in ascii format from mainframe causes all the above issue. The comp-3 fields are not transferred properly in Ascii mode of ftp. 

Now with the binary file and below properties shown in screen it was supposed to work correctly. It was not. It was all garbage because of a weird issue. The properties I set for code page in source properties in source designer was for some reason not getting reflected in the session properties for that source. I had to manually change in the session- mapping -> source- file properties -advanced tab to code page IBM EBCDIC US English IBM037 and Number of Bytes to skip between records to 0 and it finally worked.

   
 To summarize when reading cobol file:
a) Ask for the file to be ftp'ed in binary format from the mainframe.
2) Set the code page and number of bytes to skip appropriately as shown in the screenshot in the session properties.

This should solve most of your problems with comp-3 fields and alignments ,etc. 

If not try unchecking IBM comp-3 option in the source properties from source designer. Try changing the code page and number of bytes to skip,  etc and hopefully it works.    

If you are seeing dots at the end of some comp-3 fields then most likely you are writing numeric data into varchar field. If you write it to numeric fields in the database those dots should disappear. 

Another issue with Informatica power center is that it can read only fixed width files. if you have occurs section with depending on condition then most likely you have a variable length file and Informatica power center does not work correctly with variable length files. Ask the mainframe guys to convert it to fixed width files. 

  

Friday, 4 March 2016

Mapping variables and parameters in Informatica Mapping and session

Mapping variables and parameters in Informatica Mapping

1) Informatica Mapping Parameters:

You use mapping parameters to supply parameter values to informatica mappings. It could be values such as Batch_Key, Batch_Name etc.The value of the mapping parameters do not change for the entire execution time of a session. The mapping parameters can be defined in the mapping as shown in the screenshot below in the mapping designer.The same mapping parameters needs to be defined in the parameter file too.

For example, the parameter file would have:

$$BATCH_KEY=301

In the mapping designer, you would define the $$BATCH_KEY in the mapping and parameters window. During execution the session will associate the value 301 where it finds $$BATCH_KEY in the mapping. It might also be necessary to define the same parameter in the worflow under the variables tab.



2) Informatica Mapping Variables:

Mapping variable represents a value that can change through the session. Mapping variables can also be defined under the mapping and parameters window of the mapping designer just like parameters (shown in screenshot above). It can also be initialized using the parameter file. However the values of a variable can be changed during execution. The value of the mapping variables can be changed in an expression transformation using setvariable functions. For example:

SETVARIABLE($$BATCH_KEY, 302)  would change the value of $$BATCH_KEY to 302 during execution.

SETVARIABLE function can be used in expression, filter, router, and update strategy.Integration Service saves the value of a mapping variable to the repository at the end of each successful session run and uses that value the next time you run the session.

Postsession and presession variable assignment in session can be used to pass the value of the mapping variable to the session/workflow variables.This allows using the values of the variable at workflow level as shown in screenshot below.   

3)Informatica Session Parameters:

Session parameters represent values that can change between session runs, such as database connections or source and target files. E.g.: $DBConnectionName, $InputFileName, etc. These paramters needs to defined in the parameter file.

4)Informatica Workflow variables:

All the parameters, variables that have to be used in the workflow needs to be defined in the workflow in the window below. This can be all the mapping parameters, session parameters, etc.




Friday, 12 February 2016

What is informatica cloud services?

What is Informatica cloud services ?

Informatica cloud service is an Informatica product that is web based and on the cloud solution for data integration. The mappings and any data integration tasks can be created on the web based tool and executed/monitored from the web based tool. A secure agent is installed on the local environment that allows reading/writing data  to/from the cloud. Connections can be created to database such as amazon redshift, sql server, oracle, netezza etc and network folders. This product allows ETL tasks to be created on the web without having a power center server location on your premise. Hence, providing a cheaper option for creating Informatica tasks.



The different components of the Informatica cloud service (ICS) tool sets are listed below:
.

Data synchronization and Data replication: 

As the name implies it allows database synchronization. Data is copied from the source to the targets along with application of any data filters. One table or multiple tables can be synchronized. ICS provides inbuilt connectors using which connections to the source and target tables can be created.



Mapping configuration:  

Allows a task to be created from a mappings. The parameters files, variables, post processing commands, sessions settings etc can be defined for the task. This is similar to the session in the informatica power center.

Power Center:  

This component allows Informatica power center workflows to be imported and executed from the Informatica cloud service. The source and target connections from the power center workflow can be mapped to the connections available on Informatica cloud service. When importing informatica power center workflows make sure the workflow is exported from the repository manager of power center. Otherwise it shows some error while importing the task.

Mappings

This component allows to create informatica ETL mappings similar to Informatica power center. Not all the features of Informatica power center is available in Informatica cloud service. Features like SQL transformation, union, etc are not provided in ICS. If you need additional features then you might have to create a mapplet in power center and import that task as a mapplet from powercenter and add it to the ICS mapping. ICS provides transformations such as expression, joiner, filter, lookup, sorter, aggregator, mapplet, and normalizer.




Task Flows

Task flows allows to create a sequence of jobs to be executed. All the mapping configuration tasks created can be made to execute in a sequence. This is similar to the workflow in informatica power center.

Integration templates

This component allows Informatica mapping templates created using Microsoft visio to be imported and applied to mappings in ICS.

Activity Log/Monitor

This component allows to monitor all the executing tasks and as well see the completed tasks. It provides information on the number of source and target rows, session logs, etc.

Mapplets

ICS mapplets work similar to Informatica power center mapplets. Mostly this allows powercenter mapplets to be imported to ICS.
  

Connections

 This components allows connections to be created to flat files folders, databases, etc. Wide range of connectors are available.

Friday, 11 December 2015

SQL override in Source qualifier and look up overide in Informatica

SQL override in Source qualifier and look ups in Informatica



SQL override is commonly used by ETL developers in Source qualifiers and Lookup transformation. To override SQL in source qualifier you can generate the SQL from the properties tab of the source qualifier and do all the necessary modifications to the SQL. All you have to keep in mind is the columns in the select statement match the order of the ports in the source qualifier. The ports have to be mapped to the transformation next to the source qualifier. The diagram below shows a simple SQL override. Similarly SQLs can be override in the lookups too.



The advantages of SQL override in Informatica:

a) You can put a lot of logic in the SQL override. The SQL will be executed in the source database and the results will be pulled in by the Informatica powercenter. Instead of reading all the records from the table, by writing SQL override you can filter, aggregate, etc on the source database and pull in the result set.

b) Similarly in the Lookups instead of reading all the columns from a table, you can read only the required columns from the table by doing SQL override and removing all the unnecessary ports. Also you can aggregate , filter, etc on the records that needs to be read.

c) If you are comfortable with SQL you would prefer this path instead of implementing a lot of stuff that can be achieved with SQLs.