DWBI-TECH BLOGS (Pradeep Kannadiga): 2016

Thursday, 3 November 2016

Find all columns with only nulls in SQL Server table

Find all column fields with only nulls in SQL Server table

The piece of code below will give list of all columns that have only nulls in a give table. Replace YourTableName with your tablename in the code below. This is variation of code that I found here (
http://stackoverflow.com/questions/63291/sql-select-columns-with-null-values-only)

CREATE TABLE #NullColumns (ColumnName Varchar(100))
DECLARE @TableName Varchar(100)
SET @TableName='YourTableName'
DECLARE crs CURSOR LOCAL FAST_FORWARD FOR SELECT name FROM syscolumns WHERE id=OBJECT_ID(@TableName)
OPEN crs
DECLARE @name sysname
FETCH NEXT FROM crs INTO @name
WHILE @@FETCH_STATUS = 0
BEGIN
EXEC('INSERT INTO #NullColumns (ColumnName) SELECT ''' + @name + ''' WHERE NOT EXISTS (SELECT * FROM ' + @TableName + ' WHERE ' + @name + ' IS NOT NULL)')
FETCH NEXT FROM crs INTO @name
END
CLOSE crs
DEALLOCATE crs
SELECT * FROM #NullColumns
DROP TABLE #NullColumns

Wednesday, 26 October 2016

Static vs Dynamic Lookup in Informatica

Difference between Static vs Dynamic Lookup In Informatica

Static Lookups:

These are are used when you want to lookup a value on a static table or a SQL query based on certain matching conditions. It caches the values when the static lookup in created at the begining of the execution of the session. If the values in the lookup tables are changing and you want the changed values to be reflected during execution of the session then you need to use dynamic cache.
When you create a lookup transformation it by defaults creates a static cache. Static lookups can be connected or unconnected depending on weather you want multiple or single value to be returned from the lookup.

Dynamic Lookups:

These kind of lookups are required when you want the changes happening in the lookup table to be reflected during the execution of the mapping/session. You can create dynamic lookups by setting Dynamic lookup Cache property in the properties section of the lookup transformation. When you make the lookup dynamic, you will observe in ports tab that a new field called NewLookupRow is created automatically. This NewlookupRow port will indicate if a row has been inserted or updated in the cache. In the dynamic lookup the records gets inserted or updated in the cache depending on weather the rows existed or not existed in the cached.

Use of NewLookupRow in Dynamic Cache:

NewLookupRow port can have the following values:-

0 - Indicates that powercenter did not update or insert the row in the cache.
1 - Indicates that powercenter inserted the row into the cache because it is a new row
2- Indicates that powercenter updated the row in the cache because the lookup value has changed.

Using the above values you can decide using update or filter transformation what you want to do with those rows coming of the source. Like for example if you want to insert it into the target table or update it? If NewLookupRow=1 then insert it to the target table or if it equal to 2 then update the record in the target table.

Monday, 24 October 2016

How to Loop in SQL server?

The below example shows how to loop in microsoft SQL server and insert records in chunks: This makes inserts faster.

SET IDENTITY_INSERT dbo.table1 ON /* This is required if you are adding data to Identity columns */

DECLARE @cnt INT = 2006;

WHILE @cnt < 2016
BEGIN
select * from table1 where [Reporting_Period_SID]= @cnt
SET @cnt = @cnt + 1
END

SET IDENTITY_INSERT dbo.table1 OFF

Selecting top 3 (n) or Max 3 (n) records for each group by condition

Use the example to select top 3 records for each group. The row number is broken by the Sales_Reporting_Period and then is used to filter the top 3 records.

SELECT Company_code, Run_sid, Curr_ind, Sales_Reporting_Period from

(SELECT AF.Company_Code,

BB.Run_SID ,

BB.Curr_Ind

,BB.Sales_Reporting_Period , ROW_NUMBER()

over (

PARTITION BY Sales_Reporting_Period

order by Sales_Reporting_Period, BB.Run_Sid Desc

) AS RowNo

FROM

[dbo].[Run_Dim] BB,

[dbo].[My_Reserve_Fact] AF

Where

AF.Run_SID = BB.Run_SID

group by

AF.Company_Code,

BB.SalesReporting_Period

,BB.Curr_Ind

,BB.Run_SID)Q1 where RowNo <=3

Thursday, 6 October 2016

SQL server and azure troubleshooting - common issues

1) Msg 245, Level 16, State 1, Line 2

Conversion failed when converting the varchar value '000005EW84' to data type int.

Resolution: this issue happens when there is character in the column that you are converting to integer. Use a case statement like below to remove those records that have a column. In the example below policy_number is the column that has characters.

case when isnumeric(policy_number) = 1 then cast(policy_number as int) else 0 end as policy_num

2) Cannot insert explicit value for identity column in table 'Mytable_Dim' when IDENTITY_INSERT is set to OFF.

Resolution: You're inserting values for OperationId that is an identity column.

You can turn on identity insert on the table like this so that you can specify your own identity values.

SET IDENTITY_INSERT Table1 ON

INSERT INTO Table1
/*Note the column list is REQUIRED here, not optional*/
            (Mydata_SID,
             MyDate_Desc)
VALUES      (20,
             ''My data")

SET IDENTITY_INSERT Table1 OFF

3)Msg 515, Level 16, State 2, Line 45

Cannot insert the value NULL into column 'Party1_SID', table 'dbo.Party_Dim'; column does not allow nulls. INSERT fails.

Resolution: The column Party1_Sid does not accept null values. Insert a value of -1 to the column whenever the value is null. Isnull() function can be used to test if a column value is null.

4) Msg 120, Level 15, State 1, Line 45 The select list for the INSERT statement contains fewer items than the insert list.

Resolution:The number of SELECT values must match the number of INSERT columns.
Match the number of columns on select and insert statements..

5) Msg 264, Level 16, State 1, Line 45

The column name 'Party1_SID' is specified more than once in the SET clause or column list of an INSERT.

Resolution: Remove any duplication of columns in the query. A column cannot be assigned more than one value in the same clause. Modify the clause to make sure that a column is updated only once. If this statement updates or inserts columns into a view, column aliasing can conceal the duplication in your code.

6) Arithmetic overflow error converting numeric to data type numeric.

Resolution: Increase the numeric precision. If there are datatype such as numeric(11,2) you might have to make it high so that all the values can fit into the datatype. For example, make it numeric(20,2).

7) Msg 1033, Level 15, State 1, Procedure test_vw, Line 56
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.

Resolution: Remove order by from the view statement.

Thursday, 25 August 2016

Improving performance of Inserts, updates and deletes in SQL server

How to make Inserts , updates and deletes faster in SQL server - Microsoft Azure?

I tried to delete some records from a table with 100 millions records and it took forever. One of the reason being there was no index on the column I used in the where clause of the delete statement and other reason being the table was so big. So deleting a big chunk of the data from such big tables can take forever since the data gets written to transaction logs and the table has to scan for all the records that match the where clause condition.

How to delete records quickly from huge tables in sql server on Azure?

1) Use only the column that has index defined in the where clause condition. May be this will help find the records to be deleted quickly.

2) Delete records in small chunks using the loop below. This prevents transaction logs from growing rapidly. @@rowcount always returns the number of records affected from delete.

DELETE TOP (100000) from dbo.mytable WHERE name='dwbi'
WHILE @@rowcount > 0
BEGIN
 
  DELETE TOP (100000) from dbo.mytable
    WHERE name='dwbi'
 
END

3) Insert data into temp table with no index on it using the columns that need not be deleted , delete the original table, rename the temp table back to the original table and add back the index on the newly renamed temp table.

select * into tmp_mytable from mytable where name!='dwbi'

drop mytable

sp_rename 'tmp_mytable', 'mytable'

How to insert records quickly from huge tables in sql server on Azure?

My records to insert records into a huge table improved a lot after I disabled the non clustered indexes and rebuild it after the insert. The steps are below:

ALTER INDEX [IX_MYTABLE_User_SID] ON dbo.[MYTABLE] DISABLE

Insert into [dbo].[mytable] select * from [dbo].[stg_table]

ALTER INDEX [IX_MYTABLE_UserSID] ON dbo.[MYTABLE] REBUILD

How to update records quickly from huge tables in sql server on Azure?

My usual way of updating records used to be something like this:

update dbo.mytablename set name= (
select dbo.TMP_table_name.name
from TMP_table_name

where dbo.mytablename.party_sid=TMP_table_name.party_sid)

The above update statement takes a lot of time because it updates all the records even those that do not match. The easier and faster way to update is using the statement below: This will update only the matching records and it is pretty quicker.

update dbo.mytablename set
dbo.mytablename.name=dbo.TMP_table_name.name
from TMP_table_name

where dbo.mytablename.party_sid=TMP_table_name.party_sid

Friday, 8 July 2016

Performance tuning of Informatica Big Data Edition Mapping

Below are the list of performance tuning steps that can be done in Informatica Big Data Edition:

1) When using a look up transformation only when the lookup table is small. Lookup data is copied to each node and hence it is slow.

2) Use Joiners instead of lookup for large data sets.

3) Join large data sets before small datasets. Reduce the number of times the large datasets are joined in Informatica BDE.

4) Since Hadoop does not allow updates, you will have to rebuild the target table whenever the record is updated in a target table. Instead of rebuilding the whole table, consider rebuilding only the impacted partitions.

5) Hive slower with any non string data type. It needs to create temp tables to do the conversion to and from the non string data type to string data type. Use non string data type only when required.

6) Use the data type precision close the actual data. Using higher precision slows down the performance of Informatica BDE.

7) Map only the ports that are required in the mapping transformation or loaded to target. Less number of ports means better performance and less data reads.

Workarounds for Mapping variables and parameters and sequence generators and sorters in BDE

Mapping variables and parameters and sequence generators and sorters in BDE

Since there are no mapping variables/parameters and sequence generators in BDE, you can use the following workarounds:

For mapping variables and parameters, you can use a control table or a files instead and read the control tables or files in the mapping and use the content of the table/files in your mapping. Create a look up on the control table to get values for all the parameters defined in the control table. You can update the control table if the parameter needs to be updated at the end of the run.

For Sequence generator, you can use UUID (Unique Identified) functions instead. These UUID functions are alphanumeric and if you need numeric only then use Java functions.

HSQL does sorting by default. i.e Hadoop does the sorting and so you do not need sorter unless you are using it with a aggregator that has a sorted input. In this case you need to add a sorter to validate the mapping.

Thursday, 7 July 2016

Useful Queries for troubleshooting amazon redshift

USEFUL QUERIES FOR TROUBLESHOOTING IN AMAZON REDSHIFT

Here are some of my queries for troubleshooting in amazon redshift. I have collected this from different sources.

TO CHECK LIST OF RUNNING QUERIES AND USERNAMES:

select a.userid, cast(u.usename as varchar(100)), a.query, a.label, a.pid, a.starttime, b.duration,
b.duration/1000000 as duration_sec, b.query as querytext
from stv_inflight a, stv_recents b, pg_user u
where a.pid = b.pid and a.userid = u.usesysid

select pid, trim(user_name), starttime, substring(query,1,20) from stv_recents where status='Running'

TO CANCEL A RUNNING QUERY:

cancel <pid>

You can get pid from one of the queries above used to check running queries.

TO LOOK FOR ALERTS:

select * from STL_ALERT_EVENT_LOG
where query = 1011
order by event_time desc
limit 100;

TO CHECK TABLE SIZE:

select trim(pgdb.datname) as Database, trim(pgn.nspname) as Schema,
trim(a.name) as Table, b.mbytes, a.rows
from ( select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name ) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
order by b.mbytes desc, a.db_id, a.name;

TO CHECK FOR TABLE COMPRESSION:

analyze <tablename>;

analyze compression <tablename>;

TO ANALYZE ENCODING:

select "column", type, encoding
from pg_table_def where tablename = 'biglist';

TO CHECK LIST OF FILES COPIED:

select * from stl_load_errors

select * from stl_load_commits

select query, trim(filename) as file, curtime as updated, *
from stl_load_commits
where query = pg_last_copy_id();

TO CHECK LOAD ERRORS

select d.query, substring(d.filename,14,20),
d.line_number as line,
substring(d.value,1,16) as value,
substring(le.err_reason,1,48) as err_reason
from stl_loaderror_detail d, stl_load_errors le
where d.query = le.query
and d.query = pg_last_copy_id();

TO CHECK FOR DISKSPACE USED IN REDSHIFT:

select owner as node, diskno, used, capacity
from stv_partitions
order by 1, 2, 3, 4;
select query, trim(querytxt) as sqlquery
from stl_query
order by query desc limit 5;

SOME IMPORTANT AWS COMMANDS:

To resize the redshift cluster (node type and number of nodes always required):

aws redshift modify-cluster --cluster-identifier <cluster name> --node-type dw2.8xlarge --number-of-nodes 3

To get filelist on S3:

aws s3 ls $BUCKET/ > ./filecount.out

To get status of cluster and other information of cluster in text format:

aws redshift describe-clusters --output text

Thursday, 28 April 2016

How to run stored procedure in Informatica cloud

How to run stored procedure in Informatica cloud?

There are multiple ways to run stored procedure in Informatica cloud.

If you want to call a stored procedure by passing an input value and reading the output from the stored procedure and process it in a mapping then it is better to use SQL transformation in Informatica cloud mapping designer as shown in picture below:

If you want to read the output of a stored procedure as a source in a mapping then you can call the stored procedure a query in the source as shown in screenshot below. The exec sp_help stored procedure is called in the source query. The stored procedure should return some data for this method to work. You might have to define the output variables in the stored procedure.

If you want to just execute a stored procedure once before or after executing a mapping then you can call the stored procedure a pre or post processing command in the mapping configuration as shown in the screenshot below. SQLCMD of sql server is used to execute the stored procedure. The output of the stored procedure is written to an output file. -b flag is used to return error code and make the preprocessing command fail when the stored procedure fails. Use session property Preprocessing command failure and set it to stop if you want failure to stop the processing of other jobs.

Tuesday, 29 March 2016

Importing Cobol copy book definition , and reading and troubleshooting cobol file issues in Informatica

Importing Cobol copy book definition in Informatica and troubleshooting cobol file issues

To read a cobol file, first you need to create the source definition and then read the file the data from the cobol input file. Source definitions are created from cobol copy book definitions. Cobol copy book has definition of all the fields present in the data file along with the level and occurence information.

How to import a cobol copy book in Informatica?

Create source definition using Source-> Importing From Cobol file option to create the source definition from cobol copy book.

Insert the source definition into mapping. Instead of source qualifier it is going to create normalizer transformation.Read from the normalizer and write to any target.

Details on how to load data from cobol copy book is in this video that I found on youtube:

https://www.youtube.com/watch?v=WeACh8AWQv8

Troubleshooting copy book import in informatica:

Reading from cobol file is not always straightforward. You will have to sometimes edit the copy book to get it working. Some of the messages I had while creating source definitions are below:

Error messages seen:

05 Error at line 176 : parse error

Error: FD FILE-ONE ignored - invalid or incomplete.

05 Error: Record not starting with 01 level at line 7

Resolution for copy book import issues In Informatica:

1) Add the headers and footers shown in blue below even if they are blank

        identification division.
        program-id. mead.
        environment division.
        select file-one assign to "fname".
       data division.
       file section.
       fd FILE-ONE.
        01 HOSPITAL-REC.
           02 DEPT-ID                PIC 9(15).
           02 DEPT-NM                PIC X(25).
           02 DEPT-REC                OCCURS 2 TIMES.
               03 AREA-ID             PIC 9(5).
               03 AREA-NM             PIC X(25).
               03 FLOOR-REC       OCCURS 5 TIMES.
                   04 FLOOR-ID       PIC 9(15).
                   04 FLOOR-NM       PIC X(25).
      working-storage section.
       procedure division.
           stop run.

2) All the fields should start from column 8. Create 8 spaces before each field.
3) Make sure the field names do not start with *. It has to have a level number at the beginning.
4) Level 01 has to be defined as the first record. If level 01 not there then it would not import the cbl file.
5) Special levels such as 88 , 66 might not get imported. This apparently is limitation of powercenter. I could not find a way around it.

Resolution for cobol file issues In Informatica:

I had such a hard time reading a cobol file that was in EBCDIC format with comp-3 fields in it. I thought the issues are mostly in importing the cobol copy book but even more issues when reading the cobol file. It is supposed to be straight forward in Informatica powercenter. The list of issues I faced.

1) When reading from cobol mainfram file the comp-3 fields were not read properly.

2) The alignments of the fields were not correct. The values were moving to next columns and had a cascading effect.

To resolve the above two issues I asked the mainframe file to be sent in binary format. Apparently files ftp'ed in ascii format from mainframe causes all the above issue. The comp-3 fields are not transferred properly in Ascii mode of ftp.

Now with the binary file and below properties shown in screen it was supposed to work correctly. It was not. It was all garbage because of a weird issue. The properties I set for code page in source properties in source designer was for some reason not getting reflected in the session properties for that source. I had to manually change in the session- mapping -> source- file properties -advanced tab to code page IBM EBCDIC US English IBM037 and Number of Bytes to skip between records to 0 and it finally worked.

To summarize when reading cobol file:
a) Ask for the file to be ftp'ed in binary format from the mainframe.
2) Set the code page and number of bytes to skip appropriately as shown in the screenshot in the session properties.

This should solve most of your problems with comp-3 fields and alignments ,etc.

If not try unchecking IBM comp-3 option in the source properties from source designer. Try changing the code page and number of bytes to skip, etc and hopefully it works.

If you are seeing dots at the end of some comp-3 fields then most likely you are writing numeric data into varchar field. If you write it to numeric fields in the database those dots should disappear.

Another issue with Informatica power center is that it can read only fixed width files. if you have occurs section with depending on condition then most likely you have a variable length file and Informatica power center does not work correctly with variable length files. Ask the mainframe guys to convert it to fixed width files.

Friday, 4 March 2016

Mapping variables and parameters in Informatica Mapping and session

Mapping variables and parameters in Informatica Mapping

1) Informatica Mapping Parameters:

You use mapping parameters to supply parameter values to informatica mappings. It could be values such as Batch_Key, Batch_Name etc.The value of the mapping parameters do not change for the entire execution time of a session. The mapping parameters can be defined in the mapping as shown in the screenshot below in the mapping designer.The same mapping parameters needs to be defined in the parameter file too.

For example, the parameter file would have:

$$BATCH_KEY=301

In the mapping designer, you would define the $$BATCH_KEY in the mapping and parameters window. During execution the session will associate the value 301 where it finds $$BATCH_KEY in the mapping. It might also be necessary to define the same parameter in the worflow under the variables tab.

2) Informatica Mapping Variables:

Mapping variable represents a value that can change through the session. Mapping variables can also be defined under the mapping and parameters window of the mapping designer just like parameters (shown in screenshot above). It can also be initialized using the parameter file. However the values of a variable can be changed during execution. The value of the mapping variables can be changed in an expression transformation using setvariable functions. For example:

SETVARIABLE($$BATCH_KEY, 302) would change the value of $$BATCH_KEY to 302 during execution.

SETVARIABLE function can be used in expression, filter, router, and update strategy.Integration Service saves the value of a mapping variable to the repository at the end of each successful session run and uses that value the next time you run the session.

Postsession and presession variable assignment in session can be used to pass the value of the mapping variable to the session/workflow variables.This allows using the values of the variable at workflow level as shown in screenshot below.

3)Informatica Session Parameters:

Session parameters represent values that can change between session runs, such as database connections or source and target files. E.g.: $DBConnectionName, $InputFileName, etc. These paramters needs to defined in the parameter file.

4)Informatica Workflow variables:

All the parameters, variables that have to be used in the workflow needs to be defined in the workflow in the window below. This can be all the mapping parameters, session parameters, etc.

Monday, 22 February 2016

Pushdown Optimization in Informatica Power Center and Big data edition

Pushdown optimization is a feature in Informatica which enables the Informatica mapping to be converted into a SQL code and executed on the database. It is useful when you are dealing with huge volumes of data from the source and instead of reading all the data to the Informatica server it might be better to do some of the transformation logic such as filtering, aggregation, joins on the database itself. Also, useful in cases when you have powerful data warehouse appliance such as Netezza or Teradata.

What is informatica cloud services?

What is Informatica cloud services ?

Informatica cloud service is an Informatica product that is web based and on the cloud solution for data integration. The mappings and any data integration tasks can be created on the web based tool and executed/monitored from the web based tool. A secure agent is installed on the local environment that allows reading/writing data to/from the cloud. Connections can be created to database such as amazon redshift, sql server, oracle, netezza etc and network folders. This product allows ETL tasks to be created on the web without having a power center server location on your premise. Hence, providing a cheaper option for creating Informatica tasks.

The different components of the Informatica cloud service (ICS) tool sets are listed below:
.

Data synchronization and Data replication:

As the name implies it allows database synchronization. Data is copied from the source to the targets along with application of any data filters. One table or multiple tables can be synchronized. ICS provides inbuilt connectors using which connections to the source and target tables can be created.

Mapping configuration:

Allows a task to be created from a mappings. The parameters files, variables, post processing commands, sessions settings etc can be defined for the task. This is similar to the session in the informatica power center.

Power Center:

This component allows Informatica power center workflows to be imported and executed from the Informatica cloud service. The source and target connections from the power center workflow can be mapped to the connections available on Informatica cloud service. When importing informatica power center workflows make sure the workflow is exported from the repository manager of power center. Otherwise it shows some error while importing the task.

Mappings

This component allows to create informatica ETL mappings similar to Informatica power center. Not all the features of Informatica power center is available in Informatica cloud service. Features like SQL transformation, union, etc are not provided in ICS. If you need additional features then you might have to create a mapplet in power center and import that task as a mapplet from powercenter and add it to the ICS mapping. ICS provides transformations such as expression, joiner, filter, lookup, sorter, aggregator, mapplet, and normalizer.

Task Flows

Task flows allows to create a sequence of jobs to be executed. All the mapping configuration tasks created can be made to execute in a sequence. This is similar to the workflow in informatica power center.

Integration templates

This component allows Informatica mapping templates created using Microsoft visio to be imported and applied to mappings in ICS.

Activity Log/Monitor

This component allows to monitor all the executing tasks and as well see the completed tasks. It provides information on the number of source and target rows, session logs, etc.

Mapplets

ICS mapplets work similar to Informatica power center mapplets. Mostly this allows powercenter mapplets to be imported to ICS.

Connections

This components allows connections to be created to flat files folders, databases, etc. Wide range of connectors are available.