DWBI-TECH BLOGS (Pradeep Kannadiga): DW-BI

Showing posts with label DW-BI. Show all posts

Tuesday, 10 November 2015

User management- Creating groups , users , owners in Amazon redshift

Below are some sample commands to create user groups, adding users, managing permission on table in Amazon redshift.

1) To create a user use the below command in amazon redshift:

create user my_user password 'anaconda123';

In the above example, user with name my_user is created with password anaconda123.

You can later change the password using the command:

alter user my_user password 'mynewpassword123'

2) To create a user_group use the below command in amazon redshift:

create group my_group with user my_user;

In the above command, a new group with name my_group is created and the user my_user is added to that group.

3) To create a database schema use the below command in Amazon redshift:

create schema myschema authorization my_user;

In the above example, schema myschema is created with my_user being provided all the permission on the schema.

4) To assign user group to a schema use the below command in Amazon redshift:

grant all on schema myschema to group my_group;

5) To add or drop users to a user group use the below command in Amazon redshift:

ALTER GROUP group_name ADD USER username;

ALTER GROUP group_name DROP USER username;

ALTER GROUP group_name RENAME TO new_name;

6) To check permissions on the tables in amazon redshift:

select * from pg_tables /* provides all table owner info */
select * from pg_views /* provided table owner info */
select * from pg_shadow /* provides user info */
select * from pg_group /* provided user group info */

7) To add permissions on the tables in amazon redshift:

Use sample below to add permissions on a table in redshift

Create table MYTABLE (
COLUMN1 CHAR(1) NOT NULL,
COLUMN2 CHARACTER VARYING(100) NOT NULL)
distkey (COLUMN1);

GRANT ALL ON TABLE MYTABLE TO mstr_user;

ALTER TABLE MYTABLE owner TO my_user;

Friday, 2 October 2015

Informatica Java transformation to parse comma separated columns and generate new rows - normalizer

Informatica Java transformation to parse comma separated columns and generate new rows - normalizer limitations

Informatica normalizer transformation is normally used to convert a column containing multiple values into separate rows. i.e something like ID, 1,2,3,4 can be converted to something like below.
ID , 1
ID, 2
ID, 3
ID, 4

The problem with normalizer however is that it has to know the number of occurences. If the number of occurences are very high then it is hard to create the output ports for that many number of ports.

The easy way in these cases is to use Java transformation to parse the column containing multiple values and generate a separate row for each of them:

The java code is shown below:

String str_var= INPUT_PORT_NAME
String[] arr;
String delimiter = ",";
arr=str_var.split(delimiter);
for (int i=0; i<arr.length;i++) {
INPUT_PORT_NAME=arr[i];
generateRow();
}

Hadoop: How to dynamically partition table in Hive and insert data into partitioned table for better query performance?

Partitioning in Hive just like in any database allows for better query performance since it allows only sections on data to read instead of the complete table. Hive allows single or multiple columns to be used for partitioning data. It allows dynamic and both static partitioning of tables. In Hive, since data is stored as files on HDFS, whenever you partition the table it creates sub directories using the partition key. For example, if date and batch key are the partition keys , then it creates directories by date and then batch key sub directory within than date directory. This allows to read data by a certain date and batch key.

In dynamic partitioning of hive, the data is inserted into the respective partition dynamically without you having explicitly create the partitions. Do not use columns with high cardinality as partition keys since that will create more subdirectories which will deteriorate the performance.

Below example shows one of the example:

/*Below is the ddl for creating the main table for which you want to insert data.

It is partitioned by date and batchid. The partitions keys apparently has to be at the end of the table */

create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string
)
partitioned by (roweffectivedate string, batchid int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'  STORED AS SEQUENCEFILE;

/*Below is the ddl for the temp hive table from which you want to insert data to the main table*/

create table IF NOT EXISTS temp_mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string,
batchid int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'  STORED AS SEQUENCEFILE;

/* Load data from a local file to the temp hive table */

load data local inpath '/etc/home/mydirectory/test.txt' overwrite into table temp_table;

/* The below set statment allows all the partitions to be dynamic in Hive */

set hive.exec.dynamic.partition.mode=nonstrict;

/* Now insert data from the temp hive table to the main hive table that stores the data by date and batch id */

insert overwrite table mytable partition(roweffectivedate string, batchid int) select * from temp_mytable;

If you want to manually add the partitions to Hive then use the alter statements such as below:

ALTER TABLE test
    ADD PARTITION (roweffectivedate='2014-03-05',batchid='10')
    location '/usr/datawarehouse/hive/db/mytable/'

Friday, 18 September 2015

Creating a table in Parquet, Sequence, RCFILE and TextFile format and enabling compression in Hive

Creating a table in Parquet, Sequence, RCFILE and TextFile format in Hive.

My intention was to write an article of different file formats in Hive but happened to notice an article already posted.

We are using parquet these days apparently because of the compression options and performance with large tables in Hive. Parquet is optimized to work with large data sets and provide good performance when doing aggregation functions such as max or sum.

If you prefer to see the HDFS file in clear text format then you need to store the file in textfile format. This apparently takes more space than the binary formats supported by sequence and rcfile.

Below are examples for creating a table in Parquet ,Sequence, RCfile, TextFile format: Location is optional.

create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS parquet location 'hdfspathname';

create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS sequencefile location 'hdfspathname';

create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS rcfile location 'hdfspathname';

create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS textfile location 'hdfspathname';

The default storage format is apparently is a comma-delimited TEXTFILE.

Enabling compression in Hive or Impala table:

Apart from the storage format supported in Hive, the data can also be compressed using codecs such as LZO, Snappy, etc. To enable compression in Hive, we use set statements such as below. The below example is for Lzo compression. After the set statements you can use insert statements to insert data to a table and the data will be compressed as per the set statement.

set mapreduce.output.fileoutputformat.compress=true;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;

insert overwrite table tablename partition(dateid) select * from table2;

Monday, 14 September 2015

Why a career in Data warehousing ? Best jobs and salary in Data warehousing.

Why a career in Data warehousing?

Firstly, what is data warehousing? Data Warehousing (DW) is a term used in Information technology to refer to data repositories created to store historical data that is mostly used for reporting and analytical purposes. This could be an enterprise data warehouse that has data integrated from various sources, or it could be subject specific data repositories called the data marts. This is just a small definition of data warehousing.

These days data warehousing is gaining more popularity and attention thanks to Big data and other high profile web service technologies. Big data is a technology that allows huge volume of data to be processed at lower cost for analytical purpose. Big companies like Facebook, google, twitter, and yahoo have invested in these technologies and using them for processing the huge amount of data that they gather on daily basis. These days cloud services have become popular and are allowing huge data warehouses to be hosted on the cloud. Amazon web services, Microsoft azure to make a few are gaining popularity. With these kind of technologies making news and gaining popularity, data warehousing is becoming more exciting and a place where lot of IT jobs are getting created and bringing in talented IT professionals.

Having worked in Data warehousing from last 10 years, I have seen increased interest in companies investing in data warehousing and in general increase in interest among ITfolks in data warehousing related technologies. I'm writing this article to introduce people to data warehousing and let them know about some of the jobs in data warehousing.

Best jobs and salary in Data warehousing.

Enterprise Data Architect/ ETL Architect/ Solution Architect / Hadoop Architect -

As the name implies these guys are responsible for architecture. This could be enterprise architecture which means they decide how all the pieces of the data warehouse work together, which products need to be used, interactions between different systems, etc. ETL architect could be more involved with data integration and creation of data warehouses. Hadoop architects are more concerned about the big data architecture in an organization. Architects also work with upper management and business teams to provide optimum data warehousing solutions.

Estimated Salary: 80,000 to 140,000+

Team/Project/Account Managers -

As with any team , there will be managers in data warehousing teams too. These guys usually are expected to have some experience with data warehousing. Team managers are more involved in resourcing and budgets etc. Project managers are more involved in managing a projects till the end. Account managers are involved if the company is a vendor providing services to another organization.

Estimated Salary: 80,000 to 120,000+

Big data / Hadoop developers and Admins-

The big data developers are involved in developing Hadoop jobs which could be writing map reduce jobs, developing hive , pig , oozie scripts or working on complex apache spark code. The Hadoop admins are involved in managing the Hadoop infrastructure, fine tuning it for performance, user administrator, security, etc.

Estimated Salary: 60,000 to 100,000+

Data modellers -
In some of the organization the data modelling can be done by data architects or even ETL developers. Some organizations also have different people for doing data modelling. Data modelling refers to developing all the table structure and defining the relationships between them.

Estimated Salary: 60,000 to 100,000+

ETL/Data warehouse Developer -
ETL stands for Extract, transform, and load and ETL professionals create jobs to integrate data from multiple sources, transform and load it to relational or flat file targets. These days ETL developers also work ETL tools created for big data platforms. There are many big vendors in this area and Informatica, Pentaho are some of the big names.

Estimated Salary: 70,000 to 100,000+

Report Developer -
Report developers build various reports and dashboards for the business teams using tools developed by vendors such as Cognos, Business Objects, Microstrategy, etc. Report developers need to know SQL and proficiency with any of the report development tools.

Estimated Salary: 70,000 to 100,000+

Tester/ Quality Assurance -
Well every field in software development needs a tester to make sure the software is working as per requirement. Testers do all the testing, write test cases, etc. Data warehouse testing might need testers who know a bit of ETL tools, SQL, and some data warehousing concepts.

Estimate Salary - 60,000 - 100,000+

Database Administrators -

Database administrators are responsible for managing and administering the databases and servers that are used to store the data in data warehousing. There are many jobs or oracle , sql server, netezza, etc kind of databases.

Estimated Salary: 70,000 to 100,000+

Production and Infrastructure Support -
Production support folks are responsible for managing and supporting all the jobs in production. Only these guys have access to production servers and make sure all the production jobs are running smoothly. They have to carry pagers and sometimes have to work overnight shifts.

Estimated Salary: 50,000 to 80,000+

Friday, 26 December 2014

Big Data - Good Books for Hadoop, Hive, Pig, Impala, Hbase.

Big Data - Good Books for Hadoop, Hive, Pig, Impala, Hbase.

1) Hadoop: The Definitive Guide

Hadoop: The Definitive Guide: Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

You’ll find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This third edition covers recent changes to Hadoop, including material on the new MapReduce API, as well as MapReduce 2 and its more flexible execution model (YARN).

Store large datasets with the Hadoop Distributed File System (HDFS)

Run distributed computations with MapReduce

Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence

Discover common pitfalls and advanced features for writing real-world MapReduce programs

Design, build, and administer a dedicated Hadoop cluster—or run Hadoop in the cloud

Load data from relational databases into HDFS, using Sqoop

Perform large-scale data processing with the Pig query language

Analyze datasets with Hive, Hadoop’s data warehousing system

Take advantage of HBase for structured and semi-structured data, and ZooKeeper for building distributed systems

2) Programming Hive

Programming Hive: Need to move a relational database application to Hadoop? This comprehensive guide introduces you to Apache Hive, Hadoop’s data warehouse infrastructure. You’ll quickly learn how to use Hive’s SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoop’s distributed filesystem.
This example-driven guide shows you how to set up and configure Hive in your environment, provides a detailed overview of Hadoop and MapReduce, and demonstrates how Hive works within the Hadoop ecosystem. You’ll also find real-world case studies that describe how companies have used Hive to solve unique problems involving petabytes of data.

Use Hive to create, alter, and drop databases, tables, views, functions, and indexes
Customize data formats and storage options, from files to external databases
Load and extract data from tables—and use queries, grouping, filtering, joining, and other conventional query methods
Gain best practices for creating user defined functions (UDFs)
Learn Hive patterns you should use and anti-patterns you should avoid
Integrate Hive with other data processing programs
Use storage handlers for NoSQL databases and other datastores
Learn the pros and cons of running Hive on Amazon’s Elastic MapReduce

3) Programming Pig

Programming Pig: This guide is an ideal learning tool and reference for Apache Pig, the open source engine for executing parallel data flows on Hadoop. With Pig, you can batch-process data without having to create a full-fledged application—making it easy for you to experiment with new datasets.

Programming Pig introduces new users to Pig, and provides experienced users with comprehensive coverage on key features such as the Pig Latin scripting language, the Grunt shell, and User Defined Functions (UDFs) for extending Pig. If you need to analyze terabytes of data, this book shows you how to do it efficiently with Pig.

Delve into Pig’s data model, including scalar and complex data types
Write Pig Latin scripts to sort, group, join, project, and filter your data
Use Grunt to work with the Hadoop Distributed File System (HDFS)
Build complex data processing pipelines with Pig’s macros and modularity features
Embed Pig Latin in Python for iterative processing and other advanced tasks
Create your own load and store functions to handle data formats and storage mechanisms
Get performance tips for running scripts on Hadoop clusters in less time

4) HBase: The Definitive Guide

HBase: The Definitive Guide: If you're looking for a scalable storage solution to accommodate a virtually endless amount of data, this book shows you how Apache HBase can fulfill your needs. As the open source implementation of Google's BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. Many IT executives are asking pointed questions about HBase. This book provides meaningful answers, whether you’re evaluating this non-relational database or planning to put it into practice right away.

Discover how tight integration with Hadoop makes scalability with HBase easier

Distribute large datasets across an inexpensive cluster of commodity servers

Access HBase with native Java clients, or with gateway servers providing REST, Avro, or Thrift APIs

Get details on HBase’s architecture, including the storage format, write-ahead log, background processes, and more

Integrate HBase with Hadoop's MapReduce framework for massively parallelized data processing jobs

Learn how to tune clusters, design schemas, copy tables, import bulk data, decommission nodes, and many other tasks

5) Getting Started with Impala: Interactive SQL for Apache Hadoop

Getting Started with Impala: Interactive SQL for Apache Hadoop:Learn how to write, tune, and port SQL queries and other statements for a Big Data environment, using Impala—the massively parallel processing SQL query engine for Apache Hadoop. The best practices in this practical guide help you design database schemas that not only interoperate with other Hadoop components, and are convenient for administers to manage and monitor, but also accommodate future expansion in data size and evolution of software capabilities.

Ideal for database developers and business analysts, Getting Started with Impala includes advice from Cloudera’s development team, as well as insights from its consulting engagements with customers.

Learn how Impala integrates with a wide range of Hadoop components
Attain high performance and scalability for huge data sets on production clusters
Explore common developer tasks, such as porting code to Impala and optimizing performance
Use tutorials for working with billion-row tables, date- and time-based values, and other techniques
Learn how to transition from rigid schemas to a flexible model that evolves as needs change
Take a deep dive into joins and the roles of statistics

Monday, 17 November 2014

Performance tuning in amazon redshift - Simple tricks (table/query design)

Performance tuning in amazon redshift - Simple tricks

The performance tuning of a query in amazon redshift just like any database depends on how much the query is optimised, the design of the table, distribution key and sort key, the type of cluster (number of nodes, disk space,etc) which is basically the support hardware of redshift, concurrent queries, number of users, etc. However, the good news is it is not that hard compared to database such as oracle to do performance tuning.Amazon Redshift is a peta byte scale and massively parallel database and very high performance can be achieved with simple configuration steps. In this blog, we have explained in detailed how to achieve high performance in amazon redshift.

INVESTIGATION:
To give an idea about the performance issues we were facing, have a look at the cost of the query and the disk usage that were resulting from the queries we were running on redshift. Explain command can be used to get the cost of the query and the execution plan. The execution plan will show those parts of the query that are very costly and needs to be tuned. In the below query, you can see just by looking at the numbers beside the cost the first two outer joins are very expensive and the last inner join is not that costly. It is not only because of the outer join, but because of the amount of data distribution that is happening in the query.

explain select * from table1 a11 outer join table2 a12 on (a11.EMPLOYEE_KEY = a12.EMPLOYEE_KEY) outer join table3 a13 on (a11.name = a13.name ) join table3 a14 on (a11.dept_key = a14.dept_key)

.>XN Hash Left Join DS_BCAST_INNER (cost=683864529053.53..9912121211212999999999999996733616880411.00 rows=8226899613 Width=1388)                                                                                                                                  ,"Hash Cond: ("outer"".employee_key="inner".employee_key)
   -> XN Hash Left Join DS_DIST_BOTH (cost=    683646717104.93..999999121211121999999999967336168804116 rows=8226899613 width=1372)
              ,"Hash Cond: ("outer"".name="inner".name)
       -> XN Hash Left Join DS_DIST_NONE (cost= 346363.65 rows=822689 width=1372)
                                  ,"Hash Cond: ("outer"".dept_key="inner".dept_key)

Data is distributed among various computing nodes in amazon redshift and the amount of data movement between nodes plays a large part in query performance. In your query, you see a lot of DS_DIST_NONE in your query then you have least amount of data distribution and if you have other keywords like DS_DIST_BOTH or DS_BCAST_INNER then there are more data redistribution happening and the performance can be bad.

The meaning of Explain plan attributes as per amazon redshift website is below:

DS_BCAST_INNER- means that broadcast a copy of the entire inner table to all compute nodes.
DS_DIST_ALL_NONE-No redistribution is required because the inner table was distributed to every node using DISTSTYLE ALL
DS_DIST_NONE- No tables are redistributed: collocated joins are possible because corresponding slices are joined without moving data between nodes.
DS_DIST_INNER-    The inner table is redistributed.
DS_DIST_ALL_INNER -    The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL
DS_DIST_BOTH- Both tables are redistributed.

The disk usage for the above query was looking like below. The disk usage can be seen from the amazon redshift web console in the performance tab. The disk space usage was reaching close to 100% and this is not good news at all since that single query is consuming all the disk space. This is indication of poor query performance. This query was run on a dw2_8xLarge cluster which is a large node type with relatively high disk space.

Also, another way you can check if your query is performing poorly is check if there are lot of writes to the disk using query such as below: 2044 below is the query id of the query.

select query, step, rows, workmem, label, is_diskbased
from svl_query_summary
where query = 2044 order by workmem desc;

SOLUTION:

We had to make a lot of changes to fix the performance issues. All the factors that influence the query performance and the required changes in discussed below:

1) Table design (Data distribution and Sorting) - In redshift the data is distributed on all the nodes. Whenever it is executing the query it has to bring the data from the different nodes and use it in joins, aggregation, sorting, group by, etc. If the amount of data that has to grabbed from these nodes is high then it results in lot of traffic and poor query performance. To reduce the amount of traffic between nodes, the data has to be distributed properly between the nodes. This can be achieved through the proper table design, data compression, distribution key, sort keys, and query joins. First lets looks at the table design. Below is the table create statement. The distribution style (diststyle), distribution key (distkeys), sort keys are very important in defining how the data is distributed among the nodes.

CREATE [ [LOCAL ] { TEMPORARY | TEMP } ]
TABLE table_name [ ( column_name1 encode , ]
| DISTSTYLE { EVEN | ALL | KEY }
| DISTKEY ( distkey_identifier )
| SORTKEY ( sortkey_identifier [, ... ] )

DISTSTYLE ALL will put a copy of the data in all the nodes and make it available for joins without having to move the data between nodes. This is ideal for dimension tables which does not have many records in it. DISTSTYLE KEY will distribute data based on a defined key column and better for tables that are huge and you can use a column that is used in joins to distribute the data. DISTSTYLE EVEN is used mostly to evenly distribute the data and when you cant deside between ALL or KEY.

*Trick*: USE DISTSTYLE ALL for all the dimension tables that are not big.

DISTKEY (column name)- The column name used here should be the one that is used in joins and should be defined for all the large fact tables. The distribution key (distkey) will help in distributing the data based on the distkey column and during joins only the required data is brought the nodes. There can be only one distkey (distribution key) column defined.

*Trick*: Define Distkey for all the large fact tables and the dist key column should be the one used in the joins.

If the distribution key from the large fact table is joined to the any column of a dimension table that is in diststyle ALL then in your explain plan you will see the attribute DS_DIST_ALL_NONE which mean there is no data redistribution and the cost of the query is very low. This is the state you want most part of the query to be in for best performance.

SORT KEY (column name1, ..): There can be multiple sort keys and the sort keys help a lot in storing the data in the nodes in sorted order which helps in group by or order by operation.

*Trick*: If there are columns such as date or timestamp that you use in order by operation then define those as sort keys.

2) Table design (Compression) - In the amazon redshift tables the encoding that can be used for compression can be defined along with the column name. Format shown below. There are various encodings available.

CREATE TABLE table_name (column_name
data_type ENCODE encoding-type)[, ...]

*Trick*: Easy way to decide on the encoding if you are using the copy command to load data from s3to redshift is to use COMPUPDATE ON option set in the copy command as shown below. The copy command choosed the best compression to use for the columns that it is loading data to.

copy <tablename> from 's3://mybucket/files.txt'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
delimiter '|' COMPUPDATE ON

If you want to explicitly define the encoding like when you are inserting data from another table or set of tables, then load some 200K records to the table and use the command ANALYZE COMPRESSION <tablename> to make redshift suggest the best compression for each of the columns. You can use those suggestion while recreating the table.

3) Table design (Constraints) - Defining the primary keys and foreign keys though is not enforced by redshift apparently is used by the query planner to execute the query efficiently. Hence, define those keys in your create table statements for better performance.

4) Query design - Now we have discussed about table design, the actual query design will all use all those table design features to execute the query. As discussed before the trick in designing good queries is to have proper joins. i.e use distribution key from the large fact table to join to any column of a dimension table that is in diststyle ALL. If dimension table is not in (distribution style) diststyle all, then use the distkey (distribution key) from the dimension table in the joins.

5) WLM queue setting:

The WLM queue setting on your cluster configuration determines the number of queries running on your cluster and essential part of performance tuning. Keeping it low means not many queries can run at the same time and provided more memory for each query. Set the concurrency and memory setting for each user group in the cluster for better performance

6) Copy Command Performance tuning:
If you want your copy command to be faster then split the files into multiple files so that they can get loaded in parallel to the redshift database.

All the above discussed steps should help in running the queries more efficiently. If you need more details refer to aws link:
http://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables.html

Also check:
http://dwbitechguru.blogspot.ca/2014/11/amazon-redshift-top-concerns.html

Thursday, 13 November 2014

Example PL-SQL Procedure to truncate tables in SQL server using cursors

Example PL-SQL Procedure to truncate tables in SQL server using Cursors

/* The below procedure truncates all the tables that end with _CDW using cursor function */

DECLARE @name VARCHAR(256) -- table name
DECLARE @fileName VARCHAR(256)

DECLARE db_cursor CURSOR FOR
SELECT 'TRUNCATE TABLE '+TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME LIKE '%_CDW'

OPEN db_cursor
FETCH NEXT FROM db_cursor INTO @name
WHILE @@FETCH_STATUS = 0
BEGIN
SET @fileName = @name
exec (@fileName)
FETCH NEXT FROM db_cursor INTO @name
END

CLOSE db_cursor
DEALLOCATE db_cursor

Check out SQL server date manipulation functions:
http://dwbitechguru.blogspot.ca/2014/11/sql-server-important-date-manipulation.html

Troubleshooting amazon redshift connections issues:

a) Could not connect or resolve host name

Resolution/Cause: This means the cluster is not up or your cluster address is wrong. Check the obbc/jdbc setting. Bring the cluster up.

b) ][ODBC PostgreSQL Wire Protocol driver]Seeding the PRNG failed, most likely because the system does not have /dev/random. Connection String.

][Database Classes][Error] Failed to connect to data source. Error message: Connect failed.

Cause: cluster down for maintenance or is not available.

c) [ODBC PostgresSQL Wire protocol driver] FATAL: no pg_hba.conf entry for host "10.10.140.1". user "something", database "something", SSL off

Resolution: Have the SSL mode On i.e. in Allow mode. This is a client side setting. If you are using AQT, Workbench, or any client to connect, you need to have SSL mode on if you are getting the above error.

d) [ODBC PostgreSQL Wire Protocol driver]SSL I/O Error.

Resolution: This could be caused by the cluster not being up, or the odbc not being configured properly or the connection being dropped from huge volume of data. Most likely cause being the connection being dropped because the odbc driver is trying to read large chuncks of data.

Tuesday, 11 November 2014

Amazon Redshift - Top concerns and issues

Amazon redshift cloud data warehouse is going to change the data warehousing landscape sooner or later. With more and more companies wanting to expand their data warehousing capabilities and wanting high powered devices that can churn huge loads, Amazon redshift is a very attractive proposition. It offers a peta byte scale data warehouse appliance which is very cheap and payable on demand on hourly basis with zero or no administration or maintenance required and over very high performance. All we have to do find a secure reliable way to upload data to the amazon redshift cloud which is not hard by the way, and then go crazy doing any kind of data processing. It is a great solution for sure, but before you jump into the bandwagon, a few concerns to be noted. The amazon redshift top concerns are listed below:

1) How much data do you have to upload and download and at what frequency ? - If you are uploading gigabytes of data to amazon s3 and then loading to amazon redshift, there is some amount of time required to upload the data and which also consumes huge bandwidth. Again not a problem if you are doing once a week or month. If doing daily or very constantly then yes this will add latency and network cost.

2) Integration with ETL/BI tools: If you are reading/writing huge volume data for ETL/BI processing directly over odbc/jdbc then most likely this will be challenging since the network is going to slow it down and there are not many odbc drivers (atleast I havent come across so far) that can do bulk loads directly to redshift. Odbc drivers do not compress data and if you are reading millions of records then it is going to run out of buffer or the connection will fail at some point. The recommended solution for huge volumes is to upload/download data as flat files through s3 to amazon redshift.

3) Security clearance - Some companies like insurance/banks have policies over not allowing data to be uploaded to cloud. If you can get all the security clearance then it is possible to upload data to cloud in a secure and encrypted way and keep it encrypted on the cluster in such a way that even amazon cannot see it. In any case, there is some hurdles to be crossed before companies get all the internal clearance to upload data to the amazon redshift cloud.

3) Special characters- Redshift stores data in UTF-8 characters sets. It can copy data from files in UTF-8, UTF16 formar but it converts them all into UTF-8 character set. That means some characters like those french characters with accents etc will not get loaded properly. The solution redshift is offering is to replace all of them with a single utf-8 character while copying. If any one has figured out a way to load french characters to amazon redshift, please message me. At this moment this is a big concern for me.

Also, check out:
http://dwbitechguru.blogspot.ca/2014/11/performance-tuning-in-amazon-redshift.html

Monday, 21 July 2014

Data WareHouse and Business Intelligence Project Activities in Waterfall/Agile Methodology

Initiation phase

Requirement Analysis

Design

Activities:

Cost benefit analysis and ROI calculation
Project feasibility analysis
Business case study
Project cost estimation
Resource identification and allocation, other project management tasks
Vendor identification
Proof of concepts

Activities:

Risk Assessment
Info Security Assessment
Data Dictionary
Capture Business requirements
Source Data Analysis
Defining source system format
Capture Non Functional requirements

Deliverables:

Business requirement document, Functional specification document ,
Data dictionary, etc

Activities:

High Level Architecture Design- Architecture Design artifacts , Project Technical review
Detailed design- Application Design (ETL/Report Design), Database Design
Data profiling, Data Quality analysis
Proof of concepts
Source Data Analysis
Source Target mapping
Logical/Physical Data Modeling
Design of workflows and schedule
Design walkthrough, review and sign offs

Deliverables: Design documents, data models, data mapping documents, etc

Testing

Implementation

Activities:

Prepare test strategy, test plan and test cases
Test document review and sign off
Test data set up
Test scripts creation
System integration/functional testing
Data validation
Performance testing
Defect logging and tracking
QA testing sign off
User acceptance testing and sign off

Deliverables: Testing documents, Defect summary, etc

Activities:

Prepare Implementation plan or deployment guides
Raise change management/implementation tickets and get necessary approvals
Prepare runbooks
Support implementation group during implementation
Knowledge transfer
Provide post production support during warranty period

Deliverables: Production implemented code, run books, support manuals, change tickets, etc

Sunday, 20 July 2014

Business Intelligence - A simple definition and overview

Business Intelligence (BI): tools and systems that provide business the capability to store, analyze and understand business data to make good business decision, generate report/dash boards/metrics for better understanding of company data which helps to improve efficiency and drive revenues.

Typical application of business intelligence:
a) Metrics of gross revenue, profits, sales, etc related to financial and sales analytics
b) Customer behavior profiling
c) Order management and supply chain
d) Geo-Spatial analytics
e) Regulatory complaince
f) Human Resource management

The source of data for Business Intelligence:
Operational data
Web services
Data marts/ODS and enterprise data warehouses
Social media websites (Facebook/Twitter)
B2B includes feeds from other business
MDM data- Master data management data
Geo-Spatial information
Pervasive computing devices such as sensors, cell phones, etc.

Business Intelligence tools:
a) Reporting/Dashboard tools such as Business objects and microstrategy
b) Data mining tools such as Weka
c) Statistical analysis tools such as SAS/R
d) Big data tools such as Hive/Impala etc for analysis of huge volume of data

There are obviously other complex tools that can be used for business intelligence. The tools above is what is normally found in industries.

Recent trends:
a) Moving business analytics to cloud such as Amazon redshift.
b) Using Big data for handling huge volume of data and offloading work from traditional data warehouse appliances.
c) Mobile BI applications
d) Social media analysis

Success Factors:
a) Management involvement and sponsorship
b) Focus on solving business problems
c) Stable and efficient BI environment

Required skill set for building business intelligence systems:
a) Report/dashboard developers.
d) Data modellers and data architects.
c) ETL designers/developers and understanding of data warehousing concepts.
d) Statisticians/Data scientists or Data Analysts.
e) Testers and Business analysts with background in data warehousing and business intelligence

See also: http://dwbitechguru.blogspot.ca/2014/07/a-standard-etl-architecture.html

Monday, 14 July 2014

Informatica - ETL testing -Challenges

ETL testing is a bit complicated since the complexity of testing cannot be hidden behind button or GUIs. The tester will have write SQLs on variety of databases, some times write own mappings to test the ETL code since ETL involves integration of data from heterogeneous data sources and transformation that cannot be easily coded using SQLs.

Things that might have to be tested during ETL testing:

1) The data from multiple sources is properly integrated, transformed, aggregated, sorted etc and loaded into the target database.
2) The data is properly landed, staged and loaded into mart tables.
3) Proper batch keys are inserted for delta loads and delta loads are working properly.
4) The data is inserted/updated in the target table.
5) Historical loads and daily loads are working as expected.
6) Test of events, commands tasks, schedule and notification.
7) Validation of error cases and rejected data.
8) Performance testing
9) Impact testing: Testing interfaces to upstream and downstream processes.
10) Regression testing: test if the existing data/tables/etc are not broken.

Complication in ETL testing:

a) Availability of all test data and creation of test data for all the test cases.
b) Understanding the complexities of the ETL tool and data warehousing.
c) Working with data spread across multiple databases.

Just out of interest, If you want to check out different roles in data warehousing, please read:
http://dwbitechguru.blogspot.ca/2015/09/career-and-hot-jobs-in-data-warehousing.html
How ETL testing can be done:

a) If you are testing informatica code and if testing involves only a few impacted sessions, a separate workflow can be created just for the purpose of testing. This avoids searching for the right session to run to test the code.

b) Perform a run of the ETL jobs for historical data and for daily loads separately and test both loads.

c) Test the load of data into landing area. If this is working properly, then it is easy to write SQLs to validate data against landing area since data from multiple sources are brought into one landing area.

d) If jobs involve files, make sure the files are deleted/archived and folders ready for next load.

e) Look at the ETL design document to understand how ETL is designed.

f) Make sure the ETL code can handle the data volume and meet all the performance parameters.

g) There might be some ETL testing tools that could make testing easier. Informatica has a set of etl testing tools of their own.

See also:
http://dwbitechguru.blogspot.ca/2014/07/business-object-reports-testing.html