Friday, 26 December 2014

Big Data - Good Books for Hadoop, Hive, Pig, Impala, Hbase.




Big Data - Good Books for Hadoop, Hive, Pig, Impala, Hbase.



1) Hadoop: The Definitive Guide


 

Hadoop: The Definitive Guide: Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.
You’ll find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This third edition covers recent changes to Hadoop, including material on the new MapReduce API, as well as MapReduce 2 and its more flexible execution model (YARN).
  • Store large datasets with the Hadoop Distributed File System (HDFS)
  • Run distributed computations with MapReduce
  • Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence
  • Discover common pitfalls and advanced features for writing real-world MapReduce programs
  • Design, build, and administer a dedicated Hadoop cluster—or run Hadoop in the cloud
  • Load data from relational databases into HDFS, using Sqoop
  • Perform large-scale data processing with the Pig query language
  • Analyze datasets with Hive, Hadoop’s data warehousing system
  • Take advantage of HBase for structured and semi-structured data, and ZooKeeper for building distributed systems

2)  Programming Hive



Programming Hive: Need to move a relational database application to Hadoop? This comprehensive guide introduces you to Apache Hive, Hadoop’s data warehouse infrastructure. You’ll quickly learn how to use Hive’s SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoop’s distributed filesystem.
This example-driven guide shows you how to set up and configure Hive in your environment, provides a detailed overview of Hadoop and MapReduce, and demonstrates how Hive works within the Hadoop ecosystem. You’ll also find real-world case studies that describe how companies have used Hive to solve unique problems involving petabytes of data.
  • Use Hive to create, alter, and drop databases, tables, views, functions, and indexes
  • Customize data formats and storage options, from files to external databases
  • Load and extract data from tables—and use queries, grouping, filtering, joining, and other conventional query methods
  • Gain best practices for creating user defined functions (UDFs)
  • Learn Hive patterns you should use and anti-patterns you should avoid
  • Integrate Hive with other data processing programs
  • Use storage handlers for NoSQL databases and other datastores
  • Learn the pros and cons of running Hive on Amazon’s Elastic MapReduce

  3) Programming Pig




Programming Pig: This guide is an ideal learning tool and reference for Apache Pig, the open source engine for executing parallel data flows on Hadoop. With Pig, you can batch-process data without having to create a full-fledged application—making it easy for you to experiment with new datasets.

Programming Pig introduces new users to Pig, and provides experienced users with comprehensive coverage on key features such as the Pig Latin scripting language, the Grunt shell, and User Defined Functions (UDFs) for extending Pig. If you need to analyze terabytes of data, this book shows you how to do it efficiently with Pig.
  • Delve into Pig’s data model, including scalar and complex data types
  • Write Pig Latin scripts to sort, group, join, project, and filter your data
  • Use Grunt to work with the Hadoop Distributed File System (HDFS)
  • Build complex data processing pipelines with Pig’s macros and modularity features
  • Embed Pig Latin in Python for iterative processing and other advanced tasks
  • Create your own load and store functions to handle data formats and storage mechanisms
  • Get performance tips for running scripts on Hadoop clusters in less time

4) HBase: The Definitive Guide 
 


HBase: The Definitive Guide: If you're looking for a scalable storage solution to accommodate a virtually endless amount of data, this book shows you how Apache HBase can fulfill your needs. As the open source implementation of Google's BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. Many IT executives are asking pointed questions about HBase. This book provides meaningful answers, whether you’re evaluating this non-relational database or planning to put it into practice right away.
  • Discover how tight integration with Hadoop makes scalability with HBase easier
  • Distribute large datasets across an inexpensive cluster of commodity servers
  • Access HBase with native Java clients, or with gateway servers providing REST, Avro, or Thrift APIs
  • Get details on HBase’s architecture, including the storage format, write-ahead log, background processes, and more
  • Integrate HBase with Hadoop's MapReduce framework for massively parallelized data processing jobs
  • Learn how to tune clusters, design schemas, copy tables, import bulk data, decommission nodes, and many other tasks

5) Getting Started with Impala: Interactive SQL for Apache Hadoop
 

Getting Started with Impala: Interactive SQL for Apache Hadoop:Learn how to write, tune, and port SQL queries and other statements for a Big Data environment, using Impala—the massively parallel processing SQL query engine for Apache Hadoop. The best practices in this practical guide help you design database schemas that not only interoperate with other Hadoop components, and are convenient for administers to manage and monitor, but also accommodate future expansion in data size and evolution of software capabilities.
Ideal for database developers and business analysts, Getting Started with Impala includes advice from Cloudera’s development team, as well as insights from its consulting engagements with customers.
  • Learn how Impala integrates with a wide range of Hadoop components
  • Attain high performance and scalability for huge data sets on production clusters
  • Explore common developer tasks, such as porting code to Impala and optimizing performance
  • Use tutorials for working with billion-row tables, date- and time-based values, and other techniques
  • Learn how to transition from rigid schemas to a flexible model that evolves as needs change
  • Take a deep dive into joins and the roles of statistics




How to add only non existing records to Hive table and implement NOT IN operation?

Implementing left outer join and  NOT IN operation in Hadoop Hive

&

How to add only non existing records to Hive table?

a) Now lets start with how to do a left outer join?.


Consider you have two tables table1 and table2. Doing a normal left outer join is simple. The below syntax should do it:

select * from table2 left outer join table1 on table1.dates=table2.dates 


b) How to implement a not in operation in Hive?


Consider you have two tables table1 and table2. To implement a not in operation you have to do a left outer join and then apply a filter.

In normal SQL, if it would look like:

select * from table2 where table2.dates not in ( select dates from table1)

In apache hive SQL, it would look like:

select * from table2 left outer join table1 on table1.dates=table2.dates where table2.dates is null 


 c)  How to add only non existing records to Hive table?


 Now lets say you have a scenario where you have to check if a record exists in your  table and only insert those records that do not exist in your hive table then you have to take a two step approach.

1) Create a landing table which is same as the target table to which you have to insert the records. Insert all your new records to this landing table. You can truncate this landing table and insert the new records to this landing table. Lets call this table as LND_TABLE.

b) Now lets say your target table is TGT_TABLE to which you want to insert only the non existing records. You would use the syntax below to insert non existing records from the LND_TABLE. Here dates is the column that is used to identify existing records.


INSERT INTO TGT_TABLE select * from LND_TABLE left outer join TGT_TABLE on LND_TABLE.dates=TGT_TABLE.dates where TGT_TABLE.dates is null 









Thursday, 25 December 2014

2014- Best of Boxing Day Deals for Electronic Items in Toronto - TV -Laptop-Tablet-Speaker-Camera

  2014- Best of Boxing Day Deals for Electronic Items in Toronto - TV -Laptop-Tablet-Speaker-Camera

 Stores you can do your boxing day shopping:

     
  • Bestbuy, Futureshop, Walmart, Staples, Target, Sears- are good for buying electronics such as  TV, Laptop, speakers,etc.
  • Walmart, ToyrUs- are good for shopping baby stuff.
  • Leons, Brick, Badboy, Walmart, Sears- are good for furniture items.
  • All the malls are good for shopping cloths etc. My favourites are scarborough town center, eaton center, yorkdale mall, vaughan mills, etc.
  • Homedepot, lowes- are good for home hardware, appliances, gardening or snow removal equipments.

Some of the best electronic item boxing day deals that I found from the websites are below:

Location: FutureShop, Walmart

There is this 58 inch for $699 and a 60 inch from sony for $999


Samsung 58" 1080p 60Hz LED Smart TV (UN58H5202AFXZC) - Black
Cost: 699.99
Save $300



LG 50" 1080p 120Hz LED TV - Silver (50LB5900)
Cost: $499.99
Save: $100
Sale Ends: December 28, 2014



Location: BestBuy/FutureShop


ASUS X Series 15.6" Laptop - Black (Intel Dual-Core Celeron N2830/500GB HDD/4GB RAM/Windows 8.1)
Cost: $279.99
Save: $100
Sale Ends: December 28, 2014



Location: Bestbuy/ FutureShop

Cost: $279.99
Save $100
Sale ends: January 1, 2015
Samsung Galaxy Tab 4 10.1" 16GB Android 4.4 Tablet With 1.2 GHz Quad-Core Processor - White



Location: BestBuy


Canon Rebel T5 18MP DSLR Camera With EF-S 18-55mm IS, EF 75-300mm Lenses & DSLR Bag
$499.99
Save: $310
Sale Ends: December 25, 2014








Location: Walmart


Sony- 5.1-Ch. 3D Smart Blu-ray Home Theater System (BDVE2100)
3D Blu-ray Home Theater with Wi-Fi

Cost: $168

Tuesday, 23 December 2014

What is Materialized (Mat or Mview) view in Netezza and how is it different from normal views?


 

  What is Materialized (Mat or Mview) view in Netezza ?

Materialized views in Netezza are used to create a new table that contain subset of columns from the base tables and the data sorted by the columns in the new table. These kind of views are used to improve performance since it reduces the width of data being scanned since it creates a thin version of the base table and the data can be sorted in different order than the base table. Example usage of mat view is below:

create materialized view vendor_mview as select vendor_id, vendor_name from vendor order by vendor_id;



The limitation of materialized views in Netezza are:

1) Only one table can be used as the base (source) table.
2) There can be no where clause in the select statement and the columns used in the select must be part of the base table and no expressions can be used in the select statement.
3) Materialized (mat) views can be created from user tables only and cannot be temporary or external tables. 
4) You cannot GROOM a base table with an active materialized view created on top of it.
5) The materialized views needs maintenance since when data is added to the base table, the materialized views also get updated and data is added to the end in unsorted order. Hence, periodically the materialized (mview or mat) views need to be rebuilt.


The command to rebuild materialized view is:

ALTER VIEW <materialized_view_name> MATERIALIZE REFRESH 

How is materialized (mat) view different from normal views in Netezza?

The normal views in Netezza are used to select only required columns from multiple tables and make it look like a single table. Normal views do not create a real table and do not require maintenance when data is added to the base tables.

Normal views in Netezza can be based out of multiple tables, can contain where clause in the select statement, the columns can have expressions and pretty much everything you can have in a select statement. 

Sample create view statement is below:

create view vendor_mview as select vendor_id, vendor_name, product_id, product_name  from vendor, product where vendor.product_id=product.product_id and product_category='cell phone' order by vendor_id;

The normal views do not improve the performance of a query.The use of a normal view is more for convenience where instead of using the same select query multiple times, we can create a view instead and use the view in place of select statement. Also, the view permission can be changed and only few columns exposed to certain users who do not have permission to see all the columns of the base tables.

Check out, selecting distribution key in Netezza:
http://dwbitechguru.blogspot.ca/2014/12/selecting-distribution-key-in-netezza.html 

Selecting proper Distribution Key In Netezza for better query performance

Selecting  proper Distribution Key In Netezza for better query performance


Netezza distributes data across the data slices based on the distribution key. This distribution of data impacts the query performance. The distribution key can be specified during table creation (up to 4 columns can be part of distribution key) or it can be specified to be random which mean Netezza will use round robin algorithm to distributed data on the data slices or if not specified Netezza will pick one of the column as distribution key. Hence, Never create table without a distribution key since that does not guarantee optimal performance.

The data has to distributed uniformly across all the data slices so that all the Netezza SPU (snippets) have same amount of work to do. If the data is some data slices are more than others then some SPU will do more work than others in which case the data distribution is skewed which is bad for Netezza query performance.

In the below example, employee_id column is used as distribution key. The distribution keys are defined using the DISTRIBUTE ON clause.



create table employee (
employee_id integer not null ,
employee_name varchar(25),
employee_dept_id integer,
employee_age integer,
employee_address varchar(25)
) distribute on employee_id;

 
1) When selecting distribution key, choose the columns that have unique values and high cardinality. Hence, primary keys or part of primary key are good choice for distribution key. In the above example, employee ID is unique for all the records hence a good candidate for distribution key.

2) Selecting data ID is not good choice for distribution key. The data might get uniformly distributed across data slices with data ID but if the query requires data from a particular month then only few SPUs might be busy and hence impacting query performance. 


3) If two big tables are joined on certain columns, then better to used those columns as distribution keys.

4) When creating table as (Create table as select * from t1 join t2 on t1.c1=t2.c2) join from two other tables then the distribution keys will be the columns used for joining. If there is a group by clause in the select statement used in  then those group by clause will be the distribution keys. 

5) If there is no good choice for distribution key, then select distribution random instead of not defining any distribution key.

6) Try not to use varchar as a distribution key. Big Int data type or integer data type might be better choice than Varchar. 


We can check distribution of data using performance portal or the netezza administration console, or the explain plan, or nz_skew dba script. 


Explain distribution command gives the data distribution information. For example:

explain distribution select * from employee_table;

Friday, 19 December 2014

How to select between Hadoop vs Netezza vs Redshift- comparison?

How to select between Hadoop vs Netezza vs Redshift- comparison? 

Common question that we here is, among Hadoop, Netezza, and Redshift, which one is better and which one is going to replace everything else? All these technologies have their own pros and cons and it is essential to know those before we try find out ideal use cases for them and how to use them in your ecosystem.

Hadoop:


Pros:
  • As we know hadoop is mainly used for big data purpose i.e purpose where huge volume is involved often peta byte scale or more.
  • The data can be unstructured, stuctured or semi structured.
  • The data in hadoop can be stored on commodity hardware and the cost of storing and processing is less considering the volume of data it can handle. Now lets says we are receiving huge number of XML files, flat files from huge number of sources (for example, sensors from air plane, logs fromfirewalls/IDS systems, tweets from twitter, etc) and needs to be loaded at a rapid pace and processed then hadoop is  an ideal candidate. 
  • The hadoop clusters are highly scalable and it can be horizontally or vertically scaled by adding more commodity hardware and nodes that are not expensive.
  • The hadoop clusters can also be made to handle streaming data, collect data from sensors, store the streaming data, playback etc.
  • Data redundancy is built  into the cluster.

Cons:
  • Point to be noted here are the files if stored on HDFS are going to be in flat file format and though you can use Hive or Impala to process these files as structured data there are some limitation to it. The hive SQL do not offer full functionality that is found in traditional databases.
  • Updating records will be tough and you will have to use tool like HBASE to get this update functionality.
  • Response time of hadoop hive queries are not going to fast as Netezza because at the back ground these program run map reduce jobs that are not as fast as the Netezza. Also, they are meant for huge volumes and used in places where response time are  not that important i.e for batch jobs.
  • Administration of the cluster is not simple like administration of netezza.
Conclusion:

  • If you are doing ELT (Extract Load transform) with huge data loads that can be stored as flat files on HDFS then Hadoop is a good candidate. You can load all the files and then process or transform etc.
  • Volume and scalability are also the deciding factors and Hadoop is deal for big data purpose

Netezza:


Pros:
  • It is a traditional data warehouse appliance that supports traditional SQL statements i.e all DML and DDL statements. Inserts, updates are all possible.
  • Very fast and made for gigabyte scale data warehousing purpose. Depending on the number of SPUs, it can handle gigabytes of data and process it at super fast speed. The data is distributed on multiple SPUs and they all work together to crunch the data.
  • Administration overhead is very less compared to oracle, DB2 , etc data bases.
Cons:
  • Mainly meant for structured data warehouse applications.
  • Probably cannot handle unstructured data of peta byte volumes effectively like Hadoop.

Conclusion:

If you are building traditional data warehouse where you also need to ETL (Extract - Transform- Load) then Netezza is better since you can insert, update, aggregate, etc with ease and write your SQL for reporting and building cubes.

Redshift:


Pros:
  • Peta byte scale computing and cost model based on On demand basis.
  • Traditional data warehousing and SQLs supported.
  • Can handle volume even greater than Netezza and build on the same model as Netezza.
  • On the cloud and minimal administration required
  • Easily scalable and nodes (computing power ) and space can be added with some mouse clicks.
  • Easy to create data snapshots and bring the redshift cluster up and down as and when required.

Cons:
  • On the cloud and data has to be transported to the cloud redshift cluster.
  • Reading huge volumes from ODBC/JDBC connection is still a problem.
  • If there is lot of back and forward movement between the cluster and local network then it creates some latency.
  • Have to get security clearance to get data out of the network
Conclusion:
  • Good for ELT purpose where data is loaded to redshift cluster once and then processing is done on the cluster and aggregate/summarized data is read from the cluster or if the other downstream applications are sourcing the data directly from the redshift cluster.
  • Good for traditional data warehousing purpose.

Thursday, 18 December 2014

Java code to extract tweets since any date using Twitter4J API


/* Below is the code to extract tweets since any date or any given date  using twitter 4J API */       


         Twitter twitter = new TwitterFactory().getInstance();
         twitter.setOAuthConsumer(CONSUMER_KEY, CONSUMER_KEY_SECRET);
         String accessToken = 'asdasdsdsaasdsaedadasdadsadasd'; /* enter your twitter access token */
        String accessTokenSecret = '2020202-asdsadsadasdsadsadsadsa'; /enter your twitter secret token */
         AccessToken oathAccessToken = new AccessToken(accessToken, accessTokenSecret);
        twitter.setOAuthAccessToken(oathAccessToken);
        Query query = new Query("Iron Man");
        Date date = new Date();
        String modifiedDate= new SimpleDateFormat("yyyymmdd").format(date);
        query.setSince(modifiedDate);
        QueryResult result;
            do {
            result = twitter.search(query);
            List<Status> tweets = result.getTweets();
            for (Status tweet : tweets) {
                 System.out.println("@"+tweet.getUser().getScreenName() + "|" + tweet.getText()+"|"+ tweet.isRetweeted() );
                 }
              } while ((query = result.nextQuery()) != null);


To create twitter app account and get the access tokens, check out the link below:
http://dwbitechguru.blogspot.ca/2014/12/how-to-create-twitter-app-account-and.html

Tuesday, 16 December 2014

Issue with reading a oracle CLOB column in Informatica and Inserting to Netezza using Netezza PowerCenter Driver:

Issue with reading a oracle CLOB column in Informatica and Inserting to Netezza using Netezza PowerCenter Driver:


Issue: Informatica session failed with the below message when reading a table with a column of CLOB data type.

Severity    Timestamp    Node    Thread    Message Code    Message
ERROR    12/16/2014 11:40:43 AM        NZ_WRT_1_*_1    Net_1762    [ERROR] Failed to execute query : [INSERT INTO "MT TABLE"("MY_KEY", "KEY_DESC", "BATCH_KEY") ...........

The .nzbad showed that all the columns after clob column was going to next line. i.e netezza driver was not able to read properly the | delimited file that it creates while inserting data to the target.

Resolution:
There can be two reasons for this issue, either your clob column is null or you have some new line characters. First I fixed the null issue but the problem still persisted which made me later think that the clob column has some newline or carriage return character. Fixed the issue by using the replacestr function to replace all the new line and carriage return characters.

The expression is below:

IIF(isnull(CLOB_COLUMN_STRING),'  ', REPLACESTR (1,CLOB_COLUMN_STRING,CHR(10),CHR(13), ''))


char(10), char(13) in the expression above stands for newline and carriage return characters.



In the above example CLOB_COLUMN_STRING is the name of the port that belongs to the CLOB table column. In Informatica, clob columns are read as Text data type.

Note: Also, make sure the length of the text port in informatica for the clob column is same as the length of clob column defined in the oracle table.

Thursday, 11 December 2014

How to create twitter app account and create twitter applications that can download/upload tweets and then load to Hadoop?

How to create twitter app account and create twitter applications that can download/upload tweets and then load to hadoop?

To create any twitter application that can download or upload tweets to twitter, you first start with creating an application account on twitter. Creating the twitter application account is simple. Follow the steps below:

Step1: Login to your twitter account and then open the dev.twitter.com url. Go to Manage your app link on dev.twitter.com site.


Step2: Click on the create app link.


Step 3: Enter the information in the twitter create application form and press the create application button.
 Step4: Once you create the application, login to the application and change the settings, access tokens, etc.


Step5: change the permission to read and write, so that your application can download and upload twitter. You will need the access tokens in the application your are building to access your twitter account. Also, check the Allow this application to be used to Sign in with Twitter box on the settings tab.

 
Step6: Use java (twitter4j.org) or some other libraries available online to create your application that can do twitter upload, search, download .

The Java program I tried to read tweets is from this link:

"http://www.javacodegeeks.com/2011/10/java-twitter-client-with-twitter4j.html"

You need to download the twitter4j jar files from www.twitter4j.org website and set up the jar files in your eclipse before you use the java program in the link above.



Wednesday, 10 December 2014

How to create Unix Script to Drop & Truncate Netezza table- or execute any SQL on Netezza

#  Unix Script to Drop & Truncate a Netezza tables
#!/bin/sh
# enter your database name and table name in below

dbname=exampledb
tblname=exampletbl

# use below line to drop a table 
nzsql $dbanme -c "drop table $tblname"
# use below line to truncate a table 
nzsql $dbanme -c "truncate table $tblname"

-----------------------------------------------------

#  Unix Script to Truncate list of Netezza tables
#!/bin/sh
# enter your database name below

dbname=exampledb
table_list=`nzsql $dbane-t -c " select TABLENAME from _V_TABLE where OBJTYPE = 'TABLE' and TABLENAME like 'yourtablelistprefix_%' "`

for table_name in $table_list; do
  echo "Dropping the table $table_name from database $dbname"
  nzsql $dbname -c "truncate table $table_name"

done;

Tuesday, 9 December 2014

Netezza Groom Table to free table space - Netezza Administration

Netezza Groom Table command to reclaim table space -  Netezza Administration

In Netezza when a record is updated, it marks the original record for deletion and inserts a new record with the updated values. So basically it is doing a logical delete. The records marked for deletion are not visible when you do a simple select from a table. However, those records marked for deletion are still available till you groom the table. The same thing happen when you delete records from a table. Those records stay still a table is groomed.  If a table is not groomed regularly it can impact the performance since those deleted records also need to be scanned. Hence, Grooming tables is essential as part of Netezza Administration.

If you want to remove those logically deleted records and free up space, then groom the table.

A simple groom command to reclaim all deleted records would be:


GROOM TABLE <tablename> RECORDS ALL;


Groom command to reorganize new data that is added  to previously groomed table is :

GROOM TABLE <tablename> RECORDS READY;


Note:
1) You do not have to groom a table after truncating a table since truncating a table deletes all the records permanently

2) Groom is non blocking however it is resource intensive so better to schedule it weekly or monthly or during backup operation.

3) Netezza Groom replaces reclaim operation so you do not have to use reclaim operation after groom.

You can find other options available for the groom command here:
https://www-304.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/r_dbuser_groom_table.html



Friday, 5 December 2014

How to evaluate a query performance in Amazon Redshift - Explain Plan and Understanding Performance

How to evaluate a query performance in Amazon Redshift - Explain Plan and Understanding Performance?

To understand how your query is performing, use explain plan and check the disk/cpu usage of your query along with lot of other parameter.  Explain command can be used to get the cost of the query and the execution plan. The execution plan will show those parts of the query that are very costly and needs to be tuned. In the below query, you can see just by looking at the numbers beside the cost the first two outer joins are very expensive and the last inner join is not that costly. It is not only because of the outer join, but because of the amount of data distribution that is happening in the query and the number of rows each of those joins are retrieving. The below query is retrieving 8 billion records in each of those joins. Also the data distribution is high.

explain select * from table1  a11 outer join table2 a12 on (a11.EMPLOYEE_KEY = a12.EMPLOYEE_KEY) outer join table3 a13 on (a11.name = a13.name ) join table3 a14 on (a11.dept_key = a14.dept_key)


.>XN Hash Left Join DS_BCAST_INNER  (cost=683864529053.53..9912121211212999999999999996733616880411.00 rows=8226899613 Width=1388)                                                                                                                                  ,"Hash Cond: ("outer"".employee_key="inner".employee_key)
   ->  XN Hash Left Join DS_DIST_BOTH  (cost=    683646717104.93..999999121211121999999999967336168804116 rows=8226899613 width=1372)
              ,"Hash Cond: ("outer"".name="inner".name)
       ->  XN Hash Left Join DS_DIST_NONE  (cost= 346363.65 rows=822689 width=1372)
                                  ,"Hash Cond: ("outer"".dept_key="inner".dept_key)

 
Data is distributed among various computing nodes in amazon redshift and the amount of data movement between nodes plays a large part in query performance. In  your query, you see a lot of DS_DIST_NONE in your query then you have least amount of data distribution and if you have other keywords  like DS_DIST_BOTH or DS_BCAST_INNER then there are more data redistribution happening and the performance can be bad.

The meaning of Explain plan attributes as per amazon redshift website is below:

DS_BCAST_INNER- means that broadcast a copy of the entire inner table to all compute nodes.
DS_DIST_ALL_NONE-No redistribution is required because the inner table was distributed to every node using DISTSTYLE ALL
DS_DIST_NONE- No tables are redistributed: collocated joins are possible because corresponding slices are joined without moving data between nodes.
DS_DIST_INNER-    The inner table is redistributed.
DS_DIST_ALL_INNER -    The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL
DS_DIST_BOTH-  Both tables are redistributed.



The disk usage for the above query was looking like below. The disk usage can be seen from the amazon redshift web console in the performance tab. The disk space usage was reaching close to 100% and this is not good news at all since that single query is consuming all the disk space. This is indication of poor query performance. This query was run on a dw2_8xLarge cluster which is a large node type with relatively high disk space.
Also, you can see the CPU usage from the amazon redshift web console
.





Also, another way you can check how your query is performing is to check if there are lot of writes to the disk using query such as below: 2044 below is the query id of the query.

select query, step, rows, workmem, label, is_diskbased
from svl_query_summary
where query = 2044 order by workmem desc;

Monday, 1 December 2014

Unlocking Informatica Mapping/Folders

Unlocking Informatica Mapping/Folders



If you connect to a folder in Informatica and get a message that some user has write intent lock on certain mapping and you cant edit it, then you need to either ask the user to disconnect from that folder so that the lock is release or ask the administrator to unlock the mapping for you. 


Step1: The administrator can unlock it by going to the admin console and selecting the repository to which the folder belongs.

Step2: On the right hand side you will see the connection and locks tab. Click on it and select the user who has locked your folder. On the top right hand corner you will see a button shown in in screenshot below that should allow you to disconnect that user connection. Click it and the object is unlocked.