Sunday 30 November 2014

Why hadoop? 8 good reasons to adopt !!

Why hadoop?

 

8 reasons why organizations are adopting Hadoop are as follows:

1) Big data: Hadoop framework support extremely large data sets and hence used for Big data purpose. Hadoop can be used with structure, semi structured,or unstructured data. Big data can handle terabyte data volumes and is highly scalable.

2) Cost effective -Commodity Hardware: Hadoop framework runs on commodity hardware i.e. it does not need vendor specific hardware like tera data or Netezza. This makes it cheap and easy to upgrade or scale. Any computer can be made part of the hadoop cluster.

3) Scalability: Hadoop framework is highly scalable both vertically and horizontally. Vertical scalability refers to upgrading the machine by adding more hardware and horizontal scalability refers to adding more machines. Hadoop can be both vertically and horizontally scalable. More over the machines in the hadoop cluster can be heterogenous i.e they do not have to be from the company/vendor.   

4) Parallel/distributed processing:
Hadoop framework supports distributed processing. The data is distributed in the hadoop cluster consisting of many machines. The processing of data occurs at the same time on multiple machines using map reduce algorithms.

5) Redundancy: Any data file in hadoop is broken into small blocks and stored on multiple machine. Also, multiple copies which by default is 3 is stored on multiple machines. If any of the machine is down then the data can be fetched from the other machine.

6) Archiving and Exploratory analysis: Data from different data warehouses can be extracted and HDFS can be used to store those files for archiving purpose. Since the HDFS runs on commodity hardware it is cheap and effective way to archive data. Also, Hadoop can be used to store files for intial exploratory analysis before a proper data warehouses are built.


7) Streaming data: Hadoop tools such as apache flume can be used to read stream data and store those data in xml or json format directly on HDFS for analysis. 

8) Unstructured data: Traditional data warehouses are very good in dealing with structured data.Hadoop is very good with even unstructured data that are in the form of flat files, xml, or json files.


Some of challenges faced by organizations in adopting hadoop:


a) Finding well trained and experienced professionals to build and develop on hadoop framework.


b) Finding suitable use cases and training the buisness to ask questions that hadoop can solve


Some of the big hadoop vendors are:


Cloudera
Hortonworks
IBM
ECM
Intel
MapR
Oracle


Check out other articles on same topic:

http://dwbitechguru.blogspot.ca/2014/11/how-to-load-local-data-file-into-apache_25.html 
 

How to truncate or delete records in a apache HIVE table and remove files from HDFS?

How to truncate or delete records in a apache HIVE table and remove files from HDFS?


As you know in apache Hive, the data is stored as files on HDFS and the hive tables are just meta data that is used to read data from those files in a table format.

So when you want to truncate a table in Hive, you are basically removing all the files that are on HDFS for that table. 

Try the below command:

Hive>  use mydatabase;
Hive> truncate table mytable;

In the above command mydatabase and mytable are the example database and table name. Replace it with your database and table name.
 

If in case the above commands do not work, then you can go and delete the files that these tables are referring to from hdfs.

FYI: If you want to update records then check out this LINK (click HERE)

To find the location of the files, go to your name node UI and check for the path where the files for the table are located using the file browser.
The name node UI link is usually in the format  http://<namenode_host>:50070/ .In a multinode cluster your namenode might be different from your localhost so find out the name node link from your admin.

In my example, the files for my table are stored in /user/hive/warehouse/mydatabase/mytable. 



Now use the below hadoop command to delete all the files belonging to that table from hdfs. This should basically truncate the table in Hive.

hadoop fs -rm /user/hive/warehouse/mydatabase/mytable/*


If for instance you want to delete only few files from dhfs then you can use the same command where instead of the wildcard you will use the file name. This is equivalent to deleting records in Hive.  For example:


hadoop fs -rm /user/hive/warehouse/mydatabase/mytable/file1.txt


Hope this helps. If any questions, contact me.

Check also the below link to load files to Hive tables:

http://dwbitechguru.blogspot.ca/2014/11/how-to-load-local-data-file-into-apache_25.html

Friday 28 November 2014

How to connect to Amazon Redshift cluster using psql in a unix script?

How to connect to Amazon Redshift cluster using psql in a unix script and execute a sql statement?


You need to first define the following variables in your unix script and enter appropriate values and then follow it up with the psql command with whatever statement you need to connect to Amazon Redshift cluster using psql ;

AWS_ACCESS_KEY=<ENTER ACCESS KEY>
SECRET_ACCESS_KEY=<ENTER SECRET ACCESS KEY>

db=default
AWS_SERVER=mytest-cluster2.adsadsadh.us-east-1.redshift.amazonaws.com

username=admin

psql -h $AWS_SERVER_ADDRESS -p 8443 -d $db -U $username  -c  "TRUNCATE TABLE PUBLIC.TABLENAME"



Note: if you have AWS_ACCESS_KEY and SECRET_ACCESS_KEY defined as environment variable then you can use psql command as shown below and instead of variables directly enter the address like below to connect to Amazon Redshift cluster using psql.


localhost> export AWS_ACCESS_KEY=<ENTER ACCESS KEY>
localhost> export SECRET_ACCESS_KEY=<ENTER SECRET ACCESS KEY>
localhost> psql -h mytest-cluster2.adsadsadh.us-east-1.redshift.amazonaws.com  -p 8443 -d default -U $username  -c  "TRUNCATE TABLE PUBLIC.TABLENAME"

Thursday 27 November 2014

My Top 5 cheap android tablets for less than 150$ in Canada

My Top 5 cheap android tablets for less than 150$ in Canada.


1) Samsung 7" 8GB Galaxy Tab3 Lite Tablet - White


Pros:
Samsung is a better brand
Samsung apps and better quality LCD display
1GB of RAM and a 1.2GHz dual core Marvell PXA986 processor
Front and back facing camera


Cons:
8GB Ram.


   
  





2) Asus 7" 16GB MeMO Pad Tablet With Wi-Fi - Black



Pros:
7 inch WSVGA touchscreen LCD display
1.2GHz Intel Clover Trail Plus Z2520 Dual Core processor with 1GB of RAM offers great multitasking capabilities
16GB EMMC storage offers plenty of room for files, photos, videos, and more
Front and Rear facing camera
Jelly bean 4.3







3) Google Nexus 7 by ASUS 32GB 7" Tablet with Wi-Fi (1B32-CB)


Pros:
Cheaper than the first two tablets in this list.
32 GB storage
NVIDIA Tegra 3
Processor Speed 1.3 GHz and NVIDIA Tegra 3 type


Cons:
Only front facing camera
IPS Capacitive LED





4) Le Pan II 9.7" 8GB Tablet with Wi-Fi - English

 
Pros:
9.7'' inch LCD display. Bigger than all the tablets in this list.
Android 4.0 (Ice Cream Sandwich) operating system with Adobe Flash support
Qualcomm APQ8060 1.2 GHz processor


Cons:
No Rear Camera
8gb storage



 

5) Acer Iconia B1-710 7" 8GB Android 4.1 Tablet With MTK Dual Core Processor - White


Pros:
Better brand than the lesser known brands.
MTK dual core processor and 16GB of flash storage


Cons:
Only Front facing camera


 






Also check out
the below link for other cheap tablets for less than 150$:



a) Kobo Arc 7HD 7" 32GB Android 4.2.2 Tablet With NVIDIA Tegra 3 Processor - Black



b) Le Pan TC082A 8" 8GB Android 4.2.2 Tablet With Cortex-A7




I own this Le Pan TC082A from a year now and it is pretty good. Check out my reviews using the below link:

http://dwbitechguru.blogspot.ca/2014/10/review-le-pan-tc082a-8-8gb-android-422.html 

Microsoft Excel: How to split a comma or any delimited (pipe, colon, etc) text in one column to separate columns?

How to split a comma or any delimited (pipe, colon, etc) text/data in one column to separate columns in microsoft excel?


In the example below, all the data is in the first column. Now how do you split it to separate columns in Microsoft Excel? Easy. Look at the next screen.






Select the first column, go to Data tab and press the Text to Columns button and you will see the Convert Text to Columns wizard. You can select Delimited file type and press next and select comma as the delimiter. If you have pipe delimiter then select | as the delimiter and follow the next steps in the wizard.



How to connect to Hadoop Hive using Beeline?

How to connect to Hadoop Hive using Beeline?

 Apache Beeline is a hive command line client that can be used to connect to Hive server and execute HiveSQL,etc. Beeline uses JDBC to connect.



Sample beeline command to connect to hive server is shown below:

localhost> beeline

beeline> !connect jdbc:hive2://localhost:10000
scan complete in 8ms
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: admin
Enter password for jdbc:hive2://localhost:10000:
Connected to: Apache Hive (version 0.12.0-cdh5.1.2)
Driver: Hive JDBC (version 0.12.0-cdh5.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:1000> show databases;
+----------------+
| database_name  |
+----------------+
| sample         |
| default    |
| employee        |
| operational       |
| test_hive_db   |
+----------------+
5 rows selected (0.52 seconds)
jdbc:hive2://localhost:1000> use test_hive_db;

jdbc:hive2://localhost:1000>select * from employee;

Check the link below to load files to hive:
http://dwbitechguru.blogspot.ca/2014/11/how-to-load-local-data-file-into-apache_25.html 

To truncate apache hive tables and delete files from hdfs:
 http://dwbitechguru.blogspot.ca/2014/11/how-to-truncatedelete-apache-hive-table.html




More details about beeline is in the below link:

https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients




How to change the delimiter character to pipe or colon in Microsoft excel csv file? Also, change digit grouping, other format settings.

How to change the delimiter character to pipe or colon and other format settings such as digit grouping  in Microsoft excel csv file?


Every time you save a file as .csv (comma separated) file in Microsoft Excel, the file will have comma (,) as the delimiter character. In some instance, we want to change the delimiter to ; or | or other delimiter character. For those instances, you need to use the below instructions to change the delimiter settings. Similarly in the last screen shown below you can see that you change other format settings.

Step 1: Go to region and language on your control panel.


Step 2: Press the additional settings button in the region and language window.


Step3: In the numbers tab, you will see a field called List Separator. In that field enter the character you want to see as your delimiter. In the example below, I have it as ; . Press apply and you are good to go. Also in the screen below you can see that you can change other format setting such as digit grouping etc.

After you done with these setting changes, you can go back to your Microsoft excel and try to save the files in .csv format. Now the file will be saved with ; colon as the delimiter.

Tuesday 25 November 2014

How to a load a local data file into hadoop apache hive ? Example of tweets getting loaded into Hive.

How to a load a local data file into apache hive ? Example of tweets getting loaded into Hive ?



What is Apache hive?
Apache Hive is a Data warehouse system originally built by facebook and runs on  top of the Hadoop/HDFS i.e. actual data is stored as files on HDFS and Hive is used to query and manage the data. Hive stores meta data such as table structure, table definition in its meta store database which is usually derby or mySQL. You can run simple SQL called the Hive SQL to query the hive data warehouse just in other SQL based data warehouse.


In the example below, a sample tweets file is loaded into hive.

Step1: Download the sample tweets file of sachin tendular tweets and save it some local folder on your unix system. Lets say you saved it in /home/desktop folder. The sample tweet file captures only three fields which are the screenname, tweet text and a field indicating if the tweet is retweeted.


The sample tweets file can be found using the link below. Copy and save the file as sampletweets.txt in /home/desktop. If you save it in another folder then change the folder path in step3.

http://dwbitechguru.blogspot.ca/2014/11/sample-tweets-file-of-sachine-tendular.html

Step2: Create a table on hive. Sample DDL below. This DDL is the three fields that are in the sample file. If you have more fields in your sample file, then add those fields to the table. The sample file is pipe delimited. You and execute the below SQL from AQT or from Hive command interface.


CREATE database twitter;

use twitter;

CREATE TABLE if not exists twitter.tweets
(
ScreenName VARCHAR(100),
Text VARCHAR(300),
isRetweeted VARCHAR(10),
)
row format delimited fields terminated by '|'
;


Step3: Execute the hive command to load the sample file of sachin tendulkar to the table created in step 2.

hive -e "use twitter; load data local inpath '/home/desktop/sampletweets.txt' into table twitter.tweets;"

Sample tweets pipe delimited file of sachin tendulkar for hive load (copy and save it as sampletweets.txt)

@ajaymago|@sachintendulkar book launch http://t.co/whI16I803w|false
@Shrutanwita|#SaritaDevi Boxing Row: #SachinTendulkar to meet sports minister Sarbananda Sonowal tomorrow#ndtvupdate|false
@LurthasJo|RT @PDdancing: Missing you #sachintendulkar http://t.co/kPK4JKCR1R|false
@MedhuntMbbs|RT @GK4APSPSC: *Legendary #SachinTendulkar launches his #autobiography 'Playing It My Way'.-Presented first copy to his mother. " http://t…|false
@imricky19|It is very unfortunate @BCCI for ruining cricket in INDIA by spot fixing. @SACHINTENDULKAR interrupt in this matter..|false
@AbhigyanShekhar|http://t.co/4n8OO8qDEL. The best piece of statistical analysis on the two distinct great phases of #SachinTendulkar career.|false

Friday 21 November 2014

Agile Development Methodologies- Scrum

Agile Development Methodologies- Scrum


Scrum is an agile project management methodology that provides a frame work for team members to collaborate and implement a project in smaller pieces which enables users to provide feedback and respond to changing required. Each smaller cycle is called the sprint. Each sprint consists of a set of requirements (user stories) also called the sprint backlog that are implemented and results in a working software at end of each spring. In the next sprint other set of requirements are added to the software. The three main actors in an scrum methodology are the product owners, development team, and scrum master.














Important points of scrum are:
  • Focus more on project management.
  • Based on short daily meeting known as scrums will all team members
  • Iterations are called as sprint 
  • Sprint is a fixed development period with set of work items to implement 
  • Planning occurs between sprints 
  • Product Owners determine what needs to be built in the sprint
  • Development Teams build the actual product and consists of developers, testers, etc.
  • Scrum Masters responsible for the scrum process to carry on as smoothly as possible, and continually help improve the scrum process.

Implementing Agile Scrum Projects Involves:

1) Selecting tasks for each iteration & balance the amount of work (user stories to be included) in each iteration ? (based on customer priorities, tasks that developers can
complete, task redistribution, and sprint duration).

2) Be co-located to engage in constant communication and holding regular scrum meeting to discuss blockers, and daily plan for the project. Scrum meeting are usually held daily.

3) Monitor the project tasks using tools such as charts and graphs.

4) Manage risk, user feedback, scope, quality and change through out the project

See also:
http://dwbitechguru.blogspot.ca/2014/07/difference-between-waterfall-and-agile.html

Wednesday 19 November 2014

Netezza NZSQL, NZLOAD, NZ_MIGRATE, NZUNLOAD utility to Extract - Load - Migrate Files/Data

Netezza NZSQL, NZLOAD, NZ_MIGRATE, NZUNLOAD utility to Extract - Load - Migrate Files/Data 

Extract - Load - Migrate Files/Data to/from Netezza


It is very easy in Netezza to extract data from netezza tables, load data from files to Netezza target tables, or Migrate data from one Netezza database to another.Below are few commands that you can use this purpose.

NZSQL utility: This utility can be run from any machine on which it is installed i.e. it does not have be from netezza box itself. This utility can be used to run any sql statement like select or insert statement on the netezza database. In the below usage you see data extracted from the table into a output file and in the second mzsql command the data is also gzipped.

Usage:

nzsql -host <netezz_db_servername> -d <database> -u <username> -pw <password> -c  -c  "select * from tablename"  -o /root/home/outputfilename.txt;


nzsql -host <netezza_db_servername> -d <database> -u <username> -pw <password> -c  "select * from tablename" -F "|" -A -q -t | gzip > /root/home/outputfilename.txt.gz;

nzsql -host <netezza_db_servername> -d <database> -u <username> -pw <password> -c  'insert into tablename values (1 ,2 )'  -o /root/home/outputfilename.txt;

----------------------------------------------------------------------------------------------------

NZLOAD utility: This utility can be used to load data from files to netezza target tables. The nzload command usage is below. In the usage below the flat file which is delimited by | is loaded to a table mentioned with in <target_table_name>. Also user name and password can be provided for this utility.

nzload -host <netezzahost> -db <database> -u <user name>
       -pw <password> -delim '|' -t <target_table_name> -df /root/home/outputfilename.txt



----------------------------------------------------------------------------------------------------

NZ_MIGRATE utility: this utility is used to migrate data from source database to the target database. All the tables mentioned in the tablelist parameter are migrated to the target database. There are options to create target tables or truncate target tables before migrating the data. NZ_Migrate utility does the migration by running many parallel threads and hence is faster than nzload.


nz_migrate command usage is below:

./nz_migrate -shost <source_host> -thost <target_host> -sdb <source_database> -tdb <targetdatabase> -suser <source_user> -spassword <source_password> -tuser <target_user> -tpassword <target_password> -t <table1, table2, ...> -cksum fast -genStats Full  -TruncateTargetTable YES >> $log


./nz_migrate -shost <source_host> -thost <target_host> -sdb <source_database> -tdb <targetdatabase> -suser <source_user> -spassword <source_password> -tuser <target_user> -tpassword <target_password> -t <table1, table2, ...> -cksum fast -genStats Full  -CreateTargetTable YES>> $log

----------------------------------------------------------------------------------------------------

NZ_UNLOAD utility: if instead of nzsql you want to extract files from a netezza table faster by running multiple threads in parallel then use nz_unload utility. The nz_unload command usage is below:

./nz_unload -sql '"select * from tablename"' -file /nz/dba/output/outputfilename.txt

----------------------------------------------------------------------------------------------------

External tables: Another way to migrate data is to create external table and insert the data from external table into the targer table. External table usage would like shown below:

CREATE EXTERNAL TABLE '/tmp/export.csv' USING (DELIM ',') AS
SELECT * from <TABLENAME>;


INSERT INTO <targettable> SELECT * FROM external '/tmp/export.csv'
USING (DELIM ',');


Monday 17 November 2014

Performance tuning in amazon redshift - Simple tricks (table/query design)

Performance tuning in amazon redshift - Simple tricks


The performance tuning of a query in amazon redshift just like any database depends on how much the query is optimised, the design of the table, distribution key and sort key,  the type of cluster (number of nodes, disk space,etc) which is basically the support hardware of redshift, concurrent queries, number of users, etc. However, the good news is it is not that hard compared to database such as oracle to do performance tuning.Amazon Redshift is a peta byte  scale and massively parallel database and very high performance can be achieved with simple configuration steps. In this blog, we have explained in detailed how to achieve high performance in amazon redshift.

INVESTIGATION:
To give an idea about the performance issues we were facing, have a look at the cost of the query and the disk usage that were resulting from the queries we were running on redshift. Explain command can be used to get the cost of the query and the execution plan. The execution plan will show those parts of the query that are very costly and needs to be tuned. In the below query, you can see just by looking at the numbers beside the cost the first two outer joins are very expensive and the last inner join is not that costly. It is not only because of the outer join, but because of the amount of data distribution that is happening in the query.


explain select * from table1  a11 outer join table2 a12 on (a11.EMPLOYEE_KEY = a12.EMPLOYEE_KEY) outer join table3 a13 on (a11.name = a13.name ) join table3 a14 on (a11.dept_key = a14.dept_key)


.>XN Hash Left Join DS_BCAST_INNER  (cost=683864529053.53..9912121211212999999999999996733616880411.00 rows=8226899613 Width=1388)                                                                                                                                  ,"Hash Cond: ("outer"".employee_key="inner".employee_key)
   ->  XN Hash Left Join DS_DIST_BOTH  (cost=    683646717104.93..999999121211121999999999967336168804116 rows=8226899613 width=1372)
              ,"Hash Cond: ("outer"".name="inner".name)
       ->  XN Hash Left Join DS_DIST_NONE  (cost= 346363.65 rows=822689 width=1372)
                                  ,"Hash Cond: ("outer"".dept_key="inner".dept_key)

 
Data is distributed among various computing nodes in amazon redshift and the amount of data movement between nodes plays a large part in query performance. In  your query, you see a lot of DS_DIST_NONE in your query then you have least amount of data distribution and if you have other keywords  like DS_DIST_BOTH or DS_BCAST_INNER then there are more data redistribution happening and the performance can be bad.

The meaning of Explain plan attributes as per amazon redshift website is below:

DS_BCAST_INNER- means that broadcast a copy of the entire inner table to all compute nodes.
DS_DIST_ALL_NONE-No redistribution is required because the inner table was distributed to every node using DISTSTYLE ALL
DS_DIST_NONE- No tables are redistributed: collocated joins are possible because corresponding slices are joined without moving data between nodes.
DS_DIST_INNER-    The inner table is redistributed.
DS_DIST_ALL_INNER -    The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL
DS_DIST_BOTH-  Both tables are redistributed.


The disk usage for the above query was looking like below. The disk usage can be seen from the amazon redshift web console in the performance tab. The disk space usage was reaching close to 100% and this is not good news at all since that single query is consuming all the disk space. This is indication of poor query performance. This query was run on a dw2_8xLarge cluster which is a large node type with relatively high disk space.




Also, another way you can check if your query is performing poorly is check if there are lot of writes to the disk using query such as below: 2044 below is the query id of the query.

select query, step, rows, workmem, label, is_diskbased
from svl_query_summary
where query = 2044 order by workmem desc;



SOLUTION:

We had to make a lot of changes to fix the performance issues. All the factors that influence the query performance and the required changes in discussed below:

1) Table design (Data distribution and Sorting) - In redshift the data is distributed on all the nodes. Whenever it is executing the query it has to bring the data from the different nodes and use it in joins, aggregation, sorting, group by, etc. If the amount of data that has to grabbed from these nodes is high then it results in lot of traffic and poor query performance. To reduce the amount of traffic between nodes, the data has to be distributed properly between the nodes. This can be achieved through the proper table design, data compression,  distribution key, sort keys, and query joins. First lets looks at the table design. Below is the table create statement. The distribution style (diststyle), distribution key (distkeys), sort keys are very important in defining how the data is distributed among the nodes.

 CREATE [ [LOCAL ] { TEMPORARY | TEMP } ]
TABLE table_name [ ( column_name1 encode ,  ]
| DISTSTYLE { EVEN | ALL | KEY }
| DISTKEY ( distkey_identifier )
| SORTKEY ( sortkey_identifier [, ... ] )



DISTSTYLE ALL will put a copy of the data in all the nodes and make it available for joins without having to move the data between nodes. This is ideal for dimension tables which does not have many records in it. DISTSTYLE KEY will distribute data based on a defined key column and better for tables that are huge and you can use a column that is used in joins to distribute the data. DISTSTYLE EVEN is used mostly to evenly distribute the data and when you cant deside between ALL or KEY.

*Trick*: USE DISTSTYLE ALL for all the dimension tables that are not big.

DISTKEY (column name)- The column name used here should be the one that is used in joins and should be defined for all the large fact tables. The
distribution key (distkey) will help in distributing the data based on the distkey column and during joins only the required data is brought the nodes. There can be only one distkey (distribution key) column defined.

*Trick*: Define Distkey for all the large fact tables and the dist key column should be the one used in the joins.

If the distribution key from the large fact table is joined to the any column of a dimension table that is in diststyle ALL then in your explain plan you will see the attribute DS_DIST_ALL_NONE which mean there is no data redistribution and the cost of the query is very low.  This is the state you want most part of the query to be in for best performance.

SORT KEY (column name1, ..):  There can be multiple sort keys and the sort keys help a lot in storing the data in the nodes in sorted order which helps in group by or order by operation.

*Trick*: If there are columns such as date or timestamp that you use in order by operation then define those as sort keys.

2) Table design (Compression) - In the amazon redshift tables the encoding that can be used for compression can be defined along with the column name. Format shown below. There are various encodings available.

CREATE TABLE table_name (column_name
            data_type ENCODE encoding-type)[, ...]

*Trick*: Easy way to decide on the encoding if you are using the copy command to load data  from s3to redshift is to use COMPUPDATE ON option set in the copy command as shown below. The copy command choosed the best compression to use for the columns that it is loading data to. 

copy <tablename> from 's3://mybucket/files.txt'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
delimiter '|' COMPUPDATE ON

If you want to explicitly define the encoding like when you are inserting data from another table or set of tables, then load some 200K records to the table and use the command ANALYZE COMPRESSION <tablename> to make redshift suggest the best compression for each of the columns. You can use those suggestion while recreating the table.


3) Table design (Constraints) - Defining the primary keys and foreign keys though is not enforced by redshift apparently is used by the query planner to execute the query efficiently. Hence, define those keys in your create table statements for better performance.


4) Query design -
Now we have discussed about table design, the actual query design will all use all those table design features to execute the query. As discussed before the trick in designing good queries is to have proper joins. i.e use distribution key from the large fact table to join to  any column of a dimension table that is in diststyle ALL. If dimension table is not in (distribution style) diststyle all, then use the distkey (
distribution key) from the dimension table in the joins.

5) WLM queue setting:

The WLM queue setting on your cluster configuration determines the number of queries running on your cluster and essential part of performance tuning. Keeping it low means not many queries can run at the same time and provided more memory for each query. Set the concurrency and memory setting for each user group in the cluster for better performance





6) Copy Command Performance tuning:
If you want your copy command to be faster then split the files into multiple files so that they can get loaded in parallel to the redshift database.


All the above discussed steps should help in running the queries more efficiently. If you need more details refer to aws link:
http://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables.html


Also check:
http://dwbitechguru.blogspot.ca/2014/11/amazon-redshift-top-concerns.html 

Sunday 16 November 2014

Custom Properties in Informatica Session Config Properties for XML targets and parsing

Custom Properties in Informatica Session Config Properties for XML targets and parsing

a)  XMLAnyTypeToString

 Noticed  that when I was passing the below XML string to a web service consumer  transformation in Informatica the characters such as  < , “, & were getting replaced by some other characters like &amp and the xml string was not sent in a proper format to the web service transformation. Later fixed this  by setting the XMLAnyTypeToString=Yes in the session properties as shown in the screenshot. XMLAnyTypeToString=Yes setting makes the characters be passed as it is.
 
'<Batch OnError="Continue">
 <Method ID="1" Cmd="Update">
<Field Name="ID">1</Field>
<Field Name="Name">John</Field>
</Method>
 </Batch>'




 
Other XML customizations related settings that are available  are listed below and can be set as show in screenshot above:

b) WriteNullXMLFile

If you do not want an XML output file to be generated when there is no input date then use the session property WriteNullXMLFile=No. If you do not set this property then a default xml with some meta data is created when there is no input data.

c)  SuppressNilContentMethod

If the child tags have no values and you might see parent tags with no child tags depending on the properties you have set for the XML target. If you want to suppress the parent tags then use the session property SuppressNilContentMethod=ByTree
 

d) XMLWarnDupRows

Similarly if you are seeing huge session logs being generated from duplicate row warning then setting session XMLWarnDupRows=No will remove those warnings from the session log.

Found more properties here : http://www.cifconsult.com/?p=743

More XML customizations in Informatica can be done in the session- mapping-target/xml parser transformation configuration: The setting shown in yellow highlights will decide the outcome of your xml.