Showing posts with label General. Show all posts
Showing posts with label General. Show all posts

Thursday, 7 July 2016

Useful Queries for troubleshooting amazon redshift

USEFUL QUERIES FOR TROUBLESHOOTING IN AMAZON REDSHIFT 

Here are some of my queries for troubleshooting in amazon redshift. I have collected this from different sources.

TO CHECK LIST OF RUNNING QUERIES AND USERNAMES:

select a.userid, cast(u.usename as varchar(100)), a.query, a.label, a.pid, a.starttime, b.duration,
b.duration/1000000 as duration_sec, b.query as querytext
from stv_inflight a, stv_recents b, pg_user u
where a.pid = b.pid and a.userid = u.usesysid




select pid, trim(user_name), starttime, substring(query,1,20) from stv_recents where status='Running'

TO CANCEL A RUNNING QUERY:

cancel <pid>


You can get pid from one of the queries above used to check running queries.


TO LOOK FOR ALERTS:

select * from STL_ALERT_EVENT_LOG
where query = 1011
order by event_time desc
limit 100;


TO CHECK TABLE SIZE:

select trim(pgdb.datname) as Database, trim(pgn.nspname) as Schema,
trim(a.name) as Table, b.mbytes, a.rows
from ( select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name ) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
order by b.mbytes desc, a.db_id, a.name;


TO CHECK FOR TABLE COMPRESSION:

analyze <tablename>;
analyze compression <tablename>;



TO ANALYZE ENCODING:

select "column", type, encoding
from pg_table_def where tablename = 'biglist';



TO CHECK LIST OF FILES COPIED:

select * from stl_load_errors

select * from stl_load_commits


select query, trim(filename) as file, curtime as updated, *
from stl_load_commits
where query = pg_last_copy_id();


TO CHECK LOAD ERRORS

select d.query, substring(d.filename,14,20),
d.line_number as line,
substring(d.value,1,16) as value,
substring(le.err_reason,1,48) as err_reason
from stl_loaderror_detail d, stl_load_errors le
where d.query = le.query
and d.query = pg_last_copy_id();


TO CHECK FOR DISKSPACE USED IN REDSHIFT:

select owner as node, diskno, used, capacity
from stv_partitions
order by 1, 2, 3, 4;
select query, trim(querytxt) as sqlquery
from stl_query
order by query desc limit 5;


SOME IMPORTANT AWS COMMANDS:

To resize the redshift cluster (node type and number of nodes always required):

aws  redshift modify-cluster --cluster-identifier <cluster name> --node-type dw2.8xlarge --number-of-nodes 3

To get filelist on S3:

aws s3 ls $BUCKET/  > ./filecount.out

To get status of cluster and other information of cluster in text format:

aws redshift describe-clusters --output text   


Friday, 4 March 2016

Mapping variables and parameters in Informatica Mapping and session

Mapping variables and parameters in Informatica Mapping

1) Informatica Mapping Parameters:

You use mapping parameters to supply parameter values to informatica mappings. It could be values such as Batch_Key, Batch_Name etc.The value of the mapping parameters do not change for the entire execution time of a session. The mapping parameters can be defined in the mapping as shown in the screenshot below in the mapping designer.The same mapping parameters needs to be defined in the parameter file too.

For example, the parameter file would have:

$$BATCH_KEY=301

In the mapping designer, you would define the $$BATCH_KEY in the mapping and parameters window. During execution the session will associate the value 301 where it finds $$BATCH_KEY in the mapping. It might also be necessary to define the same parameter in the worflow under the variables tab.



2) Informatica Mapping Variables:

Mapping variable represents a value that can change through the session. Mapping variables can also be defined under the mapping and parameters window of the mapping designer just like parameters (shown in screenshot above). It can also be initialized using the parameter file. However the values of a variable can be changed during execution. The value of the mapping variables can be changed in an expression transformation using setvariable functions. For example:

SETVARIABLE($$BATCH_KEY, 302)  would change the value of $$BATCH_KEY to 302 during execution.

SETVARIABLE function can be used in expression, filter, router, and update strategy.Integration Service saves the value of a mapping variable to the repository at the end of each successful session run and uses that value the next time you run the session.

Postsession and presession variable assignment in session can be used to pass the value of the mapping variable to the session/workflow variables.This allows using the values of the variable at workflow level as shown in screenshot below.   

3)Informatica Session Parameters:

Session parameters represent values that can change between session runs, such as database connections or source and target files. E.g.: $DBConnectionName, $InputFileName, etc. These paramters needs to defined in the parameter file.

4)Informatica Workflow variables:

All the parameters, variables that have to be used in the workflow needs to be defined in the workflow in the window below. This can be all the mapping parameters, session parameters, etc.




Friday, 18 September 2015

Creating a table in Parquet, Sequence, RCFILE and TextFile format and enabling compression in Hive

Creating a table in Parquet, Sequence, RCFILE and TextFile format in Hive.


My intention was to write an article of different file formats in Hive but happened to notice an article already posted. 
 
We are using parquet these days apparently because of the compression options and performance with large tables in Hive. Parquet is optimized to work with large data sets and provide good performance when doing aggregation functions such as max or sum.

If you prefer to see the HDFS file in clear text format then you need to store the file in textfile format. This apparently takes more space than the binary formats supported by sequence and rcfile.


Below are examples for creating a table in Parquet ,Sequence, RCfile, TextFile format: Location is optional.

create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS parquet location 'hdfspathname';


create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS sequencefile location 'hdfspathname';


create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS rcfile location 'hdfspathname';


create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS textfile location 'hdfspathname';


The default storage format is apparently is a comma-delimited TEXTFILE.

Enabling compression in Hive or Impala table:


Apart from the storage format supported in Hive, the data can also be compressed using codecs such as LZO, Snappy, etc. To enable compression in Hive, we use set statements such as below. The below example is for Lzo compression. After the set statements you can use insert statements to insert data to a table and the data will be compressed as per the set statement.



set mapreduce.output.fileoutputformat.compress=true;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;

insert overwrite table tablename partition(dateid) select * from table2;




Monday, 14 September 2015

Why a career in Data warehousing ? Best jobs and salary in Data warehousing.

Why a career in Data warehousing?


Firstly, what is data warehousing? Data Warehousing (DW) is a term used in Information technology to refer to data repositories created to store historical data that is mostly used for reporting and analytical purposes. This could be an enterprise data warehouse that has data integrated from various sources, or it could be subject specific data repositories called the data marts.  This is just a small definition of data warehousing. 


These days data warehousing is gaining more popularity and attention thanks to Big data and other high profile web service technologies. Big data is a technology that allows huge volume of data to be processed at lower cost for analytical purpose. Big companies like Facebook, google, twitter, and yahoo have invested in these technologies and using them for processing the huge amount of data that they gather on daily basis. These days cloud services have become popular and are allowing huge data warehouses to be hosted on the cloud. Amazon web services, Microsoft azure to make a few are gaining popularity. With these kind of technologies making news and gaining popularity, data warehousing is becoming more exciting and a place where lot of IT jobs are getting created and bringing in talented IT professionals. 

Having worked in Data warehousing from last 10 years, I have seen increased interest in companies investing in data warehousing and in general increase in interest among ITfolks in data warehousing related technologies. I'm writing this article to introduce people to data warehousing and let them know about some of the jobs in data warehousing. 

Best jobs and salary in Data warehousing.

Enterprise Data Architect/ ETL Architect/ Solution Architect / Hadoop Architect - 
As the name implies these guys are responsible for architecture. This could be enterprise architecture which means they decide how all the pieces of the data warehouse work together, which products need to be used, interactions between different systems, etc. ETL architect could be more involved with data integration and creation of data warehouses. Hadoop architects are more concerned about the big data architecture in an organization. Architects also work with upper management and business teams to provide optimum data warehousing solutions. 

Estimated Salary: 80,000 to 140,000+

Team/Project/Account Managers -
As with any team , there will be managers in data warehousing teams too. These guys usually are expected to have some experience with data warehousing. Team managers are more involved in resourcing and budgets etc. Project managers are more involved in managing a projects till the end. Account managers are involved if the company is a vendor providing services to another organization.

Estimated Salary: 80,000 to 120,000+ 

Big data / Hadoop developers and Admins-
The big data developers are involved in developing Hadoop jobs which could be writing map reduce jobs, developing hive , pig , oozie scripts or working on complex apache spark code. The Hadoop admins are involved in managing the Hadoop infrastructure, fine tuning it for performance, user administrator, security, etc.

Estimated Salary: 60,000 to 100,000+ 

Data modellers -
In some of the organization the data modelling can be done by data architects or even ETL developers. Some organizations also have different people for doing data modelling. Data modelling refers to developing all the table structure and defining the relationships between them.

Estimated Salary: 60,000 to 100,000+ 

ETL/Data warehouse Developer -
ETL stands for Extract, transform, and load and ETL professionals create jobs to integrate data from multiple sources, transform and load it to relational or flat file targets. These days ETL developers also work ETL tools created for big data platforms. There are many big vendors in this area and Informatica, Pentaho are some of the big names.

Estimated Salary: 70,000 to 100,000+ 

Report Developer -
Report developers build various reports and dashboards for the business teams using tools developed by vendors such as Cognos, Business Objects, Microstrategy, etc. Report developers need to know SQL and proficiency with any of the report development tools.


Estimated Salary: 70,000 to 100,000+ 


Tester/ Quality Assurance -
Well every field in software development needs a tester to make sure the software is working as per requirement. Testers do all the testing, write test cases, etc. Data warehouse testing might need testers who know a bit of ETL tools, SQL, and some data warehousing concepts.

Estimate Salary - 60,000 - 100,000+

Database Administrators -
Database administrators are responsible for managing and administering the databases and servers that are used to store the data in data warehousing. There are many jobs or oracle , sql server, netezza, etc kind of databases.

Estimated Salary: 70,000 to 100,000+ 

Production and Infrastructure Support - 
Production support folks are responsible for managing and supporting all the jobs in production. Only these guys have access to production servers and make sure all the production jobs are running smoothly. They have to carry pagers and sometimes have to work overnight shifts.

Estimated Salary: 50,000 to 80,000+ 










Monday, 17 August 2015

Difference between native and hive mode in Informatica big data edition (BDE)

Difference between native and hive modes in Informatica big data edition BDE : 

 

a) Native mode- In native mode BDE works like a normal power center. This can be used to read /wrtie to traditional RDBMS databases. It can also be used to write to HDFS and Hive. It works like a power center because the execution of the mapping logic happens on the power center server. i.e. that source data is read to informatica server , transformations applied and then data is loaded to the target.

This mode is stateful i.e you can keep track of data from previous records, use sequence generators, sorters , etc just like in normal power center.


b) Hive mode- In Hive mode , like in native you can have similar source and targets however the whole mapping logic is pushed down to hive i.e. the hadoop cluster. The Informatica BDE in this mode coverts the mapping logic into hive SQL queries and executes it directly on the hadoop cluster as Hive queries there by converting them all into map reduce jobs.

This mode is not stateful i.e., you cannot keep track of dataa in the previous records using stateful variables. Your transformations like sorters, sequence generators wont work fully or properly.

Your update strategy transformation will not work in hive mode just because hive does not allow updates. You can only insert records to Hive database. 

In this mode the data gets read from source to temporary hive tables , transformed , and the target also gets loaded to temp hive tables before being inserted to final target which can be RDBMS database like oracle or Hive database. Hence the limition of hive also follows on to Hive mode in Informatica BDE. 

However. if your volume of data is huge and you want to push all the processing to hive then Hive mode is a better option. There are workarounds to  do type 2 kind of updates in Hive mode.

Thursday, 6 August 2015

What is Informatica big data edition (BDE) ?

What is Informatica big data edition BDE ?


Informatica big data edition BDE is a product from Informatica Corp that can be used like an ETL tool for working in hadoop enviroment along with traditional RDBMS tools.  Now there are lot of ETL products in the market that makes it easier to integrate with hadoop. To name a few talend, pentaho, etc. Informatica is one of the leading ETL tool vendor and Informatica power center tool is very famous as an ETL tool and has been for many years. Traditionally this tool was used to extract transform and load data to traditional databases such as oracle, sql server, netezza to name a few. With advent of hadoop for storing peta byte volumes of data, building ETL tools that can work with hadoop became more important. It requires a lot of handcoding and knowledge to work directly with hadoop and build map reduce jobs. Hadoop tools such as hive made it easier to write SQL queries on top of Hive database and convert it to map reduce jobs. Hence, lot of companies started using Hive as a data warehouse tool and storing data in hadoop just like traditional databases and writing queries on Hive. How do we now extract , transform, and load the data in Hadoop? Thats where Informatica BDE comes into picture. It is a tool that you can use for ETL or ELT on hadoop infrastructure.  Informatica BDE can run in two modes. They are native mode and hive mode. 

In the native mode, it runs as a normal powercenter but in hive mode you can push down the whole mapping logic to hive and make it run on the hadoop cluster there by using the parallelism provided by hadoop. However there are some limitation when running in hive mode but that is more because of the limitations from hive itself. For example, hive does not allow updates in older versions.  

With Informatica BDE you can do the following at a very high level :

a) Just like any ETL tool you can do extract , transform, load between tranditional rdbms or hive/hdfs source and targets.
b) Push the whole ETL logic to hadoop cluster and make use of the map reduce framework. Basically it makes building hadoop jobs easier.
c) Makes it easy to create connection to all the different sources and integrate data from those sources. 
d) It makes it easier to ingest complex files such as JSON, XML, Cobol, AVRO, Parquest , etc.


Informatica BDE uses the Informatica developer interface to build the mappings , deploy and create applications. Anyone who has used IDQ before might be familiar with the Informatica developer interface. Informatica BDE can be found in Informatica  9.6 versions onwards.


Monday, 8 June 2015

How to install UDF functions on Netezza?


How to install an user defined function (UDF) in netezza?


Lets says you have got hold of the UDF function from netezza and want to install it then the below steps would help. In the example below, reverse function is installed on netezza server.


How to install UDF functions on Netezza?


 In the folder where you have placed the tgz file execute the below instruction.

[netezzahost]$ gunzip reverse.tgz
[netezzahost]$ tar -xvf reverse.tar

[netezzahost]$ cd reverse

[netezzahost]$ ./install <databasename>

CREATE FUNCTION
Created udf
Done


After this step you might have to execute grant command to provide permission on the function to the users.

For example:

grant execute on reverse to <mygroup>;


Login to database and test the function:

select reverse('mydog');
 REVERSE
---------
 godmy
(1 row)

Tuesday, 12 May 2015

Connect to DB2 database using Informatica

Setting up DB2 ODBC or Native DB2 database Connection in Informatica

 There are two ways to create connection to DB2 in Informatica. First way is to use the DB2 in native mode using DB2 powerconnect module. Second way is to use odbc drivers.

To connect to DB2 in native mode, you need DB2 PowerConnect module installed on your informatica server machine.

The steps to create DB2 connection using DB2 powerconnect module is below:

1)    Create a remote DB2 database connection entry  using command:
db2 CATALOG TCPIP NODE <nodename> REMOTE <hostname_or_address> SERVER <port number>
2)    Create Catalog entries for the database in unix using command:
 db2 CATALOG DATABASE <dbname> as <dbalias> at NODE <nodename>
3)    Commands to check the database entries are below:
db2 list database directory
db2 list node directory
4)    Verify the connection to the DB2 database using command:
CONNECT TO <dbalias> USER <username> USING <password>
5)    Create a relational connection (type:DB2) with connect string as “nodename” in Informatic workflow manager (Sample below):





To connect to DB2 using odbc, you need odbc driver for db2 installed on your informatica server machine.

The steps to create DB2 connection using odbc is below:

1)    Create entry in .odbc.ini file on your Informatica server machine.

[mydb2database]
Driver=/informatica/ODBC6.1/lib/DWdb225.so
Description=DataDirect 5.2 DB2 Wire Protocol
ApplicationUsingThreads=1
ConnectionRetryCount=0
ConnectionRetryDelay=3
#Database applies to DB2 UDB only
Database=MYDB2DB
DynamicSections=200
GrantAuthid=PUBLIC
GrantExecute=1
IpAddress=101.101.701.101
LoadBalancing=0
#Location applies to OS/390 and AS/400 only
Location=
LogonID=<yourid>
 Password=<yourpasswd>
PackageOwner=
ReportCodePageConversionErrors=0
SecurityMechanism=0
TcpPort=4456
UseCurrentSchema=1
WithHold=1

2)    Create relational database connection of type odbc using the connection string created in step 1.

Wednesday, 7 January 2015

How to set up and sync Fitbit Flex Activity Tracker and Review

How to set up and sync Fitbit Flex Activity Tracker? Review at end of this article.




1) Unbox the fitbit package. You should have the below parts in it.


2) Charge the flex chip (the small chip on the left side of the pic above) by inserting into cable that has USB plug at the end (the one in the top of the pic above) and connecting the cable to USB port of your computer. Also, you can connect it to USB charger that plugs into power slots. Once fully charged all the four light indicators on the chip turns on.


Creating fitbit account and the software

3) Now go to http://www.fitbit.com/setup and download the software for the flex model and install the software on your computer. If you have a model other then flex then download the software for that mode. 

4)  Open your Fitbit Connect application. You can set up your device and create a fitbit account by clicking on the Set up a New Fitbit Device link.



5)Click on the New to Fitbit link and then set up your account. You will use this account to login to fitbit.com and see all your activities.



6) After you have set up your account, you need to track your activity with fitbit. Insert the fitbit chip in your fitbit wristband. The device will start tracking your movement in terms of steps taken. Basically every time you move your hand, walk, run etc.


Syncing your fitbit

7) Now to see the activities tracked by your fitbit, you need to sync your fitbit device with your fitbit account. To do this, you insert the fitbit dongle to USB port on your computer.


8)  Start the fit bit connect application and click on the sync button. You need to have your fitbit wristband along with the chip close to the dongle for it to detect and read your activities from the tracker.


9) After sync is complete, click on Go to fitbit.com link or open fitbit.com and login using the account created in one of the steps above. This should take you into account dashboard that will show all the activities tracked in the day by hour.



 10) In your dashboard, you can see the number of steps and calories burnt in the day by hour.


11) That should be it. Sync your activities every now and then. Also, keep your fitbit tracker chip charged once every 2-3 days.

12) If you have android or Iphone, you can install the fitbit app and that app should take care of syncing your activities without having to use the dongle etc.

Review:
a) Easy to set up, sync and check the daily stats. A nice and informative dashboard that gives counts of steps, calories burnt and a graph that plots number of steps by time.
b) LED indicators to indicate the amount of battery power left.
c) A bit hard to wear because of the clip that is used to lock the strip. We have to press it hard for it lock. I would have liked if it was more like a normal watch band. May be it is designed this way so that it hugs the wrist tightly.
d) Comes with two wristbands of different sizes. 
e) Very light weight. Looks stylish.
f) Dongle helps to sync using normal PC. Apps available for Android and Iphone.

Friday, 26 December 2014

Big Data - Good Books for Hadoop, Hive, Pig, Impala, Hbase.




Big Data - Good Books for Hadoop, Hive, Pig, Impala, Hbase.



1) Hadoop: The Definitive Guide


 

Hadoop: The Definitive Guide: Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.
You’ll find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This third edition covers recent changes to Hadoop, including material on the new MapReduce API, as well as MapReduce 2 and its more flexible execution model (YARN).
  • Store large datasets with the Hadoop Distributed File System (HDFS)
  • Run distributed computations with MapReduce
  • Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence
  • Discover common pitfalls and advanced features for writing real-world MapReduce programs
  • Design, build, and administer a dedicated Hadoop cluster—or run Hadoop in the cloud
  • Load data from relational databases into HDFS, using Sqoop
  • Perform large-scale data processing with the Pig query language
  • Analyze datasets with Hive, Hadoop’s data warehousing system
  • Take advantage of HBase for structured and semi-structured data, and ZooKeeper for building distributed systems

2)  Programming Hive



Programming Hive: Need to move a relational database application to Hadoop? This comprehensive guide introduces you to Apache Hive, Hadoop’s data warehouse infrastructure. You’ll quickly learn how to use Hive’s SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoop’s distributed filesystem.
This example-driven guide shows you how to set up and configure Hive in your environment, provides a detailed overview of Hadoop and MapReduce, and demonstrates how Hive works within the Hadoop ecosystem. You’ll also find real-world case studies that describe how companies have used Hive to solve unique problems involving petabytes of data.
  • Use Hive to create, alter, and drop databases, tables, views, functions, and indexes
  • Customize data formats and storage options, from files to external databases
  • Load and extract data from tables—and use queries, grouping, filtering, joining, and other conventional query methods
  • Gain best practices for creating user defined functions (UDFs)
  • Learn Hive patterns you should use and anti-patterns you should avoid
  • Integrate Hive with other data processing programs
  • Use storage handlers for NoSQL databases and other datastores
  • Learn the pros and cons of running Hive on Amazon’s Elastic MapReduce

  3) Programming Pig




Programming Pig: This guide is an ideal learning tool and reference for Apache Pig, the open source engine for executing parallel data flows on Hadoop. With Pig, you can batch-process data without having to create a full-fledged application—making it easy for you to experiment with new datasets.

Programming Pig introduces new users to Pig, and provides experienced users with comprehensive coverage on key features such as the Pig Latin scripting language, the Grunt shell, and User Defined Functions (UDFs) for extending Pig. If you need to analyze terabytes of data, this book shows you how to do it efficiently with Pig.
  • Delve into Pig’s data model, including scalar and complex data types
  • Write Pig Latin scripts to sort, group, join, project, and filter your data
  • Use Grunt to work with the Hadoop Distributed File System (HDFS)
  • Build complex data processing pipelines with Pig’s macros and modularity features
  • Embed Pig Latin in Python for iterative processing and other advanced tasks
  • Create your own load and store functions to handle data formats and storage mechanisms
  • Get performance tips for running scripts on Hadoop clusters in less time

4) HBase: The Definitive Guide 
 


HBase: The Definitive Guide: If you're looking for a scalable storage solution to accommodate a virtually endless amount of data, this book shows you how Apache HBase can fulfill your needs. As the open source implementation of Google's BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. Many IT executives are asking pointed questions about HBase. This book provides meaningful answers, whether you’re evaluating this non-relational database or planning to put it into practice right away.
  • Discover how tight integration with Hadoop makes scalability with HBase easier
  • Distribute large datasets across an inexpensive cluster of commodity servers
  • Access HBase with native Java clients, or with gateway servers providing REST, Avro, or Thrift APIs
  • Get details on HBase’s architecture, including the storage format, write-ahead log, background processes, and more
  • Integrate HBase with Hadoop's MapReduce framework for massively parallelized data processing jobs
  • Learn how to tune clusters, design schemas, copy tables, import bulk data, decommission nodes, and many other tasks

5) Getting Started with Impala: Interactive SQL for Apache Hadoop
 

Getting Started with Impala: Interactive SQL for Apache Hadoop:Learn how to write, tune, and port SQL queries and other statements for a Big Data environment, using Impala—the massively parallel processing SQL query engine for Apache Hadoop. The best practices in this practical guide help you design database schemas that not only interoperate with other Hadoop components, and are convenient for administers to manage and monitor, but also accommodate future expansion in data size and evolution of software capabilities.
Ideal for database developers and business analysts, Getting Started with Impala includes advice from Cloudera’s development team, as well as insights from its consulting engagements with customers.
  • Learn how Impala integrates with a wide range of Hadoop components
  • Attain high performance and scalability for huge data sets on production clusters
  • Explore common developer tasks, such as porting code to Impala and optimizing performance
  • Use tutorials for working with billion-row tables, date- and time-based values, and other techniques
  • Learn how to transition from rigid schemas to a flexible model that evolves as needs change
  • Take a deep dive into joins and the roles of statistics




Thursday, 25 December 2014

2014- Best of Boxing Day Deals for Electronic Items in Toronto - TV -Laptop-Tablet-Speaker-Camera

  2014- Best of Boxing Day Deals for Electronic Items in Toronto - TV -Laptop-Tablet-Speaker-Camera

 Stores you can do your boxing day shopping:

     
  • Bestbuy, Futureshop, Walmart, Staples, Target, Sears- are good for buying electronics such as  TV, Laptop, speakers,etc.
  • Walmart, ToyrUs- are good for shopping baby stuff.
  • Leons, Brick, Badboy, Walmart, Sears- are good for furniture items.
  • All the malls are good for shopping cloths etc. My favourites are scarborough town center, eaton center, yorkdale mall, vaughan mills, etc.
  • Homedepot, lowes- are good for home hardware, appliances, gardening or snow removal equipments.

Some of the best electronic item boxing day deals that I found from the websites are below:

Location: FutureShop, Walmart

There is this 58 inch for $699 and a 60 inch from sony for $999


Samsung 58" 1080p 60Hz LED Smart TV (UN58H5202AFXZC) - Black
Cost: 699.99
Save $300



LG 50" 1080p 120Hz LED TV - Silver (50LB5900)
Cost: $499.99
Save: $100
Sale Ends: December 28, 2014



Location: BestBuy/FutureShop


ASUS X Series 15.6" Laptop - Black (Intel Dual-Core Celeron N2830/500GB HDD/4GB RAM/Windows 8.1)
Cost: $279.99
Save: $100
Sale Ends: December 28, 2014



Location: Bestbuy/ FutureShop

Cost: $279.99
Save $100
Sale ends: January 1, 2015
Samsung Galaxy Tab 4 10.1" 16GB Android 4.4 Tablet With 1.2 GHz Quad-Core Processor - White



Location: BestBuy


Canon Rebel T5 18MP DSLR Camera With EF-S 18-55mm IS, EF 75-300mm Lenses & DSLR Bag
$499.99
Save: $310
Sale Ends: December 25, 2014








Location: Walmart


Sony- 5.1-Ch. 3D Smart Blu-ray Home Theater System (BDVE2100)
3D Blu-ray Home Theater with Wi-Fi

Cost: $168

Friday, 28 November 2014

How to connect to Amazon Redshift cluster using psql in a unix script?

How to connect to Amazon Redshift cluster using psql in a unix script and execute a sql statement?


You need to first define the following variables in your unix script and enter appropriate values and then follow it up with the psql command with whatever statement you need to connect to Amazon Redshift cluster using psql ;

AWS_ACCESS_KEY=<ENTER ACCESS KEY>
SECRET_ACCESS_KEY=<ENTER SECRET ACCESS KEY>

db=default
AWS_SERVER=mytest-cluster2.adsadsadh.us-east-1.redshift.amazonaws.com

username=admin

psql -h $AWS_SERVER_ADDRESS -p 8443 -d $db -U $username  -c  "TRUNCATE TABLE PUBLIC.TABLENAME"



Note: if you have AWS_ACCESS_KEY and SECRET_ACCESS_KEY defined as environment variable then you can use psql command as shown below and instead of variables directly enter the address like below to connect to Amazon Redshift cluster using psql.


localhost> export AWS_ACCESS_KEY=<ENTER ACCESS KEY>
localhost> export SECRET_ACCESS_KEY=<ENTER SECRET ACCESS KEY>
localhost> psql -h mytest-cluster2.adsadsadh.us-east-1.redshift.amazonaws.com  -p 8443 -d default -U $username  -c  "TRUNCATE TABLE PUBLIC.TABLENAME"

Thursday, 27 November 2014

My Top 5 cheap android tablets for less than 150$ in Canada

My Top 5 cheap android tablets for less than 150$ in Canada.


1) Samsung 7" 8GB Galaxy Tab3 Lite Tablet - White


Pros:
Samsung is a better brand
Samsung apps and better quality LCD display
1GB of RAM and a 1.2GHz dual core Marvell PXA986 processor
Front and back facing camera


Cons:
8GB Ram.


   
  





2) Asus 7" 16GB MeMO Pad Tablet With Wi-Fi - Black



Pros:
7 inch WSVGA touchscreen LCD display
1.2GHz Intel Clover Trail Plus Z2520 Dual Core processor with 1GB of RAM offers great multitasking capabilities
16GB EMMC storage offers plenty of room for files, photos, videos, and more
Front and Rear facing camera
Jelly bean 4.3







3) Google Nexus 7 by ASUS 32GB 7" Tablet with Wi-Fi (1B32-CB)


Pros:
Cheaper than the first two tablets in this list.
32 GB storage
NVIDIA Tegra 3
Processor Speed 1.3 GHz and NVIDIA Tegra 3 type


Cons:
Only front facing camera
IPS Capacitive LED





4) Le Pan II 9.7" 8GB Tablet with Wi-Fi - English

 
Pros:
9.7'' inch LCD display. Bigger than all the tablets in this list.
Android 4.0 (Ice Cream Sandwich) operating system with Adobe Flash support
Qualcomm APQ8060 1.2 GHz processor


Cons:
No Rear Camera
8gb storage



 

5) Acer Iconia B1-710 7" 8GB Android 4.1 Tablet With MTK Dual Core Processor - White


Pros:
Better brand than the lesser known brands.
MTK dual core processor and 16GB of flash storage


Cons:
Only Front facing camera


 






Also check out
the below link for other cheap tablets for less than 150$:



a) Kobo Arc 7HD 7" 32GB Android 4.2.2 Tablet With NVIDIA Tegra 3 Processor - Black



b) Le Pan TC082A 8" 8GB Android 4.2.2 Tablet With Cortex-A7




I own this Le Pan TC082A from a year now and it is pretty good. Check out my reviews using the below link:

http://dwbitechguru.blogspot.ca/2014/10/review-le-pan-tc082a-8-8gb-android-422.html