Friday, 19 December 2014

How to select between Hadoop vs Netezza vs Redshift- comparison?

How to select between Hadoop vs Netezza vs Redshift- comparison? 

Common question that we here is, among Hadoop, Netezza, and Redshift, which one is better and which one is going to replace everything else? All these technologies have their own pros and cons and it is essential to know those before we try find out ideal use cases for them and how to use them in your ecosystem.

Hadoop:


Pros:
  • As we know hadoop is mainly used for big data purpose i.e purpose where huge volume is involved often peta byte scale or more.
  • The data can be unstructured, stuctured or semi structured.
  • The data in hadoop can be stored on commodity hardware and the cost of storing and processing is less considering the volume of data it can handle. Now lets says we are receiving huge number of XML files, flat files from huge number of sources (for example, sensors from air plane, logs fromfirewalls/IDS systems, tweets from twitter, etc) and needs to be loaded at a rapid pace and processed then hadoop is  an ideal candidate. 
  • The hadoop clusters are highly scalable and it can be horizontally or vertically scaled by adding more commodity hardware and nodes that are not expensive.
  • The hadoop clusters can also be made to handle streaming data, collect data from sensors, store the streaming data, playback etc.
  • Data redundancy is built  into the cluster.

Cons:
  • Point to be noted here are the files if stored on HDFS are going to be in flat file format and though you can use Hive or Impala to process these files as structured data there are some limitation to it. The hive SQL do not offer full functionality that is found in traditional databases.
  • Updating records will be tough and you will have to use tool like HBASE to get this update functionality.
  • Response time of hadoop hive queries are not going to fast as Netezza because at the back ground these program run map reduce jobs that are not as fast as the Netezza. Also, they are meant for huge volumes and used in places where response time are  not that important i.e for batch jobs.
  • Administration of the cluster is not simple like administration of netezza.
Conclusion:

  • If you are doing ELT (Extract Load transform) with huge data loads that can be stored as flat files on HDFS then Hadoop is a good candidate. You can load all the files and then process or transform etc.
  • Volume and scalability are also the deciding factors and Hadoop is deal for big data purpose

Netezza:


Pros:
  • It is a traditional data warehouse appliance that supports traditional SQL statements i.e all DML and DDL statements. Inserts, updates are all possible.
  • Very fast and made for gigabyte scale data warehousing purpose. Depending on the number of SPUs, it can handle gigabytes of data and process it at super fast speed. The data is distributed on multiple SPUs and they all work together to crunch the data.
  • Administration overhead is very less compared to oracle, DB2 , etc data bases.
Cons:
  • Mainly meant for structured data warehouse applications.
  • Probably cannot handle unstructured data of peta byte volumes effectively like Hadoop.

Conclusion:

If you are building traditional data warehouse where you also need to ETL (Extract - Transform- Load) then Netezza is better since you can insert, update, aggregate, etc with ease and write your SQL for reporting and building cubes.

Redshift:


Pros:
  • Peta byte scale computing and cost model based on On demand basis.
  • Traditional data warehousing and SQLs supported.
  • Can handle volume even greater than Netezza and build on the same model as Netezza.
  • On the cloud and minimal administration required
  • Easily scalable and nodes (computing power ) and space can be added with some mouse clicks.
  • Easy to create data snapshots and bring the redshift cluster up and down as and when required.

Cons:
  • On the cloud and data has to be transported to the cloud redshift cluster.
  • Reading huge volumes from ODBC/JDBC connection is still a problem.
  • If there is lot of back and forward movement between the cluster and local network then it creates some latency.
  • Have to get security clearance to get data out of the network
Conclusion:
  • Good for ELT purpose where data is loaded to redshift cluster once and then processing is done on the cluster and aggregate/summarized data is read from the cluster or if the other downstream applications are sourcing the data directly from the redshift cluster.
  • Good for traditional data warehousing purpose.

No comments:

Post a Comment