Wednesday 18 September 2019

Setting for insert/update on Hive table in horton works


Setting for insert/update on Hive table in hortonworks

To create insert/update table, a hive table has to be set up as a transactional table. 

The properties you will need is below:


set hive.support.concurrency=true;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition.mode=nostrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on= true;
set hive.compactor.worker.threads=2;

Hive table needs to be created as following 

CREATE TABLE `test.test_zyx1`(
  `id` int, 
  `mm` int, 
  `data_as_of_dt` string, 
  `initial_dt` string )
stored as orc tbl_properties("transactional=true")

Installing and running Jupyter notebook for python

Installing and running Jupyter notebook for python


Jupyter installation requires Python 3.3 or greater, or Python 2.7. IPython 1.x, which included the parts that later became Jupyter, was the last version to support Python 3.2 and 2.6.

Steps to get Jupyter started is below:

Step1:
Download and install (Python 3.3 or greater)

https://www.python.org/downloads/
Install Python 3.3

Step 2:
Install Jupyter using the below command in command prompt:

pip3 install jupyter


Step 3:

Start jupyter using the command :

jupyter notebook



This will open jupyter in browser. You can use the below link to access it:

http:// localhost:8888/ tree


Tuesday 10 September 2019

Converting string to map data type in Hive

Converting string to map data type in Hive



If  you have a string with key value pairs and want to convert to a map data type with key value pairs then use the str_to_map function.


str_to_map(text, delimiter1, delimiter2) - Creates a map by parsing text Split text into key-value pairs using two delimiters. The first delimiter seperates pairs, and the second delimiter sperates key and value. If only one parameter is given, default delimiters are used: ',' as delimiter1 and '=' as delimiter2.

Example:

with t1 as (
select "k1:v1|k2:v2|k3:v3" as c1
 )


select str_to_map(c1 ,"\\|",":") as m1, c1 from t1



Monday 9 September 2019

Difference between posexplode and explode in Hive



Difference between posexplode and explode in Hive

In the below example, the numbers column has a list of values.


with t1 as (
select 1 as id, 'deepak' as name, '12345, 56789, 7892334, 6263636, 7181818, 1761761717' as numbers )

select * from t1



Explode is used to create multiple rows from a list value in a column. In the below example, you have broken the list in numbers fields into multiple rows.


with t1 as (
select 1 as id, 'deepak' as name, '12345, 56789, 7892334, 6263636, 7181818, 1761761717' as numbers )

select * from t1
LATERAL VIEW explode(split(numbers, ',')) test










PosExplode is used to create multiple rows from a list value in a column. In the below example, you have broken the list in numbers fields into multiple rows along with the column called pos that contains the  position of the value in the list.

with t1 as (
select 1 as id, 'deepak' as name, '12345, 56789, 7892334, 6263636, 7181818, 1761761717' as numbers )

select * from t1
LATERAL VIEW posexplode(split(numbers, ',')) test






Sunday 9 December 2018

Submitting pyspark jobs on Yarn and accessing hive tables from spark


Submitting pyspark jobs on Yarn and accessing hive tables from spark 


I was getting the below error while trying to access hive table through pyspark while submitting the job on spark.

yspark.sql.utils.AnalysisException: u'Table not found: `prady_retail_db_orc`.`orders`;'



I had to then start accessing the table through the hive context to fix the issue. 




The Command to submit the job is:

spark-submit --master yarn --conf "spark.ui.port=10111"  test2.py


The above command submit the test2.py program on yarn on port 10111.

The contents on test2.py is below. It is accessing a hive table called orders and writing the contents of the table in parquert format to hdfs location.


----------------test2.py contents----------------



from pyspark import SparkContext, SQLContext

from pyspark.sql import HiveContext

sc=SparkContext()

sqlContext = HiveContext(sc)

dailyrevDF=sqlContext.sql("select * from prady_retail_db_orc.orders")

dailyrevDF.write.mode("overwrite").format("parquet").option("compression", "none").mod

e("overwrite").save("/user/prady/data/test7")

Thursday 15 November 2018

How to make hive queries run faster?

The options available are

1) Use ORC file format which provides better performance than Text.

2) Use Tez execution engine

3) Use Cost based optimization

4) Use Vectorization execution

5) Right sensible queries that avoids joins




Reading from Amazon S3 using NiFi

Reading from Amazon S3 using NiFi

1) First create a bucket on Amazon S3 and create public and private keys from IAM in AWS
2) Proper permission should be provided so that users with the public and private keys can access the bucket
3) Use some S3 client tool to test that the files are accessible.
4) Create the dataflow on Nifi using ListS3 , FetchS3Object and PutS3 object as shown in the diagrams below.





5) Setting of ListS3 is listed below. S3-kannadiga is bucket name in US-East region. The access key and secret key is entered in this processor. 
6) Setting of Fetch S3 is given below. $(s3.bucket) is the setting to read from List S3 processor.

 7) The setting of PutFile is given below.