DWBI-TECH BLOGS (Pradeep Kannadiga): 2018

Sunday, 9 December 2018

Submitting pyspark jobs on Yarn and accessing hive tables from spark

I was getting the below error while trying to access hive table through pyspark while submitting the job on spark.

yspark.sql.utils.AnalysisException: u'Table not found: `prady_retail_db_orc`.`orders`;'

I had to then start accessing the table through the hive context to fix the issue.

The Command to submit the job is:

spark-submit --master yarn --conf "spark.ui.port=10111" test2.py

The above command submit the test2.py program on yarn on port 10111.

The contents on test2.py is below. It is accessing a hive table called orders and writing the contents of the table in parquert format to hdfs location.

----------------test2.py contents----------------

from pyspark import SparkContext, SQLContext

from pyspark.sql import HiveContext

sc=SparkContext()

sqlContext = HiveContext(sc)

dailyrevDF=sqlContext.sql("select * from prady_retail_db_orc.orders")

dailyrevDF.write.mode("overwrite").format("parquet").option("compression", "none").mod

e("overwrite").save("/user/prady/data/test7")

Thursday, 15 November 2018

How to make hive queries run faster?

The options available are

1) Use ORC file format which provides better performance than Text.

2) Use Tez execution engine

3) Use Cost based optimization

4) Use Vectorization execution

5) Right sensible queries that avoids joins

Reading from Amazon S3 using NiFi

1) First create a bucket on Amazon S3 and create public and private keys from IAM in AWS

2) Proper permission should be provided so that users with the public and private keys can access the bucket

3) Use some S3 client tool to test that the files are accessible.

4) Create the dataflow on Nifi using ListS3 , FetchS3Object and PutS3 object as shown in the diagrams below.

5) Setting of ListS3 is listed below. S3-kannadiga is bucket name in US-East region. The access key and secret key is entered in this processor.

6) Setting of Fetch S3 is given below. $(s3.bucket) is the setting to read from List S3 processor.

7) The setting of PutFile is given below.