Sunday, 9 December 2018

Submitting pyspark jobs on Yarn and accessing hive tables from spark


Submitting pyspark jobs on Yarn and accessing hive tables from spark 


I was getting the below error while trying to access hive table through pyspark while submitting the job on spark.

yspark.sql.utils.AnalysisException: u'Table not found: `prady_retail_db_orc`.`orders`;'



I had to then start accessing the table through the hive context to fix the issue. 




The Command to submit the job is:

spark-submit --master yarn --conf "spark.ui.port=10111"  test2.py


The above command submit the test2.py program on yarn on port 10111.

The contents on test2.py is below. It is accessing a hive table called orders and writing the contents of the table in parquert format to hdfs location.


----------------test2.py contents----------------



from pyspark import SparkContext, SQLContext

from pyspark.sql import HiveContext

sc=SparkContext()

sqlContext = HiveContext(sc)

dailyrevDF=sqlContext.sql("select * from prady_retail_db_orc.orders")

dailyrevDF.write.mode("overwrite").format("parquet").option("compression", "none").mod

e("overwrite").save("/user/prady/data/test7")