How to connect to remote hive server from spark
JDBC is not required
Spark connects directly to the Hive metastore, not through HiveServer2. To configure this,
Put
hive-site.xml
on yourclasspath
, and specifyhive.metastore.uri
s to where your hive metastore hosted. Also see How to connect to a Hive metastore programmatically in SparkSQL?Import
org.apache.spark.sql.hive.HiveContext
, as it can perform SQL query over Hive tables.Define
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Verify
sqlContext.sql("show tables")
to see if it works
SparkSQL on Hive tables
Conclusion : If you must go with jdbc way
Have a look connecting apache spark with apache hive remotely.
Please note that beeline also connects through jdbc. from your log it self its evident.
[ml@master spark-2.0.0]$./bin/beeline Beeline version 1.2.1.spark2 by Apache Hive beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
So please have a look at this interesting article
- Method 1: Pull table into Spark using JDBC
- Method 2: Use Spark JdbcRDD with HiveServer2 JDBC driver
- Method 3: Fetch dataset on a client side, then create RDD manually
Currently HiveServer2 driver doesn't allow us to use "Sparkling" Method 1 and 2, we can rely only on Method 3
Below is example code snippet though which it can be achieved
Loading data from one Hadoop cluster (aka "remote") into another one (where my Spark lives aka "domestic") thru HiveServer2 JDBC connection.
import java.sql.Timestamp
import scala.collection.mutable.MutableList
case class StatsRec (
first_name: String,
last_name: String,
action_dtm: Timestamp,
size: Long,
size_p: Long,
size_d: Long
)
val conn: Connection = DriverManager.getConnection(url, user, password)
val res: ResultSet = conn.createStatement
.executeQuery("SELECT * FROM stats_201512301914")
val fetchedRes = MutableList[StatsRec]()
while(res.next()) {
var rec = StatsRec(res.getString("first_name"),
res.getString("last_name"),
Timestamp.valueOf(res.getString("action_dtm")),
res.getLong("size"),
res.getLong("size_p"),
res.getLong("size_d"))
fetchedRes += rec
}
conn.close()
val rddStatsDelta = sc.parallelize(fetchedRes)
rddStatsDelta.cache()
// Basically we are done. To check loaded data:
println(rddStatsDelta.count)
rddStatsDelta.collect.take(10).foreach(println)
After providing the hive-ste.xml configuration to SPARK and after starting the HIVE Metastore service,
Two things need to be configured in SPARK Session while connecting to HIVE:
- Since Spark SQL connects to Hive metastore using thrift, we need to provide the thrift server uri while creating the Spark session.
- Hive Metastore warehouse which is the directory where Spark SQL persists tables. Use Property 'spark.sql.warehouse.dir' which is corresponding to 'hive.metastore.warehouse.dir' (as this is deprecated in Spark 2.0)
Something like:
SparkSession spark=SparkSession.builder().appName("Spark_SQL_5_Save To Hive").enableHiveSupport().getOrCreate();
spark.sparkContext().conf().set("spark.sql.warehouse.dir", "/user/hive/warehouse");
spark.sparkContext().conf().set("hive.metastore.uris", "thrift://localhost:9083");
Hope this was helpful !!