Spark DataFrames: registerTempTable vs not
It's convenient to load the dataframe into a temp view in a notebook for example, where you can run exploratory queries on the data:
df.createOrReplaceTempView("myTempView")
Then in another notebook you can run a sql query and get all the nice integration features that come out of the box e.g. table and graph visualisation etc.
%sql
SELECT * FROM myTempView
The reason to use the registerTempTable( tableName )
method for a DataFrame, is so that in addition to being able to use the Spark-provided methods of a DataFrame
, you can also issue SQL queries via the sqlContext.sql( sqlQuery )
method, that use that DataFrame as an SQL table. The tableName
parameter specifies the table name to use for that DataFrame in the SQL queries.
val sc: SparkContext = ...
val hc = new HiveContext( sc )
val customerDataFrame = myCodeToCreateOrLoadDataFrame()
customerDataFrame.registerTempTable( "cust" )
val query = """SELECT custId, sum( purchaseAmount ) FROM cust GROUP BY custId"""
val salesPerCustomer: DataFrame = hc.sql( query )
salesPerCustomer.show()
Whether to use SQL or DataFrame methods like select
and groupBy
is probably largely a matter of preference. My understanding is that the SQL queries get translated into Spark execution plans.
In my case, I found that certain kinds of aggregation and windowing queries that I needed, like computing a running balance per customer, were available in the Hive SQL query language, that I suspect would have been very difficult to do in Spark.
If you want to use SQL, then you most likely will want to create a HiveContext
instead of a regular SQLContext
. The Hive query language supports a broader range of SQL than available via a plain SQLContext
.