Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?
Explanation from spark source code under branch-2.1
SparkContext: Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
Only one SparkContext may be active per JVM. You must stop()
the active SparkContext before
creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
JavaSparkContext: A Java-friendly version of [[org.apache.spark.SparkContext]] that returns [[org.apache.spark.api.java.JavaRDD]]s and works with Java collections instead of Scala ones.
Only one SparkContext may be active per JVM. You must stop()
the active SparkContext before
creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
SQLContext: The entry point for working with structured data (rows and columns) in Spark 1.x.
As of Spark 2.0, this is replaced by [[SparkSession]]. However, we are keeping the class here for backward compatibility.
SparkSession: The entry point to programming Spark with the Dataset and DataFrame API.
Spark Context is Class in Spark API which is the first stage to build the spark application. Functionality of the spark context is to create memory in RAM we call this as driver memory, allocation of number of executers and cores in short its all about the cluster management. Spark Context can be used to create RDD and shared variables. To access this we need to create object of it.
This way we can create Spark Context :: var sc=new SparkContext()
Spark Session this is new Object added since spark 2.x which is replacement of Sql Context and Hive Context. Earlier we had two options like one is Sql Context which is way to do sql operation on Dataframe and second is Hive Context which manage the Hive connectivity related stuff and fetch/insert the data from/to the hive tables.
Since 2.x came We can create SparkSession for the SQL operation on Dataframe and if you have any Hive related work just call the Method enablehivesupport() then you can use the SparkSession for the both Dataframe and hive related SQL operations.
This way we can create SparkSession for Sql operation on Dataframe
val sparksession=SparkSession.builder().getOrCreate();
Second way is to create SparkSession for Sql operation on Dataframe as well as Hive Operation.
val sparkSession=SparkSession.builder().enableHiveSupport().getOrCreate()
sparkContext
is a Scala implementation entry point and JavaSparkContext
is a java wrapper of sparkContext
.
SQLContext
is entry point of SparkSQL which can be received from sparkContext
.Prior to 2.x.x, RDD ,DataFrame and Data-set were three different data abstractions.Since Spark 2.x.x, All three data abstractions are unified and SparkSession
is the unified entry point of Spark.
An additional note is , RDD meant for unstructured data, strongly typed data and DataFrames are for structured and loosely typed data. You can check
Is there any method to convert or create Context using Sparksession ?
yes. its sparkSession.sparkContext()
and for SQL, sparkSession.sqlContext()
Can I completely replace all the Context using one single entry SparkSession ?
yes. you can get respective contexs from sparkSession.
Does all the functions in SQLContext, SparkContext,JavaSparkContext etc are added in SparkSession?
Not directly. you got to get respective context and make use of it.something like backward compatibility
How to use such function in SparkSession?
get respective context and make use of it.
How to create the following using SparkSession?
- RDD can be created from
sparkSession.sparkContext.parallelize(???)
- JavaRDD same applies with this but in java implementation
- JavaPairRDD
sparkSession.sparkContext.parallelize(???).map(//making your data as key-value pair here is one way)
- Dataset what sparkSession returns is Dataset if it is structured data.
I will talk about Spark version 2.x only.
SparkSession: It's a main entry point of your spark Application. To run any code on your spark, this is the first thing you should create.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("Word Count")\
.config("spark.some.config.option", "some-value")\
.getOrCreate()
SparkContext: It's a inner Object (property) of SparkSession. It's used to interact with Low-Level API
Through SparkContext
you can create RDD
, accumlator
and Broadcast variables
.
for most cases you won't need SparkContext
. You can get SparkContext
from SparkSession
val sc = spark.sparkContext