What is the difference between Spark Standalone, YARN and local mode?


Local mode
Think of local mode as executing a program on your laptop using single JVM. It can be java, scala or python program where you have defined & used spark context object, imported spark libraries and processed data residing in your system.


YARN
In reality Spark programs are meant to process data stored across machines. Executors process data stored on these machines. We need a utility to monitor executors and manage resources on these machines( clusters). Hadoop has its own resources manager for this purpose. So when you run spark program on HDFS you can leverage hadoop's resource manger utility i.e. yarn. Hadoop properties is obtained from ‘HADOOP_CONF_DIR’ set inside spark-env.sh or bash_profile


Spark Standalone
Spark distribution comes with its own resource manager also. When your program uses spark's resource manager, execution mode is called Standalone. Moreover, Spark allows us to create distributed master-slave architecture, by configuring properties file under $SPARK_HOME/conf directory. By Default it is set as single node cluster just like hadoop's psudo-distribution-mode.


You are getting confused with Hadoop YARN and Spark.

YARN is a software rewrite that decouples MapReduce's resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications.

With the introduction of YARN, Hadoop has opened to run other applications on the platform.

In short YARN is "Pluggable Data Parallel framework".

Apache Spark

Apache spark is a Batch interactive Streaming Framework. Spark has a "pluggable persistent store". Spark can run with any persistence layer.

For spark to run it needs resources. In standalone mode you start workers and spark master and persistence layer can be any - HDFS, FileSystem, cassandra etc. In YARN mode you are asking YARN-Hadoop cluster to manage the resource allocation and book keeping.

When you use master as local[2] you request Spark to use 2 core's and run the driver and workers in the same JVM. In local mode all spark job related tasks run in the same JVM.

So the only difference between Standalone and local mode is that in Standalone you are defining "containers" for the worker and spark master to run in your machine (so you can have 2 workers and your tasks can be distributed in the JVM of those two workers?) but in local mode you are just running everything in the same JVM in your local machine.

Tags:

Apache Spark