How to set amount of Spark executors?

OK, got it. Number of executors is not actually Spark property itself but rather driver used to place job on YARN. So as I'm using SparkSubmit class as driver and it has appropriate --num-executors parameter which is exactly what I need.

UPDATE:

For some jobs I don't follow SparkSubmit approach anymore. I cannot do it primarily for applications where Spark job is only one of application component (and is even optional). For these cases I use spark-defaults.conf attached to cluster configuration and spark.executor.instances property inside it. This approach is much more universal allowing me to balance resources properly depending on cluster (developer workstation, staging, production).


We had a similar problem in my lab running Spark on Yarn with data on hdfs, but no matter which of the above solutions I tried, I could not increase the number of Spark executors beyond two.

Turns out the dataset was too small (less than the hdfs block size of 128 MB), and only existed on two of the data nodes (1 master, 7 data nodes in my cluster) due to hadoop's default data replication heuristic.

Once my lab-mates and I had more files (and larger files) and the data was spread on all nodes, we could set the number of Spark executors, and finally see an inverse relationship between --num-executors and time to completion.

Hope this helps someone else in a similar situation.


In Spark 2.0+ version

use spark session variable to set number of executors dynamically (from within program)

spark.conf.set("spark.executor.instances", 4)
spark.conf.set("spark.executor.cores", 4)

In above case maximum 16 tasks will be executed at any given time.

other option is dynamic allocation of executors as below -

spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.executor.cores", 4)
spark.conf.set("spark.dynamicAllocation.minExecutors","1")
spark.conf.set("spark.dynamicAllocation.maxExecutors","5")

This was you can let spark decide on allocating number of executors based on processing and memory requirements for running job.

I feel second option works better that first option and is widely used.

Hope this will help.


You could also do it programmatically by setting the parameters "spark.executor.instances" and "spark.executor.cores" on the SparkConf object.

Example:

SparkConf conf = new SparkConf()
      // 4 executor per instance of each worker 
      .set("spark.executor.instances", "4")
      // 5 cores on each executor
      .set("spark.executor.cores", "5");

The second parameter is only for YARN and standalone mode. It allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker.