getting number of visible nodes in PySpark
sc.defaultParallelism
is just a hint. Depending on the configuration it may not have a relation to the number of nodes. This is the number of partitions if you use an operation that takes a partition count argument but you don't provide it. For example sc.parallelize
will make a new RDD from a list. You can tell it how many partitions to create in the RDD with the second argument. But the default value for this argument is sc.defaultParallelism
.
You can get the number of executors with sc.getExecutorMemoryStatus
in the Scala API, but this is not exposed in the Python API.
In general the recommendation is to have around 4 times as many partitions in an RDD as you have executors. This is a good tip, because if there is variance in how much time the tasks take this will even it out. Some executors will process 5 faster tasks while others process 3 slower tasks, for example.
You don't need to be very accurate with this. If you have a rough idea, you can go with an estimate. Like if you know you have less than 200 CPUs, you can say 500 partitions will be fine.
So try to create RDDs with this number of partitions:
rdd = sc.parallelize(data, 500) # If distributing local data.
rdd = sc.textFile('file.csv', 500) # If loading data from a file.
Or repartition the RDD before the computation if you don't control the creation of the RDD:
rdd = rdd.repartition(500)
You can check the number of partitions in an RDD with rdd.getNumPartitions()
.
On pyspark you could still call the scala getExecutorMemoryStatus
API using pyspark's py4j bridge:
sc._jsc.sc().getExecutorMemoryStatus().size()