Why does Spark job fail with "too many open files"?

This has been answered on the spark user list:

The best way is definitely just to increase the ulimit if possible, this is sort of an assumption we make in Spark that clusters will be able to move it around.

You might be able to hack around this by decreasing the number of reducers [or cores used by each node] but this could have some performance implications for your job.

In general if a node in your cluster has C assigned cores and you run a job with X reducers then Spark will open C*X files in parallel and start writing. Shuffle consolidation will help decrease the total number of files created but the number of file handles open at any time doesn't change so it won't help the ulimit problem.

-Patrick Wendell


Another solution for this error is reducing your partitions.

check to see if you've got a lot of partitions with:

someBigSDF.rdd.getNumPartitions()

Out[]: 200

#if you need to persist the repartition, do it like this
someBigSDF = someBigSDF.repartition(20)

#if you just need it for one transformation/action, 
#you can do the repartition inline like this
someBigSDF.repartition(20).groupBy("SomeDt").agg(count("SomeQty")).orderBy("SomeDt").show()

the default ulimit is 1024 which is ridiculously low for large scale applications. HBase recommends up to 64K; modern linux systems don't seem to have trouble with this many open files.

use

ulimit -a

to see your current maximum number of open files

ulimit -n

can temporarily change the number of open files; you need to update the system configuration files and per-user limits to make this permanent. On CentOS and RedHat systems, that can be found in

/etc/sysctl.conf
/etc/security/limits.conf

Tags:

Apache Spark