How to prevent Spark Executors from getting Lost when using YARN client mode?
I had a very similar problem. I had many executors being lost no matter how much memory we allocated to them.
The solution if you're using yarn was to set --conf spark.yarn.executor.memoryOverhead=600
, alternatively if your cluster uses mesos you can try --conf spark.mesos.executor.memoryOverhead=600
instead.
In spark 2.3.1+ the configuration option is now --conf spark.yarn.executor.memoryOverhead=600
It seems like we were not leaving sufficient memory for YARN itself and containers were being killed because of it. After setting that we've had different out of memory errors, but not the same lost executor problem.
You can follow this AWS post to calculate memory overhead (and other spark configs to tune): best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr