spark off heap memory config and tungsten
What for are spark.memory.offheap.size and spark.memory.offheap.enabled? spark.memory.offHeap.enabled: Parameter to enable/disable the use of off-heap memory. spark.memory.offHeap.size: The total amount of memory in bytes for off-heap allocation(from the native memory). It has no impact on heap memory usage, also make sure not to exceed your executor’s total limits.
Do I manually need to specify the amount of off-heap memory for Tungsten here? Yes. Besides enabling OffHeap memory, you need to manually set its size to use Off-Heap memory for spark Applications. Note that Off-heap memory model includes only Storage memory and Execution memory.
The Image below is the abstract Concept when Off-Heap memory is in action.
• If the Off-heap memory is enabled, there will be both On-heap and Off-heap memory in the Executor.
• The storage memory of the Executor = Storage Memory On-Heap + Storage Memory Off-Heap
• The Execution memory of the Executor = Execution memory On-Heap + Execution memory Off-Heap
Spark/Tungsten use Encoders/Decoders to represent JVM objects as a highly specialized Spark SQL Types objects which then can be serialized and operated on in a highly performant way. Internal format representation is highly efficient and friendly to GC memory utilization.
Thus, even operating in the default on-heap mode Tungsten alleviates the great overhead of JVM objects memory layout and the GC operating time. Tungsten in that mode does allocate objects on heap for its internal purposes and the allocation memory chunks might be huge but it happens much less frequently and survives GC generation transitions smoothly. This almost eliminates the need to consider moving this internal structure off-heap.
In our experiments with this mode on and off we did not see a considerable run time improvements. But what you get with off-heap mode on is that one need to carefully design for the memory allocation outside of you JVM process. This might impose some difficulties within container managers like YARN, Mesos etc when you will need to allow and plan for additional memory chunks besides your JVM process configuration.
Also in off-heap mode Tungsten uses sun.misc.Unsafe which might not be a desired or even possible in your deployment scenarios (with restrictive java security manager configuration for example).
I am also sharing a time tagged video conference talk from Josh Rosen when he is being asked the similar question.