Keras occupies an indefinitely increasing amount of memory for each epoch
Since the memory leak still seems to be present in TensorFlow 2.4.1 when using the built-in functions like model.fit() here is my take on it.
Issues
- Loads of RAM usage even though I am running NVIDIA GeForce RTX 2080 TI GPUs.
- Increasing epoch times as training progresses.
- Some kind of memory leakage (feels like it was somewhat linear).
Solutions
- Add the
run_eagerly=True
argument to themodel.compile()
function. However, doing so might result in TensorFlow's graph optimization to not work anymore which could lead to a decreased performance (reference). - Create a custom callback that garbage collects and clears the Keras backend at the end of each epoch (reference).
- Do not use the
activation
parameter inside thetf.keras.layers
. Put the activation function as a seperate layer (reference). - Use
LeakyReLU
instead ofReLU
as the activation function (reference).
Note: Since all the bullet points can be implemented individually you can mix and match them until you get a result that works for you. Anyways, here is a code snippet showing the solutions all together:
import gc
from tensorflow.keras import backend as k
from tensorflow.keras.layers import Conv2D, BatchNormalization, ReLU
from tensorflow.keras.callbacks import Callback
class CovNet:
...
x = Conv2d(
...,
activation=None
)(x)
x = BatchNormalization()(x)
x = ReLU()(x) # or LeakyReLU
...
#--------------------------------------------------------------------------------
class ClearMemory(Callback):
def on_epoch_end(self, epoch, logs=None):
gc.collect()
k.clear_session()
#--------------------------------------------------------------------------------
model.compile(
...,
run_eagerly=True
)
#--------------------------------------------------------------------------------
model.fit(
...,
callbacks=ClearMemory()
)
With these solutions I am now able to train with less RAM being occupied, epoch times stay constant and if there still is memory leakage it is negligible.
Thanks to @Hongtao Yang for providing the link to one of the related GitHub issues and to rschiewer over at GitHub for his comment.
Notes
- If none of the above works for you, you might want to try writing your own training loop in TensorFlow. Here is a guide on how to do it.
- People have also been reporting that using
tcmalloc
instead of the defaultmalloc
allocater alleviated the memory leakage to some degree. For references see here or here.
I hope this might help others too and save you some bugging hours of research on the internet.
Consuming the entire available memory is the default behavior of TF.
You can restrict the amount of memory consumption in TF using following code:
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9 # fraction of memory
config.gpu_options.visible_device_list = "0"
set_session(tf.Session(config=config))