What is the canonical way to split tf.Dataset into test and validation subsets?
I don't think there's a canonical way (typically, data is being split e.g. in separate directories). But here's a recipe that will let you do it dynamically:
# Caveat: cache list_ds, otherwise it will perform the directory listing twice.
ds = list_ds.cache()
# Add some indices.
ds = ds.enumerate()
# Do a rougly 70-30 split.
train_list_ds = ds.filter(lambda i, data: i % 10 < 7)
test_list_ds = ds.filter(lambda i, data: i % 10 >= 7)
# Drop indices.
train_list_ds = train_list_ds.map(lambda i, data: data)
test_list_ds = test_list_ds.map(lambda i, data: data)
Based on Dan Moldovan's answer I created a reusable function. Maybe this is useful to other people.
def split_dataset(dataset: tf.data.Dataset, validation_data_fraction: float):
"""
Splits a dataset of type tf.data.Dataset into a training and validation dataset using given ratio. Fractions are
rounded up to two decimal places.
@param dataset: the input dataset to split.
@param validation_data_fraction: the fraction of the validation data as a float between 0 and 1.
@return: a tuple of two tf.data.Datasets as (training, validation)
"""
validation_data_percent = round(validation_data_fraction * 100)
if not (0 <= validation_data_percent <= 100):
raise ValueError("validation data fraction must be ∈ [0,1]")
dataset = dataset.enumerate()
train_dataset = dataset.filter(lambda f, data: f % 100 > validation_data_percent)
validation_dataset = dataset.filter(lambda f, data: f % 100 <= validation_data_percent)
# remove enumeration
train_dataset = train_dataset.map(lambda f, data: data)
validation_dataset = validation_dataset.map(lambda f, data: data)
return train_dataset, validation_dataset