How to obtain reproducible but distinct instances of GroupKFold
KFold
is only randomized ifshuffle=True
. Some datasets should not be shuffled.GroupKFold
is not randomized at all. Hence therandom_state=None
.GroupShuffleSplit
may be closer to what you're looking for.
A comparison of the group-based splitters:
- In
GroupKFold
, the test sets form a complete partition of all the data. LeavePGroupsOut
leaves all possible subsets of P groups out, combinatorially; test sets will overlap for P > 1. Since this meansP ** n_groups
splits altogether, often you want a small P, and most often wantLeaveOneGroupOut
which is basically the same asGroupKFold
withk=1
.GroupShuffleSplit
makes no statement about the relationship between successive test sets; each train/test split is performed independently.
As an aside,
Dmytro Lituiev has proposed an alternative GroupShuffleSplit
algorithm which is better at getting the right number of samples (not merely the right number of groups) in the test set for a specified test_size
.
Inspired by user0's answer (can't comment) but faster:
def RandomGroupKFold_split(groups, n, seed=None): # noqa: N802
"""
Random analogous of sklearn.model_selection.GroupKFold.split.
:return: list of (train, test) indices
"""
groups = pd.Series(groups)
ix = np.arange(len(groups))
unique = np.unique(groups)
np.random.RandomState(seed).shuffle(unique)
result = []
for split in np.array_split(unique, n):
mask = groups.isin(split)
train, test = ix[~mask], ix[mask]
result.append((train, test))
return result