Sklearn StratifiedKFold: ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead
keras.utils.to_categorical
produces a one-hot encoded class vector, i.e. the multilabel-indicator
mentioned in the error message. StratifiedKFold
is not designed to work with such input; from the split
method docs:
split
(X, y, groups=None)[...]
y : array-like, shape (n_samples,)
The target variable for supervised learning problems. Stratification is done based on the y labels.
i.e. your y
must be a 1-D array of your class labels.
Essentially, what you have to do is simply to invert the order of the operations: split first (using your intial y_train
), and convert to_categorical
afterwards.
Call to split()
like this:
for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train_categorical.argmax(1))):
x_train_kf, x_val_kf = x_train[train_index], x_train[val_index]
y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]
I bumped into the same problem and found out that you can check the type of the target with this util
function:
from sklearn.utils.multiclass import type_of_target
type_of_target(y)
'multilabel-indicator'
From its docstring:
- 'binary':
y
contains <= 2 discrete values and is 1d or a column vector.- 'multiclass':
y
contains more than two discrete values, is not a sequence of sequences, and is 1d or a column vector.- 'multiclass-multioutput':
y
is a 2d array that contains more than two discrete values, is not a sequence of sequences, and both dimensions are of size > 1.- 'multilabel-indicator':
y
is a label indicator matrix, an array of two dimensions with at least two columns, and at most 2 unique values.
With LabelEncoder
you can transform your classes into an 1d array of numbers (given your target labels are in an 1d array of categoricals/object):
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(target_labels)