Are the k-fold cross-validation scores from scikit-learn's `cross_val_score` and `GridsearchCV` biased if we include transformers in the pipeline?
No, sklearn
doesn't do fit_transform
with entire dataset.
To check this, I subclassed StandardScaler
to print the size of the dataset sent to it.
class StScaler(StandardScaler):
def fit_transform(self,X,y=None):
print(len(X))
return super().fit_transform(X,y)
If you now replace StandardScaler
in your code, you'll see dataset size passed in first case is actually bigger.
But why does the accuracy remain exactly same? I think this is because LogisticRegression
is not very sensitive to feature scale. If we instead use a classifier that is very sensitive to scale, like KNeighborsClassifier
for example, you'll find accuracy between two cases start to vary.
X,y = load_breast_cancer(return_X_y=True)
X_sc = StScaler().fit_transform(X)
lr = KNeighborsClassifier(n_neighbors=1)
cross_val_score(lr, X_sc,y, cv=5)
Outputs:
569
[0.94782609 0.96521739 0.97345133 0.92920354 0.9380531 ]
And the 2nd case,
pipe = Pipeline([
('sc', StScaler()),
('lr', KNeighborsClassifier(n_neighbors=1))
])
print(cross_val_score(pipe, X, y, cv=5))
Outputs:
454
454
456
456
456
[0.95652174 0.97391304 0.97345133 0.92920354 0.9380531 ]
Not big change accuracy-wise, but change nonetheless.