Cross validation with grid search returns worse results than default

Running cross-validation on your entire dataset for parameter and/or feature selection can definitely cause problems when you test on the same dataset. It looks like that's at least part of the problem here. Running CV on a subset of your data for parameter optimization, and leaving a holdout set for testing, is good practice.

Assuming you're using the iris dataset (that's the dataset used in the example in your comment link), here's an example of how GridSearchCV parameter optimization is affected by first making a holdout set with train_test_split:

from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

iris = datasets.load_iris()
gbc = GradientBoostingClassifier()
parameters = {'learning_rate':[0.01, 0.05, 0.1, 0.5, 1], 
              'min_samples_split':[2,5,10,20], 
              'max_depth':[2,3,5,10]}

clf = GridSearchCV(gbc, parameters)
clf.fit(iris.data, iris.target)

print(clf.best_params_)
# {'learning_rate': 1, 'max_depth': 2, 'min_samples_split': 2}

Now repeat the grid search using a random training subset:

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(iris.data, iris.target, 
                                                 test_size=0.33, 
                                                 random_state=42)

clf = GridSearchCV(gbc, parameters)
clf.fit(X_train, y_train)

print(clf.best_params_)
# {'learning_rate': 0.01, 'max_depth': 5, 'min_samples_split': 2}

I'm seeing much higher classification accuracy with both of these approaches, which makes me think maybe you're using different data - but the basic point about performing parameter selection while maintaining a holdout set is demonstrated here. Hope it helps.

Cross validation with grid search returns worse results than default

Tags:

Python

Machine Learning

Scikit Learn

Cross Validation

Grid Search

Related

Recent Posts