Cross validation with grid search returns worse results than default
Running cross-validation on your entire dataset for parameter and/or feature selection can definitely cause problems when you test on the same dataset. It looks like that's at least part of the problem here. Running CV on a subset of your data for parameter optimization, and leaving a holdout set for testing, is good practice.
Assuming you're using the iris
dataset (that's the dataset used in the example in your comment link), here's an example of how GridSearchCV
parameter optimization is affected by first making a holdout set with train_test_split
:
from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
iris = datasets.load_iris()
gbc = GradientBoostingClassifier()
parameters = {'learning_rate':[0.01, 0.05, 0.1, 0.5, 1],
'min_samples_split':[2,5,10,20],
'max_depth':[2,3,5,10]}
clf = GridSearchCV(gbc, parameters)
clf.fit(iris.data, iris.target)
print(clf.best_params_)
# {'learning_rate': 1, 'max_depth': 2, 'min_samples_split': 2}
Now repeat the grid search using a random training subset:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(iris.data, iris.target,
test_size=0.33,
random_state=42)
clf = GridSearchCV(gbc, parameters)
clf.fit(X_train, y_train)
print(clf.best_params_)
# {'learning_rate': 0.01, 'max_depth': 5, 'min_samples_split': 2}
I'm seeing much higher classification accuracy with both of these approaches, which makes me think maybe you're using different data - but the basic point about performing parameter selection while maintaining a holdout set is demonstrated here. Hope it helps.